HLCUSE: Hierarchical Language Clustering Using Sentence Embeddings

In this project, we aimed to test a hypothesis: that multilingual sentence embedding models could encode genetic relationships between languages. We used a parallel corpus and measured the average angular distance between equivalent sentences in 17 different languages, then used the resulting distance matrix to hierarchically cluster these languages in an attempt to recreate the Indo-European family tree. Ultimately, we found that they can do this to some degree, but other computational methods people have proposed to do this can do it better because our approach doesn't capture any syntactic details of the text, only semantics. Read our report to learn more.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
README.md		README.md
align.ipynb		align.ipynb
compare_distances.py		compare_distances.py
distance_comparison.png		distance_comparison.png
distances.pkl		distances.pkl
edge_distances.pkl		edge_distances.pkl
embeddings.py		embeddings.py
gold_distances.csv		gold_distances.csv
matrix.py		matrix.py
nlp_final_report.pdf		nlp_final_report.pdf
rabinovich_edge_distances.pkl		rabinovich_edge_distances.pkl
rabinovich_tree.py		rabinovich_tree.py
requirements.txt		requirements.txt
tree.py		tree.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HLCUSE: Hierarchical Language Clustering Using Sentence Embeddings

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

tristanmcd130/hlcuse

Folders and files

Latest commit

History

Repository files navigation

HLCUSE: Hierarchical Language Clustering Using Sentence Embeddings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages