By Mason Shipton, David Anugraha, York Hay Ng
-
Optional
Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that incorporates various linguistic information covering typological, geographical, and phylogenetic features to align PLMs representation to the corresponding linguistic information on each language. Our LinguAlchemy significantly improves the performance of mBERT and XLM-R on low-resource languages in multiple downstream tasks such as intent classification, news classification, and semantic relatedness compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemywhich adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search.
If you are interested for more information, check out the full paper.
If you use this code for your research, please cite the following work:
@inproceedings{adilazuarda-etal-2024-lingualchemy,
title = "{L}ingu{A}lchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization",
author = "Adilazuarda, Muhammad Farid and
Cahyawijaya, Samuel and
Winata, Genta Indra and
Purwarianti, Ayu and
Aji, Alham Fikri",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.225/",
doi = "10.18653/v1/2024.findings-emnlp.225",
pages = "3912--3928",
abstract = "Pretrained language models (PLMs) have shown remarkable generalization toward multiple tasks and languages. Nonetheless, the generalization of PLMs towards unseen languages is poor, resulting in significantly worse language performance, or even generating nonsensical responses that are comparable to a random baseline. This limitation has been a longstanding problem of PLMs raising the problem of diversity and equal access to language modeling technology. In this work, we solve this limitation by introducing LinguAlchemy, a regularization technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy performance of mBERT and XLM-R on unseen languages by {\textasciitilde}18{\%} and {\textasciitilde}2{\%}, respectively compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs."
}If you have any questions, you can open a GitHub Issue.
URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
If you are interested for more information, check out the full paper.
If you use this code for your research, please cite the following work:
@inproceedings{khan-etal-2025-uriel,
title = "{URIEL}+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base",
author = {Khan, Aditya and
Shipton, Mason and
Anugraha, David and
Duan, Kaiyao and
Hoang, Phuong H. and
Khiu, Eric and
Do{\u{g}}ru{\"o}z, A. Seza and
Lee, En-Shiun Annie},
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
month = jan,
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.463/",
pages = "6937--6952",
abstract = "URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies."
}If you have any questions, you can open a GitHub Issue or send them an email.
Check out ExploRIEL, the online UI for URIEL+: https://uriel-leelab.streamlit.app/
Requires Python 3.10 or later.
All dependencies are listed in the requirements/ folder.
Run the following script to create URIEL+ vectors:
python vectors/create_vectors.pyThis will create PyTorch files containing distances for the MASAKHANEWS, MASSIVE, and SEMREL language datasets.
Output files will be saved to the
vectors/folder in the individual folders for each dataset (e.g., MASAKHANEWS PyTorch files will be saved tovectors/masakhanewsfolder).
After creating vectors, run:
./script/run_lingualchemy.shThis will run all experiments for the MASAKHANEWS, MASSIVE, and SEMREL language datasets.
Output files will be saved to the
outputs/folder in the individual folders for each dataset (e.g., MASAKHANEWS experiment results will be saved tooutputs/masakhanewsfolder).
After the experiments finish, run:
python metrics.pyThis will average the accuracy for MASAKHANEWS and MASSIVE datasets and the pearson correlation for SEMREL dataset across all languages under different benchmarks.
Table of metrics will be saved to the
outputs/folder asmetrics_table.csv.
You can test statistical significance between URIEL, URIEL+, or different URIEL versions.
Steps:
-
Copy the metrics (
accuracyorpearson) intostatistical.pyunder the correct experiment section. -
Run:
python statistical.py
This will output the p-value measuring the statistical significance between the different URIEL results.

