CreoleVal is accepted to TACL 2024!
This repository includes data (or otherwise download scripts), scripts for training and evaluation, and models for tasks spanning natural language understanding and generation for Creole languages.
Statistics about the coverage of CreoleVal can be found here, as well as additional analysis of the performance and behaviour over the included tasks.
If you wish to clone this repository to replicate the results, please remember to initialize the added submodules by using the following command:
git clone git@github.com:hclent/CreoleVal.git --recursive or if already cloned
git submodule update --init --recursiveEach of the tasks is contained in a sub-directory where further technical instructions are provided with how to get started in a dedicated README.md file.
Datasets, training and inference scripts for NLU tasks such as:
- Machine Comprehension,
- Relation Classification,
- UDPoS,
- NER,
- NLI,
- Sentiment Analysis,
- Tatoeba challenge.
Training an inference scripts for Machine Translation task with the MIT-Haiti corpus, KriolMorisiyenMT and bibles. Please note that there is no download script for the set of bibles as the material is copyrighted.
- Haitian Creole MT
- Brand new dataset for Haitian MT: the MIT-Haiti Corpus
- Contains monolingual and parallel data for Haitian Creole (hat)
- Creole-Bible MT
- KreolMorisien MT (downloadable with download_kreolmorisien_mt.sh)
Because CreoleVal is a composite of new benchmarks and pre-existing ones, there are several different software licenses at play.
For the datasets packed within CreoleVal (i.e., the data is actually in the repo, rather than fetched with a download script), we summarize them here, for your convenience. To navigate to a specific dataset, click on the dataset name in the table below.
Note: an * indicates a dataset that we have newly introduced in CreoleVal:
| Task | Dataset | Language (ISO-638-3) | License | Domain | Total Sent. | Total words |
|---|---|---|---|---|---|---|
| MC | CreoleVal MC* | hat-dir, hat-loc, mfe | Microsoft License | Education | 3894 | 32068 |
| RC | CreoleVal RC* | bis, cbk, jam, tpi | CC0 | WikiDump | 785 | 4106 |
| MT | CreoleVal Religious MT* | bzj, bis, cbk, gul,hat, hwc, jam, ktu,kri, mkn, mbf,mfe, djk, pcm,pap, pis, acf, icr,sag, srm, crs, srn,tdt, tpi, tcs | Copyrighted | Religion | 64394 | 811741 |
| MT | CreoleVal MIT-Haiti* | hat | CC 4.0 | Education | 3164 | 36281 |
| Pretraining data | CreoleVal MIT-Haiti* | hat | CC 4.0 | Education | 8281 | 116444 |
| UDPoS | Singlish Treebank | singlish | MIT | Web Scrape | 1200 | 10989 |
| UDPoS | UD_Naija-NSC | pcm | CC 4.0 | Dialog | 9621 | 150000 |
| NER | MasakhaNER | pcm | Apache 2.0 | BBC News | 3000 | 76063 |
| NER | WikiAnn | bis cbk hat, pih, sgg, tpi, pap | Unspecified | WikiDump | 5877 | 74867 |
| SA | AfriSenti | pcm | CC BY 4.0 | 10559 | 235679 | |
| SA | Naija VADER | pcm | Unspecified | 9576 | 101057 | |
| NLI | JamPatoisNLI | jam | Unspecified | Twitter, web | 650 | 2612 |
| SM | Tatoeba | cbk, gcf, hat, jam, pap, sag, tpi | CC-BY 2.0 | General web | 49192 | 319719 |
| MT | KreolMorisienMT | mfe | MIT License | Varied | 6628 | 23554 |
| New: | 80518 | 1000640 | ||||
| Total: | 176821 | 1995180 |
Paper can be found here. Please cite us:
@article{10.1162/tacl_a_00682,
author = {Lent, Heather and Tatariya, Kushal and Dabre, Raj and Chen, Yiyi and Fekete, Marcell and Ploeger, Esther and Zhou, Li and Armstrong, Ruth-Ann and Eijansantos, Abee and Malau, Catriona and Heje, Hans Erik and Lavrinovics, Ernests and Kanojia, Diptesh and Belony, Paul and Bollmann, Marcel and Grobol, Loïc and Lhoneux, Miryam de and Hershcovich, Daniel and DeGraff, Michel and Søgaard, Anders and Bjerva, Johannes},
title = {CreoleVal: Multilingual Multitask Benchmarks for Creoles},
journal = {Transactions of the Association for Computational Linguistics},
volume = {12},
pages = {950-978},
year = {2024},
month = {09},
issn = {2307-387X},
doi = {10.1162/tacl_a_00682},
url = {https://doi.org/10.1162/tacl\_a\_00682},
eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00682/2468651/tacl\_a\_00682.pdf},
}