arXiv Open Access 2022

AfroLID: A Neural Language Identification Tool for African Languages

Ife Adebara AbdelRahim Elmadany Muhammad Abdul-Mageed Alcides Alcoba Inciarte
Lihat Sumber

Abstrak

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.

Topik & Kata Kunci

Penulis (4)

I

Ife Adebara

A

AbdelRahim Elmadany

M

Muhammad Abdul-Mageed

A

Alcides Alcoba Inciarte

Format Sitasi

Adebara, I., Elmadany, A., Abdul-Mageed, M., Inciarte, A.A. (2022). AfroLID: A Neural Language Identification Tool for African Languages. https://arxiv.org/abs/2210.11744

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2022
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓