arXiv Open Access 2025

Dealing with the Hard Facts of Low-Resource African NLP

Yacouba Diarra Nouhoum Souleymane Coulibaly Panga Azazia Kamaté Madani Amadou Tall Emmanuel Élisé Koné +2 lainnya

Lihat Sumber

Abstrak

Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.

Topik & Kata Kunci

cs.CL

Penulis (7)

Yacouba Diarra

Nouhoum Souleymane Coulibaly

Panga Azazia Kamaté

Madani Amadou Tall

Emmanuel Élisé Koné

Aymane Dembélé

Michael Leventhal

Format Sitasi

APA MLA BibTeX

Diarra, Y., Coulibaly, N.S., Kamaté, P.A., Tall, M.A., Koné, E.É., Dembélé, A. et al. (2025). Dealing with the Hard Facts of Low-Resource African NLP. https://arxiv.org/abs/2511.18557

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓