WAXAL: A Large-Scale Multilingual African Language Speech Corpus
Abstrak
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 24 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with around 235 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
Penulis (43)
Abdoulaye Diack
Perry Nelson
Kwaku Agbesi
Angela Nakalembe
MohamedElfatih MohamedKhair
Vusumuzi Dube
Tavonga Siyavora
Subhashini Venugopalan
Jason Hickey
Uche Okonkwo
Abhishek Bapna
Isaac Wiafe
Raynard Dodzi Helegah
Elikem Doe Atsakpo
Charles Nutrokpor
Fiifi Baffoe Payin Winful
Kafui Kwashie Solaga
Jamal-Deen Abdulai
Akon Obu Ekpezu
Audace Niyonkuru
Samuel Rutunda
Boris Ishimwe
Michael Melese
Engineer Bainomugisha
Joyce Nakatumba-Nabende
Andrew Katumba
Claire Babirye
Jonathan Mukiibi
Vincent Kimani
Samuel Kibacia
James Maina
Fridah Emmah
Ahmed Ibrahim Shekarau
Ibrahim Shehu Adamu
Yusuf Abdullahi
Howard Lakougna
Bob MacDonald
Hadar Shemtov
Aisha Walcott-Bryant
Moustapha Cisse
Avinatan Hassidim
Jeff Dean
Yossi Matias
Akses Cepat
- Tahun Terbit
- 2026
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓