arXiv Open Access 2026

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

Abdoulaye Diack Perry Nelson Kwaku Agbesi Angela Nakalembe MohamedElfatih MohamedKhair +38 lainnya
Lihat Sumber

Abstrak

The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 24 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with around 235 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.

Topik & Kata Kunci

Penulis (43)

A

Abdoulaye Diack

P

Perry Nelson

K

Kwaku Agbesi

A

Angela Nakalembe

M

MohamedElfatih MohamedKhair

V

Vusumuzi Dube

T

Tavonga Siyavora

S

Subhashini Venugopalan

J

Jason Hickey

U

Uche Okonkwo

A

Abhishek Bapna

I

Isaac Wiafe

R

Raynard Dodzi Helegah

E

Elikem Doe Atsakpo

C

Charles Nutrokpor

F

Fiifi Baffoe Payin Winful

K

Kafui Kwashie Solaga

J

Jamal-Deen Abdulai

A

Akon Obu Ekpezu

A

Audace Niyonkuru

S

Samuel Rutunda

B

Boris Ishimwe

M

Michael Melese

E

Engineer Bainomugisha

J

Joyce Nakatumba-Nabende

A

Andrew Katumba

C

Claire Babirye

J

Jonathan Mukiibi

V

Vincent Kimani

S

Samuel Kibacia

J

James Maina

F

Fridah Emmah

A

Ahmed Ibrahim Shekarau

I

Ibrahim Shehu Adamu

Y

Yusuf Abdullahi

H

Howard Lakougna

B

Bob MacDonald

H

Hadar Shemtov

A

Aisha Walcott-Bryant

M

Moustapha Cisse

A

Avinatan Hassidim

J

Jeff Dean

Y

Yossi Matias

Format Sitasi

Diack, A., Nelson, P., Agbesi, K., Nakalembe, A., MohamedKhair, M., Dube, V. et al. (2026). WAXAL: A Large-Scale Multilingual African Language Speech Corpus. https://arxiv.org/abs/2602.02734

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓