arXiv Open Access 2025

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivate Kayode Olaleye Sitwala Mundia Andinda Bakainga Unarine Netshifhefhe +17 lainnya

Lihat Sumber

Abstrak

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

Topik & Kata Kunci

cs.CL

Penulis (22)

Vukosi Marivate

Kayode Olaleye

Sitwala Mundia

Andinda Bakainga

Unarine Netshifhefhe

Mahmooda Milanzie

Tsholofelo Hope Mogale

Thapelo Sindane

Zainab Abdulrasaq

Kesego Mokgosi

Chijioke Okorie

Nia Zion Van Wyk

Graham Morrissey

Dale Dunbar

Francois Smit

Tsosheletso Chidi

Rooweither Mabuya

Andiswa Bukula

Respect Mlambo

Tebogo Macucwa

Idris Abdulmumin

and Seani Rananga

Format Sitasi

APA MLA BibTeX

Marivate, V., Olaleye, K., Mundia, S., Bakainga, A., Netshifhefhe, U., Milanzie, M. et al. (2025). Swivuriso: The South African Next Voices Multilingual Speech Dataset. https://arxiv.org/abs/2512.02201

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓