arXiv Open Access 2025

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivate Kayode Olaleye Sitwala Mundia Andinda Bakainga Unarine Netshifhefhe +17 lainnya
Lihat Sumber

Abstrak

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

Topik & Kata Kunci

Penulis (22)

V

Vukosi Marivate

K

Kayode Olaleye

S

Sitwala Mundia

A

Andinda Bakainga

U

Unarine Netshifhefhe

M

Mahmooda Milanzie

T

Tsholofelo Hope Mogale

T

Thapelo Sindane

Z

Zainab Abdulrasaq

K

Kesego Mokgosi

C

Chijioke Okorie

N

Nia Zion Van Wyk

G

Graham Morrissey

D

Dale Dunbar

F

Francois Smit

T

Tsosheletso Chidi

R

Rooweither Mabuya

A

Andiswa Bukula

R

Respect Mlambo

T

Tebogo Macucwa

I

Idris Abdulmumin

a

and Seani Rananga

Format Sitasi

Marivate, V., Olaleye, K., Mundia, S., Bakainga, A., Netshifhefhe, U., Milanzie, M. et al. (2025). Swivuriso: The South African Next Voices Multilingual Speech Dataset. https://arxiv.org/abs/2512.02201

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓