Swivuriso: The South African Next Voices Multilingual Speech Dataset
Abstrak
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
Topik & Kata Kunci
Penulis (22)
Vukosi Marivate
Kayode Olaleye
Sitwala Mundia
Andinda Bakainga
Unarine Netshifhefhe
Mahmooda Milanzie
Tsholofelo Hope Mogale
Thapelo Sindane
Zainab Abdulrasaq
Kesego Mokgosi
Chijioke Okorie
Nia Zion Van Wyk
Graham Morrissey
Dale Dunbar
Francois Smit
Tsosheletso Chidi
Rooweither Mabuya
Andiswa Bukula
Respect Mlambo
Tebogo Macucwa
Idris Abdulmumin
and Seani Rananga
Akses Cepat
- Tahun Terbit
- 2025
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓