DOAJ Open Access 2025

Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Tanja Gaustad Cindy A. McKellar Martin J. Puttkammer

Abstrak

This data paper describes machine translation datasets built for the Autshumato project. The datasets contain both bilingual aligned data between English and all other official written languages of South Africa, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso), as well as monolingual data for all 11 languages. The content was sourced from existing and commissioned translations, various publications, and web-crawling of government sites. The present article describes the collection, alignment and cleanup processes that were used to create these resources. It also gives a detailed overview of the amount and provenance of the data included in the final datasets for all languages. Although the datasets were created primarily for the training of statistical and neural machine translation systems, they can also be used for other natural language processing tasks or linguistic research, such as term extraction or lexicography.

Topik & Kata Kunci

History of scholarship and learning. The humanities Language and Literature

Penulis (3)

Tanja Gaustad

Cindy A. McKellar

Martin J. Puttkammer

Format Sitasi

APA MLA BibTeX

Gaustad, T., McKellar, C.A., Puttkammer, M.J. (2025). Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation. https://doi.org/10.5334/johd.372

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.5334/johd.372

Informasi Jurnal

Tahun Terbit: 2025
Sumber Database: DOAJ
DOI: 10.5334/johd.372
Akses: Open Access ✓