DOAJ Open Access 2025

Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation

Tanja Gaustad Cindy A. McKellar Martin J. Puttkammer

Abstrak

This data paper describes machine translation datasets built for the Autshumato project. The datasets contain both bilingual aligned data between English and all other official written languages of South Africa, namely Afrikaans (ISO 639-3: afr), isiNdebele (nbl), isiXhosa (xho), isiZulu (zul), Sepedi (nso), Sesotho (sot), Setswana (tsn), Siswati (ssw), Tshivenḓa (ven) and Xitsonga (tso), as well as monolingual data for all 11 languages. The content was sourced from existing and commissioned translations, various publications, and web-crawling of government sites. The present article describes the collection, alignment and cleanup processes that were used to create these resources. It also gives a detailed overview of the amount and provenance of the data included in the final datasets for all languages. Although the datasets were created primarily for the training of statistical and neural machine translation systems, they can also be used for other natural language processing tasks or linguistic research, such as term extraction or lexicography.

Penulis (3)

T

Tanja Gaustad

C

Cindy A. McKellar

M

Martin J. Puttkammer

Format Sitasi

Gaustad, T., McKellar, C.A., Puttkammer, M.J. (2025). Datasets for South African Languages: Bilingual Aligned and Monolingual Data for Machine Translation. https://doi.org/10.5334/johd.372

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.5334/johd.372
Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.5334/johd.372
Akses
Open Access ✓