arXiv Open Access 2020

Speech Recognition for Endangered and Extinct Samoyedic languages

Niko Partanen Mika Hämäläinen Tiina Klooster
Lihat Sumber

Abstrak

Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic.

Topik & Kata Kunci

Penulis (3)

N

Niko Partanen

M

Mika Hämäläinen

T

Tiina Klooster

Format Sitasi

Partanen, N., Hämäläinen, M., Klooster, T. (2020). Speech Recognition for Endangered and Extinct Samoyedic languages. https://arxiv.org/abs/2012.05331

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2020
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓