arXiv Open Access 2025

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Daul Alessio Tosolini Claire Bowern
Lihat Sumber

Abstrak

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

Topik & Kata Kunci

Penulis (3)

M

Massimo Daul

A

Alessio Tosolini

C

Claire Bowern

Format Sitasi

Daul, M., Tosolini, A., Bowern, C. (2025). Linguistically Informed Tokenization Improves ASR for Underresourced Languages. https://arxiv.org/abs/2510.06461

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓