arXiv Open Access 2025

Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Daul Alessio Tosolini Claire Bowern

Lihat Sumber

Abstrak

Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

Topik & Kata Kunci

cs.CL

Penulis (3)

Massimo Daul

Alessio Tosolini

Claire Bowern

Format Sitasi

APA MLA BibTeX

Daul, M., Tosolini, A., Bowern, C. (2025). Linguistically Informed Tokenization Improves ASR for Underresourced Languages. https://arxiv.org/abs/2510.06461

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓