arXiv Open Access 2025

Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang Chanakan Wittayasakpan Kritsadha Phatcharoen Supakit Buakaw

Lihat Sumber

Abstrak

This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

Topik & Kata Kunci

cs.CL

Penulis (4)

Adisai Na-Thalang

Chanakan Wittayasakpan

Kritsadha Phatcharoen

Supakit Buakaw

Format Sitasi

APA MLA BibTeX

Na-Thalang, A., Wittayasakpan, C., Phatcharoen, K., Buakaw, S. (2025). Developing an Open Conversational Speech Corpus for the Isan Language. https://arxiv.org/abs/2511.21229

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓