arXiv Open Access 2026

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel Yosuke Fukumoto Chikara Maeda Chyi-Jiunn Lin Shinji Watanabe

Lihat Sumber

Abstrak

We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

Topik & Kata Kunci

eess.AS cs.CL cs.SD

Penulis (5)

Muhammad Shakeel

Yosuke Fukumoto

Chikara Maeda

Chyi-Jiunn Lin

Shinji Watanabe

Format Sitasi

APA MLA BibTeX

Shakeel, M., Fukumoto, Y., Maeda, C., Lin, C., Watanabe, S. (2026). CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR. https://arxiv.org/abs/2601.22792

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓