Semantic Scholar Open Access 2023 10 sitasi

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis

Yayue Deng Jinlong Xue Fengping Wang Yingming Gao Ya Li

Abstrak

Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence between interlocutors. Previous studies have verified that prior commonsense knowledge helps machines understand subtle psychological information (e.g., feelings and intentions) in spontaneous oral dialogues. Therefore, to enhance context understanding and improve the naturalness of synthesized speech, we propose a novel conversational speech synthesis system (CMCU-CSS) that incorporates the Commonsense-based Multi-modal Context Understanding (CMCU) module to model the dynamic emotional interaction among interlocutors. Specifically, we first utilize three implicit states (intent state, internal state and external state) in CMCU to model the context dependency between inter/intra speakers with the help of commonsense knowledge. Furthermore, we infer emotion vectors from the fusion of these implicit states and multi-modal features to enhance the emotion discriminability of synthesized speech. This is the first attempt to combine commonsense knowledge with conversational speech synthesis, and its effect in terms of emotion discriminability of synthetic speech is evaluated by emotion recognition in conversation task. The results of subjective and objective evaluations demonstrate that the CMCU-CSS model achieves more natural speech with context-appropriate emotion and is equipped with the best emotion discriminability, surpassing that of other conversational speech synthesis models.

Topik & Kata Kunci

Computer Science

Penulis (5)

Yayue Deng

Jinlong Xue

Fengping Wang

Yingming Gao

Ya Li

Format Sitasi

APA MLA BibTeX

Deng, Y., Xue, J., Wang, F., Gao, Y., Li, Y. (2023). CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis. https://doi.org/10.1145/3581783.3612565

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →

Lihat di Sumber doi.org/10.1145/3581783.3612565

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Total Sitasi: 10×
Sumber Database: Semantic Scholar
DOI: 10.1145/3581783.3612565
Akses: Open Access ✓