Semantic Scholar Open Access 2023 10 sitasi

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis

Yayue Deng Jinlong Xue Fengping Wang Yingming Gao Ya Li

Abstrak

Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence between interlocutors. Previous studies have verified that prior commonsense knowledge helps machines understand subtle psychological information (e.g., feelings and intentions) in spontaneous oral dialogues. Therefore, to enhance context understanding and improve the naturalness of synthesized speech, we propose a novel conversational speech synthesis system (CMCU-CSS) that incorporates the Commonsense-based Multi-modal Context Understanding (CMCU) module to model the dynamic emotional interaction among interlocutors. Specifically, we first utilize three implicit states (intent state, internal state and external state) in CMCU to model the context dependency between inter/intra speakers with the help of commonsense knowledge. Furthermore, we infer emotion vectors from the fusion of these implicit states and multi-modal features to enhance the emotion discriminability of synthesized speech. This is the first attempt to combine commonsense knowledge with conversational speech synthesis, and its effect in terms of emotion discriminability of synthetic speech is evaluated by emotion recognition in conversation task. The results of subjective and objective evaluations demonstrate that the CMCU-CSS model achieves more natural speech with context-appropriate emotion and is equipped with the best emotion discriminability, surpassing that of other conversational speech synthesis models.

Topik & Kata Kunci

Penulis (5)

Y

Yayue Deng

J

Jinlong Xue

F

Fengping Wang

Y

Yingming Gao

Y

Ya Li

Format Sitasi

Deng, Y., Xue, J., Wang, F., Gao, Y., Li, Y. (2023). CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis. https://doi.org/10.1145/3581783.3612565

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber doi.org/10.1145/3581783.3612565
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Total Sitasi
10×
Sumber Database
Semantic Scholar
DOI
10.1145/3581783.3612565
Akses
Open Access ✓