arXiv Open Access 2023

Visual-Aware Text-to-Speech

Mohan Zhou Yalong Bai Wei Zhang Ting Yao Tiejun Zhao +1 lainnya

Lihat Sumber

Abstrak

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

Topik & Kata Kunci

eess.AS cs.CL cs.SD

Penulis (6)

Mohan Zhou

Yalong Bai

Wei Zhang

Ting Yao

Tiejun Zhao

Tao Mei

Format Sitasi

APA MLA BibTeX

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T. (2023). Visual-Aware Text-to-Speech. https://arxiv.org/abs/2306.12020

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓