arXiv Open Access 2023

Visual-Aware Text-to-Speech

Mohan Zhou Yalong Bai Wei Zhang Ting Yao Tiejun Zhao +1 lainnya
Lihat Sumber

Abstrak

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

Topik & Kata Kunci

Penulis (6)

M

Mohan Zhou

Y

Yalong Bai

W

Wei Zhang

T

Ting Yao

T

Tiejun Zhao

T

Tao Mei

Format Sitasi

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T. (2023). Visual-Aware Text-to-Speech. https://arxiv.org/abs/2306.12020

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓