arXiv Open Access 2025

Multi-interaction TTS toward professional recording reproduction

Hiroki Kanagawa Kenichi Fujita Aya Watanabe Yusuke Ijima

Lihat Sumber

Abstrak

Voice directors often iteratively refine voice actors' performances by providing feedback to achieve the desired outcome. While this iterative feedback-based refinement process is important in actual recordings, it has been overlooked in text-to-speech synthesis (TTS). As a result, fine-grained style refinement after the initial synthesis is not possible, even though the synthesized speech often deviates from the user's intended style. To address this issue, we propose a TTS method with multi-step interaction that allows users to intuitively and rapidly refine synthesized speech. Our approach models the interaction between the TTS model and its user to emulate the relationship between voice actors and voice directors. Experiments show that the proposed model with its corresponding dataset enables iterative style refinements in accordance with users' directions, thus demonstrating its multi-interaction capability. Sample audios are available: https://ntt-hilab-gensp.github.io/ssw13multiinteractiontts/

Topik & Kata Kunci

cs.SD cs.CL eess.AS

Penulis (4)

Hiroki Kanagawa

Kenichi Fujita

Aya Watanabe

Yusuke Ijima

Format Sitasi

APA MLA BibTeX

Kanagawa, H., Fujita, K., Watanabe, A., Ijima, Y. (2025). Multi-interaction TTS toward professional recording reproduction. https://arxiv.org/abs/2507.00808

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓