arXiv Open Access 2025

Serenade: A Singing Style Conversion Framework Based On Audio Infilling

Lester Phillip Violeta Wen-Chin Huang Tomoki Toda

Lihat Sumber

Abstrak

We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel-spectrogram with a flow-matching model using the complement of the masked target mel-spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel-spectrogram as a target. Finally, to retain the source melody better, we investigate a post-processing module using a source-filter-based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. We also found that resynthesizing with the original F0 patterns alleviated out-of-tune singing and improved naturalness, but found a slight tradeoff in similarity due to not changing the F0 patterns into the target style.

Topik & Kata Kunci

cs.SD eess.AS

Penulis (3)

Lester Phillip Violeta

Wen-Chin Huang

Tomoki Toda

Format Sitasi

APA MLA BibTeX

Violeta, L.P., Huang, W., Toda, T. (2025). Serenade: A Singing Style Conversion Framework Based On Audio Infilling. https://arxiv.org/abs/2503.12388

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓