ICRCycleGAN-VC: a robust one-to-one voice conversion method based on CycleGAN and inception-ResNet blocks
Abstrak
Abstract Voice conversion (VC) transforms a source speaker’s voice into that of a target speaker while preserving the underlying linguistic content. However, existing methods, especially for languages with complex phonetic structures like Persian, often struggle with issues such as over-smoothing, inadequate multi-scale feature extraction, and loss of essential acoustic details. In this paper, we introduce ICRCycleGAN-VC, an innovative one-to-one voice conversion framework that integrates Inception-ResNet modules into a CycleGAN architecture. By leveraging multi-scale convolutional filters, residual connections, and an optimized loss function strategy that eliminates second adversarial losses in the generator, our approach significantly improves the preservation of linguistic content, addressing the main challenge of ensuring accurate content retention. Extensive experiments on both Persian and English datasets demonstrate significant improvements, achieving notable reductions in mel-cepstral distortion and root mean squared error compared to baseline models such as MaskCycleGAN-VC. Furthermore, subjective evaluations reveal a substantial increase in both voice similarity and naturalness. Ablation studies highlight the critical contributions of each architectural component, confirming the robustness of our approach in advancing non-parallel voice conversion.
Topik & Kata Kunci
Penulis (2)
Nayereh Seyed Afiuny
Amir Lakizadeh
Akses Cepat
- Tahun Terbit
- 2025
- Sumber Database
- DOAJ
- DOI
- 10.1186/s13636-025-00422-5
- Akses
- Open Access ✓