arXiv Open Access 2025

Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings

Dyah A. M. G. Wisnu Ryandhimas E. Zezario Stefano Rini Hsin-Min Wang Yu Tsao

Lihat Sumber

Abstrak

We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.

Topik & Kata Kunci

eess.AS cs.LG cs.SD

Penulis (5)

Dyah A. M. G. Wisnu

Ryandhimas E. Zezario

Stefano Rini

Hsin-Min Wang

Yu Tsao

Format Sitasi

APA MLA BibTeX

Wisnu, D.A.M.G., Zezario, R.E., Rini, S., Wang, H., Tsao, Y. (2025). Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings. https://arxiv.org/abs/2509.03292

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓