arXiv Open Access 2026

MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio

Qihao Zhao Yunqi Cao Yangyu Huang Hui Yi Leong Fan Zhang +2 lainnya

Lihat Sumber

Abstrak

Despite recent advances in multimodal large language models (MLLMs), their ability to understand and interact with music remains limited. Music understanding requires grounded reasoning over symbolic scores and expressive performance audio, which general-purpose MLLMs often fail to handle due to insufficient perceptual grounding. We introduce MuseAgent, a music-centric multimodal agent that augments language models with structured symbolic representations derived from sheet music images and performance audio. By integrating optical music recognition and automatic music transcription modules, MuseAgent enables multi-step reasoning and interaction over fine-grained musical content. To systematically evaluate music understanding capabilities, we further propose MuseBench, a benchmark covering music theory reasoning, score interpretation, and performance-level analysis across text, image, and audio modalities. Experiments show that existing MLLMs perform poorly on these tasks, while MuseAgent achieves substantial improvements, highlighting the importance of structured multimodal grounding for interactive music understanding.

Topik & Kata Kunci

cs.MM cs.SD eess.AS

Penulis (7)

Qihao Zhao

Yunqi Cao

Yangyu Huang

Hui Yi Leong

Fan Zhang

Kim-Hui Yap

Wei Hu

Format Sitasi

APA MLA BibTeX

Zhao, Q., Cao, Y., Huang, Y., Leong, H.Y., Zhang, F., Yap, K. et al. (2026). MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio. https://arxiv.org/abs/2601.11968

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓