MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer
Abstrak
Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification.
Penulis (9)
Dong Yao
Jieming Zhu
Jiahao Xun
Shengyu Zhang
Zhou Zhao
Liqun Deng
Wenqiao Zhang
Zhenhua Dong
Xin Jiang
Akses Cepat
- Tahun Terbit
- 2023
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓