arXiv Open Access 2023

LLark: A Multimodal Instruction-Following Language Model for Music

Josh Gardner Simon Durand Daniel Stoller Rachel M. Bittner

Lihat Sumber

Abstrak

Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark .

Topik & Kata Kunci

cs.SD cs.LG eess.AS

Penulis (4)

Josh Gardner

Simon Durand

Daniel Stoller

Rachel M. Bittner

Format Sitasi

APA MLA BibTeX

Gardner, J., Durand, S., Stoller, D., Bittner, R.M. (2023). LLark: A Multimodal Instruction-Following Language Model for Music. https://arxiv.org/abs/2310.07160

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓