MUSE-Explainer: Counterfactual Explanations for Symbolic Music Graph Classification Models
Baptiste Hilaire, Emmanouil Karystinaios, Gerhard Widmer
Interpretability is essential for deploying deep learning models in symbolic music analysis, yet most research emphasizes model performance over explanation. To address this, we introduce MUSE-Explainer, a new method that helps reveal how music Graph Neural Network models make decisions by providing clear, human-friendly explanations. Our approach generates counterfactual explanations by making small, meaningful changes to musical score graphs that alter a model's prediction while ensuring the results remain musically coherent. Unlike existing methods, MUSE-Explainer tailors its explanations to the structure of musical data and avoids unrealistic or confusing outputs. We evaluate our method on a music analysis task and show it offers intuitive insights that can be visualized with standard music tools such as Verovio.
PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music
Hayeon Bang, Eunjin Choi, Seungheon Doh
et al.
Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.
Evaluating Interval-based Tokenization for Pitch Representation in Symbolic Music Analysis
Dinh-Viet-Toan Le, Louis Bigo, Mikaela Keller
Symbolic music analysis tasks are often performed by models originally developed for Natural Language Processing, such as Transformers. Such models require the input data to be represented as sequences, which is achieved through a process of tokenization. Tokenization strategies for symbolic music often rely on absolute MIDI values to represent pitch information. However, music research largely promotes the benefit of higher-level representations such as melodic contour and harmonic relations for which pitch intervals turn out to be more expressive than absolute pitches. In this work, we introduce a general framework for building interval-based tokenizations. By evaluating these tokenizations on three music analysis tasks, we show that such interval-based tokenizations improve model performances and facilitate their explainability.
Music Discovery Dialogue Generation Using Human Intent Analysis and Large Language Models
SeungHeon Doh, Keunwoo Choi, Daeyong Kwon
et al.
A conversational music retrieval system can help users discover music that matches their preferences through dialogue. To achieve this, a conversational music retrieval system should seamlessly engage in multi-turn conversation by 1) understanding user queries and 2) responding with natural language and retrieved music. A straightforward solution would be a data-driven approach utilizing such conversation logs. However, few datasets are available for the research and are limited in terms of volume and quality. In this paper, we present a data generation framework for rich music discovery dialogue using a large language model (LLM) and user intents, system actions, and musical attributes. This is done by i) dialogue intent analysis using grounded theory, ii) generating attribute sequences via cascading database filtering, and iii) generating utterances using large language models. By applying this framework to the Million Song dataset, we create LP-MusicDialog, a Large Language Model based Pseudo Music Dialogue dataset, containing over 288k music conversations using more than 319k music items. Our evaluation shows that the synthetic dataset is competitive with an existing, small human dialogue dataset in terms of dialogue consistency, item relevance, and naturalness. Furthermore, using the dataset, we train a conversational music retrieval model and show promising results.
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation
Yun-Han Lan, Wen-Yi Hsiao, Hao-Chung Cheng
et al.
Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.
Do Captioning Metrics Reflect Music Semantic Alignment?
Jinwoo Lee, Kyogu Lee
Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments. By addressing these issues, we aim to emphasize the need for a critical reevaluation of how music captions are assessed.
Художні трансформації образу батька в українському кінематографі сталінського періоду
V. Nykonenko
У статті послідовно розглядаються особливості художніх інтерпретацій батьківського архетипу в українському кінематографі в період правління Йосифа Сталіна. На прикладі кінофільмів «Іван» (1932), «Чарівний сад» (1935), «Аероград» (1935), «Партизани в степах України» (1942), «Третій удар» (1948) здійснюється аналіз політичних та соціокультурних чинників, що корінним чином вплинули на особливості трактування екранного персонажа батька в зазначений період. Згідно з юнгівським ученням про архетипи колективного несвідомого, батьківська постать розглядається в статті в найширшому символічному значенні — коли батьківську роль для людини може відігравати ціла соціальна інституція або ж апарат державної влади. Саме тому значна увага приділяється взаємовідносинам між радянською владою та громадянами УРСР. Ці зв’язки тогочасні кінематографісти часто репрезентували, орієнтуючись на класичні моделі сімейних відносин: представникам більшовицької системи відводилася роль опікунів, населенню — їхніх вихованців.
Fine Arts, Music and books on Music
Review-Essay on The Book “Tìm Hiểu Âm Nhạc Tài Tử XXTITEL Cải Lương” By Kiều Tấn. 2023. Ho Chi Minh City: Nhà Xuất Bản Tổng Hợp Thành Phố Hồ Chí Minh
Gisa Jähnichen
This is a short review-essay on the book “Tìm Hiểu Âm Nhạc Tài Tử XXABSTRACT Cải Lương” by Kiều Tấn. Ho Chi Minh City: Nhà xuất bản tổng hợp Thành phố Hồ Chí Minh.
Predicting Music Hierarchies with a Graph-Based Neural Decoder
Francesco Foscarin, Daniel Harasim, Gerhard Widmer
This paper describes a data-driven framework to parse musical sequences into dependency trees, which are hierarchical structures used in music cognition research and music analysis. The parsing involves two steps. First, the input sequence is passed through a transformer encoder to enrich it with contextual information. Then, a classifier filters the graph of all possible dependency arcs to produce the dependency tree. One major benefit of this system is that it can be easily integrated into modern deep-learning pipelines. Moreover, since it does not rely on any particular symbolic grammar, it can consider multiple musical features simultaneously, make use of sequential context information, and produce partial results for noisy inputs. We test our approach on two datasets of musical trees -- time-span trees of monophonic note sequences and harmonic trees of jazz chord sequences -- and show that our approach outperforms previous methods.
LP-MusicCaps: LLM-Based Pseudo Music Captioning
SeungHeon Doh, Keunwoo Choi, Jongpil Lee
et al.
Automatic music captioning, which generates natural language descriptions for given music tracks, holds significant potential for enhancing the understanding and organization of large volumes of musical data. Despite its importance, researchers face challenges due to the costly and time-consuming collection process of existing music-language datasets, which are limited in size. To address this data scarcity issue, we propose the use of large language models (LLMs) to artificially generate the description sentences from large-scale tag datasets. This results in approximately 2.2M captions paired with 0.5M audio clips. We term it Large Language Model based Pseudo music caption dataset, shortly, LP-MusicCaps. We conduct a systemic evaluation of the large-scale music captioning dataset with various quantitative evaluation metrics used in the field of natural language processing as well as human evaluation. In addition, we trained a transformer-based music captioning model with the dataset and evaluated it under zero-shot and transfer-learning settings. The results demonstrate that our proposed approach outperforms the supervised baseline model.
Passage Summarization with Recurrent Models for Audio-Sheet Music Retrieval
Luis Carvalho, Gerhard Widmer
Many applications of cross-modal music retrieval are related to connecting sheet music images to audio recordings. A typical and recent approach to this is to learn, via deep neural networks, a joint embedding space that correlates short fixed-size snippets of audio and sheet music by means of an appropriate similarity structure. However, two challenges that arise out of this strategy are the requirement of strongly aligned data to train the networks, and the inherent discrepancies of musical content between audio and sheet music snippets caused by local and global tempo differences. In this paper, we address these two shortcomings by designing a cross-modal recurrent network that learns joint embeddings that can summarize longer passages of corresponding audio and sheet music. The benefits of our method are that it only requires weakly aligned audio-sheet music pairs, as well as that the recurrent network handles the non-linearities caused by tempo variations between audio and sheet music. We conduct a number of experiments on synthetic and real piano data and scores, showing that our proposed recurrent method leads to more accurate retrieval in all possible configurations.
The Ritual Music, Culture and Nature of the Chiswa harvesting among Babukusu of Western Kenya
Mukasa Situma Wafula
Termite harvesting remains an important ritual and cultural activity in the Bukusu community of Western Kenya. Termites are a delicacy that people enjoy. Harvesting the termites takes different approaches as will be discussed in this article. The paper will, however, focus on one mode that makes use of a musical stimulus to attract the termites into a trap. In this article, I will venture into showing how this ritual of termite harvesting is an important part of the intangible cultural heritage of Babukusu. Further, I will seek to demonstrate how music can be central to interspecies communication. It is also envisaged that this article will contribute to the literature concerning ecomusicology, a new field of study that is in the process of developing definitions and terminology as well as conceptual understanding. This article drew data from field research I carried out in Bungoma County in 2017. Using ethnographic approaches, I interviewed participants during the harvesting rituals. Further data was gathered through participant observation. It is important to note that I come from this community and have over time participated in the ritual. Some of the information in this document is based on my position as an insider who has participated in this ritual over the years.
Sound Drawing and Dhol Notation: A Methodological Approach to Visualising Drum Sounds
Amrinder Singh Romana
The research introduces ‘Sound Drawing’ as an engaging instructional activity to develop a visual dhol notation system. In contrast to the conventional reliance on spoken language for dhol instruction, this research involves developing a visual notation system that effectively bridges the auditory intricacies of the dhol drum with corresponding visual representations. Through a methodical examination of sound drawings collaboratively generated by participants, this study critically assesses the effectiveness of sound drawing as an active and inclusive pedagogical instrument within the domain of dhol learning. The outcomes demonstrate how participants’ visual interpretations of dhol sounds led to creating a notation system. This system reflects a diverse range of auditory perceptions and offers a new avenue for cultural engagement and learning in music education. By introducing sound drawing as an immersive instructional activity, this research aspires to advance dhol pedagogy, rendering it more accessible to diverse cultures and communities, thus transcending linguistic barriers. This study pioneers the incorporation of sound drawing as an innovative pedagogical activity for the collective development of a visual dhol notation system, thereby instigating a transformation in pedagogical paradigms and fostering cross-cultural engagement within the rich musical tradition of the dhol.
Making Music Using Two Quantum Algorithms
Euan J. Allen, Jacob F. F. Bulmer, Simon D. Small
This document explores how to make music using quantum computing algorithms. The text is an unedited pre-publication chapter which will appear in the book "Quantum Computer Music", Miranda, E. R. (Editor). This chapters provides the background and specific details of a collaboration formed in 2021 between the Quantum Engineering Technology Labs - a quantum computing and technology research group at the University of Bristol - and music artist, producer and audio engineer Simon Small. The goal of the collaboration was to explore how the data and concepts used in the research at the university could be `sonified' to create sounds or even make music.
Audiovisual interactive artwork via web-deployed software: Motus composes Homino-idea
Augusto Piato Amstalden, Tiago Fernandes Tavares, Anésio Azevedo Costa Neto
et al.
Many art installations rely on camera-based audiovisual interactions, and this commonly requires specialized hardware and software. Consequently, audiovisual installations are usually restricted to wealthier areas, in which the specialized equipment can be afforded and properly hosted. In countries with an evident income unbalance linked to location, the geographic restriction leads to an audience restriction. In this work, we present the development of a web-deployed composition tool for audiovisual interactions that runs on the client side and does not require installing any additional software. Simultaneously, it provides visual feedback that can aid the audience to understand the experience. Consequently, the tool can be used to compose audiovisual interactions that reach a large audience via web. We further explore the tool by composing the audiovisual installation Homino-idea. The installation is inspired by the interactions between humans and the environment, and can be either shown in art venues or used online.
Music and books on Music, Music
A Quantum Natural Language Processing Approach to Musical Intelligence
Eduardo Reck Miranda, Richie Yeung, Anna Pearson
et al.
There has been tremendous progress in Artificial Intelligence (AI) for music, in particular for musical composition and access to large databases for commercialisation through the Internet. We are interested in further advancing this field, focusing on composition. In contrast to current black-box AI methods, we are championing an interpretable compositional outlook on generative music systems. In particular, we are importing methods from the Distributional Compositional Categorical (DisCoCat) modelling framework for Natural Language Processing (NLP), motivated by musical grammars. Quantum computing is a nascent technology, which is very likely to impact the music industry in time to come. Thus, we are pioneering a Quantum Natural Language Processing (QNLP) approach to develop a new generation of intelligent musical systems. This work follows from previous experimental implementations of DisCoCat linguistic models on quantum hardware. In this chapter, we present Quanthoven, the first proof-of-concept ever built, which (a) demonstrates that it is possible to program a quantum computer to learn to classify music that conveys different meanings and (b) illustrates how such a capability might be leveraged to develop a system to compose meaningful pieces of music. After a discussion about our current understanding of music as a communication medium and its relationship to natural language, the chapter focuses on the techniques developed to (a) encode musical compositions as quantum circuits, and (b) design a quantum classifier. The chapter ends with demonstrations of compositions created with the system.
BERT-like Pre-training for Symbolic Piano Music Classification Tasks
Yi-Hui Chou, I-Chun Chen, Chin-Jui Chang
et al.
This article presents a benchmark study of symbolic piano music classification using the masked language modelling approach of the Bidirectional Encoder Representations from Transformers (BERT). Specifically, we consider two types of MIDI data: MIDI scores, which are musical scores rendered directly into MIDI with no dynamics and precisely aligned with the metrical grid notated by its composer and MIDI performances, which are MIDI encodings of human performances of musical scoresheets. With five public-domain datasets of single-track piano MIDI files, we pre-train two 12-layer Transformer models using the BERT approach, one for MIDI scores and the other for MIDI performances, and fine-tune them for four downstream classification tasks. These include two note-level classification tasks (melody extraction and velocity prediction) and two sequence-level classification tasks (style classification and emotion classification). Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.
Unidades Semióticas Espaciais (USEs) e a manipulação artística sistemática do espaço extrínseco musical
Fabio Wanderley Janhan Sousa
Apresentamos neste trabalho o conceito de espaço extrínseco e, a partir de estudos bibliográficos relacionados ao desenvolvimento sintático e semântico do mesmo, assim como da prática de espacialização em tempo real e da composição de obras para sistemas multicanais, nos direcionamos para a estruturação do que denominamos de Unidades Semióticas Espaciais (USEs). Pretendidas como as menores unidades perceptíveis no domínio do espaço extrínseco capazes de produzir algum significado artístico, as USEs são uma proposta de ferramenta para composição e análise tanto de material sonoro quanto imagético. A partir da proposta de uma grafia para as USEs, apresentamos uma análise do espaço extrínseco de uma composição de música eletroacústica audiovisual realizada pelo autor. Avaliamos vantagens e desvantagens do procedimento adotado na conclusão.
Music and books on Music, Music
Investigation of Singing Voice Separation for Singing Voice Detection in Polyphonic Music
Yifu Sun, Xulong Zhang, Yi Yu
et al.
Singing voice detection (SVD), to recognize vocal parts in the song, is an essential task in music information retrieval (MIR). The task remains challenging since singing voice varies and intertwines with the accompaniment music, especially for some complicated polyphonic music such as choral music recordings. To address this problem, we investigate singing voice detection while discarding the interference from the accompaniment. The proposed SVD has two steps: i. The singing voice separation (SVS) technique is first utilized to filter out the singing voice's potential part coarsely. ii. Upon the continuity of vocal in the time domain, Long-term Recurrent Convolutional Networks (LRCN) is used to learn compositional features. Moreover, to eliminate the outliers, we choose to use a median filter for time-domain smoothing. Experimental results show that the proposed method outperforms the existing state-of-the-art works on two public datasets, the Jamendo Corpus and the RWC pop dataset.
Static and Dynamic Measures of Active Music Listening as Indicators of Depression Risk
Aayush Surana, Yash Goyal, Vinoo Alluri
Music, an integral part of our lives, which is not only a source of entertainment but plays an important role in mental well-being by impacting moods, emotions and other affective states. Music preferences and listening strategies have been shown to be associated with the psychological well-being of listeners including internalized symptomatology and depression. However, till date no studies exist that examine time-varying music consumption, in terms of acoustic content, and its association with users' well-being. In the current study, we aim at unearthing static and dynamic patterns prevalent in active listening behavior of individuals which may be used as indicators of risk for depression. Mental well-being scores and listening histories of 541 Last.fm users were examined. Static and dynamic acoustic and emotion-related features were extracted from each user's listening history and correlated with their mental well-being scores. Results revealed that individuals with greater depression risk resort to higher dependency on music with greater repetitiveness in their listening activity. Furthermore, the affinity of depressed individuals towards music that can be perceived as sad was found to be resistant to change over time. This study has large implications for future work in the area of assessing mental illness risk by exploiting digital footprints of users via online music streaming platforms.