MusicSem: A Semantically Rich Language--Audio Dataset of Natural Music Descriptions
Rebecca Salganik, Teng Tu, Fei-Yueh Chen
et al.
Music representation learning is central to music information retrieval and generation. While recent advances in multimodal learning have improved alignment between text and audio for tasks such as cross-modal music retrieval, text-to-music generation, and music-to-text generation, existing models often struggle to capture users' expressed intent in natural language descriptions of music. This observation suggests that the datasets used to train and evaluate these models do not fully reflect the broader and more natural forms of human discourse through which music is described. In this paper, we introduce MusicSem, a dataset of 32,493 language-audio pairs derived from organic music-related discussions on the social media platform Reddit. Compared to existing datasets, MusicSem captures a broader spectrum of musical semantics, reflecting how listeners naturally describe music in nuanced and human-centered ways. To structure these expressions, we propose a taxonomy of five semantic categories: descriptive, atmospheric, situational, metadata-related, and contextual. In addition to the construction, analysis, and release of MusicSem, we use the dataset to evaluate a wide range of multimodal models for retrieval and generation, highlighting the importance of modeling fine-grained semantics. Overall, MusicSem serves as a novel semantics-aware resource to support future research on human-aligned multimodal music representation learning.
User Experience of a Semi-Immersive Musical Serious Game to Stimulate Cognitive Functions in Hospitalized Older Patients: Questionnaire Study
Laurent Samson, Lena Carcreff, Frédéric Noublanche
et al.
Abstract
BackgroundReminiscence therapy through music is a psychosocial intervention with benefits for older patients with neurocognitive disorders. Therapies using virtual or augmented reality are efficient in ecologically assessing, and eventually training, episodic memory in older populations. We designed a semi-immersive musical game called “A Life in Songs,” which invites patients to immerse themselves in a past era through visuals and songs from that time period. The game aspires to become a playful, easy-to-use, and complete tool for the assessment, rehabilitation, and prevention of neurocognitive decline associated with aging.
ObjectiveThis study aimed to assess the user experience (UX) associated with the newly designed serious game.
MethodsAfter one or several sessions of the game guided by the therapist, patients of the geriatric wards were asked to answer questions selected from 2 widely known UX scales (AttrakDiff and meCUE [modular evaluation of the components of user experience]) with the therapist’s help. The internal consistency of the UX dimensions was assessed through Cronbach α to verify the validity of the dimensions. The level of engagement of the patient throughout the experimental session was also assessed following an internally developed scale, which included 5 levels (interactive, constructive, active, passive, and disengaged behaviors). UX mean scores were computed and presented graphically. Verbal feedbacks were reported to support the quantitative results.
ResultsOverall, 60 inpatients with a mean age of 84.2 (SD 5.5) years, the majority of whom were women (41/60, 68%), were included. Their score on the Mini-Mental State Examination (MMSE) ranged between 12 and 29. A majority of patients (27/56, 48%) had no major neurocognitive disorder (MNCD), 22/56 (39%) had mild MNCD, and 7/56 (13%) had moderate MNCD. The results revealed very positive UX with mean values beyond the neutral values for every UX dimension of both scales. The overall mean (SD) judgment was rated 3.92 (SD 0.87) (on a scale of −5 to 5). Internal consistency was acceptable to good for the emotional dimensions of the meCUE. Questionable to unacceptable consistency was found for the other UX dimensions. Participants were mostly active (23/60, 38%) and constructive (21/60, 35%).
ConclusionsThese findings demonstrated a very good appreciation of the game by geriatric inpatients. Participants’ and health care professionals’ verbal comments strongly aligned with the quantitative results. The poor internal consistency in the UX dimensions reflected the high heterogeneity among the included patients. Further studies are needed to evaluate the potential benefits of clinical factors such as neurocognitive functions, mood, depression, or quality of life.
Information technology, Public aspects of medicine
Coded speech enhancement using auxiliary utterance-level information
Haixin Zhao, Nilesh Madhu
Abstract Numerous post-processing methods have been proposed to improve coded speech quality and intelligibility. However, achieving state-of-the-art enhancement and generalisation across varying distortion levels remains a challenge. To bridge this gap, we propose a Lightweight Causal-Transformer-based Coded Speech Enhancement (LCT-CSE) model employing a causal frequency-time-frequency (FTF) transformer block. This block facilitates temporal and spectral sequential modelling using transformers, efficiently exploiting global dependency across causal-context TF bins while minimising computational overhead. Experimental results indicate that the proposed LCT-CSE model outperforms the considered baselines across mainstream lossy audio codecs, including Opus, AMR-WB, EVS and LC3+, with less footprint and complexity. To further utilise auxiliary, utterance-level information such as bitrate and other general distortion characteristics, building upon the LCT-CSE model, we propose two information incorporation methods. One employs one-hot vector representations and feature fusions, referred to as 1-hot vector-based modulation, while the other dynamically switches information-dependent network paths, termed dynamic linear modulation (DLM). These methods can be used to improve performance in bitrate-information utilisation, with negligible additional computational overhead. The DLM model even achieves comparable performance to bitrate-specific trained (BST) models. We further extend the proposed information incorporation method, DLM, to a generalised scenario, tandem coding. Compared to the two practically used approaches, the DLM-based LCT-CSE model consistently exhibits improved generalisability across varying tandem encoding conditions, based on derivative distortion information. Specifically, it achieves gains up to 0.74 in PESQ, 7% in STOI, and 0.18 in MOS-SIG under various bitrate conditions. This indicates significant potential for further applications where auxiliary information can be utilised.
Acoustics. Sound, Electronic computers. Computer science
CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
Yinghao Ma, Siyou Li, Juntao Yu
et al.
Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.
Persian Musical Instruments Classification Using Polyphonic Data Augmentation
Diba Hadi Esfangereh, Mohammad Hossein Sameti, Sepehr Harfi Moridani
et al.
Musical instrument classification is essential for music information retrieval (MIR) and generative music systems. However, research on non-Western traditions, particularly Persian music, remains limited. We address this gap by introducing a new dataset of isolated recordings covering seven traditional Persian instruments, two common but originally non-Persian instruments (i.e., violin, piano), and vocals. We propose a culturally informed data augmentation strategy that generates realistic polyphonic mixtures from monophonic samples. Using the MERT model (Music undERstanding with large-scale self-supervised Training) with a classification head, we evaluate our approach with out-of-distribution data which was obtained by manually labeling segments of traditional songs. On real-world polyphonic Persian music, the proposed method yielded the best ROC-AUC (0.795), highlighting complementary benefits of tonal and temporal coherence. These results demonstrate the effectiveness of culturally grounded augmentation for robust Persian instrument recognition and provide a foundation for culturally inclusive MIR and diverse music generation systems.
SLEEPING-DISCO 9M: A large-scale pre-training dataset for generative music modeling
Tawsif Ahmed, Andrej Radonjic, Gollam Rabby
We present Sleeping-DISCO 9M, a large-scale pre-training dataset for music and song. To the best of our knowledge, there are no open-source high-quality dataset representing popular and well-known songs for generative music modeling tasks such as text-music, music-captioning, singing-voice synthesis, melody reconstruction and cross-model retrieval. Past contributions focused on isolated and constrained factors whose core perspective was to create synthetic or re-recorded music corpus (e.g. GTSinger, M4Singer) and arbitrarily large-scale audio datasets (e.g. DISCO-10M and LAIONDISCO-12M) had been another focus for the community. Unfortunately, adoption of these datasets has been below substantial in the generative music community as these datasets fail to reflect real-world music and its flavour. Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
Learning and composing of classical music using restricted Boltzmann machines
Mutsumi Kobayashi, Hiroshi Watanabe
We investigate how machine learning models acquire the ability to compose music and how musical information is internally represented within such models. We develop a composition algorithm based on a restricted Boltzmann machine (RBM), a simple generative model capable of producing musical pieces of arbitrary length. We convert musical scores into piano-roll image representations and train the RBM in an unsupervised manner. We confirm that the trained RBM can generate new musical pieces; however, by analyzing the model's responses and internal structure, we find that the learned information is not stored in a form directly interpretable by humans. This study contributes to a better understanding of how machine learning models capable of music composition may internally represent musical structure and highlights issues related to the interpretability of generative models in creative tasks.
MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation
Fathinah Izzati, Xinyue Li, Yuxuan Wu
et al.
Humans can imagine various atmospheres and settings when listening to music, envisioning movie scenes that complement each piece. For example, slow, melancholic music might evoke scenes of heartbreak, while upbeat melodies suggest celebration. This paper explores whether a Music Language Model, e.g. MU-LLaMA, can perform a similar task, called Music Scene Imagination (MSI), which requires cross-modal information from video and music to train. To improve upon existing music captioning models which focusing solely on musical elements, we introduce MusiScene, a music captioning model designed to imagine scenes that complement each music. In this paper, (1) we construct a large-scale video-audio caption dataset with 3,371 pairs, (2) we finetune Music Understanding LLaMA for the MSI task to create MusiScene, and (3) we conduct comprehensive evaluations and prove that our MusiScene is more capable of generating contextually relevant captions compared to MU-LLaMA. We leverage the generated MSI captions to enhance Video Background Music Generation (VBMG) from text.
Glazba u obitelji Antuna Gustava Matoša
Martina Mičija Palić
Uzimajući u obzir činjenicu da se dvije godine zaredom obilježava 150. godina od rođenja i 110. godina od smrti jednog od naših najznačajnijih književnika – Antuna Gustava Matoša – u znanstvenim se istraživanjima učestalije pojavljuje tematika kojom se nastoje ostvariti tzv. »nova čitanja«, koja valoriziraju i razmatraju njegovu djelatnost iz novog rakursa. Polazeći od navoda Antuna Gustava o »muzikalnoj kući« Matoševih, izrečenoga u žaru polemike, u ovome se radu istražuje značaj i pojavnost glazbe u obitelji Matoš. Dijakronijskim prikazom obiteljskoga stabla Matoševih prezentirat će se glazbena djelatnost pojedinih članova obitelji, bilo da su se glazbom bavili u profesionalnome ili amaterskom kontekstu, i to Antuna Gustava, njegova djeda Grgura, oca Augusta, sestre Danice, brata Leona i ostalih članova obitelji Matoš. Kao najznačajniji izvori pri tome koristili su se književna ostavština Antuna Gustava Matoša, relevantna znanstvena i stručna literatura te primarno arhivsko gradivo.
The Analysis of Chinese and Japanese Traditional Opera Tunes With Artificial Intelligence Technology Based on Deep Learning
Min Yao, Jingchun Liu
This study aims to propose a comparison model of Chinese and Japanese traditional opera tunes based on deep learning (DL) and artificial intelligence technology, which can be used to effectively analyze and classify the tune characteristics of Chinese and Japanese traditional opera. To achieve this aim, a multi-layered DL model, including Convolutional Neural Network, Long Short-Term Memory, Recurrent Neural Network, and Gated Recurrent Unit, is constructed based on the tune characteristics of Chinese and Japanese traditional opera. In the extensive experimental verification, the proposed model has achieved remarkable results in the task of classifying Chinese and Japanese traditional opera tunes. Specifically, the proposed model achieves an accuracy of 89% and 88%, recall of 88% and 87%, and F1-score of 88% and 87% in the tune classification task of the Chinese and Japanese traditional opera, respectively. Compared with similar models, the proposed model performs better in classification.
Electrical engineering. Electronics. Nuclear engineering
Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model
Julia Barnett, Hugo Flores Garcia, Bryan Pardo
Every artist has a creative process that draws inspiration from previous artists and their works. Today, "inspiration" has been automated by generative music models. The black box nature of these models obscures the identity of the works that influence their creative output. As a result, users may inadvertently appropriate, misuse, or copy existing artists' works. We establish a replicable methodology to systematically identify similar pieces of music audio in a manner that is useful for understanding training data attribution. A key aspect of our approach is to harness an effective music audio similarity measure. We compare the effect of applying CLMR and CLAP embeddings to similarity measurement in a set of 5 million audio clips used to train VampNet, a recent open source generative music model. We validate this approach with a human listening study. We also explore the effect that modifications of an audio example (e.g., pitch shifting, time stretching, background noise) have on similarity measurements. This work is foundational to incorporating automated influence attribution into generative modeling, which promises to let model creators and users move from ignorant appropriation to informed creation. Audio samples that accompany this paper are available at https://tinyurl.com/exploring-musical-roots.
Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems
Lukáš Samuel Marták, Patricia Hu, Gerhard Widmer
Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.
SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data
Liqian Zhang, Magdalena Fuentes
We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.
The Effects of Sacral Massage and Music Therapy Applied to Primaparous Women in Labor on Labor Pain, Postpartum Comfort, and Posttraumatic Growth: A Randomized Controlled Trial
Hatice Gül Öztaş, Nilay Gökbulut, Emine İbici Akça
Objective: In this study, the researchers investigated to the effects of sacral massage and music therapy applied to primiparous women in labor on labor pain, postpartum comfort, and posttraumatic growth. Material - Method: This randomized controlled study was conducted with 96 primiparous women admitted to the delivery room of a public hospital in a province in the south of Turkey. Data of the study was collected with “Visual Analog Scale (VAS)”, “Postpartum Comfort Questionnaire (PPCQ)” and “Posttraumatic Growth Inventory (PTGI)”. Results: It was determined that the PPCQ total mean scores were 127.0 (119.0-136.0) in the massage group, 121.0 (113.0-127.0) in the music group, 112.0 (106.0-115.0) in the control group, and the difference between the groups was statistically significant (X2=34.159, p
Drawing skills at the beginning of higher education: Teachers’ perspectives, expectations, and realities
Pedro Ramalho da Silva, Odete Palaré
For more than two decades, educational policies in Portugal have primarily focused on improving student outcomes, reducing school absenteeism, and preventing school abandonment. Another factor has been the overemphasis on literacy and numeracy, which has resulted in an unbalanced weighting of these school subjects on instructional time. As a result, instruction time for non-essential classes, such as music and visual arts, was reduced. What effect do these policies have on the drawing abilities and visual literacy of those who pursue visual arts studies in higher education? To date, there has been a plethora of studies on drawing in its didactic, neurological, and physiological components, with findings that strengthen and support the idea that drawing plays a central role in the development of conceptual thinking and abductive reasoning. These findings are significant when advocating for drawing as a learning tool in S.T.E.M., but also, for drawing to play a different role in education overall. To determine whether these policies impact the competencies, skills, and visual literacy of those who pursue visual arts studies, we surveyed higher education teachers regarding their perceptions of student drawing skills as they begin college degrees ranging from fine arts to design. Some survey findings point to weak areas, particularly in perspective drawing and preparatory drawings and sketches. The teachers suggested that the causes were mostly political. These findings should be investigated further, specifically through follow-up interviews and a survey of first-and second-year students enrolled in the courses taught at the educational institutions under analysis.
GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework
Ang Lv, Xu Tan, Peiling Lu
et al.
Symbolic music generation aims to create musical notes, which can help users compose music, such as generating target instrument tracks based on provided source tracks. In practical scenarios where there's a predefined ensemble of tracks and various composition needs, an efficient and effective generative model that can generate any target tracks based on the other tracks becomes crucial. However, previous efforts have fallen short in addressing this necessity due to limitations in their music representations and models. In this paper, we introduce a framework known as GETMusic, with ``GET'' standing for ``GEnerate music Tracks.'' This framework encompasses a novel music representation ``GETScore'' and a diffusion model ``GETDiff.'' GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. At a training step, each track of a music piece is randomly selected as either the target or source. The training involves two processes: In the forward process, target tracks are corrupted by masking their tokens, while source tracks remain as the ground truth; in the denoising process, GETDiff is trained to predict the masked target tokens conditioning on the source tracks. Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations. Our experiments demonstrate that the versatile GETMusic outperforms prior works proposed for certain specific composition tasks.
MAGMA: Music Aligned Generative Motion Autodecoder
Sohan Anisetty, Amit Raj, James Hays
Mapping music to dance is a challenging problem that requires spatial and temporal coherence along with a continual synchronization with the music's progression. Taking inspiration from large language models, we introduce a 2-step approach for generating dance using a Vector Quantized-Variational Autoencoder (VQ-VAE) to distill motion into primitives and train a Transformer decoder to learn the correct sequencing of these primitives. We also evaluate the importance of music representations by comparing naive music feature extraction using Librosa to deep audio representations generated by state-of-the-art audio compression algorithms. Additionally, we train variations of the motion generator using relative and absolute positional encodings to determine the effect on generated motion quality when generating arbitrarily long sequence lengths. Our proposed approach achieve state-of-the-art results in music-to-motion generation benchmarks and enables the real-time generation of considerably longer motion sequences, the ability to chain multiple motion sequences seamlessly, and easy customization of motion sequences to meet style requirements.
Quantifying the evolution of harmony and novelty in western classical music
Alfredo González-Espinoza, Joshua B. Plotkin
Music is a complex socio-cultural construct, which fascinates researchers in diverse fields, as well as the general public. Understanding the historical development of music may help us understand perceptual and cognition, while also yielding insight in the processes of cultural transmission, creativity, and innovation. Here, we present a study of musical features related to harmony, and we document how they evolved over 400 years in western classical music. We developed a variant of the center of effect algorithm to call the most likely for a given set of notes, to represent a musical piece as a sequence of local keys computed measure by measure. We develop measures to quantify key uncertainty, and diversity and novelty in key transitions. We provide specific examples to demonstrate the features represented by these concepts, and we argue how they are related to harmonic complexity and can be used to study the evolution of harmony. We confirm several observations and trends previously reported by musicologists and scientists, with some discrepancies during the Classical period. We report a decline in innovation in harmonic transitions in the early classical period followed by a steep increase in the late classical; and we give an explanation for this finding that is consistent with accounts by music theorists. Finally, we discuss the limitations of this approach for cross-cultural studies and the need for more expressive but still tractable representations of musical scores, as well as a large and reliable musical corpus, for future study.
Modulation Graphs in Popular Music
Jason I. Brown, Ian George
In this paper, graph theory is used to explore the musical notion of tonal modulation, in theory and application. We define (pivot) modulation graphs based on the common scales used in popular music. Properties and parameters of these graphs are discussed. We also investigate modulation graphs for the canon of Lennon-McCartney songs in the works of The Beatles. Our approach may provide composers with mathematical insights into pivot modulation.
CHOIR ART FROM THE REPUBLIC OF MOLDOVA AT THE CROSSROADS OF MILLENNIUMS: REALITIES AND PERSPECTIVES
MORARU, EMILIA
In this article, the author describes the current situation of interpretive choral art in the Republic of Moldova: the demographic situation and socio-cultural processes, which influence the trajectory of the choral movement evolution in our country, the challenges of the contemporary society, which lead to changes in vision, approach and resizing of the activity of choirs and conductors; the perspectives of national choral art and the opportunities that choral conductors and collectives can benefit in the century of „speed and digital resources”. The benefits of vocal-choral singing are highlighted and recommendations for theoretical training are formulated (study of specialized literature) and the practice of the conductor (individually and together with the choir).