Classical Recurrent Neural Networks (RNNs) summarize musical context into a deterministic hidden state vector, imposing an information bottleneck that fails to capture the inherent ambiguity in music. We propose the Density Matrix RNN (DM-RNN), a novel theoretical architecture utilizing the Density Matrix. This allows the model to maintain a statistical ensemble of musical interpretations (a mixed state), capturing both classical probabilities and quantum coherences. We rigorously define the temporal dynamics using Quantum Channels (CPTP maps). Crucially, we detail a parameterization strategy based on the Choi-Jamiolkowski isomorphism, ensuring the learned dynamics remain physically valid (CPTP) by construction. We introduce an analytical framework using Von Neumann Entropy to quantify musical uncertainty and Quantum Mutual Information (QMI) to measure entanglement between voices. The DM-RNN provides a mathematically rigorous framework for modeling complex, ambiguous musical structures.
Video game music (VGM) is often studied under the same lens as film music, which largely focuses on its theoretical functionality with relation to the identified genres of the media. However, till date, we are unaware of any systematic approach that analyzes the quantifiable musical features in VGM across several identified game genres. Therefore, we extracted musical features from VGM in games from three sub-genres of Role-Playing Games (RPG), and then hypothesized how different musical features are correlated to the perceptions and portrayals of each genre. This observed correlation may be used to further suggest such features are relevant to the expected storytelling elements or play mechanics associated with the sub-genre.
Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.
Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at https://github.com/wzk1015/Awesome-Vision-to-Music-Generation.
Juan Carlos Martinez-Sevilla, Francesco Foscarin, Patricia Garcia-Iasci
et al.
In this paper, we address the challenge of Optical Music Recognition (OMR) for handwritten jazz lead sheets, a widely used musical score type that encodes melody and chords. The task is challenging due to the presence of chords, a score component not handled by existing OMR systems, and the high variability and quality issues associated with handwritten images. Our contribution is two-fold. We present a novel dataset consisting of 293 handwritten jazz lead sheets of 163 unique pieces, amounting to 2021 total staves aligned with Humdrum **kern and MusicXML ground truth scores. We also supply synthetic score images generated from the ground truth. The second contribution is the development of an OMR model for jazz lead sheets. We discuss specific tokenisation choices related to our kind of data, and the advantages of using synthetic scores and pretrained models. We publicly release all code, data, and models.
Managing the emotional aspect remains a challenge in automatic music generation. Prior works aim to learn various emotions at once, leading to inadequate modeling. This paper explores the disentanglement of emotions in piano performance generation through a two-stage framework. The first stage focuses on valence modeling of lead sheet, and the second stage addresses arousal modeling by introducing performance-level attributes. To further capture features that shape valence, an aspect less explored by previous approaches, we introduce a novel functional representation of symbolic music. This representation aims to capture the emotional impact of major-minor tonality, as well as the interactions among notes, chords, and key signatures. Objective and subjective experiments validate the effectiveness of our framework in both emotional valence and arousal modeling. We further leverage our framework in a novel application of emotional controls, showing a broad potential in emotion-driven music generation.
Hugo T. Carvalho, Min S. Li, Massimiliano di Luca
et al.
The synchronization of motor responses to rhythmic auditory cues is a fundamental biological phenomenon observed across various species. While the importance of temporal alignment varies across different contexts, achieving precise temporal synchronization is a prominent goal in musical performances. Musicians often incorporate expressive timing variations, which require precise control over timing and synchronization, particularly in ensemble performance. This is crucial because both deliberate expressive nuances and accidental timing deviations can affect the overall timing of a performance. This discussion prompts the question of how musicians adjust their temporal dynamics to achieve synchronization within an ensemble. This paper introduces a novel feedback correction model based on the Kalman Filter, aimed at improving the understanding of interpersonal timing in ensemble music performances. The proposed model performs similarly to other linear correction models in the literature, with the advantage of low computational cost and good performance even in scenarios where the underlying tempo varies.
Still rebounding from the impact of the global pandemic, the higher education sector is being challenged even further by the next wave of Artificial Intelligence (AI) technologies. These technologies have the power to generate in a matter of seconds, quality text, images, music and coding responses to questions or prompts entered into an online chat box. Currently, one of the most accessible and popular text generators is OpenAI’s ChatGPT which was released in November 2022. Early evaluation indicates that the quality of the responses exceed standard pass rates for comparable university assessments. Even if academic protocols mandate that text cited from AI sources should be acknowledged and referenced as any other source material, the speed, accessibility and high quality of the AI material justifies a rethink of the purpose of higher education and a redesign of curriculum, pedagogy and assessment. An initial suggestion being promoted in the sector is that learning outcomes and assessments should move away from a focus on content memorisation and recall, to development of higher order thinking skills such as critical analysis, evaluation, resilience, creativity, problem solving, appraising and mastery of verbal communication and computer literacy. This preliminary paper examines some of the literature to date, which discusses potential risks and threats, as well as the opportunities to enhance learning, embedded in this new wave of emerging AI technologies in higher education.
This article aims to examine how Joyland Festival implements the principles of event experience design, focusing on its purposeful, playful, and impactful strategies. Joyland Festival has distinguished itself as a leading example of music festival in Indonesia since 2011, aligning its programs with the principles of good event experience design to create immersive and impactful experiences for its diverse audience. Utilizing a five-phase analytical framework, this research explores the potential opportunities associated with creating memorable experiences in the context of events. It investigates the interplay between experience design as implemented by festival management and its perception by attendees and the broader community. Employing a qualitative research approach, the study reveals a strong alignment between the Joyland Festival’s practices and O'Loughlin’s framework of the Seven Elements of Good Event Experience Design: Purpose, Play, Awe, Flow, Shared Experience, Communitas, and Impact. By contributing to the growing body of literature and case studies in event experience design—an area predominantly focused on non-Indonesian contexts—this research provides valuable insights into the practical application of experience design in the Indonesian festival industry.
Hospitality industry. Hotels, clubs, restaurants, etc. Food service
The aim of the research is to approach Turkish makam music from a holistic perspective, to address the makam structure in Erdal ERZİNCAN’s baglama/saz improvisation performances and to offer solutions to the ongoing makam-diz-ayak debates among GTHM officials. The research is a descriptive study based on the survey model. In this research, where "literature scanning" and "indirect observation" data collection techniques are used, the data source is 21 baglama/saz improvisation performed by Erdal ERZİNCAN in the baglama order in his 11 solo music albums, which were released between 1994 and 2018, excluding his collaborative works. The performance was recorded and recorded. An attempt has been made to describe and explain the makam structures of the improvisation performances through the audio recordings and dictated notes of the improvisation performances. Based on the findings obtained, the unity of açış and taksim forms, GTHM melodies can and should be explained with the traditional makam understanding and terminology, GTHM is a part of Turkish makam music, and the richness and diversity of tunes and melodies of Turkish makam music has lived and been carried in GTHM works from past to present. results have been achieved.
Pablo Alonso-Jiménez, Xavier Serra, Dmitry Bogdanov
In this work, we address music representation learning using convolution-free transformers. We build on top of existing spectrogram-based audio transformers such as AST and train our models on a supervised task using patchout training similar to PaSST. In contrast to previous works, we study how specific design decisions affect downstream music tagging tasks instead of focusing on the training task. We assess the impact of initializing the models with different pre-trained weights, using various input audio segment lengths, using learned representations from different blocks and tokens of the transformer for downstream tasks, and applying patchout at inference to speed up feature extraction. We find that 1) initializing the model from ImageNet or AudioSet weights and using longer input segments are beneficial both for the training and downstream tasks, 2) the best representations for the considered downstream tasks are located in the middle blocks of the transformer, and 3) using patchout at inference allows faster processing than our convolutional baselines while maintaining superior performance. The resulting models, MAEST, are publicly available and obtain the best performance among open models in music tagging tasks.
Emmanouil Karystinaios, Francesco Foscarin, Florent Jacquemard
et al.
This paper focuses on the nominal durations of musical events (notes and rests) in a symbolic musical score, and on how to conveniently handle these in computer applications. We propose the usage of a temporal unit that is directly related to the graphical symbols in musical scores and pair this with a set of operations that cover typical computations in music applications. We formalize this time unit and the more commonly used approach in a single mathematical framework, as semirings, algebraic structures that enable an abstract description of algorithms/processing pipelines. We then discuss some practical use cases and highlight when our system can improve such pipelines by making them more efficient in terms of data type used and the number of computations.
Raquel Quevedo-Redondo, Marta Rebolledo, Nuria Navarro-Sierra
The changes brought by new technologies and the ensuing rapid development of the communication field have resulted in an increasing number of studies on politicians’ use of the internet and social media. However, while election campaigns have been the predominant research area in political communication scholarship, music has not yet been taken as an object of study alongside spectacularisation and politainment. Aside from some preliminary studies, systematic research on music in politics is scarce. The literature holds that music is a universal language. Music in politics can therefore be deemed to be an identification tool that can help politicians connect with voters and bring together positions between the different actors of international relations. This is an exploratory study about the use of music in political campaigning. It is focused on the role played by the Spotify playlists created by the main political parties in recent election campaigns in Spain. The initial hypothesis is that some of the candidates strategically selected songs to be shared with their followers. A quantitative content analysis (N = 400) of some Spotify playlists showed that there were significant differences in the selection of songs among the different political parties. This research contributes to the understanding of how Spotify has been used for electoral campaigning, as well as shedding some light on the current communication literature on music and politics.
Music can generate a positive effect in runners’ performance and motivation. However, the practical implementation of music intervention during exercise is mostly absent from the literature. Therefore, this paper designs a playback sequence system for joggers by considering music emotion and physiological signals. This playback sequence is implemented by a music selection module that combines artificial intelligence techniques with physiological data and emotional music. In order to make the system operate for a long time, this paper improves the model and selection music module to achieve lower energy consumption. The proposed model obtains fewer FLOPs and parameters by using logarithm scaled Mel-spectrogram as input features. The accuracy, computational complexity, trainable parameters, and inference time are evaluated on the Bi-modal, 4Q emotion, and Soundtrack datasets. The experimental results show that the proposed model is better than that of Sarkar et al. and achieves competitive performance on Bi-modal (84.91%), 4Q emotion (92.04%), and Soundtrack (87.24%) datasets. More specifically, the proposed model reduces the computational complexity and inference time while maintaining the classification accuracy, compared to other models. Moreover, the size of the proposed model for network training is small, which can be applied to mobiles and other devices with limited computing resources. This study designed the overall playback sequence system by considering the relationship between music emotion and physiological situation during exercise. The playback sequence system can be adopted directly during exercise to improve users’ exercise efficiency.
Escape rooms have been proven to be a functional game-based approach to teach a variety of subjects. Teachers as well as students are eager to play escape rooms in the classroom; field studies have demonstrated how escape games are a memorable activity with a high retention rate, especially if followed by a proper debriefing session, in which learnings emerged and are made consistent. In recent years literature on educational escape rooms has grown, yet there is little body of research on educational escape rooms on music education. In this paper we present an educational escape room about Afro-American music at the beginning of the past century. The players are asked to solve a murder case happened in the backstage of a jazz club in New Orleans. Such scenario gives the players/students the opportunity to "experience", within the escape room context, cultural and historical details and characteristics related to this musical genre. The puzzles within the escape room are formally related to jazz and Afro-American music, giving players/students the possibility to learn and have a "first-hand" experience with concepts that would otherwise remain purely theoretical within a normal classroom environment. Learning has been measured by the means of a 3-step test design: the pre-test was administered before playing the escape room, the first post-test was administered right after playing the escape room and before the debriefing, while the second post-test was administered two weeks after the debriefing phase. From a qualitative point of view the teacher has noticed high motivation while playing the game, with respect to a normal classroom activity. The quantitative results of the second post-test have shown students have retained many of the concepts presented within the escape room and this highlights the importance of a debriefing phase to consolidate learning after playing an educational escape room.
Maarten Grachten, Carlos Cancino-Chacón, Thassilo Gadermaier
This demo paper introduces partitura, a Python package for handling symbolic musical information. The principal aim of this package is to handle richly structured musical information as conveyed by modern staff music notation. It provides a much wider range of possibilities to deal with music than the more reductive (but very common) piano roll-oriented approach inspired by the MIDI standard. The package is an open source project and is available on GitHub.
Gembrung music is a traditional Islamic art that has developed a lot in Java. Its existence indicates a da'wah process in order to convey the messages of Islamic teachings in Java. The intersection of Muslim and Javanese groups creates a form of cultural acculturation, each of which represents two acculturated cultures. By using a qualitative methodology, the research focus is on literature and gembrung in Java, the aim of this study is to find the process of acculturation of the traditional Gembrung musical instrument. Where in Gembrung art it is found as a communication tool in preaching as well as a process of acculturation of music. Among them, several elements of gembrung music adopted Javanese music, including kenong and gong which were replaced with tambourines, which are musical instruments made of wood and goat skin. In addition to the form of abolition or omission, there is also musical pluralism, where one musical instrument can be accepted by another group but still places one musical instrument that represents each acculturation group. Among those that must still be included are drums as a representation of Java, and also tambourines as a representation of Islam-Arabism. What's interesting is that each musical instrument which is a cultural representation becomes the central determinant of other musical elements. The acculturation of this musical element has led to cultural acceptance by the Javanese people of the Gembrung art. Likewise, there is acceptance of Muslim groups as a representation of Islamic music.
This paper focuses on the director's theater as an important stage in the evolution of the opera genre and at the same time an ambiguous cultural phenomenon. Due to the rapid growth of interest in opera in the last quarter of the twentieth century, the concept of director's opera, which owes its formation and flourishing to the postmodern condition, begins to spread in the West. At present the term director's opera is actively used in Russian art criticism. As a result, it was found that the majority of researchers of musical theater in their works associate this concept with the provocative nature of works and too speculative stage interpretations of operas, as well as apparent liberty in the treatment of classical heritage, leading to the destruction of the opera canon. Simultaneously, with the theoretical reflection of the phenomenon of director' opera, as a result of the major crises of the turn of the 20th and 21st centuries, as well as fatigue from the corrosive, destructing irony of postmodernism, the concept of metamodernism appears. According to the theorists of metamodernism, the new cultural paradigm aims to recode modern culture through art, and music was the first to embrace metamodernism that later on penetrated and conquered various cultural spheres. Defending its social significance and vitality, opera theater, following other art forms, changes its aesthetics, demonstrating trends associated with metamodernism, for instance allows for unexpected collaborations that welcome very different people to shine their otherness. As a result of the study of literature and empirical materials, the author concludes that the outlined intensions, which appeared in the period 2016–2018, can be considered as significant examples of the penetration of metamodern trends in contemporary opera theater. As the examples, the writer considers production of Verdi's opera La Traviata by Wilson and Arthur Honegger’s oratorio Joan of Arc at the Stake by Castellucci.
Michael J. Silverman, Sonia Bourdaghs, Jessica M. Abbazio
et al.
Background: Conditioning- and cue-induced craving theories indicate that music has the potential to induce substance craving. A better understanding of this phenomenon could enhance treatment and prevent misuse, relapse, and overdose. Objective: The purpose of this systematic review was to locate and examine studies using music to induce substance craving in humans. We sought to discover if music can induce substance craving as well as specific aspects of the music and how it was used. Method: Adhering to the PRISMA Statement and Checklist, we conducted a systematic review of literature on music-induced substance craving in nine databases. We extracted data from studies meeting our inclusion criteria, which related to substance craving induced by music and data based on music intervention reporting guidelines. Results: We reviewed 751 research outputs. A total of 33 articles meeting the inclusion criteria were found, indicating that various types of music can induce alcohol, cannabis, nicotine, and general substance craving. In most of the studies, music was used as a component of a mood induction technique or in a virtual reality setting that led to craving. There tended to be a lack of detail about the music itself and most authors did not adhere to music intervention reporting guidelines. In the majority of studies, the researchers selected the music to induce negative mood states so as to elicit craving. Conclusion: Music has the potential to induce substance craving. While the music used in studies varied considerably and tended to be well controlled from a research design perspective, the music was not based on the music psychology literature, and authors did not adequately report essential aspects of the music. Implications for clinical practice, limitations, and suggestions for future research are provided.