Étudier les musiques populaires : culture, économie, politique. Entretien avec Gérôme Guibert
Andrée Friaud, Maël Hamey Jakubowicz
In this interview, sociologist Gérôme Guibert reflects on Popular Music Studies’ history. Guibert, who has contributed to the development of this field of study in France, discusses recent research on metal music and subcultures, an area to which he has been contributing since the late 1990s. The interview then addresses the presence of so-called “right-wing” values and aesthetics in popular music.
As técnicas estendidas no trompete: relato de interação entre intérpretes/compositores e intérpretes/construtores
Adenilson Roberto Telles, Paulo adriano Ronqui
Este trabalho propõe uma reflexão sobre a expansão das técnicas estendidas durante a história do trompete, empregando como fio condutor a interação entre o intérprete/compositor e intérprete/construtor de instrumento. Para isso foram coletados dados históricos dessas relações, realizados por meio de levantamento bibliográfico e de entrevistas com solistas nacionais e internacionais da atualidade com o objetivo de demostrar a trajetória do desenvolvimento das técnicas estendidas no trompete.
Music and books on Music, Music
Universal Music Representations? Evaluating Foundation Models on World Music Corpora
Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos
Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize across diverse musical traditions. This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora spanning Western popular, Greek, Turkish, and Indian classical traditions. We employ three complementary methodologies to investigate these models' cross-cultural capabilities: probing to assess inherent representations, targeted supervised fine-tuning of 1-2 layers, and multi-label few-shot learning for low-resource scenarios. Our analysis shows varying cross-cultural generalization, with larger models typically outperforming on non-Western music, though results decline for culturally distant traditions. Notably, our approaches achieve state-of-the-art performance on five out of six evaluated datasets, demonstrating the effectiveness of foundation models for world music understanding. We also find that our targeted fine-tuning approach does not consistently outperform probing across all settings, suggesting foundation models already encode substantial musical knowledge. Our evaluation framework and benchmarking results contribute to understanding how far current models are from achieving universal music representations while establishing metrics for future progress.
YNote: A Novel Music Notation for Fine-Tuning LLMs in Music Generation
Shao-Chien Lu, Chen-Chen Yeh, Hui-Lin Cho
et al.
The field of music generation using Large Language Models (LLMs) is evolving rapidly, yet existing music notation systems, such as MIDI, ABC Notation, and MusicXML, remain too complex for effective fine-tuning of LLMs. These formats are difficult for both machines and humans to interpret due to their variability and intricate structure. To address these challenges, we introduce YNote, a simplified music notation system that uses only four characters to represent a note and its pitch. YNote's fixed format ensures consistency, making it easy to read and more suitable for fine-tuning LLMs. In our experiments, we fine-tuned GPT-2 (124M) on a YNote-encoded dataset and achieved BLEU and ROUGE scores of 0.883 and 0.766, respectively. With just two notes as prompts, the model was able to generate coherent and stylistically relevant music. We believe YNote offers a practical alternative to existing music notations for machine learning applications and has the potential to significantly enhance the quality of music generation using LLMs.
Cross-Modal Learning for Music-to-Music-Video Description Generation
Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu
et al.
Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation.
Multi Agents Semantic Emotion Aligned Music to Image Generation with Music Derived Captions
Junchang Shi, Gang Li
When people listen to music, they often experience rich visual imagery. We aim to externalize this inner imagery by generating images conditioned on music. We propose MESA MIG, a multi agent semantic and emotion aligned framework that first produces structured music captions and then refines them with cooperating agents specializing in scene, motion, style, color, and composition. In parallel, a Valence Arousal regression head predicts continuous affective states from music, while a CLIP based visual VA head estimates emotions from images. These components jointly enforce semantic and emotional alignment between music and synthesized images. Experiments on curated music image pairs show that MESA MIG outperforms caption only and single agent baselines in aesthetic quality, semantic consistency, and VA alignment, and achieves competitive emotion regression performance compared with state of the art music and image emotion models.
Chanter son/l’homosexualité pour mieux l’affirmer : Suzane, Eddy de Pretto et autres échos dans la chanson française contemporaine
Renaud Lagabrielle
Suzane and Eddy de Pretto are two French singer-songwriters whose music encourages a direct and rare criticism of heterosexism. Fuelled by the principle of agency, their songs advocate for coming out as well as coming in, and celebrate gay longing and pleasure. This article examines the work and discourse of these two artists, whose common advocacy for LGBTQI+ rights can be observed in songs as well as in media interventions.
Debussy’s Old Hindu Chant (La Boîte à joujoux): Exotic Humorous Fakery and Rejuvenation of Music
Benjamin Lassauzet
Debussy’s ballet for children La Boîte à joujoux (1913) contains a fake “Old Hindu chant”. But the comic fakery seems to provide the key to understanding the deep meaning of the ballet, in which Debussy strives to rejuvenate and revitalize Western music by referring to the so-called spontaneity of both childhood and foreign traditions, for the sake of recovered authenticity.
Music, Musical instruction and study
Intelligent Text-Conditioned Music Generation
Zhouyao Xie, Nikhil Yadala, Xinyi Chen
et al.
CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap between natural language and music. Our model is split into two steps: first, we train a CLIP-like model on pairs of text and music over contrastive loss to align a piece of music with its most probable text caption. Then, we combine the alignment model with a music decoder to generate music. To the best of our knowledge, this is the first attempt at text-conditioned deep music generation. Our experiments show that it is possible to train the text-music alignment model using contrastive loss and train a decoder to generate music from text prompts.
A Survey of Foundation Models for Music Understanding
Wenjun Li, Ying Cai, Ziyang Wu
et al.
Music is essential in daily life, fulfilling emotional and entertainment needs, and connecting us personally, socially, and culturally. A better understanding of music can enhance our emotions, cognitive skills, and cultural connections. The rapid advancement of artificial intelligence (AI) has introduced new ways to analyze music, aiming to replicate human understanding of music and provide related services. While the traditional models focused on audio features and simple tasks, the recent development of large language models (LLMs) and foundation models (FMs), which excel in various fields by integrating semantic information and demonstrating strong reasoning abilities, could capture complex musical features and patterns, integrate music with language and incorporate rich musical, emotional and psychological knowledge. Therefore, they have the potential in handling complex music understanding tasks from a semantic perspective, producing outputs closer to human perception. This work, to our best knowledge, is one of the early reviews of the intersection of AI techniques and music understanding. We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities. We also discussed their limitations and proposed possible future directions, offering insights for researchers in this field.
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
Baisen Wang, Le Zhuo, Zhaokai Wang
et al.
Multimodal music generation aims to produce music from diverse input modalities, including text, videos, and images. Existing methods use a common embedding space for multimodal fusion. Despite their effectiveness in other modalities, their application in multimodal music generation faces challenges of data scarcity, weak cross-modal alignment, and limited controllability. This paper addresses these issues by using explicit bridges of text and music for multimodal alignment. We introduce a novel method named Visuals Music Bridge (VMB). Specifically, a Multimodal Music Description Model converts visual inputs into detailed textual descriptions to provide the text bridge; a Dual-track Music Retrieval module that combines broad and targeted retrieval strategies to provide the music bridge and enable user control. Finally, we design an Explicitly Conditioned Music Generation framework to generate music based on the two bridges. We conduct experiments on video-to-music, image-to-music, text-to-music, and controllable music generation tasks, along with experiments on controllability. The results demonstrate that VMB significantly enhances music quality, modality, and customization alignment compared to previous methods. VMB sets a new standard for interpretable and expressive multimodal music generation with applications in various multimedia fields. Demos and code are available at https://github.com/wbs2788/VMB.
MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study
Rajat Bhattacharjya, Arnab Sarkar, Biswadip Maity
et al.
Multiple Signal Classification (MUSIC) is a widely used Direction of Arrival (DoA)/Angle of Arrival (AoA) estimation algorithm applied to various application domains such as autonomous driving, medical imaging, and astronomy. However, MUSIC is computationally expensive and challenging to implement in low-power hardware, requiring exploration of trade-offs between accuracy, cost, and power. We present MUSIC-lite, which exploits approximate computing to generate a design space exploring accuracy-area-power trade-offs. This is specifically applied to the computationally intensive singular value decomposition (SVD) component of the MUSIC algorithm in an orthogonal frequency-division multiplexing (OFDM) radar use case. MUSIC-lite incorporates approximate adders into the iterative CORDIC algorithm that is used for hardware implementation of MUSIC, generating interesting accuracy-area-power trade-offs. Our experiments demonstrate MUSIC-lite's ability to save an average of 17.25% on-chip area and 19.4% power with a minimal 0.14% error for efficient MUSIC implementations.
Language Models for Music Medicine Generation
Emmanouil Nikolakakis, Joann Ching, Emmanouil Karystinaios
et al.
Music therapy has been shown in recent years to provide multiple health benefits related to emotional wellness. In turn, maintaining a healthy emotional state has proven to be effective for patients undergoing treatment, such as Parkinson's patients or patients suffering from stress and anxiety. We propose fine-tuning MusicGen, a music-generating transformer model, to create short musical clips that assist patients in transitioning from negative to desired emotional states. Using low-rank decomposition fine-tuning on the MTG-Jamendo Dataset with emotion tags, we generate 30-second clips that adhere to the iso principle, guiding patients through intermediate states in the valence-arousal circumplex. The generated music is evaluated using a music emotion recognition model to ensure alignment with intended emotions. By concatenating these clips, we produce a 15-minute "music medicine" resembling a music therapy session. Our approach is the first model to leverage Language Models to generate music medicine. Ultimately, the output is intended to be used as a temporary relief between music therapy sessions with a board-certified therapist.
Foundation Models for Music: A Survey
Yinghao Ma, Anders Øland, Anton Ragni
et al.
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multimodal learning. We first contextualise the significance of music in various industries and trace the evolution of AI in music. By delineating the modalities targeted by foundation models, we discover many of the music representations are underexplored in FM development. Then, emphasis is placed on the lack of versatility of previous methods on diverse music applications, along with the potential of FMs in music understanding, generation and medical application. By comprehensively exploring the details of the model pre-training paradigm, architectural choices, tokenisation, finetuning methodologies and controllability, we emphasise the important topics that should have been well explored, like instruction tuning and in-context learning, scaling law and emergent ability, as well as long-sequence modelling etc. A dedicated section presents insights into music agents, accompanied by a thorough analysis of datasets and evaluations essential for pre-training and downstream tasks. Finally, by underscoring the vital importance of ethical considerations, we advocate that following research on FM for music should focus more on such issues as interpretability, transparency, human responsibility, and copyright issues. The paper offers insights into future challenges and trends on FMs for music, aiming to shape the trajectory of human-AI collaboration in the music realm.
Azerbaijani Mugham In the Pushkinsky Dom
Alla Bayramova
The Azerbaijani mugham is a traditional musical form, a pearl of Azerbaijani culture, proclaimed by UNESCO as a masterpiece of oral and intangible heritage of humanity in 2003 and inscribed on the Representative List of Intangible Cultural Heritage of Humanity in 2008.
For the Museum of Musical Culture of Azerbaijan, I represent that everything related to mugham (its music and sound recordings, musical instruments, testimonies of mugham performers, and studies of mugham) is of great interest. Of the more than 60,000 items in the museum's collection, the phono archive comprises approximately 3,000 items, of which the earliest audio recordings of mugham from the first decades of the 20th century are an important part. All the witnesses of the mugham’s history are valuable to the public.
The Music Note Ontology
Andrea Poltronieri, Aldo Gangemi
In this paper we propose the Music Note Ontology, an ontology for modelling music notes and their realisation. The ontology addresses the relation between a note represented in a symbolic representation system, and its realisation, i.e. a musical performance. This work therefore aims to solve the modelling and representation issues that arise when analysing the relationships between abstract symbolic features and the corresponding physical features of an audio signal. The ontology is composed of three different Ontology Design Patterns (ODP), which model the structure of the score (Score Part Pattern), the note in the symbolic notation (Music Note Pattern) and its realisation (Musical Object Pattern).
In-depth analysis of music structure as a text network
Ping-Rui Tsai, Yen-Ting Chou, Nathan-Christopher Wang
et al.
Music, enchanting and poetic, permeates every corner of human civilization. Although music is not unfamiliar to people, our understanding of its essence remains limited, and there is still no universally accepted scientific description. This is primarily due to music being regarded as a product of both reason and emotion, making it difficult to define. In this article, we focus on the fundamental elements of music and construct an evolutionary network from the perspective of music as a natural language, aligning with the statistical characteristics of texts. Through this approach, we aim to comprehend the structural differences in music across different periods, enabling a more scientific exploration of music. Relying on the advantages of structuralism, we can concentrate on the relationships and order between the physical elements of music, rather than getting entangled in the blurred boundaries of science and philosophy. The scientific framework we present not only conforms to past conclusions in music, but also serves as a bridge that connects music to natural language processing and knowledge graphs.
Bi-Sampling Approach to Classify Music Mood leveraging Raga-Rasa Association in Indian Classical Music
Mohan Rao B C, Vinayak Arkachaari, Harsha M N
et al.
The impact of Music on the mood or emotion of the listener is a well-researched area in human psychology and behavioral science. In Indian classical music, ragas are the melodic structure that defines the various styles and forms of the music. Each raga has been found to evoke a specific emotion in the listener. With the advent of advanced capabilities of audio signal processing and the application of machine learning, the demand for intelligent music classifiers and recommenders has received increased attention, especially in the 'Music as a service' cloud applications. This paper explores a novel framework to leverage the raga-rasa association in Indian classical Music to build an intelligent classifier and its application in music recommendation system based on user's current mood and the mood they aspire to be in.
INKBLOT
Serge Bulat
Musical work for the Dossier “Creative practices in Sound Art”
Music and books on Music, Music
TräumerAI: Dreaming Music with StyleGAN
Dasaem Jeong, Seungheon Doh, Taegyun Kwon
The goal of this paper to generate a visually appealing video that responds to music with a neural network so that each frame of the video reflects the musical characteristics of the corresponding audio clip. To achieve the goal, we propose a neural music visualizer directly mapping deep music embeddings to style embeddings of StyleGAN, named TräumerAI, which consists of a music auto-tagging model using short-chunk CNN and StyleGAN2 pre-trained on WikiArt dataset. Rather than establishing an objective metric between musical and visual semantics, we manually labeled the pairs in a subjective manner. An annotator listened to 100 music clips of 10 seconds long and selected an image that suits the music among the 200 StyleGAN-generated examples. Based on the collected data, we trained a simple transfer function that converts an audio embedding to a style embedding. The generated examples show that the mapping between audio and video makes a certain level of intra-segment similarity and inter-segment dissimilarity.