Supervised Learning for Game Music Segmentation
Shangxuan Luo, Joshua Reiss
At present, neural network-based models, including transformers, struggle to generate memorable and readily comprehensible music from unified and repetitive musical material due to a lack of understanding of musical structure. Consequently, these models are rarely employed by the games industry. It is hypothesised by many scholars that the modelling of musical structure may inform models at a higher level, thereby enhancing the quality of music generation. The aim of this study is to explore the performance of supervised learning methods in the task of structural segmentation, which is the initial step in music structure modelling. An audio game music dataset with 309 structural annotations was created to train the proposed method, which combines convolutional neural networks and recurrent neural networks, achieving performance comparable to the state-of-the-art unsupervised learning methods with fewer training resources.
A Design Space for Live Music Agents
Yewon Kim, Stephen Brade, Alexander Wang
et al.
Live music provides a uniquely rich setting for studying creativity and interaction due to its spontaneous nature. The pursuit of live music agents--intelligent systems supporting real-time music performance and interaction--has captivated researchers across HCI, AI, and computer music for decades, and recent advancements in AI suggest unprecedented opportunities to evolve their design. However, the interdisciplinary nature of music has led to fragmented development across research communities, hindering effective communication and collaborative progress. In this work, we bring together perspectives from these diverse fields to map the current landscape of live music agents. Based on our analysis of 184 systems across both academic literature and video, we develop a comprehensive design space that categorizes dimensions spanning usage contexts, interactions, technologies, and ecosystems. By highlighting trends and gaps in live music agents, our design space offers researchers, designers, and musicians a structured lens to understand existing systems and shape future directions in real-time human-AI music co-creation. We release our annotated systems as a living artifact at https://live-music-agents.github.io.
Beats on Airwaves
Dennis Mathei
This article investigates electronic dance music on German radio stations. How is this music selected by editors? Which terminology is employed and presented on-air? What are the implications of this for audiences and researchers? What types of electronic dance music and related music are currently featured on German radio? What are their characteristics? In order to address these questions, a multi-method approach is employed, encompassing playlist and musicological analyses, as well as interviews with radio editors. Electronic dance music on German radio is characterised by heterogeneity of variants and trends, with fluid transitions to other popular music. Distinguishing between different types of this music can often be a challenging task. Editors' expertise in genres is evident in their work, but this is only apparent at certain times, as genres are not mentioned on-air most of the time, which adds to the confusion surrounding electronic dance music and its contested meanings.
MusFlow: Multimodal Music Generation via Conditional Flow Matching
Jiahao Song, Yuzhao Wang
Music generation aims to create music segments that align with human aesthetics based on diverse conditional information. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise or time to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching. We employ multiple Multi-Layer Perceptrons (MLPs) to align multimodal conditional information into the audio's CLAP embedding space. Conditional flow matching is trained to reconstruct the compressed Mel-spectrogram in the pretrained VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi-agent collaboration, we construct an intelligent data annotation workflow centered around a fine-tuned Qwen2-VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image-to-music, story-to-music, caption-to-music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high-quality music pieces whether the input conditions are unimodal or multimodal. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples, code and dataset are available at musflow.github.io.
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Sifei Li, Mining Tan, Feier Shen
et al.
Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.
Música moderna: la new wave en el contexto español
Sara Arenillas Meléndez
En el trabajo se reflexiona sobre la new wave como género musical y sus problemas a la hora de trasladarla al contexto español a través del caso particular de la Movida y del grupo Radio Futura. Se abordan elementos como el pastiche, la ironía y las negociaciones sobre el concepto de modernidad surgidas en el entorno de los ochenta. La metodología aplicada tiene en cuenta estrategias procedentes de los estudios culturales, así como otras extraídas de la musicología tradicional para el análisis de ciertos elementos musicales. Así, en una primera sección se abordan los problemas y contradicciones de la new wave y su conflictiva relación con la posmodernidad. En un segundo apartado, se examina el caso concreto de Radio Futura como banda que adaptaba, en sus inicios, este género al ámbito español. Por último, se analizan a través de la canción “Enamorado de la moda juvenil”, aspectos propios de la new wave, como la herencia del punk o la articulación de ironía, y se problematiza su funcionamiento fuera de las escenas anglosajonas.
Music and books on Music, Music
ArchiTone: A LEGO-Inspired Gamified System for Visualized Music Education
Jiaxing Yu, Tieyao Zhang, Songruoyao Wu
et al.
Participation in music activities has many benefits, but often requires music theory knowledge and aural skills, which can be challenging for beginners. To help them engage more easily, it's crucial to adopt teaching strategies that lower these barriers. Informed by formative investigation and inspired by LEGO, we introduce ArchiTone, a gamified system that employs constructivism by visualizing music theory concepts as musical blocks and buildings for music education. This system includes two modes: Learning Mode, which involves recognizing and learning common musical blocks through familiar musical works; Creation Mode, which allows learners to freely create and combine musical blocks to produce new musical works. User studies demonstrate that our gamified system is not only more engaging than traditional music education methods but also more effective in helping learners understand abstract music theory and apply it to music praxis. Additionally, learners demonstrate superior performance on music theory tasks after using ArchiTone.
UniMuMo: Unified Text, Music and Motion Generation
Han Yang, Kun Su, Yutong Zhang
et al.
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{https://hanyangclarence.github.io/unimumo_demo/}{project page}.
MuCodec: Ultra Low-Bitrate Music Codec
Yaoxun Xu, Hangting Chen, Jianwei Yu
et al.
Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics. Code and Demo: https://xuyaoxun.github.io/MuCodec_demo/.
Presentazione
Corrado Roselli
Literature on music, Musical instruction and study
Reconstrucción y análisis de la red social de Francisco Asenjo Barbieri en Valencia (España) a partir de la correspondencia (1852-1893)
María Ordiñana Gil
The main focus of this article is to identify and study Francisco Asenjo Barbieri’s links with Valencia (Spain) through his correspondence from a qualitative and quantitative approach. The article presents a graphic reconstruction of his network of correspondents residing in Comunidad Valenciana, analyzing his relationships with them through an intensive study of the letters using the metrics and parameters of Social Network Analysis. Combining these techniques has allowed us to confirm the existence of a network of correspondents and identify its main agents and the groups and subgroups among them. Along with studying the structure and morphology of the resulting network, we have analyzed the motives underlying the correspondence and, therefore, explored the nature and substance of these relationships. As we will see, Barbieri was able to use the network and establish links to further his name and his work, provide information, and acquire books for his well-known library. On the other hand, his correspondents took advantage of their relationship with him in order to gain his artistic collaboration, flaunt their friendship, and obtain information or intermediation.
Peribiophoty
Hsu-Ming Teo, Karen Pearlman, Malcolm Choat
et al.
Creative practitioners inside academia are often tasked with explaining how their embodied practices constitute research. The Peribiophoty project reverses this paradigm to ask: how does academic research constitute embodied practice? By considering the personal and intellectual contexts (peri), surrounding academics and their biographies (bio), through audio-visual representation (photy), we investigate how academic thinking is embodied thinking. The notion that “traditional” research only involves the brain is challenged by the audio-visual representations of thoughts and ideas embedded in objects, experience, time, and interactions. Peribiophoty makes its propositions about academic thinking through embodied presence and rhythmic juxtapositions of gesture, things, place, text on screen and voice. It evokes the narrative pasts and selves of the project’s literature, history, and digital games scholars as substantively entangled with their ongoing research programs and demonstrates that their academic research is necessarily an embodied and embedded practice.
The performing arts. Show business, Music
Transfer Learning for Underrepresented Music Generation
Anahita Doosti, Matthew Guzdial
This paper investigates a combinational creativity approach to transfer learning to improve the performance of deep neural network-based models for music generation on out-of-distribution (OOD) genres. We identify Iranian folk music as an example of such an OOD genre for MusicVAE, a large generative music model. We find that a combinational creativity transfer learning approach can efficiently adapt MusicVAE to an Iranian folk music dataset, indicating potential for generating underrepresented music genres in the future.
Music Rearrangement Using Hierarchical Segmentation
Christos Plachouras, Marius Miron
Music rearrangement involves reshuffling, deleting, and repeating sections of a music piece with the goal of producing a standalone version that has a different duration. It is a creative and time-consuming task commonly performed by an expert music engineer. In this paper, we propose a method for automatically rearranging music recordings that takes into account the hierarchical structure of the recording. Previous approaches focus solely on identifying cut-points in the audio that could result in smooth transitions. We instead utilize deep audio representations to hierarchically segment the piece and define a cut-point search subject to the boundaries and musical functions of the segments. We score suitable entry- and exit-point pairs based on their similarity and the segments they belong to, and define an optimal path search. Experimental results demonstrate the selected cut-points are most commonly imperceptible by listeners and result in more consistent musical development with less distracting repetitions.
Mixing Levels -- A Rock Music Spirit Level App
Tim Ziemer
To date, sonification apps are rare. Music apps on the other hand are widely used. Smartphone users like to play with music. In this manuscript, we present Mixing Levels, a spirit level sonification based on music mixing. Tilting the smartphone adjusts the volumes of 5 musical instruments in a rock music loop. Only when perfectly leveled, all instruments in the mix are well-audible. The app is supposed to be useful and fun. Since the app appears like a music mixing console, people have fun to interact with Mixing Levels, so that learning the sonification is a playful experience.
From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation
Adarsh Kumar, Pedro Sarmento
Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to investigate the efficacy of subword tokenization in the symbolic music domain. In this paper, we explore subword tokenization techniques, such as byte-pair encoding (BPE), in symbolic music generation and its impact on the overall structure of generated songs. Our experiments are based on three types of MIDI datasets: single track-melody only, multi-track with a single instrument, and multi-track and multi-instrument. We apply subword tokenization on post-musical tokenization schemes and find that it enables the generation of longer songs at the same time and improves the overall structure of the generated music in terms of objective metrics like structure indicator (SI), Pitch Class Entropy, etc. We also compare two subword tokenization methods, BPE and Unigram, and observe that both methods lead to consistent improvements. Our study suggests that subword tokenization is a promising technique for symbolic music generation and may have broader implications for music composition, particularly in cases involving complex data such as multi-track songs.
Християнське диво агностика Форе: про що мовчить «Requiem»
Ю. В. Воскобойнікова
Обґрунтовано виконавську концепцію Реквієму Ґабріеля Форе на основі вивчення фактів його біографії, релігійно-філософських настанов і психологічних особливостей. За допомогою текстологічного, контекстного та компаративного аналізу означено смислове й емоційне підґрунтя нетипової структури заупокійної меси Форе, окреслено семантичні та лейтмотивні зв’язки між різними частинами Реквієму, надано рекомендації щодо визначення незафіксованих композитором темпів, використання вокальної манери, інтонування вербального тексту. Визначено, що головною ідеєю Форе було звільнення людини від страждань земного життя, продиктоване щирим співчуттям до своїх співвітчизників, постраждалих під час Франко-прусської війни, що, відповідно, сформувало нове семантичне наповнення традиційного жанру реквієму.
Fine Arts, Music and books on Music
Paris Conservatoire and Development of Violin Performance
Polina B. Podmazova
The opening of the Paris Conservatoire in 1795 played a huge role in the 19th century development of French and wider European performing arts. The article discusses the uniqueness of violin education in the 18th century France and identifies the reasons for the emergence of the world's first educational institution that provided free tuition to all its students regardless of their social background. The article is supplemented with the Russian language translation of the law on the establishment of Paris Conservatoire and a list of its first professors. The article provides a detailed analysis of the initial stage in the development of violin training associated with the names of such famous French violinists as Pierre Gaviniès, Rodolphe Kreutzer, Pierre Rode, and Pierre Baillot. Their wellcoordinated methodological efforts and commitment to common aesthetic ideals encouraged the development of a very powerful violin school that left the rest of Europe behind. The Paris Conservatoire concentrated the best of the country’s artistic potential, thus shaping the musical landscape of France. It also served as a role model for other secular educational institutions located in Europe’s major musical centers.
Multitrack Music Transformer
Hao-Wen Dong, Ke Chen, Shlomo Dubnov
et al.
Existing approaches for generating multitrack music with transformer models have been limited in terms of the number of instruments, the length of the music segments and slow inference. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations. In this work, we propose a new multitrack music representation that allows a diverse set of instruments while keeping a short sequence length. Our proposed Multitrack Music Transformer (MMT) achieves comparable performance with state-of-the-art systems, landing in between two recently proposed models in a subjective listening test, while achieving substantial speedups and memory reductions over both, making the method attractive for real time improvisation or near real time creative applications. Further, we propose a new measure for analyzing musical self-attention and show that the trained model attends more to notes that form a consonant interval with the current note and to notes that are 4N beats away from the current step.
Nicholas Baragwanath, The Solfeggio Tradition. A Forgotten Art of Melody in the Long Eighteenth Century, New York: Oxford University Press 2020
Hans Aerts