Zipf-Mandelbrot Scaling in Korean Court Music: Universal Patterns in Music
Byeongchan Choi, Junwon You, Myung Ock Kim
et al.
Zipf's law, originally discovered in natural language and later generalized to the Zipf-Mandelbrot law, describes a power-law relationship between the frequency of a Zipfian element and its rank. Due to the semantic characteristics of this law, it has also been observed in musical data. However, most such studies have focused on Western music, and its applicability to non-Western music remains not well investigated. We analyzed 43 Korean court music pieces called Jeong-ak, spanning several centuries and written in the traditional Korean musical notation Jeongganbo. These pieces were transcribed into Western staff notation, and musical data such as pitch and duration were extracted. Using pitch, duration, and their paired combinations as Zipfian units, we found that Korean music also fits the Zipf-Mandelbrot law to a high degree, particularly for the paired pitch-duration unit. Korean music has evolved collectively over long periods, smoothing idiosyncratic variations and producing forms that are widely understandable among people. This collective evolution appears to have played a significant role in shaping the characteristics that lead to the satisfaction of Zipf-Mandelbrot law. Our findings provide additional evidence that Zipf-Mandelbrot scaling in musical data is universal across cultures. We further show that the joint distribution of two independent Zipfian data sets follows the Zipf-Mandelbrot law; in this sense, our result does not merely extend Zipf's law but deepens our understanding of how scaling laws behave under composition and interaction, offering a more unified perspective on rank-based statistical regularities.
ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis
Stephen Ni-Hahn, Chao Péter Yang, Mingchen Ma
et al.
Artificial Intelligence (AI) for music generation is undergoing rapid developments, with recent symbolic models leveraging sophisticated deep learning and diffusion model algorithms. One drawback with existing models is that they lack structural cohesion, particularly on harmonic-melodic structure. Furthermore, such existing models are largely "black-box" in nature and are not musically interpretable. This paper addresses these limitations via a novel generative music framework that incorporates concepts of Schenkerian analysis (SchA) in concert with a diffusion modeling framework. This framework, which we call ProGress (Prolongation-enhanced DiGress), adapts state-of-the-art deep models for discrete diffusion (in particular, the DiGress model of Vignac et al., 2023) for interpretable and structured music generation. Concretely, our contributions include 1) novel adaptations of the DiGress model for music generation, 2) a novel SchA-inspired phrase fusion methodology, and 3) a framework allowing users to control various aspects of the generation process to create coherent musical compositions. Results from human experiments suggest superior performance to existing state-of-the-art methods.
NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms
Yashan Wang, Shangda Wu, Jianhuai Hu
et al.
We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC notation, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on "period-composer-instrumentation" prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music generation.
Music Tagging with Classifier Group Chains
Takuya Hasumi, Tatsuya Komatsu, Yusuke Fujita
We propose music tagging with classifier chains that model the interplay of music tags. Most conventional methods estimate multiple tags independently by treating them as multiple independent binary classification problems. This treatment overlooks the conditional dependencies among music tags, leading to suboptimal tagging performance. Unlike most music taggers, the proposed method sequentially estimates each tag based on the idea of the classifier chains. Beyond the naive classifier chains, the proposed method groups the multiple tags by category, such as genre, and performs chains by unit of groups, which we call \textit{classifier group chains}. Our method allows the modeling of the dependence between tag groups. We evaluate the effectiveness of the proposed method for music tagging performance through music tagging experiments using the MTG-Jamendo dataset. Furthermore, we investigate the effective order of chains for music tagging.
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Xinyi Tong, Yiran Zhu, Jishang Chen
et al.
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) inadequate temporal and rhythmic correspondence, particularly in achieving precise beat synchronization. To address the challenges, we propose Video Echoed in Music (VeM), a latent music diffusion that generates high-quality soundtracks with semantic, temporal, and rhythmic alignment for input videos. To capture video details comprehensively, VeM employs a hierarchical video parsing that acts as a music conductor, orchestrating multi-level information across modalities. Modality-specific encoders, coupled with a storyboard-guided cross-attention mechanism (SG-CAtt), integrate semantic cues while maintaining temporal coherence through position and duration encoding. For rhythmic precision, the frame-level transition-beat aligner and adapter (TB-As) dynamically synchronize visual scene transitions with music beats. We further contribute a novel video-music paired dataset sourced from e-commerce advertisements and video-sharing platforms, which imposes stricter transition-beat synchronization requirements. Meanwhile, we introduce novel metrics tailored to the task. Experimental results demonstrate superiority, particularly in semantic relevance and rhythmic precision.
Music2Fail: Transfer Music to Failed Recorder Style
Chon In Leong, I-Ling Chung, Kin-Fong Chao
et al.
The goal of music style transfer is to convert a music performance by one instrument into another while keeping the musical contents unchanged. In this paper, we investigate another style transfer scenario called ``failed-music style transfer''. Unlike the usual music style transfer where the content remains the same and only the instrumental characteristics are changed, this scenario seeks to transfer the music from the source instrument to the target instrument which is deliberately performed off-pitch. Our work attempts to transfer normally played music into off-pitch recorder music, which we call ``failed-style recorder'', and study the results of the conversion. To carry out this work, we have also proposed a dataset of failed-style recorders for this task, called ``FR109 Dataset''. Such an experiment explores the music style transfer task in a more expressive setting, as the generated audio should sound like an ``off-pitch recorder'' while maintaining a certain degree of naturalness.
Source Separation & Automatic Transcription for Music
Bradford Derby, Lucas Dunker, Samarth Galchar
et al.
Source separation is the process of isolating individual sounds in an auditory mixture of multiple sounds [1], and has a variety of applications ranging from speech enhancement and lyric transcription [2] to digital audio production for music. Furthermore, Automatic Music Transcription (AMT) is the process of converting raw music audio into sheet music that musicians can read [3]. Historically, these tasks have faced challenges such as significant audio noise, long training times, and lack of free-use data due to copyright restrictions. However, recent developments in deep learning have brought new promising approaches to building low-distortion stems and generating sheet music from audio signals [4]. Using spectrogram masking, deep neural networks, and the MuseScore API, we attempt to create an end-to-end pipeline that allows for an initial music audio mixture (e.g...wav file) to be separated into instrument stems, converted into MIDI files, and transcribed into sheet music for each component instrument.
Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio
Roser Batlle-Roca, Wei-Hsiang Liao, Xavier Serra
et al.
Recent advancements in music generation are raising multiple concerns about the implications of AI in creative music processes, current business models and impacts related to intellectual property management. A relevant discussion and related technical challenge is the potential replication and plagiarism of the training set in AI-generated music, which could lead to misuse of data and intellectual property rights violations. To tackle this issue, we present the Music Replication Assessment (MiRA) tool: a model-independent open evaluation method based on diverse audio music similarity metrics to assess data replication. We evaluate the ability of five metrics to identify exact replication by conducting a controlled replication experiment in different music genres using synthetic samples. Our results show that the proposed methodology can estimate exact data replication with a proportion higher than 10%. By introducing the MiRA tool, we intend to encourage the open evaluation of music-generative models by researchers, developers, and users concerning data replication, highlighting the importance of the ethical, social, legal, and economic consequences. Code and examples are available for reproducibility purposes.
Multimodal Embodiment Research of Oral Music Traditions: Electromyography in Oud Performance and Education Research of Persian Art Music
Stella Paschalidou
With the recent advent of research focusing on the body’s significance in music, the integration of physiological sensors in the context of empirical methodologies for music has also gained momentum. Given the recognition of covert muscular activity as a strong indicator of musical intentionality and the previously ascertained link between physical effort and various musical aspects, electromyography (EMG)—signals representing muscle activity—has also experienced a noticeable surge. While EMG technologies appear to hold good promise for sensing, capturing, and interpreting the dynamic properties of movement in music, which are considered innately linked to artistic expressive power, they also come with certain challenges, misconceptions, and predispositions. The paper engages in a critical examination regarding the utilisation of muscle force values from EMG sensors as indicators of physical effort and musical activity, particularly focusing on (the intuitively expected link to) sound levels. For this, it resides upon empirical work, namely practical insights drawn from a case study of music performance (Persian instrumental music) in the context of a music class. The findings indicate that muscle force can be explained by a small set of (six) statistically significant acoustic and movement features, the latter captured by a state-of-the-art (full-body inertial) motion capture system. However, no straightforward link to sound levels is evident.
When the Music Stops: Tip-of-the-Tongue Retrieval for Music
Samarth Bhargav, Anne Schuth, Claudia Hauff
We present a study of Tip-of-the-tongue (ToT) retrieval for music, where a searcher is trying to find an existing music entity, but is unable to succeed as they cannot accurately recall important identifying information. ToT information needs are characterized by complexity, verbosity, uncertainty, and possible false memories. We make four contributions. (1) We collect a dataset - $ToT_{Music}$ - of 2,278 information needs and ground truth answers. (2) We introduce a schema for these information needs and show that they often involve multiple modalities encompassing several Music IR subtasks such as lyric search, audio-based search, audio fingerprinting, and text search. (3) We underscore the difficulty of this task by benchmarking a standard text retrieval approach on this dataset. (4) We investigate the efficacy of query reformulations generated by a large language model (LLM), and show that they are not as effective as simply employing the entire information need as a query - leaving several open questions for future research.
Mustango: Toward Controllable Text-to-Music Generation
Jan Melechovsky, Zixun Guo, Deepanway Ghosal
et al.
The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models such as MusicGen and AudioLDM2.
MuseCoco: Generating Symbolic Music from Text
Peiling Lu, Xin Xu, Chenfei Kang
et al.
Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.
EXPLORING CHINESE CULTURAL IDENTITY IN THE LIANG ZHU VIOLIN CONCERTO: AN INTERCULTURAL PERSPECTIVE ON THE ADAPTATION OF TRADITIONAL ELEMENTS IN WESTERN CLASSICAL MUSIC LANGUAGE
Maria-Magdalena SUCIU, Stela DRĂGULIN
This article examines the phenomenon of interculturality through the lens of the Liang Zhu Concerto for Violin and Orchestra by He Zhanhao and Chen Gang. Interculturality is no longer merely a means of elevating the axiological value of a given context but has become a necessity for authenticating contemporary discourse. The role of interculturality in shaping the expression of creative intentions is amplified, as it attenuates divergences determined by the incongruity of individuals’ backgrounds by comprehensively observing the uniqueness of foreign elements from a familiarity-based perspective. The Liang Zhu Violin Concerto exemplifies the adaptation of East Asian culture to the context of the Western language and means of expression while preserving its Chinese cultural identity. This concerto has significant value and desirability for consumption due to the proportion of originality and familiarity which it upholds and determines its overall appeal. Ultimately, this article aims to explore how the Liang Zhu Violin Concerto achieves originality at a global level while preserving its Chinese cultural identity.
Exploring the determinants of users' co-creation behavior on music streaming platforms in China.
Jinghong Xu, Jiankun Gong, Dan Ji
Music streaming platforms have recently become one of the latest innovative music devices used to replace traditional music sets. In order to examine users' behavior on music streaming platforms, this study proposes an extended research model based on flow theory and investigates the relationship between flow experience and co-creation behavior. A partial least square methodology was employed to test the proposed model and corresponding hypotheses on data collected from 390 survey samples. The results showed that flow experience has a significant influence on users' co-creation behavior. Among the three antecedents, only perceived skill and perceived interactivity have the strongest effects on flow experience, while perceived control has little effect on flow experience. This study discusses some valuable theoretical implications and offers insights useful for both researchers and practitioners.
Proceedings of the 2nd International Workshop on Reading Music Systems
Jorge Calvo-Zaragoza, Alexander Pacha
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.
Musical Information Extraction from the Singing Voice
Preeti Rao
Music information retrieval is currently an active research area that addresses the extraction of musically important information from audio signals, and the applications of such information. The extracted information can be used for search and retrieval of music in recommendation systems, or to aid musicological studies or even in music learning. Sophisticated signal processing techniques are applied to convert low-level acoustic signal properties to musical attributes which are further embedded in a rule-based or statistical classification framework to link with high-level descriptions such as melody, genre, mood and artist type. Vocal music comprises a large and interesting category of music where the lead instrument is the singing voice. The singing voice is more versatile than many musical instruments and therefore poses interesting challenges to information retrieval systems. In this paper, we provide a brief overview of research in vocal music processing followed by a description of related work at IIT Bombay leading to the development of an interface for melody detection of singing voice in polyphony.
An interactive music infilling interface for pop music composition
Rui Guo
Artificial intelligence (AI) has been widely applied to music generation topics such as continuation, melody/harmony generation, genre transfer and music infilling application. Although with the burst interest to apply AI to music, there are still few interfaces for the musicians to take advantage of the latest progress of the AI technology. This makes those tools less valuable in practice and harder to find its advantage/drawbacks without utilizing them in the real scenario. This work builds a max patch for interactive music infilling application with different levels of control, including track density/polyphony/occupation rate and bar tonal tension control. The user can select the melody/bass/harmony track as the infilling content up to 16 bars. The infilling algorithm is based on the author's previous work, and the interface sends/receives messages to the AI system hosted in the cloud. This interface lowers the barrier of AI technology and can generate different variations of the selected content. Those results can give several alternatives to the musicians' composition, and the interactive process realizes the value of the AI infilling system.
Tactile Low Frequency Vibration in Dementia Management: A Scoping Review
Elsa A. Campbell, Elsa A. Campbell, Elsa A. Campbell
et al.
The prevalence of dementia is increasing with the ever-growing population of older adults. Non-pharmacological, music-based interventions, including sensory stimulation, were reported by the Lancet Commission in 2020 to be the first-choice approach for managing the behavioural and psychological symptoms of dementia. Low frequency sinusoidal vibration interventions, related to music interventions through their core characteristics, may offer relief for these symptoms. Despite increasing attention on the effectiveness of auditory music interventions and music therapy for managing dementia, this has not included low frequency vibration. This scoping review, following the JBI methodology guidelines, was conducted to investigate participants’ responses to both sound and mechanical vibration, the characteristics of the delivered interventions, methodological challenges, and the specifics of the research experiments reported. An extensive search was conducted in BMC, CINAHL, Cochrane Central Register of Controlled Trials, EMBASE, ERIC, MEDLINE (OvidSP), Pedro, ProQuest Central, PsycINFO, Scopus, and Web of Science. Current Controlled Trials, Clinical Trials, and Google Scholar were also searched as well as a hand search in relevant journals. Studies on adults with all types of dementia, investigating tactile low frequency sound or mechanical vibration in any context were considered. Data from eight full-length studies (three RCTs, two quasi-experimental, two case reports, and one qualitative) were extracted using the data extraction table developed by the authors and were included in the analysis and critical appraisal. Issues in quality related to, for example, control groups and blinding. Few studies addressed participants’ subjective responses to the interventions. Reporting on the intervention characteristics was unclear. It appeared more frequent sessions led to better outcomes and home-based interventions potentially addressing the issue of access and feasibility. Future research should include neuroimaging to measure and confirm the hypothesised mechanism of cerebral coherence. Standardised reporting of intervention characteristics is also needed to ensure replicability of the experiments. Higher quality research is needed to investigate the impact and effect of low frequency vibration for the symptoms of dementia and compare outcomes in meta-syntheses.
Joseph N. Straus, Introduction to Post-Tonal Theory, New York, 4. Auflage, London: W. W. Norton & Company 2016
Ullrich Scheideler
Personalized Popular Music Generation Using Imitation and Structure
Shuqi Dai, Xichu Ma, Ye Wang
et al.
Many practices have been presented in music generation recently. While stylistic music generation using deep learning techniques has became the main stream, these models still struggle to generate music with high musicality, different levels of music structure, and controllability. In addition, more application scenarios such as music therapy require imitating more specific musical styles from a few given music examples, rather than capturing the overall genre style of a large data corpus. To address requirements that challenge current deep learning methods, we propose a statistical machine learning model that is able to capture and imitate the structure, melody, chord, and bass style from a given example seed song. An evaluation using 10 pop songs shows that our new representations and methods are able to create high-quality stylistic music that is similar to a given input song. We also discuss potential uses of our approach in music evaluation and music therapy.