Mathias Rose Bjare, Giorgia Cantisani, Marco Pasini
et al.
We argue that training autoencoders to reconstruct inputs from noised versions of their encodings, when combined with perceptual losses, yields encodings that are structured according to a perceptual hierarchy. We demonstrate the emergence of this hierarchical structure by showing that, after training an audio autoencoder in this manner, perceptually salient information is captured in coarser representation structures than with conventional training. Furthermore, we show that such perceptual hierarchies improve latent diffusion decoding in the context of estimating surprisal in music pitches and predicting EEG-brain responses to music listening. Pretrained weights are available on github.com/CPJKU/pa-audioic.
Mohammad Shokri, Alexandra C. Salem, Gabriel Levine
et al.
In this paper, we introduce Story2MIDI, a sequence-to-sequence Transformer-based model for generating emotion-aligned music from a given piece of text. To develop this model, we construct the Story2MIDI dataset by merging existing datasets for sentiment analysis from text and emotion classification in music. The resulting dataset contains pairs of text blurbs and music pieces that evoke the same emotions in the reader or listener. Despite the small scale of our dataset and limited computational resources, our results indicate that our model effectively learns emotion-relevant features in music and incorporates them into its generation process, producing samples with diverse emotional responses. We evaluate the generated outputs using objective musical metrics and a human listening study, confirming the model's ability to capture intended emotional cues.
Recent advances in music foundation models have improved audio representation learning, yet their effectiveness across diverse musical traditions remains limited. We introduce CultureMERT-95M, a multi-culturally adapted foundation model developed to enhance cross-cultural music representation learning and understanding. To achieve this, we propose a two-stage continual pre-training strategy that integrates learning rate re-warming and re-decaying, enabling stable adaptation even with limited computational resources. Training on a 650-hour multi-cultural data mix, comprising Greek, Turkish, and Indian music traditions, results in an average improvement of 4.9% in ROC-AUC and AP across diverse non-Western music auto-tagging tasks, surpassing prior state-of-the-art, with minimal forgetting on Western-centric benchmarks. We further investigate task arithmetic, an alternative approach to multi-cultural adaptation that merges single-culture adapted models in the weight space. Task arithmetic performs on par with our multi-culturally trained model on non-Western auto-tagging tasks and shows no regression on Western datasets. Cross-cultural evaluation reveals that single-culture models transfer with varying effectiveness across musical traditions, whereas the multi-culturally adapted model achieves the best overall performance. To support research on world music representation learning, we publicly release CultureMERT-95M and CultureMERT-TA-95M, fostering the development of more culturally aware music foundation models.
While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.
Hugo Chateau-Laurent, Tara Vanhatalo, Wei-Tung Pan
et al.
Generative artificial intelligence raises concerns related to energy consumption, copyright infringement and creative atrophy. We show that randomly initialized recurrent neural networks can produce arpeggios and low-frequency oscillations that are rich and configurable. In contrast to end-to-end music generation that aims to replace musicians, our approach expands their creativity while requiring no data and much less computational power. More information can be found at: https://allendia.com/
Brandon James Carone, Iran R. Roman, Pablo Ripollés
Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.
Matthias Erdmann, Markus von Berg, Markus von Berg
et al.
The present study explores the development and evaluation of a mixed reality music visualization for a live music performance. Real-time audio analysis and crossmodal correspondences were used as design guidelines for creating the visualization, which was presented through a head-mounted-display. To assess the impact of the music visualization on the audience’s aesthetic experience, a baseline visualization was designed, featuring the same visual elements but with random changes of color and movement. The audience’s aesthetic experience of the two conditions (i.e., listening to the same song with different visualizations) was assessed using the Aesthetic Emotions Scale (AESTHEMOS) questionnaire. Additionally, participants answered questions regarding the perceived audiovisual congruence of the stimuli and questionnaires about individual musicality and aesthetic receptivity. The results show that the visualization controlled by real-time audio analysis was associated with a slightly enhanced aesthetic experience of the audiovisual composition compared to the randomized visualization, thereby supporting similar findings reported in the literature. Furthermore, the tested personal characteristics of the participants did not significantly affect aesthetic experience. Significant correlations between these characteristics and the aesthetic experience were observed only when the ratings were averaged across conditions. An open interview provided deeper insights into the participants’ overall experiences of the live music performance. The results of the study offer insights into the development of real-time music visualization in mixed reality, examines how the specific audiovisual stimuli employed influence the aesthetic experience, and provides potential technical guidelines for creating new concert formats.
Mathias Rose Bjare, Stefan Lattner, Gerhard Widmer
In recent years, the quality and public interest in music generation systems have grown, encouraging research into various ways to control these systems. We propose a novel method for controlling surprisal in music generation using sequence models. To achieve this goal, we define a metric called Instantaneous Information Content (IIC). The IIC serves as a proxy function for the perceived musical surprisal (as estimated from a probabilistic model) and can be calculated at any point within a music piece. This enables the comparison of surprisal across different musical content even if the musical events occur in irregular time intervals. We use beam search to generate musical material whose IIC curve closely approximates a given target IIC. We experimentally show that the IIC correlates with harmonic and rhythmic complexity and note density. The correlation decreases with the length of the musical context used for estimating the IIC. Finally, we conduct a qualitative user study to test if human listeners can identify the IIC curves that have been used as targets when generating the respective musical material. We provide code for creating IIC interpolations and IIC visualizations on https://github.com/muthissar/iic.
Despite the success of contrastive learning in Music Information Retrieval, the inherent ambiguity of contrastive self-supervision presents a challenge. Relying solely on augmentation chains and self-supervised positive sampling strategies can lead to a pretraining objective that does not capture key musical information for downstream tasks. We introduce semi-supervised contrastive learning (SemiSupCon), a simple method for leveraging musically informed labeled data (supervision signals) in the contrastive learning of musical representations. Our approach introduces musically relevant supervision signals into self-supervised contrastive learning by combining supervised and self-supervised contrastive objectives in a simpler framework than previous approaches. This framework improves downstream performance and robustness to audio corruptions on a range of downstream MIR tasks with moderate amounts of labeled data. Our approach enables shaping the learned similarity metric through the choice of labeled data that (1) infuses the representations with musical domain knowledge and (2) improves out-of-domain performance with minimal general downstream performance loss. We show strong transfer learning performance on musically related yet not trivially similar tasks - such as pitch and key estimation. Additionally, our approach shows performance improvement on automatic tagging over self-supervised approaches with only 5\% of available labels included in pretraining.
Globally, preoperative period anxiety is a common occurrence among adults undergoing elective surgery. Not only can preoperative anxiety lead to delays or cancellations of surgical procedures, but it can also negatively impact the prognosis of the condition being treated. To alleviate this issue, non-pharmacological interventions such as prayer, massage, music, imagery, and acupuncture have been widely employed. These methods have proven effective in reducing preoperative anxiety, lowering healthcare costs, and enhancing patient satisfaction. In particular, the traditional Chinese medicine technique of auricular acupressure offers a novel approach to managing preoperative anxiety through its ability to promote circulation, relieve pain, calm the mind, and induce tranquility. However, current research on auricular acupressure primarily focuses on postoperative scenarios, specific diseases, or types of surgery. Through literature review and analysis, this paper summarizes the application of auricular acupressure in alleviating preoperative anxiety among adults undergoing elective surgery and discusses its physiological mechanisms, providing a reference for clinical practice. Although auricular acupressure presents an effective non-pharmacological intervention, further research is needed to validate its efficacy in clinical settings, and attention must be paid to the risk of infection during its application. (全球范围内, 成年人在择期手术前普遍存在术前焦虑。术前焦虑不仅可能导致手术推迟或取消, 还可能影响疾病的预后。为了缓解这一情况, 非药物干预方法如祈祷、按摩、音乐、图像和针灸等被广泛应用, 这些方法能有效减轻术前焦虑, 降低医疗成本, 提高患者满意度。中医的耳穴压豆疗法, 通过行气活血止痛、宁心安神、镇静等效果, 为术前焦虑的管理提供了新的思路。然而, 目前关于耳穴压豆的研究主要集中在术后, 特定疾病或手术类型上。通过文献检索和分析, 本文总结了耳穴压豆在成年人择期手术中缓解术前焦虑的应用情况, 并探讨了潜在生理机制, 为临床实践提供了参考。尽管耳穴压豆为非药物干预提供了一种有效的手段, 但在临床应用中还需进一步地究验证其效果, 并注意操作中的感染风险。)
as Schemata (Gjerdingen 2007) são estruturas prototípicas harmônico-rítmico-melódicas amplamente utilizadas na música do século XVIII, especialmente naquela denominada Galante, representando um paradigma composicional de um grande acervo de obras. Desta forma, a utilização desta teoria como ferramenta de análise para obras galantes já é aceita, tendo como resultado trabalhos científicos pautados nesta bibliografia. Entretanto, ao adentrarmos no repertório do século XIX, tratando especialmente das obras do Romantismo, observamos que, ainda, pouco foi explorado acerca da permanência das Schemata como prática composicional. Portanto, este trabalho apresenta uma série de exemplos em que tais estruturas prototípicas se fazem presentes no período Romântico, então já plenamente estabelecido em meados do século XIX, articulando-os com o conceito de cânone musical de William Weber, com intuito de demonstrar que a teoria de Gjerdingen possui potencial para analisar obras pós estilo Galante.
The inspiration for this Special Issue on Im/Materiality in Renaissance Arts arose from two convictions: (1) that sensual experiences and the physicality of creation must be a part of our accounts of the past, and (2) that crosstalk among scholars of music, literature, art, and architecture can reveal both the historiographical gaps endemic to specific disciplines and the critical tools each specialty brings to the project of incorporating living, breathing artists, builders, poets, singers, players, worshippers, scientists, and others into histories of the Renaissance arts [...]
BackgroundSound therapy methods have seen a surge in popularity, with a predominant focus on music among all types of sound stimulation. There is substantial evidence documenting the integrative impact of music therapy on psycho-emotional and physiological outcomes, rendering it beneficial for addressing stress-related conditions such as pain syndromes, depression, and anxiety. Despite these advancements, the therapeutic aspects of sound, as well as the mechanisms underlying its efficacy, remain incompletely understood. Existing research on music as a holistic cultural phenomenon often overlooks crucial aspects of sound therapy mechanisms, particularly those related to speech acoustics or the so-called “music of speech.”
ObjectiveThis study aims to provide an overview of empirical research on sound interventions to elucidate the mechanism underlying their positive effects. Specifically, we will focus on identifying therapeutic factors and mechanisms of change associated with sound interventions. Our analysis will compare the most prevalent types of sound interventions reported in clinical studies and experiments. Moreover, we will explore the therapeutic effects of sound beyond music, encompassing natural human speech and intermediate forms such as traditional poetry performances.
MethodsThis review adheres to the methodological guidance of the Joanna Briggs Institute and follows the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) checklist for reporting review studies, which is adapted from the Arksey and O’Malley framework. Our search strategy encompasses PubMed, Web of Science, Scopus, and PsycINFO or EBSCOhost, covering literature from 1990 to the present. Among the different study types, randomized controlled trials, clinical trials, laboratory experiments, and field experiments were included.
ResultsData collection began in October 2022. We found a total of 2027 items. Our initial search uncovered an asymmetry in the distribution of studies, with a larger number focused on music therapy compared with those exploring prosody in spoken interventions such as guided meditation or hypnosis. We extracted and selected papers using Rayyan software (Rayyan) and identified 41 eligible papers after title and abstract screening. The completion of the scoping review is anticipated by October 2024, with key steps comprising the analysis of findings by May 2024, drafting and revising the study by July 2024, and submitting the paper for publication in October 2024.
ConclusionsIn the next step, we will conduct a quality evaluation of the papers and then chart and group the therapeutic factors extracted from them. This process aims to unveil conceptual gaps in existing studies. Gray literature sources, such as Google Scholar, ClinicalTrials.gov, nonindexed conferences, and reference list searches of retrieved studies, will be added to our search strategy to increase the number of relevant papers that we cover.
International Registered Report Identifier (IRRID)DERR1-10.2196/54030
Medicine, Computer applications to medicine. Medical informatics
The collision of opera with Iranian musical culture can be considered one of the most important points of intersection between Western and Iranian culture, which took place under the influence of certain political and social developments in Iran. After the Persian Constitutional Revolution, Iran was affected by political, social and cultural changes. Countless changes were made to help Iran's entry into the modern world, in which elites and intellectuals played a significant role. They insisted on bringing the modern form of theatre to the culture of Iran. This is how operetta first entered Iran under the influence of immigrants from northern European countries. The first operettas of the constitutional era were created under the influence of the Caucasian operettas. During several decades of activity in the field of operetta production in Iran, and with the provision of academic education in the field of Western singing in the music conservatory, conditions were created for the production of the first operas in Iran. The coexistence of music and drama was praised by Iranians from the very beginning and prompted many playwrights to create operettas. Since the first operettas in Iran were created with hints of Iranian music and culture, some experts did not consider it correct to use the word operetta for these performances. This essay, based on the theory of glocalization and relying on the descriptive-analytical method based on library documents, has listed some of the effects of Western operas and operettas on the post-constitutional operettas, with the study of three operettas, namely Rastakhiz'e Shahriaran'e Iran, Parichehr and Parizad and finally Elahe. These operetta artists included musical profiles other than Iranian music in their works by using descriptions under the title of "European song and ballet" and using "Western Instruments" and "the songs close to Caucasian and Turkish music", and tried to step closer to the Western operettas. In such a situation, the juxtaposition of Western operetta and Iranian musical culture had effects on each other, which were examined in the article in three stages: exposure, cultural influence, and acculturation. At first, Iranians had a cultural exposure with Western opera by hearing, reading and seeing it. Using the name "operetta" and influences from Western music and song, they advanced to the stage of cultural influence. And since the presence of operetta in Iran did not pose any threat to Iranian music, Iranians did not take a step towards acculturation. As for the results of this article, we can mention the process of producing a kind of glocalized operetta in Iran. Iranians took the title of operetta as one of the successful globalized examples. They combined this globalized musical show with Iranian classical music and singing. Sometimes they used humorous stories appropriate to the society of that time and sometimes they used epic myths and stories from Iranian classical poetry and created a new type of operetta in Iran. It was under these conditions that the glocal type of Persian operettas was born, influenced by the global type of Western operettas.
Recent text-to-audio generation techniques have the potential to allow novice users to freely generate music audio. Even if they do not have musical knowledge, such as about chord progressions and instruments, users can try various text prompts to generate audio. However, compared to the image domain, gaining a clear understanding of the space of possible music audios is difficult because users cannot listen to the variations of the generated audios simultaneously. We therefore facilitate users in exploring not only text prompts but also audio priors that constrain the text-to-audio music generation process. This dual-sided exploration enables users to discern the impact of different text prompts and audio priors on the generation results through iterative comparison of them. Our developed interface, IteraTTA, is specifically designed to aid users in refining text prompts and selecting favorable audio priors from the generated audios. With this, users can progressively reach their loosely-specified goals while understanding and exploring the space of possible results. Our implementation and discussions highlight design considerations that are specifically required for text-to-audio models and how interaction techniques can contribute to their effectiveness.
Abstract Americium is a highly radioactive actinide element found in used nuclear fuel. Its adsorption on aluminum (hydr)oxide minerals is important to study for at least two reasons: (i) aluminum (hydr)oxide minerals are ubiquitous in the subsurface environment and (ii) bentonite clays, which are proposed engineered barriers for the geologic disposal of used nuclear fuel, have the same ≡AlOH sites as aluminum (hydr)oxide minerals. Surface complexation modeling is widely used to interpret the adsorption behavior of heavy metals on mineral surfaces. While americium sorption is understudied, multiple adsorption studies for europium, a chemical analog, are available. In this study we compiled data describing Eu(III) adsorption on three aluminum (hydr)oxide minerals—corundum (α-Al2O3), γ-alumina (γ-Al2O3) and gibbsite (γ-Al(OH)3)—and developed surface complexation models for Eu(III) adsorption on these minerals by employing diffuse double layer (DDL) and charge distribution multisite complexation (CD-MUSIC) electrostatic frameworks. We also developed surface complexation models for Am(III) adsorption on corundum (α-Al2O3) and γ-alumina (γ-Al2O3) by employing a limited number of Am(III) adsorption data sourced from literature. For corundum and γ-alumina, two different adsorbed Eu(III) species, one each for strong and weak sites, were found to be important regardless of which electrostatic framework was used. The formation constant of the weak site species was almost 10,000 times weaker than the formation constant for the corresponding strong site species. For gibbsite, two different adsorbed Eu(III) species formed on the single available site type and were important for the DDL model, whereas the best-fit CD-MUSIC model for Eu(III)-gibbsite system required only one Eu(III) surface species. The Am(III)-corundum model based on the CD-MUSIC framework had the same set of surface species as the Eu(III)-corundum model. However, the log K values of the surface reactions were different. The best-fit Am(III)-corundum model based on the DDL framework had only one site type. Both the CD-MUSIC and the DDL model developed for Am(III)-γ-alumina system only comprised of one site type and the formation constant of the corresponding surface species was ~ 500 times stronger and ~ 700 times weaker than the corresponding Eu(III) species on the weak and the strong sites, respectively. The CD-MUSIC model for corundum and both the DDL and the CD-MUSIC models for γ-alumina predicted the Am(III) adsorption data very well, whereas the DDL model for corundum overpredicted the Am(III) adsorption data. The root mean square of errors of the DDL and CD-MUSIC models developed in this study were smaller than those of two previously-published models describing Am(III)-γ-alumina system, indicating the better predictive capacity of our models. Overall, our results suggest that using Eu(III) as an analog for Am(III) is practical approach for predicting Am(III) adsorption onto well-characterized minerals. Graphical Abstract
Francisco J. Castellanos, Antonio Javier Gallego, Jorge Calvo-Zaragoza
et al.
Optical music recognition (OMR) is the field that studies how to automatically read music notation from score images. One of the relevant steps within the OMR workflow is the staff-region retrieval. This process is a key step because any undetected staff will not be processed by the subsequent steps. This task has previously been addressed as a supervised learning problem in the literature; however, ground-truth data are not always available, so each new manuscript requires a preliminary manual annotation. This situation is one of the main bottlenecks in OMR, because of the countless number of existing manuscripts , and the associated manual labeling cost. With the aim of mitigating this issue, we propose the application of a domain adaptation technique, the so-called Domain-Adversarial Neural Network (DANN), based on a combination of a gradient reversal layer and a domain classifier in the inference neural architecture. The results from our experiments support the benefits of our proposed solution, obtaining improvements of approximately 29% in the F-score.