Hasil untuk "Music"

Menampilkan 20 dari ~0 hasil · dari arXiv

JSON API
arXiv Open Access 2026
Music Source Restoration with Ensemble Separation and Targeted Reconstruction

Xinlong Deng, Yu Xia, Jie Jiang

The Inaugural Music Source Restoration (MSR) Challenge targets the recovery of original, unprocessed stems from fully mixed and mastered music. Unlike conventional music source separation, MSR requires reversing complex production processes such as equalization, compression, reverberation, and other real-world degradations. To address MSR, we propose a two-stage system. First, an ensemble of pre-trained separation models produces preliminary source estimates. Then a set of pre-trained BSRNN-based restoration models performs targeted reconstruction to refine these estimates. On the official MSR benchmark, our system surpasses the baselines on all metrics, ranking second among all submissions. The code is available at https://github.com/xinghour/Music-source-restoration-CUPAudioGroup

en cs.SD, cs.AI
arXiv Open Access 2025
From Sound to Sight: Towards AI-authored Music Videos

Leo Vitasovic, Stella Graßhof, Agnes Mercedes Kloft et al.

Conventional music visualisation systems rely on handcrafted ad hoc transformations of shapes and colours that offer only limited expressiveness. We propose two novel pipelines for automatically generating music videos from any user-specified, vocal or instrumental song using off-the-shelf deep learning models. Inspired by the manual workflows of music video producers, we experiment on how well latent feature-based techniques can analyse audio to detect musical qualities, such as emotional cues and instrumental patterns, and distil them into textual scene descriptions using a language model. Next, we employ a generative model to produce the corresponding video clips. To assess the generated videos, we identify several critical aspects and design and conduct a preliminary user evaluation that demonstrates storytelling potential, visual coherency and emotional alignment with the music. Our findings underscore the potential of latent feature techniques and deep generative models to expand music visualisation beyond traditional approaches.

en cs.SD, cs.AI
arXiv Open Access 2025
Procedural Music Generation Systems in Games

Shangxuan Luo, Joshua Reiss

Procedural Music Generation (PMG) is an emerging field that algorithmically creates music content for video games. By leveraging techniques from simple rule-based approaches to advanced machine learning algorithms, PMG has the potential to significantly improve development efficiency, provide richer musical experiences, and enhance player immersion. However, academic prototypes often diverge from applications due to differences in priorities such as novelty, reliability, and allocated resources. This paper bridges the gap between research and applications by presenting a systematic overview of current PMG techniques in both fields, offering a two-aspect taxonomy. Through a comparative analysis, this study identifies key research challenges in algorithm implementation, music quality and game integration. Finally, the paper outlines future research directions, emphasising task-oriented and context-aware design, more comprehensive quality evaluation methods, and improved research tool integration to provide actionable insights for developers, composers, and researchers seeking to advance PMG in game contexts.

en cs.SD
arXiv Open Access 2024
Towards practical secure delegated quantum computing with semi-classical light

Boris Bourdoncle, Pierre-Emmanuel Emeriau, Paul Hilaire et al.

Secure Delegated Quantum Computation (SDQC) protocols are a vital piece of the future quantum information processing global architecture since they allow end-users to perform their valuable computations on remote quantum servers without fear that a malicious quantum service provider or an eavesdropper might acquire some information about their data or algorithm. They also allow end-users to check that their computation has been performed as they have specified it. However, existing protocols all have drawbacks that limit their usage in the real world. Most require the client to either operate a single-qubit source or perform single-qubit measurements, thus requiring them to still have some quantum technological capabilities albeit restricted, or require the server to perform operations which are hard to implement on real hardware (e.g isolate single photons from laser pulses and polarisation-preserving photon-number quantum non-demolition measurements). Others remove the need for quantum communications entirely but this comes at a cost in terms of security guarantees and memory overhead on the server's side. We present an SDQC protocol which drastically reduces the technological requirements of both the client and the server while providing information-theoretic composable security. More precisely, the client only manipulates an attenuated laser pulse, while the server only handles interacting quantum emitters with a structure capable of generating spin-photon entanglement. The quantum emitter acts as both a converter from coherent laser pulses to polarisation-encoded qubits and an entanglement generator. Such devices have recently been used to demonstrate the largest entangled photonic state to date, thus hinting at the readiness of our protocol for experimental implementations.

en quant-ph, cs.CR
arXiv Open Access 2024
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

K R Prajwal, Bowen Shi, Matthew Lee et al.

We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available.

en cs.SD, cs.AI
arXiv Open Access 2023
Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model

Jaeyong Kang, Soujanya Poria, Dorien Herremans

Numerous studies in the field of music generation have demonstrated impressive performance, yet virtually no models are able to directly generate music to match accompanying videos. In this work, we develop a generative music AI framework, Video2Music, that can match a provided video. We first curated a unique collection of music videos. Then, we analysed the music videos to obtain semantic, scene offset, motion, and emotion features. These distinct features are then employed as guiding input to our music generation model. We transcribe the audio files into MIDI and chords, and extract features such as note density and loudness. This results in a rich multimodal dataset, called MuVi-Sync, on which we train a novel Affective Multimodal Transformer (AMT) model to generate music given a video. This model includes a novel mechanism to enforce affective similarity between video and music. Finally, post-processing is performed based on a biGRU-based regression model to estimate note density and loudness based on the video features. This ensures a dynamic rendering of the generated chords with varying rhythm and volume. In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion. The musical quality, along with the quality of music-video matching is confirmed in a user study. The proposed AMT model, along with the new MuVi-Sync dataset, presents a promising step for the new task of music generation for videos.

en cs.SD, cs.AI
arXiv Open Access 2023
From West to East: Who can understand the music of the others better?

Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos

Recent developments in MIR have led to several benchmark deep learning models whose embeddings can be used for a variety of downstream tasks. At the same time, the vast majority of these models have been trained on Western pop/rock music and related styles. This leads to research questions on whether these models can be used to learn representations for different music cultures and styles, or whether we can build similar music audio embedding models trained on data from different cultures or styles. To that end, we leverage transfer learning methods to derive insights about the similarities between the different music cultures to which the data belongs to. We use two Western music datasets, two traditional/folk datasets coming from eastern Mediterranean cultures, and two datasets belonging to Indian art music. Three deep audio embedding models are trained and transferred across domains, including two CNN-based and a Transformer-based architecture, to perform auto-tagging for each target domain dataset. Experimental results show that competitive performance is achieved in all domains via transfer learning, while the best source dataset varies for each music culture. The implementation and the trained models are both provided in a public repository.

en cs.SD, cs.CV
arXiv Open Access 2022
An adaptive music generation architecture for games based on the deep learning Transformer mode

Gustavo Amaral Costa dos Santos, Augusto Baffa, Jean-Pierre Briot et al.

This paper presents an architecture for generating music for video games based on the Transformer deep learning model. Our motivation is to be able to customize the generation according to the taste of the player, who can select a corpus of training examples, corresponding to his preferred musical style. The system generates various musical layers, following the standard layering strategy currently used by composers designing video game music. To adapt the music generated to the game play and to the player(s) situation, we are using an arousal-valence model of emotions, in order to control the selection of musical layers. We discuss current limitations and prospects for the future, such as collaborative and interactive control of the musical components.

en cs.SD, cs.LG
arXiv Open Access 2022
Contrastive Audio-Language Learning for Music

Ilaria Manco, Emmanouil Benetos, Elio Quinton et al.

As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this end, we propose MusCALL, a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences, producing multimodal embeddings that can be used for text-to-audio and audio-to-text retrieval out-of-the-box. Thanks to this property, MusCALL can be transferred to virtually any task that can be cast as text-based retrieval. Our experiments show that our method performs significantly better than the baselines at retrieving audio that matches a textual description and, conversely, text that matches an audio query. We also demonstrate that the multimodal alignment capability of our model can be successfully extended to the zero-shot transfer scenario for genre classification and auto-tagging on two public datasets.

en cs.SD, cs.CL
arXiv Open Access 2022
Quantized GAN for Complex Music Generation from Dance Videos

Ye Zhu, Kyle Olszewski, Yu Wu et al.

We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motions as input, and learns to generate music samples that plausibly accompany the corresponding input. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that usually rely on pre-defined musical synthesizers, in this work we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector Quantized (VQ) audio representation, and leverage both its generality and high abstraction capacity of its symbolic and continuous counterparts. By performing an extensive set of experiments on multiple datasets, and following a comprehensive evaluation protocol, we assess the generative qualities of our proposal against alternatives. The attained quantitative results, which measure the music consistency, beats correspondence, and music diversity, demonstrate the effectiveness of our proposed method. Last but not least, we curate a challenging dance-music dataset of in-the-wild TikTok videos, which we use to further demonstrate the efficacy of our approach in real-world applications -- and which we hope to serve as a starting point for relevant future research.

en cs.CV, cs.SD
arXiv Open Access 2022
Deep Performer: Score-to-Audio Music Performance Synthesis

Hao-Wen Dong, Cong Zhou, Taylor Berg-Kirkpatrick et al.

Music performance synthesis aims to synthesize a musical score into a natural performance. In this paper, we borrow recent advances in text-to-speech synthesis and present the Deep Performer -- a novel system for score-to-audio music performance synthesis. Unlike speech, music often contains polyphony and long notes. Hence, we propose two new techniques for handling polyphonic inputs and providing a fine-grained conditioning in a transformer encoder-decoder model. To train our proposed system, we present a new violin dataset consisting of paired recordings and scores along with estimated alignments between them. We show that our proposed model can synthesize music with clear polyphony and harmonic structures. In a listening test, we achieve competitive quality against the baseline model, a conditional generative audio model, in terms of pitch accuracy, timbre and noise level. Moreover, our proposed model significantly outperforms the baseline on an existing piano dataset in overall quality.

en cs.SD, cs.LG
arXiv Open Access 2022
Exploiting Negative Preference in Content-based Music Recommendation with Contrastive Learning

Minju Park, Kyogu Lee

Advanced music recommendation systems are being introduced along with the development of machine learning. However, it is essential to design a music recommendation system that can increase user satisfaction by understanding users' music tastes, not by the complexity of models. Although several studies related to music recommendation systems exploiting negative preferences have shown performance improvements, there was a lack of explanation on how they led to better recommendations. In this work, we analyze the role of negative preference in users' music tastes by comparing music recommendation models with contrastive learning exploiting preference (CLEP) but with three different training strategies - exploiting preferences of both positive and negative (CLEP-PN), positive only (CLEP-P), and negative only (CLEP-N). We evaluate the effectiveness of the negative preference by validating each system with a small amount of personalized data obtained via survey and further illuminate the possibility of exploiting negative preference in music recommendations. Our experimental results show that CLEP-N outperforms the other two in accuracy and false positive rate. Furthermore, the proposed training strategies produced a consistent tendency regardless of different types of front-end musical feature extractors, proving the stability of the proposed method.

en cs.IR, cs.LG
arXiv Open Access 2021
Musical Speech: A Transformer-based Composition Tool

Jason d'Eon, Sri Harsha Dumpala, Chandramouli Shama Sastry et al.

In this paper, we propose a new compositional tool that will generate a musical outline of speech recorded/provided by the user for use as a musical building block in their compositions. The tool allows any user to use their own speech to generate musical material, while still being able to hear the direct connection between their recorded speech and the resulting music. The tool is built on our proposed pipeline. This pipeline begins with speech-based signal processing, after which some simple musical heuristics are applied, and finally these pre-processed signals are passed through Transformer models trained on new musical tasks. We illustrate the effectiveness of our pipeline -- which does not require a paired dataset for training -- through examples of music created by musicians making use of our tool.

en cs.SD, cs.CV
arXiv Open Access 2020
End-to-end Music-mixed Speech Recognition

Jeongwoo Woo, Masato Mimura, Kazuyoshi Yoshii et al.

Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has achieved state-of-the-art performance for multi-speaker source separation, to extract the speech signal from a speech-music mixture in the waveform domain. We also propose joint fine-tuning of a pre-trained Conv-TasNet front-end with an attention-based ASR back-end using both separation and ASR objectives. We evaluated our method through ASR experiments using speech data mixed with background music from a wide variety of Japanese animations. We show that time-domain speech-music separation drastically improves ASR performance of the back-end model trained with mixture data, and the joint optimization yielded a further significant WER reduction. The time-domain separation method outperformed a frequency-domain separation method, which reuses the phase information of the input mixture signal, both in simple cascading and joint training settings. We also demonstrate that our method works robustly for music interference from classical, jazz and popular genres.

en eess.AS
arXiv Open Access 2020
A Comprehensive Survey on Deep Music Generation: Multi-level Representations, Algorithms, Evaluations, and Future Directions

Shulei Ji, Jing Luo, Xinyu Yang

The utilization of deep learning techniques in generating various contents (such as image, text, etc.) has become a trend. Especially music, the topic of this paper, has attracted widespread attention of countless researchers.The whole process of producing music can be divided into three stages, corresponding to the three levels of music generation: score generation produces scores, performance generation adds performance characteristics to the scores, and audio generation converts scores with performance characteristics into audio by assigning timbre or generates music in audio format directly. Previous surveys have explored the network models employed in the field of automatic music generation. However, the development history, the model evolution, as well as the pros and cons of same music generation task have not been clearly illustrated. This paper attempts to provide an overview of various composition tasks under different music generation levels, covering most of the currently popular music generation tasks using deep learning. In addition, we summarize the datasets suitable for diverse tasks, discuss the music representations, the evaluation methods as well as the challenges under different levels, and finally point out several future directions.

en cs.SD, cs.LG
arXiv Open Access 2019
musicnn: Pre-trained convolutional neural networks for music audio tagging

Jordi Pons, Xavier Serra

Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. We also provide the code to train the aforementioned models: https://github.com/jordipons/musicnn-training. This framework also allows implementing novel models. For example, a musically motivated convolutional neural network with an attention-based output layer (instead of the temporal pooling layer) can achieve state-of-the-art results for music audio tagging: 90.77 ROC-AUC / 38.61 PR-AUC on the MagnaTagATune dataset --- and 88.81 ROC-AUC / 31.51 PR-AUC on the Million Song Dataset.

en cs.SD, cs.CL
arXiv Open Access 2017
Clustering of Musical Pieces through Complex Networks: an Assessment over Guitar Solos

Stefano Ferretti

Musical pieces can be modeled as complex networks. This fosters innovative ways to categorize music, paving the way towards novel applications in multimedia domains, such as music didactics, multimedia entertainment and digital music generation. Clustering these networks through their main metrics allows grouping similar musical tracks. To show the viability of the approach, we provide results on a dataset of guitar solos.

en cs.IR, cs.SD
arXiv Open Access 2017
Quantum approaches to music cognition

Peter beim Graben, Reinhard Blutner

Quantum cognition emerged as an important discipline of mathematical psychology during the last two decades. Using abstract analogies between mental phenomena and the formal framework of physical quantum theory, quantum cognition demonstrated its ability to resolve several puzzles from cognitive psychology. Until now, quantum cognition essentially exploited ideas from projective (Hilbert space) geometry, such as quantum probability or quantum similarity. However, many powerful tools provided by physical quantum theory, e.g., symmetry groups have not been utilized in the field of quantum cognition research sofar. Inspired by seminal work by Guerino Mazzola on the symmetries of tonal music, our study aims at elucidating and reconciling static and dynamic tonal attraction phenomena in music psychology within the quantum cognition framework. Based on the fundamental principles of octave equivalence, fifth similarity and transposition symmetry of tonal music that are reflected by the structure of the circle of fifths, we develop different wave function descriptions over this underlying tonal space. We present quantum models for static and dynamic tonal attraction and compare them with traditional computational models in musicology. Our approach replicates and also improves predictions based on symbolic models of music perception.

en q-bio.NC
arXiv Open Access 2016
A Consumer BCI for Automated Music Evaluation Within a Popular On-Demand Music Streaming Service - Taking Listener's Brainwaves to Extremes

Fotis Kalaganis, Dimitrios A. Adamos, Nikos Laskaris

We investigated the possibility of using a machine-learning scheme in conjunction with commercial wearable EEG-devices for translating listener's subjective experience of music into scores that can be used for the automated annotation of music in popular on-demand streaming services. Based on the established -neuroscientifically sound- concepts of brainwave frequency bands, activation asymmetry index and cross-frequency-coupling (CFC), we introduce a Brain Computer Interface (BCI) system that automatically assigns a rating score to the listened song. Our research operated in two distinct stages: i) a generic feature engineering stage, in which features from signal-analytics were ranked and selected based on their ability to associate music induced perturbations in brainwaves with listener's appraisal of music. ii) a personalization stage, during which the efficiency of ex- treme learning machines (ELMs) is exploited so as to translate the derived pat- terns into a listener's score. Encouraging experimental results, from a pragmatic use of the system, are presented.

en cs.AI, cs.CY
arXiv Open Access 2016
Convolutional Recurrent Neural Networks for Music Classification

Keunwoo Choi, George Fazekas, Mark Sandler et al.

We introduce a convolutional recurrent neural network (CRNN) for music tagging. CRNNs take advantage of convolutional neural networks (CNNs) for local feature extraction and recurrent neural networks for temporal summarisation of the extracted features. We compare CRNN with three CNN structures that have been used for music tagging while controlling the number of parameters with respect to their performance and training time per sample. Overall, we found that CRNNs show a strong performance with respect to the number of parameter and training time, indicating the effectiveness of its hybrid structure in music feature extraction and feature summarisation.

en cs.NE, cs.LG