Hasil untuk "Music"

Menampilkan 20 dari ~1058365 hasil · dari arXiv, DOAJ, CrossRef, Semantic Scholar

JSON API
arXiv Open Access 2026
A Design Space for Live Music Agents

Yewon Kim, Stephen Brade, Alexander Wang et al.

Live music provides a uniquely rich setting for studying creativity and interaction due to its spontaneous nature. The pursuit of live music agents--intelligent systems supporting real-time music performance and interaction--has captivated researchers across HCI, AI, and computer music for decades, and recent advancements in AI suggest unprecedented opportunities to evolve their design. However, the interdisciplinary nature of music has led to fragmented development across research communities, hindering effective communication and collaborative progress. In this work, we bring together perspectives from these diverse fields to map the current landscape of live music agents. Based on our analysis of 184 systems across both academic literature and video, we develop a comprehensive design space that categorizes dimensions spanning usage contexts, interactions, technologies, and ecosystems. By highlighting trends and gaps in live music agents, our design space offers researchers, designers, and musicians a structured lens to understand existing systems and shape future directions in real-time human-AI music co-creation. We release our annotated systems as a living artifact at https://live-music-agents.github.io.

en cs.HC
arXiv Open Access 2026
Bangla Music Genre Classification Using Bidirectional LSTMS

Muntakimur Rahaman, Md Mahmudul Hoque, Md Mehedi Hassain

Bangla music is enrich in its own music cultures. Now a days music genre classification is very significant because of the exponential increase in available music, both in digital and physical formats. It is necessary to index them accordingly to facilitate improved retrieval. Automatically classifying Bangla music by genre is essential for efficiently locating specific pieces within a vast and diverse music library. Prevailing methods for genre classification predominantly employ conventional machine learning or deep learning approaches. This work introduces a novel music dataset comprising ten distinct genres of Bangla music. For the task of audio classification, we utilize a recurrent neural network (RNN) architecture. Specifically, a Long Short-Term Memory (LSTM) network is implemented to train the model and perform the classification. Feature extraction represents a foundational stage in audio data processing. This study utilizes Mel-Frequency Cepstral Coefficients (MFCCs) to transform raw audio waveforms into a compact and representative set of features. The proposed framework facilitates music genre classification by leveraging these extracted features. Experimental results demonstrate a classification accuracy of 78%, indicating the system's strong potential to enhance and streamline the organization of Bangla music genres.

en cs.SD, cs.LG
DOAJ Open Access 2026
From the Communal Music Making to Deep Learning: AI, Copyright, and the Soul of African Music

Amon Kipyegon Kirui, Tolu Owoaje

This paper critically examines the impact of the Generative Artificial Intelligence (AI) on the legal, economic and cultural integrity of African music. It discusses the reciprocity of intellectual property (IP) functionality and undermining of culture in the backdrop of algorithmic datafication. Based on the Postcolonial theory (more specifically, data-colonialism framework), the research analysis focuses on the process in which Global North technology corporations harvest cultural information of Global South producers through the expansion of Artificial Intelligence. Through the legal-ethnographic triangulation of legal analysis of the copyright regimes, together with extensive interviews of Kenyan creative professionals, the study reveals a structural imbalance between western standards of IP laws and African epistemology of oral traditions. Although AI is a source of universal precarity in the global community of creators, the results have shown a particular danger of ontological obliteration of African idioms, which occurs in the form of rhythmic flattening and digital orientalism. Additionally, the analysis also records the techniques Kenyan counter-movements, also known as digital resistance, use to deconstruct algorithmic quantisation and demand creative sovereignty. The paper concludes with an urgent appeal for decolonized IP frameworks, such as communal data trusts, to safeguard the essence, or sovereign intentionality, of African music.

Music and books on Music
arXiv Open Access 2025
MusFlow: Multimodal Music Generation via Conditional Flow Matching

Jiahao Song, Yuzhao Wang

Music generation aims to create music segments that align with human aesthetics based on diverse conditional information. Despite advancements in generating music from specific textual descriptions (e.g., style, genre, instruments), the practical application is still hindered by ordinary users' limited expertise or time to write accurate prompts. To bridge this application gap, this paper introduces MusFlow, a novel multimodal music generation model using Conditional Flow Matching. We employ multiple Multi-Layer Perceptrons (MLPs) to align multimodal conditional information into the audio's CLAP embedding space. Conditional flow matching is trained to reconstruct the compressed Mel-spectrogram in the pretrained VAE latent space guided by aligned feature embedding. MusFlow can generate music from images, story texts, and music captions. To collect data for model training, inspired by multi-agent collaboration, we construct an intelligent data annotation workflow centered around a fine-tuned Qwen2-VL model. Using this workflow, we build a new multimodal music dataset, MMusSet, with each sample containing a quadruple of image, story text, music caption, and music piece. We conduct four sets of experiments: image-to-music, story-to-music, caption-to-music, and multimodal music generation. Experimental results demonstrate that MusFlow can generate high-quality music pieces whether the input conditions are unimodal or multimodal. We hope this work can advance the application of music generation in multimedia field, making music creation more accessible. Our generated samples, code and dataset are available at musflow.github.io.

en cs.SD, cs.MM
arXiv Open Access 2025
A Survey on Cross-Modal Interaction Between Music and Multimodal Data

Sifei Li, Mining Tan, Feier Shen et al.

Multimodal learning has driven innovation across various industries, particularly in the field of music. By enabling more intuitive interaction experiences and enhancing immersion, it not only lowers the entry barriers to the music but also increases its overall appeal. This survey aims to provide a comprehensive review of multimodal tasks related to music, outlining how music contributes to multimodal learning and offering insights for researchers seeking to expand the boundaries of computational music. Unlike text and images, which are often semantically or visually intuitive, music primarily interacts with humans through auditory perception, making its data representation inherently less intuitive. Therefore, this paper first introduces the representations of music and provides an overview of music datasets. Subsequently, we categorize cross-modal interactions between music and multimodal data into three types: music-driven cross-modal interactions, music-oriented cross-modal interactions, and bidirectional music cross-modal interactions. For each category, we systematically trace the development of relevant sub-tasks, analyze existing limitations, and discuss emerging trends. Furthermore, we provide a comprehensive summary of datasets and evaluation metrics used in multimodal tasks related to music, offering benchmark references for future research. Finally, we discuss the current challenges in cross-modal interactions involving music and propose potential directions for future research.

en cs.MM, cs.SD
DOAJ Open Access 2025
Singing to speech conversion with generative flow

Jiawen Huang, Emmanouil Benetos

Abstract This paper introduces singing to speech conversion (S2S), a cross-domain voice conversion task, and presents the first deep learning-based S2S system. S2S aims to transform singing into speech while retaining the phonetic information, reducing variations in pitch, rhythm, and timbre. Inspired by the Glow-TTS architecture, the proposed model is built using generative flow, with an adjusted alignment module between the latent features. We adapt the original monotonic alignment search (MAS) to the S2S scenario and utilize a duration predictor to deal with the duration differences between the two modalities. Subjective evaluations show that the proposed model outperforms signal processing baselines in naturalness and outperforms a transcribe-and-synthesize baseline in phonetic similarity to the original singing. We further demonstrate that singing-to-speech could be an effective augmentation method for low-resource lyrics transcription.

Acoustics. Sound, Electronic computers. Computer science
arXiv Open Access 2024
UniMuMo: Unified Text, Music and Motion Generation

Han Yang, Kun Su, Yutong Zhang et al.

We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{https://hanyangclarence.github.io/unimumo_demo/}{project page}.

en cs.SD, cs.CV
arXiv Open Access 2024
MuCodec: Ultra Low-Bitrate Music Codec

Yaoxun Xu, Hangting Chen, Jianwei Yu et al.

Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics. Code and Demo: https://xuyaoxun.github.io/MuCodec_demo/.

en cs.SD, eess.AS
DOAJ Open Access 2024
Association of electronic screen exposure with depression among women in early pregnancy: a cross-sectional study

Qianqian Yang, Qian Wang, Hongzhi Zhang et al.

Abstract Background Previous studies indicated that excessive engagement in digital devices could lead to negative psychological impacts in general population. We aimed to determine the association of electronic screen exposure with depression among women in early pregnancy. Methods A cross-sectional study was conducted from June 2021 to June 2022. A total of 665 women in early pregnancy were recruited and the information included socio-demographic characteristics, screen exposure and Patient Health Questionnaire − 9 depression scale. Results Among the women in early pregnancy, the total daily smartphone viewing time was the longest (median [P25-P75], 5 [3–6] hours/day) in the three types of electronic screen exposure. The total daily smartphone viewing time (P = 0.015, OR[95%CI] = 1.09[1.11–1.18]), smartphone (P = 0.016, OR[95%CI] = 1.24[1.04–1.47]) and television viewing time (P = 0.006, OR[95%CI] = 1.35[1.09–1.67]) before nocturnal sleep were significantly associated with depression among women in early pregnancy. The thresholds calculated by receiver operator characteristic curves were 7.5 h/day, 1.5 h/day and 1.5 h/day, respectively. In addition, women with higher scores of smartphone addiction were more susceptible to depression (P<0.001, OR[95%CI] = 1.11[1.07–1.16]). The top three smartphone usages in women with depression were watching videos (22.0%), listening to music (20.9%) and playing games (16.7%). Conclusions In conclusion, electronic screen exposure, including screen viewing time, smartphone addiction and problematic smartphone use was associated with depression among women in early pregnancy. Further studies are warranted to verify the conclusions.

Gynecology and obstetrics
DOAJ Open Access 2024
From Life-Skills Research and Training to Sustainability: A Case Study from a Spanish University

Pilar Posadas de Julián, Carmen Verdejo Lucas, Belén de Rueda Villén et al.

We are currently facing a potential ‘polycrisis’, a critical inflection point that requires a holistic response aimed at building collective foresight and preparedness for short-, medium-, and long-term risks. The role of higher education institutions and social stakeholders is decisive for sustainability goals. This paper presents a case study where academia, governance, and industry have aligned to challenge, inspire, and encourage universities to enhance student growth and bind macro-scale measures leading to a sustainable future. A teaching innovation project has served as a transforming lever, in combination with the private sector, to create a platform that reaches more than 50,000 undergraduate students and teaching staff. This structure, rooted in the 2031 Strategic Plan of the University of Granada, has also served to channel local and regional initiatives, establish effective partnerships with broad social members, raise awareness, and promote actions to advance in the pursuit of Sustainable Development Goals. A comprehensive overview is provided, which details its chronology, materials, results, challenges, impact, and descriptions of the various courses, programs, and actions. The paper concludes with recommendations for future research, policy and cooperation among stakeholders.

Technology, Science (General)
arXiv Open Access 2023
Music Rearrangement Using Hierarchical Segmentation

Christos Plachouras, Marius Miron

Music rearrangement involves reshuffling, deleting, and repeating sections of a music piece with the goal of producing a standalone version that has a different duration. It is a creative and time-consuming task commonly performed by an expert music engineer. In this paper, we propose a method for automatically rearranging music recordings that takes into account the hierarchical structure of the recording. Previous approaches focus solely on identifying cut-points in the audio that could result in smooth transitions. We instead utilize deep audio representations to hierarchically segment the piece and define a cut-point search subject to the boundaries and musical functions of the segments. We score suitable entry- and exit-point pairs based on their similarity and the segments they belong to, and define an optimal path search. Experimental results demonstrate the selected cut-points are most commonly imperceptible by listeners and result in more consistent musical development with less distracting repetitions.

en cs.SD, cs.IR
arXiv Open Access 2023
Mixing Levels -- A Rock Music Spirit Level App

Tim Ziemer

To date, sonification apps are rare. Music apps on the other hand are widely used. Smartphone users like to play with music. In this manuscript, we present Mixing Levels, a spirit level sonification based on music mixing. Tilting the smartphone adjusts the volumes of 5 musical instruments in a rock music loop. Only when perfectly leveled, all instruments in the mix are well-audible. The app is supposed to be useful and fun. Since the app appears like a music mixing console, people have fun to interact with Mixing Levels, so that learning the sonification is a playful experience.

en cs.MM, cs.HC
arXiv Open Access 2023
From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation

Adarsh Kumar, Pedro Sarmento

Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to investigate the efficacy of subword tokenization in the symbolic music domain. In this paper, we explore subword tokenization techniques, such as byte-pair encoding (BPE), in symbolic music generation and its impact on the overall structure of generated songs. Our experiments are based on three types of MIDI datasets: single track-melody only, multi-track with a single instrument, and multi-track and multi-instrument. We apply subword tokenization on post-musical tokenization schemes and find that it enables the generation of longer songs at the same time and improves the overall structure of the generated music in terms of objective metrics like structure indicator (SI), Pitch Class Entropy, etc. We also compare two subword tokenization methods, BPE and Unigram, and observe that both methods lead to consistent improvements. Our study suggests that subword tokenization is a promising technique for symbolic music generation and may have broader implications for music composition, particularly in cases involving complex data such as multi-track songs.

en cs.SD, cs.LG
arXiv Open Access 2023
Choir Transformer: Generating Polyphonic Music with Relative Attention on Transformer

Jiuyang Zhou, Hong Zhu, Xingping Wang

Polyphonic music generation is still a challenge direction due to its correct between generating melody and harmony. Most of the previous studies used RNN-based models. However, the RNN-based models are hard to establish the relationship between long-distance notes. In this paper, we propose a polyphonic music generation neural network named Choir Transformer[ https://github.com/Zjy0401/choir-transformer], with relative positional attention to better model the structure of music. We also proposed a music representation suitable for polyphonic music generation. The performance of Choir Transformer surpasses the previous state-of-the-art accuracy of 4.06%. We also measures the harmony metrics of polyphonic music. Experiments show that the harmony metrics are close to the music of Bach. In practical application, the generated melody and rhythm can be adjusted according to the specified input, with different styles of music like folk music or pop music and so on.

en eess.AS, cs.AI
DOAJ Open Access 2023
A fully capable pianist with a congenital bilateral agenesis of extensor pollicis brevis muscle

K. P. Dąbrowski, P. Palczewski, H. Stankiewicz-Jóźwicka et al.

A 28-year-old male musical student has been presented with visible inability of active abduction and extension of the thumbs in both hands beyond the neutral position. The student has not been previously diagnosed and claimed no history of trauma or surgical procedures in the area of hands and no family history of such disabilities. The student remained capable of playing on keyboard instruments on high level due to compensation by hyperextension of the interphalangeal joint of both thumbs and showed no increased frequency of the injuries or playing-related disorders. The ultrasound and magnetic resonance imaging showed complete bilateral agenesis of extensor pollicis brevis muscles and was classified as isolated congenital clasped thumb syndrome. Due to the age of the student and the agenesis of the muscles the conservative treatment was deemed inadequate and due to high functionality of the student as a musician and unforeseeable results it might have on a musician’s career, surgical treatment has been disadvised.

Human anatomy, Cytology
DOAJ Open Access 2023
Analysis of Structure and Techniques of Pancula Ndeme By Justinus Hokey on Classic Guitar Composition

Juanita Theresia Adimurti, Fery Virgiawan Tampatonda

This study aims to describe the structure and techniques of playing Pancula Ndeme by Justinus Hokey on classical guitar composition. This study uses a qualitative approach. The data analyzed is the sheet music of Pancula Ndeme. The results of this study indicate that the structure of this composition consists of several parts, namely Introduction – A – B – C – A – B – Coda. There are four motifs in section A (m1, m2, m3, m4), two sentences (sentences a and a'), and two themes in each sentence. Section B consists of three motifs (n1, n2, n3), two phrases (phrase b and b'), and two themes in each sentence. Section C consists of 2 motifs (o1 and o2). The guitar playing techniques contained in this composition are the right-hand technique and the left-hand technique. The right-hand techniques found in this composition include; apoyando, tirando, and strumming techniques. The left-hand techniques contained in this composition include; slur, barre, and harmonic.

Music, Musical instruction and study
arXiv Open Access 2022
Multitrack Music Transformer

Hao-Wen Dong, Ke Chen, Shlomo Dubnov et al.

Existing approaches for generating multitrack music with transformer models have been limited in terms of the number of instruments, the length of the music segments and slow inference. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations. In this work, we propose a new multitrack music representation that allows a diverse set of instruments while keeping a short sequence length. Our proposed Multitrack Music Transformer (MMT) achieves comparable performance with state-of-the-art systems, landing in between two recently proposed models in a subjective listening test, while achieving substantial speedups and memory reductions over both, making the method attractive for real time improvisation or near real time creative applications. Further, we propose a new measure for analyzing musical self-attention and show that the trained model attends more to notes that form a consonant interval with the current note and to notes that are 4N beats away from the current step.

en cs.SD, cs.AI
arXiv Open Access 2022
Psychologically-Inspired Music Recommendation System

Danila Rozhevskii, Jie Zhu, Boyuan Zhao

In the last few years, automated recommendation systems have been a major focus in the music field, where companies such as Spotify, Amazon, and Apple are competing in the ability to generate the most personalized music suggestions for their users. One of the challenges developers still fail to tackle is taking into account the psychological and emotional aspects of the music. Our goal is to find a way to integrate users' personal traits and their current emotional state into a single music recommendation system with both collaborative and content-based filtering. We seek to relate the personality and the current emotional state of the listener to the audio features in order to build an emotion-aware MRS. We compare the results both quantitatively and qualitatively to the output of the traditional MRS based on the Spotify API data to understand if our advancements make a significant impact on the quality of music recommendations.

en cs.IR, cs.AI
arXiv Open Access 2022
Volume-Independent Music Matching by Frequency Spectrum Comparison

Anthony Lee

Often, I hear a piece of music and wonder what the name of the piece is. Indeed, there are applications such as Shazam app that provides music matching. However, the limitations of those apps are that the same piece performed by the same musician cannot be identified if it is not the same recording. Shazam identifies the recording of it, not the music. This is because Shazam matches the variation in volume, not the frequencies of the sound. This research attempts to match music the way humans understand it: by the frequency spectrum of music, not the volume variation. Essentially, the idea is to precompute the frequency spectrums of all the music in the database, then take the unknown piece and try to match its frequency spectrum against every segment of every music in the database. I did it by matching the frequency spectrum of the unknown piece to our database by sliding the window by 0.1 seconds and calculating the error by taking Absolute value, normalizing the audio, subtracting the normalized arrays, and taking the sum of absolute differences. The segment that shows the least error is considered the candidate for the match. The matching performance proved to be dependent on the complexity of the music. Matching simple music, such as single note pieces, was successful. However, more complex pieces, such as Chopins Ballade 4, were not successful, that is, the algorithm could not produce low error values in any of the music in the database. I suspect that it has to do with having too many notes: mismatches in the higher harmonics added up to a significant amount of errors, which swamps the calculations.

en cs.SD, eess.AS
DOAJ Open Access 2022
Thomas Tallis’ Magnificats: Features of the Genre

Ekaterina V. Svirskaya

The article explores the specifics of the Magnificat produced by Thomas Tallis, one of the leading English 16th century composers. The development of the genre is a direct reflection of the religious history of 16thcentury England. It was a peculiar and unique period as catholic kings would replace protestant kings which inevitably led to the reformation of church rites. Tallis as a composer witnessed the reign of four monarchs, and the three Magnificats from his legacy appear to have been written for different forms of worship in pre-Reformed English Catholic and post-Reformed Anglican Church. Russian musicology has not given much attention to Tallis's works. Recently, a few studies focused on some of his compositions, however, Tallis’s Magnificats are still outside the research scope. This article summarizes the information of foreign researchers on the chronology of Tallis' works, the conditions of developing liturgical cycles, and their relationship to religious changes in the country. The article also provides a detailed analysis of Magnificats. In particular, it focuses on the specifics of architectonics and texture in the Magnificats with the Latin text and in the cycle written for Anglican Church. Thus, the article discusses structural features of Tallis’s Magnificat for the Anglican Church and raises a question about the emergence of the tradition that combines the Magnificat and Nunc dimittis in one cycle during a reformatory worship. In addition, the article examines the techniques Tallis used to approach the melodic basis of the Magnificats that embrace unique features of British culture to further transform them in a polyphonic setting.

Halaman 28 dari 52919