Privacy-Aware Ambient Audio Sensing for Healthy Indoor Spaces
Bhawana Chhaglani
Indoor airborne transmission poses a significant health risk, yet current monitoring solutions are invasive, costly, or fail to address it directly. My research explores the untapped potential of ambient audio sensing to estimate key transmission risk factors such as ventilation, aerosol emissions, and occupant distribution non-invasively and in real time. I develop privacy-preserving systems that leverage existing microphones to monitor the whole spectrum of indoor air quality which can have a significant effect on an individual's health. This work lays the foundation for privacy-aware airborne risk monitoring using everyday devices.
Towards the Synthesis of Non-speech Vocalizations
Enjamamul Hoq, Ifeoma Nwogu
In this report, we focus on the unconditional generation of infant cry sounds using the DiffWave framework, which has shown great promise in generating high-quality audio from noise. We use two distinct datasets of infant cries: the Baby Chillanto and the deBarbaro cry dataset. These datasets are used to train the DiffWave model to generate new cry sounds that maintain high fidelity and diversity. The focus here is on DiffWave's capability to handle the unconditional generation task.
Maximum Likelihood Estimation of the Direction of Sound In A Reverberant Noisy Environment
Mohamed F. Mansour
We describe a new method for estimating the direction of sound in a reverberant environment from basic principles of sound propagation. The method utilizes SNR-adaptive features from time-delay and energy of the directional components after acoustic wave decomposition of the observed sound field to estimate the line-of-sight direction under noisy and reverberant conditions. The effectiveness of the approach is established with measured data of different microphone array configurations under various usage scenarios.
Adjoint-Based Identification of Sound Sources for Sound Reinforcement and Source Localization
Mathias Lemke, Lewin Stein
The identification of sound sources is a common problem in acoustics. Different parameters are sought, among these are signal and position of the sources. We present an adjoint-based approach for sound source identification, which employs computational aeroacoustic techniques. Two different applications are presented as a proof-of-concept: optimization of a sound reinforcement setup and the localization of (moving) sound sources.
Better speech synthesis through scaling
James Betker
In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.
Modulation Graphs in Popular Music
Jason I. Brown, Ian George
In this paper, graph theory is used to explore the musical notion of tonal modulation, in theory and application. We define (pivot) modulation graphs based on the common scales used in popular music. Properties and parameters of these graphs are discussed. We also investigate modulation graphs for the canon of Lennon-McCartney songs in the works of The Beatles. Our approach may provide composers with mathematical insights into pivot modulation.
Binaural Audio Rendering in the Spherical Harmonic Domain: A Summary of the Mathematics and its Pitfalls
Jens Ahrens
The present document reviews the mathematics behind binaural rendering of sound fields that are available as spherical harmonic expansion coefficients. This process is also known as binaural ambisonic decoding. We highlight that the details entail some amount peculiarity so that one has to be well aware of the precise definitions that are chosen for some of the involved quantities to obtain a consistent formulation. We also discuss what sets of definitions produce ambisonic signals that are compatible with the most common software tools that are available.
Melody Classification based on Performance Event Vector and BRNN
Jinyue Guo, Aozhi Liu, Jing Xiao
We proposed a model for the Conference of Music and Technology (CSMT2020) data challenge of melody classification. Our model used the Performance Event Vector as the input sequence to build a Bidirectional RNN network for classfication. The model achieved a satisfying performance on the development dataset and Wikifonia dataset. We also discussed the effect of several hyper-parameters, and created multiple prediction outputs for the evaluation dataset.
Data-driven audio recognition: a supervised dictionary approach
Imad Rida
Machine hearing is an emerging area. Motivated by the need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and efficient supervised dictionary learning method is presented. Experiments are performed on both computational auditory scene (East Anglia and Rouen) and synthetic music chord recognition datasets. Obtained results show that our method is capable to reach state-of-the-art hand-crafted features for both applications
The Sounds of Music : Science of Musical Scales III -- Indian Classical
Sushan Konar
In the previous articles of this series, we have discussed the development of musical scales particularly that of the heptatonic scale which forms the basis of Western classical music today. In this last article, we take a look at the basic structure of scales used in Indian classical music and how different `raga's are generated through the simple process of scale shifting.
Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics
Thomas Drugman, Abeer Alwan
This paper focuses on the problem of pitch tracking in noisy conditions. A method using harmonic information in the residual signal is presented. The proposed criterion is used both for pitch estimation, as well as for determining the voicing segments of speech. In the experiments, the method is compared to six state-of-the-art pitch trackers on the Keele and CSTR databases. The proposed technique is shown to be particularly robust to additive noise, leading to a significant improvement in adverse conditions.
Vocal melody extraction using patch-based CNN
Li Su
A patch-based convolutional neural network (CNN) model presented in this paper for vocal melody extraction in polyphonic music is inspired from object detection in image processing. The input of the model is a novel time-frequency representation which enhances the pitch contours and suppresses the harmonic components of a signal. This succinct data representation and the patch-based CNN model enable an efficient training process with limited labeled data. Experiments on various datasets show excellent speed and competitive accuracy comparing to other deep learning approaches.
Generating Music using an LSTM Network
Nikhil Kotecha, Paul Young
A model of music needs to have the ability to recall past details and have a clear, coherent understanding of musical structure. Detailed in the paper is a neural network architecture that predicts and generates polyphonic music aligned with musical rules. The probabilistic model presented is a Bi-axial LSTM trained with a kernel reminiscent of a convolutional kernel. When analyzed quantitatively and qualitatively, this approach performs well in composing polyphonic music. Link to the code is provided.
Understanding MIDI: A Painless Tutorial on Midi Format
H. M. de Oliveira, R. C. de Oliveira
A short overview demystifying the midi audio format is presented. The goal is to explain the file structure and how the instructions are used to produce a music signal, both in the case of monophonic signals as for polyphonic signals.
Talking Condition Identification Using Second-Order Hidden Markov Models
Ismail Shahin
This work focuses on enhancing the performance of text-dependent and speaker-dependent talking condition identification systems using second-order hidden Markov models (HMM2s). Our results show that the talking condition identification performance based on HMM2s has been improved significantly compared to first-order hidden Markov models (HMM1s). Our talking conditions in this work are neutral, shouted, loud, angry, happy, and fear.
Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries
Yu-Hsuan Wang, Cheng-Tao Chung, Hung-yi Lee
In this paper we analyze the gate activation signals inside the gated recurrent neural networks, and find the temporal structure of such signals is highly correlated with the phoneme boundaries. This correlation is further verified by a set of experiments for phoneme segmentation, in which better results compared to standard approaches were obtained.
Guitar Solos as Networks
Stefano Ferretti
This paper presents an approach to model melodies (and music pieces in general) as networks. Notes of a melody can be seen as nodes of a network that are connected whenever these are played in sequence. This creates a directed graph. By using complex network theory, it is possible to extract some main metrics, typical of networks, that characterize the piece. Using this framework, we provide an analysis on a set of guitar solos performed by main musicians. The results of this study indicate that this model can have an impact on multimedia applications such as music classification, identification, and automatic music generation.
Getting Closer to the Essence of Music: The Con Espressione Manifesto
Gerhard Widmer
This text offers a personal and very subjective view on the current situation of Music Information Research (MIR). Motivated by the desire to build systems with a somewhat deeper understanding of music than the ones we currently have, I try to sketch a number of challenges for the next decade of MIR research, grouped around six simple truths about music that are probably generally agreed on, but often ignored in everyday research.
Deep Scattering Spectrum
Joakim Andén, Stéphane Mallat
A scattering transform defines a locally translation invariant representation which is stable to time-warping deformations. It extends MFCC representations by computing modulation spectrum coefficients of multiple orders, through cascades of wavelet convolutions and modulus operators. Second-order scattering coefficients characterize transient phenomena such as attacks and amplitude modulation. A frequency transposition invariant representation is obtained by applying a scattering transform along log-frequency. State-the-of-art classification results are obtained for musical genre and phone classification on GTZAN and TIMIT databases, respectively.
A Simple Method to Produce Algorithmic MIDI Music based on Randomness, Simple Probabilities and Multi-Threading
Yannis Tzitzikas
This paper introduces a simple method for producing multichannel MIDI music that is based on randomness and simple probabilities. One distinctive feature of the method is that it produces and sends in parallel to the sound card more than one unsynchronized channels by exploiting the multi-threading capabilities of general purpose programming languages. As consequence the derived sound offers a quite ``full" and ``unpredictable" acoustic experience to the listener. Subsequently the paper reports the results of an evaluation with users. The results were very surprising: the majority of users responded that they could tolerate this music in various occasions.