We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high-quality sound infusions across diverse categories, representing a step toward more controllable and concept-driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .
Andreas Jonas Fuglsig, Mads Græsbøll Christensen, Jesper Rindom Jensen
Robust spatial audio control relies on accurate acoustic propagation models, yet environmental variations, especially changes in the speed of sound, cause systematic mismatches that degrade performance. Existing methods either assume known sound speed, require multiple microphones, or rely on separate calibration, making them impractical for systems with minimal sensing. We propose an online sound speed estimator that operates during general multichannel audio playback and requires only a single observation microphone. The method exploits the structured effect of sound speed on the reproduced signal and estimates it by minimizing the mismatch between the measured audio and a parametric acoustic model. Simulations show accurate tracking of sound speed for diverse input signals and improved spatial control performance when the estimates are used to compensate propagation errors in a sound zone control framework.
To solve the issue of environmental noise pollution and promote the resource recycling of waste pomelo peel, X-ray diffraction (XRD), Fourier transform infrared spectroscopy (FT-IR), and scanning electron microscopy (SEM) are used to systematically characterize the microstructure and chemical composition of waste pomelo peel. It was found that waste pomelo peel has a porous network structure, which is conducive to the improvement of sound absorption performance. Waste pomelo peel/polycaprolactone (PCL) sound-absorbing composites are prepared by the hot-pressing molding process, and the single-factor analysis method is adopted to explore the effects of seven factors (waste pomelo peel mass fraction, composite density, composite thickness, hot-pressing time, hot-pressing pressure, hot-pressing temperature, and thickness of rear air layer) on the sound absorption performance. Through process optimization, under the optimal conditions, the average sound absorption coefficient (SAC) of the composites reaches 0.54, the noise reduction coefficient (NRC) reaches 0.57, and the maximum SAC reaches 0.99, with the sound absorption performance reaching Grade III. This study not only provides a new idea for the preparation of porous sound-absorbing composites but also opens a new path for the high-value utilization of waste pomelo peel resources.
Federico Verdini, Nicolò Desogus, Emanuela Calcio Gaudino
et al.
Salicylic acid (SA), widely used in dermatological pharmaceuticals and as a primary metabolite of aspirin, is frequently detected in wastewater treatment plants (WWTPs) at concentrations between 3 and 900 mg/L. Given the ECHA’s predicted no-effect concentration (PNEC) of 162 mg/L for sewage treatment plants (STPs), high SA levels pose a risk to microbial communities responsible for biological oxidation. This study explores an innovative hybrid advanced oxidation process (AOP) that combines hydrodynamic cavitation (HC) and electrical discharge (ED) plasma for SA degradation, targeting environmentally friendly pretreatment before secondary biological treatment. Experiments were conducted at the pilot scale (5 L volume, 330 L/h flow rate) using model solutions with SA concentrations of 40 and 80 mg/L. While SA is typically used as a chemical dosimeter for hydroxyl radical (•OH) quantification in cavitational processes, its rapid degradation under HC/ED plasma prevents such use in this setup. For comparison, lab-scale ultrasound (US)-assisted treatments (500 kHz, 250 W, 0.1 L) were performed as a benchmark. HC/ED plasma treatment achieved over 98 % SA removal in both deionized and tap water within 20 min, whereas US treatment achieved only 54 % and 37 % removal after 120 min. Energy yield and electrical energy per order (EEO) analyses revealed that HC/ED is significantly more energy-efficient than US, particularly in tap water. Finally, the HC/ED method was evaluated through a Technology Readiness Level (TRL) assessment to estimate its scalability and applicability in real WWTP scenarios.
To improve broadband transmission performance of mid-frequency transducers, the finite element method was applied to optimize and analyze the performance of a multi-driven Tonpilz transducer, focusing on excitation methods and structural design. A multi-driven Tonpilz transducer with a radiating head containing concentric-ring cavities was proposed, in which multi-cavity design reduces effective mass and enhances bandwidth. The performance under three excitation methods was evaluated by comparing admittance curves in air and analyzing resonant modes, revealing the effects of excitation methods on electromechanical coupling and resonant peaks. Partial excitation was identified as the optimal approach. The impedance characteristics, vibrational modes, and acoustic directivity in water were further investigated. The effects of the mass block, radiating head containing concentric-ring cavities, and front cover on transmitting voltage response fluctuations were analyzed. After optimization, the transducer operates over 17 kHz–41 kHz, with a −3 dB bandwidth of 24 kHz and a maximum transmitting voltage response of 143.7 dB re 1 μPa/V @ 1 m, providing a foundation for applications in related fields.
Acoustics in engineering. Acoustical engineering, Acoustics. Sound
Cian F. Twomey, Gabriele Biagi, Albert A. Ruth
et al.
We report an all-fiber laser gas analyzer (LGA) based on quartz-enhanced photoacoustic spectroscopy (QEPAS) that exploits strong evanescent wave (EW) enhancement using a dielectric coating on side-polished fiber. The dielectric coating increases the fraction of the evanescent field in air, significantly amplifying light–gas interaction within the polished region. A single-mode fiber with a 17 mm polished section passes through two millimeter-scale resonator tubes and a custom quartz tuning fork (QTF) with 0.8 mm prong spacing. The optimized EW coupling efficiently generates photoacoustic waves that excite the QTF’s fundamental flexural mode. Methane–nitrogen mixtures at 800 mbar were used to evaluate performance, achieving a detection limit of 2.5 ppmv for CH4 with 300 ms integration time. By enhancing the evanescent interaction within a compact, robust fiber geometry, this EW-QEPAS sensor eliminates free-space optics and offers a miniaturized, field-deployable solution for gas detection in industrial and agricultural environments.
Holly Bradley, Madeleine E. Yu, Elizabeth K. Johnson
Voice assistant (VA) technology is increasingly part of children's everyday lives. But how well do these systems understand children? No study has asked this with children under 5 years old. Here, two versions of Siri, and one of Alexa, were tested on their ability to transcribe utterances produced by 2-, 3-, and 5-year-olds. Human listeners (mothers and undergraduates) were also tested. Results showed that while Siri's performance on children's speech has improved in recent years, even the newest Siri and Alexa models struggle with children's speech. Human listeners far outperformed VA systems with all ages, especially with the youngest children's speech.
Automatic recognition of insect sound could help us understand changing biodiversity trends around the world -- but insect sounds are challenging to recognize even for deep learning. We present a new dataset comprised of 26399 audio files, from 459 species of Orthoptera and Cicadidae. It is the first large-scale dataset of insect sound that is easily applicable for developing novel deep-learning methods. Its recordings were made with a variety of audio recorders using varying sample rates to capture the extremely broad range of frequencies that insects produce. We benchmark performance with two state-of-the-art deep learning classifiers, demonstrating good performance but also significant room for improvement in acoustic insect classification. This dataset can serve as a realistic test case for implementing insect monitoring workflows, and as a challenging basis for the development of audio representation methods that can handle highly variable frequencies and/or sample rates.
A learning-based method for estimating the magnitude distribution of sound fields from spatially sparse measurements is proposed. Estimating the magnitude distribution of acoustic transfer function (ATF) is useful when phase measurements are unreliable or inaccessible and has a wide range of applications related to spatial audio. We propose a neural-network-based method for the ATF magnitude estimation. The key feature of our network architecture is the input and output layers conditioned on source and receiver positions and frequency and the aggregation module of latent variables, which can be interpreted as an autoencoder-based extension of the basis expansion of the sound field. Numerical simulation results indicated that the ATF magnitude is accurately estimated with a small number of receivers by our proposed method.
Surround sound systems commonly distribute loudspeakers along standardized layouts for multichannel audio reproduction. However in less controlled environments, practical layouts vary in loudspeaker quantity, placement, and listening locations / areas. Deviations from standard layouts introduce sound-field errors that degrade acoustic timbre, imaging, and clarity of audio content reproduction. This work introduces both Bayesian loudspeaker normalization and content panning optimization methods for sound-field correction. Conjugate prior distributions over loudspeaker-listener directions update estimated layouts for non-stationary listening locations; digital filters adapt loudspeaker acoustic responses to a common reference target at the estimated listening area without acoustic measurements. Frequency-domain panning coefficients are then optimized via sensitivity / efficiency objectives subject to spatial, electrical, and acoustic domain constraints; normalized and panned loudspeakers form virtual loudspeakers in standardized layouts for accurate multichannel reproduction. Experiments investigate robustness of Bayesian adaptation, and panning optimizations in practical applications.
Yuki Okamoto, Ryotaro Nagase, Minami Okamoto
et al.
Some datasets with the described content and order of occurrence of sounds have been released for conversion between environmental sound and text. However, there are very few texts that include information on the impressions humans feel, such as "sharp" and "gorgeous," when they hear environmental sounds. In this study, we constructed a dataset with impression captions for environmental sounds that describe the impressions humans have when hearing these sounds. We used ChatGPT to generate impression captions and selected the most appropriate captions for sound by humans. Our dataset consists of 3,600 impression captions for environmental sounds. To evaluate the appropriateness of impression captions for environmental sounds, we conducted subjective and objective evaluations. From our evaluation results, we indicate that appropriate impression captions for environmental sounds can be generated.
Daniel Aleksander Krause, Archontis Politis, Annamaria Mesaros
Sound Event Detection and Localization (SELD) is a combined task of identifying sound events and their corresponding direction-of-arrival (DOA). While this task has numerous applications and has been extensively researched in recent years, it fails to provide full information about the sound source position. In this paper, we overcome this problem by extending the task to Sound Event Detection, Localization with Distance Estimation (3D SELD). We study two ways of integrating distance estimation within the SELD core - a multi-task approach, in which the problem is tackled by a separate model output, and a single-task approach obtained by extending the multi-ACCDOA method to include distance information. We investigate both methods for the Ambisonic and binaural versions of STARSS23: Sony-TAU Realistic Spatial Soundscapes 2023. Moreover, our study involves experiments on the loss function related to the distance estimation part. Our results show that it is possible to perform 3D SELD without any degradation of performance in sound event detection and DOA estimation.
Different machines can exhibit diverse frequency patterns in their emitted sound. This feature has been recently explored in anomaly sound detection and reached state-of-the-art performance. However, existing methods rely on the manual or empirical determination of the frequency filter by observing the effective frequency range in the training data, which may be impractical for general application. This paper proposes an anomalous sound detection method using self-attention-based frequency pattern analysis and spectral-temporal information fusion. Our experiments demonstrate that the self-attention module automatically and adaptively analyses the effective frequencies of a machine sound and enhances that information in the spectral feature representation. With spectral-temporal information fusion, the obtained audio feature eventually improves the anomaly detection performance on the DCASE 2020 Challenge Task 2 dataset.
Environmental sound scene and sound event recognition is important for the recognition of suspicious events in indoor and outdoor environments (such as nurseries, smart homes, nursing homes, etc.) and is a fundamental task involved in many audio surveillance applications. In particular, there is no public common data set for the research field of sound event recognition for the data set of the indoor environmental sound scene. Therefore, this paper proposes a data set (called as AGS) for the home environment sound. This data set considers various types of overlapping audio in the scene, background noise. Moreover, based on the proposed data set, this paper compares and analyzes the advanced methods for sound event recognition, and then illustrates the reliability of the data set proposed in this paper, and studies the challenges raised by the new data set. Our proposed AGS and the source code of the corresponding baselines at https://github.com/taolunzu11/AGS .
Yi-Han Lin, Xunquan Chen, Ryoichi Takashima
et al.
This paper introduces a zero-shot sound event classification (ZS-SEC) method to identify sound events that have never occurred in training data. In our previous work, we proposed a ZS-SEC method using sound attribute vectors (SAVs), where a deep neural network model infers attribute information that describes the sound of an event class instead of inferring its class label directly. Our previous method showed that it could classify unseen events to some extent; however, the accuracy for unseen events was far inferior to that for seen events. In this paper, we propose a new ZS-SEC method that can learn discriminative global features and local features simultaneously to enhance SAV-based ZS-SEC. In the proposed method, while the global features are learned in order to discriminate the event classes in the training data, the spectro-temporal local features are learned in order to regress the attribute information using attribute prototypes. The experimental results show that our proposed method can improve the accuracy of SAV-based ZS-SEC and can visualize the region in the spectrogram related to each attribute.
Jincheng Sun, Kennosuke Higashi, Sergey Romankov
et al.
Although the ultrasonic treatment of molten aluminum has been studied for long period, there is still much to be revealed for this process. Many studies have focused on the investigation of acoustic cavitation and streaming under the horn tip and their effects on the treatment efficiency. However, to the best of our knowledge, no attempt has been done to explain phenomena occurring near or on the melt free surface. Thus, the goal of this study is to investigate phenomena occurring at the free surface during ultrasound irradiation and clarify their possible influence on the ultrasound treatment performance. The results of high temperature and water model experiments reveal that ultrasound irradiation significantly promotes the formation of alumina particles on the melt free surface around sonotrode, and part of these particles can be entrained into aluminum melts. Furthermore, TEM observation results suggested that the entrained alumina inclusions can serve as nucleation sites for the primary Al3Zr compounds. Most importantly, the oxidation and entrainment of particles from free surface are likely to be controllable by the immersion depth of sonotrode into molten aluminum.
Acoustic-resolution photoacoustic microscopy (AR-PAM) image resolution is determined by the point spread function (PSF) of the imaging system. Previous algorithms, including Richardson–Lucy (R–L) deconvolution and model-based (MB) deconvolution, improve spatial resolution by taking advantage of the PSF as prior knowledge. However, these methods encounter the problems of inaccurate deconvolution, meaning the deconvolved feature size and the original one are not consistent (e.g., the former can be smaller than the latter). We present a novel deep convolution neural network (CNN)-based algorithm featuring high-fidelity recovery of multiscale feature size to improve lateral resolution of AR-PAM. The CNN is trained with simulated image pairs of line patterns, which is to mimic blood vessels. To investigate the suitable CNN model structure and elaborate on the effectiveness of CNN methods compared with non-learning methods, we select five different CNN models, while R–L and directional MB methods are also applied for comparison. Besides simulated data, experimental data including tungsten wires, leaf veins, and in vivo blood vessels are also evaluated. A custom-defined metric of relative size error (RSE) is used to quantify the multiscale feature recovery ability of different methods. Compared to other methods, enhanced deep super resolution (EDSR) network and residual in residual dense block network (RRDBNet) model show better recovery in terms of RSE for tungsten wires with diameters ranging from 30 μmto 120 μm. Moreover, AR-PAM images of leaf veins are tested to demonstrate the effectiveness of the optimized CNN methods (by EDSR and RRDBNet) for complex patterns. Finally, in vivo images of mouse ear blood vessels and rat ear blood vessels are acquired and then deconvolved, and the results show that the proposed CNN method (notably RRDBNet) enables accurate deconvolution of multiscale feature size and thus good fidelity.
Graphene has extremely high electron mobility, good thermal conductivity, and excellent acoustic properties, making it an ideal acoustic material. Currently, there are three main types of graphene used in acoustic devices: graphene films prepared by chemical vapor deposition (CVD), vertical graphene (VGr) and laser-induced graphene (LIG). Among them, Graphene prepared by CVD is lighter, thinner and more sensitive; VGr is grown by the microwave plasma chemical vapor deposition (MPCVD) method, which has higher ductility and sound pressure level (SPL); LIG has the advantages of low cost, simple preparation and no biological toxicity. This paper introduces the applications of these three materials in the direction of acoustic devices. In addition to the simple acoustic-electrical conversion application, an artificial throat with the intelligent recognition function is introduced. These different types of graphene acoustic devices have greatly promoted the application of graphene in the field of acoustics.
To extend the shelf life and retain bioactive proteins in milk, this study utilized microfiltration (MF) combined with ultrasonication to treat skim milk and investigated its efficiency in removing bacteria and retaining bioactive proteins compared with HTST pasteurization and microfiltration alone. Results showed that microfiltration combined with ultrasonication at 1296 J/mL could completely remove the bacteria in skim milk. Ultrasonication further extended the shelf life (4 °C) of microfiltered skim milk, which could reach at least 40 days when MF was combined with ˃1296 J/mL ultrasonication. In addition, ELISA showed that HTST pasteurization significantly decreased the levels of IgG by ~30%, IgA by ~ 50%, IgM by ~60%, and lactoferrin by ~40%, whereas the activity of the enzymes lactoperoxidase and xanthine oxidase were also decreased by ~ 20%. Compared with HTST, MF alone or combined with ultrasonication retained these bioactive proteins to a larger degree. On the other hand, proteomics indicated both damage to casein micelle and fat globule structures in milk when ultrasonication at >1296 J/mL was applied, as shown by increases in caseins and milk fat globular proteins. Simultaneously, this ultrasound intensity also decreased levels of bioactive proteins, such as complement factors. Taken together, this study provided new insights that may help to implement this novel combination of non-thermal technologies for the dairy industry aimed at improving milk quality and functionality.