Gaëtan Hadjeres, Marc Ferras, Khaled Koutini
et al.
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
This study systematically evaluates the efficacy of three emulsification strategies (ultrasonication, high-pressure homogenization, and high-speed shearing) in fabricating artificial oleosomes (AOs) stabilized by hemp seed oil body proteins (OBPs) and rose essential oil. Comparative analyses revealed that ultrasonication generated AOs with superior physicochemical attributes, including reduction in mean particle size, higher ζ-potential, and increased surface hydrophobicity compared to conventional methods. These structural enhancements correlated with improved functional performance: ultrasonicated AOs exhibited lower apparent viscosity, higher radical scavenging capacity, and reduction in light transmittance. Coatings derived from ultrasonicated AOs demonstrated exceptional barrier properties, achieving better water vapor permeability inhibition and oxygen transmission reduction relative to other coatings. Practical validation on bananas revealed that ultrasonicated AOs coatings extended shelf life through enzymatic browning suppression, moisture retention and content of soluble solids maintaining. The findings establish ultrasonication as a scalable strategy for engineering multifunctional AOs with applications in sustainable food preservation.
One of the central challenges in soft matter mechanics is to achieve reversible and programmable modulation of viscoelasticity in polymer-based elastomers at small strains, which is crucial for precision engineering and advanced functional devices. Conventional approaches are constrained by irreversibility and lack of dynamic control. In this study, it is demonstrated that ultrasonic vibration (19–22 kHz) enables dynamic, reversible, and tunable modulation of the mechanical response in such materials. Uniaxial compression experiments combined with constitutive and inverse modeling reveal a reversible transition from viscoelastic, dissipative behavior to an elastic-dominated, stable state. The standard linear solid (SLS) model links macroscopic mechanical changes to molecular-level dynamics, such as chain alignment and mobility. Experimentally, ultrasonic vibration suppresses viscoelastic relaxation and energy dissipation, induces negative hysteresis, and enables tunable, reversible hardening, all strongly dependent on vibration frequency and power. Quantitatively, a typical 20% increase in the instantaneous elastic modulus and over 80% reduction in the delayed elastic modulus and viscosity are achieved under ultrasonic vibration. These results clarify the mechanism by which ultrasonic vibration regulates viscoelasticity and provide practical guidance for designing adaptive polymer systems in applications such as ultrasonic-assisted polishing, soft robotics, and flexible electronics.
Ring-Array photoacoustic tomography (PAT) systems have shown great promise in non-invasive biomedical imaging. However, images produced by these systems often suffer from quality degradation due to non-ideal imaging conditions, with common issues including blurring and streak artifacts. To address these challenges, we propose an image restoration method based on a conditional generative adversarial network (CGAN) framework. Our approach integrates a hybrid spatial and channel attention mechanism within a Residual Shifted Window Transformer Module (RSTM) to enhance the generator’s performance. Additionally, we have developed a comprehensive loss function to balance pixel-level accuracy, detail preservation, and perceptual quality. We further incorporate a gamma correction module to enhance the contrast of the network’s output. Experimental results on both simulated and in vivo data demonstrate that our method significantly improves resolution and restores overall image quality.
Chaimae Slimani, Mouhcine Fadil, Chaimae Rais
et al.
There is increasing evidence of protective health effects of natural antioxidants, such as those contained in saffron (Crocus sativus L.), but it is challenging to extract polyphenols, which are the purported antioxidants with conventional techniques. This study developed an ultrasound-assisted extraction technique to improve the efficiency of phenolic compound extraction and assess the antioxidant activity of saffron. A response surface methodology, utilizing a Box-Behnken design, was employed to optimize three key extraction parameters: solid-to-liquid ratio, temperature, and extraction time for both saffron stigmas and corms. The total phenolic content and antioxidant activity were analyzed using the Folin-Ciocalteu and the 2,2-Diphenyl-1-picrylhydrazyl (DPPH) assays, respectively. The optimal extraction conditions for stigmas were a solid/liquid ratio of 40 mg/30 mL, with an extraction temperature of 50 °C for 15 min, resulting in a total phenolic content yield of 118.55 mg GAE/g DM and an IC50 value of 0.023 mg/mL. For optimal conditions when extracting from corms, the extraction time was extended to 45 min, resulting in a total phenolic content of 21.18 mg GAE/g DM and a IC50 value of 1.02 mg/mL. Results were consistent with predicted values provided by the quadratic model, which confirmed the efficacy of the Box-Behnken design in maximizing the extraction of polyphenol content from C. sativus. Although UAE has been refined at the laboratory level, additional research is needed to assess its feasibility and efficiency on an industrial scale. Exploring large-scale UAE systems, evaluating cost-effectiveness, and identifying potential adaptations for commercial production would be highly beneficial.
Recent progress in auditory intelligence has yielded high-performing systems for sound event detection (SED), acoustic scene classification (ASC), automated audio captioning (AAC), and audio question answering (AQA). Yet these tasks remain largely constrained to surface-level recognition-capturing what happened but not why, what it implies, or how it unfolds in context. I propose a conceptual reframing of auditory intelligence as a layered, situated process that encompasses perception, reasoning, and interaction. To instantiate this view, I introduce four cognitively inspired task paradigms-ASPIRE, SODA, AUX, and AUGMENT-those structure auditory understanding across time-frequency pattern captioning, hierarchical event/scene description, causal explanation, and goal-driven interpretation, respectively. Together, these paradigms provide a roadmap toward more generalizable, explainable, and human-aligned auditory intelligence, and are intended to catalyze a broader discussion of what it means for machines to understand sound.
Anomalous sound detection (ASD) encounters difficulties with domain shift, where the sounds of machines in target domains differ significantly from those in source domains due to varying operating conditions. Existing methods typically employ domain classifiers to enhance detection performance, but they often overlook the influence of domain-unrelated information. This oversight can hinder the model's ability to clearly distinguish between domains, thereby weakening its capacity to differentiate normal from abnormal sounds. In this paper, we propose a Gradient Reversal-based Hierarchical feature Disentanglement (GRHD) method to address the above challenge. GRHD uses gradient reversal to separate domain-related features from domain-unrelated ones, resulting in more robust feature representations. Additionally, the method employs a hierarchical structure to guide the learning of fine-grained, domain-specific features by leveraging available metadata, such as section IDs and machine sound attributes. Experimental results on the DCASE 2022 Challenge Task 2 dataset demonstrate that the proposed method significantly improves ASD performance under domain shift.
In unbounded media, the acoustic attenuation as function of frequency is related to the frequency-dependent sound velocity (dispersion) via Kramers–Kronig dispersion relations. These relations are fundamentally important for better understanding of the nature of attenuation and dispersion and as a tool in physical acoustics measurements, where they can be used for control purposes. However, physical acoustic measurements are frequently carried out not in unbounded media but in acoustic waveguides, e.g., inside liquid-filled pipes. Surface acoustic waves are also often used for physical acoustics measurements. In the present work, the applicability of Kramers–Kronig relations to guided and surface waves is investigated using the approach based on the theory of functions of complex variables. It is demonstrated that Kramers–Kronig relations have limited applicability to guided and surface waves. In particular, they are not applicable to waves propagating in waveguides characterised by the possibility of wave energy leakage from the waveguides into the surrounding medium. For waveguides without leakages, e.g., those formed by rigid walls, Kramers–Kronig relations remain valid for both ideal and viscous liquids. Examples of numerical calculations of wave dispersion and attenuation using Kramers–Kronig relations, where applicable, are presented for unbounded media and for waveguides formed by two rigid walls.
The sound codec's dual roles in minimizing data transmission latency and serving as tokenizers underscore its critical importance. Recent years have witnessed significant developments in codec models. The ideal sound codec should preserve content, paralinguistics, speakers, and audio information. However, the question of which codec achieves optimal sound information preservation remains unanswered, as in different papers, models are evaluated on their selected experimental settings. This study introduces Codec-SUPERB, an acronym for Codec sound processing Universal PERformance Benchmark. It is an ecosystem designed to assess codec models across representative sound applications and signal-level metrics rooted in sound domain knowledge.Codec-SUPERB simplifies result sharing through an online leaderboard, promoting collaboration within a community-driven benchmark database, thereby stimulating new development cycles for codecs. Furthermore, we undertake an in-depth analysis to offer insights into codec models from both application and signal perspectives, diverging from previous codec papers mainly concentrating on signal-level comparisons. Finally, we will release codes, the leaderboard, and data to accelerate progress within the community.
Adam Pluta, Zbigniew Pioch, Jędrzej Kardach
et al.
Introduction: We present a screening method for early dementia using features based on sound objects as voice biomarkers. Methods: The final dataset used for machine learning models consisted of 266 observations, with a distribution of 186 healthy individuals, 46 diagnosed with Alzheimer's, and 34 with MCI. This method is based on six-second recordings of the sustained vowel /a/ spoken by the subject. The main original contribution of this work is the use of carefully crafted features based on sound objects. This approach allows one to first represent the sound spectrum in a more accurate way than the standard spectrum, and then build interpretable features containing relevant information about subjects' control over their voice. Results: ROC AUC obtained in this work for distinguishing healthy subjects from those with MCI was 0.85, while accuracy was 0.76. For distinguishing between healthy subjects and those with either MCI or Alzheimer's the results were 0.84, 0.77, respectively. Conclusion: The use of features based on sound objects enables screening for early dementia even on very short recordings of language-independent voice samples.
Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale visual features to localize sounding objects in each image. Despite their promising performance, they omitted multi-scale visual features of the corresponding image, and they cannot learn discriminative regions compared to ground truths. To address this issue, we propose a novel multi-scale multi-instance visual sound localization framework, namely M2VSL, that can directly learn multi-scale semantic features associated with sound sources from the input image to localize sounding objects. Specifically, our M2VSL leverages learnable multi-scale visual features to align audio-visual representations at multi-level locations of the corresponding image. We also introduce a novel multi-scale multi-instance transformer to dynamically aggregate multi-scale cross-modal representations for visual sound localization. We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks. The results demonstrate that the proposed M2VSL can achieve state-of-the-art performance on sounding object localization and segmentation.
Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.
Although ambient air filters are commonly used to improve air quality in urban areas, their operation often produces significant noise levels. In this study, we investigated and addressed the issue of ambient air filter noise exposure in open areas of Chinese cities, with a focus on various typical urban forms. Firstly, fifteen common urban blocks with different forms and of 250 × 250 m each were chosen as sample sites, for which urban form indices and noise distribution indices were defined. Then noise mapping was conducted to investigate air filter noise exposure in open areas and the effects of urban form indices on noise distribution indices. Results show that urban form has considerable effects on filter noise exposure in open areas. Among the six urban form indices examined, the distance between the first-row building and the air filter was found to be the most critical factor affecting noise levels with the highest correlation coefficient (R = 0.754). The orientation of the first-row building shows the significant resisting effect on both average and background noise levels. Furthermore, the resisting effect of first-row buildings can create a maximum reduction of 12.0 dB (A) for peak noise. These indices could be profiled and used as an “a priori” tool for urban sound environment planning.
Jitendra A. Kewalramani, Bruno Bezerra de Souza, Richard W Marsh
et al.
An intermediate-scale reactor with 10L capacity and two transducers operating at 700 and 950 kHz frequencies was developed to study the scalability of the sonolytic destruction of Per and Polyfluoroalkyl substance (PFAS). The impact of frequency, height of liquid or power density, and transducer position on reactor performance was evaluated with the potassium iodide (KI) oxidation and calorimetric power. The dual frequency mode of operation has a synergistic effect based on the triiodide concentration, and calorimetric power. The triiodide concentration, and calorimetric power were higher in this mode compared to the combination of both frequencies operating individually. The sonochemical efficiency for an intermediate-scale reactor (10L) was similar that obtained from a bench-scale reactor (2L), showing the scalability of the sonolytic technology. The placement of the transducer at the bottom or side wall of the reactor had no significant impact on the sonochemical reactivity. The superposition of the ultrasonic field from the dual transducer mode (side and bottom) did not produce a synergistic effect compared to the single transducer mode (bottom or side). This can be attributed to a disturbance due to the interaction of ultrasonic fields of two frequencies from each transducer. With the encouraging results scaling up is in progress for site implementation.
This short paper introduces a workflow for generating realistic soundscapes for visual media. In contrast to prior work, which primarily focus on matching sounds for on-screen visuals, our approach extends to suggesting sounds that may not be immediately visible but are essential to crafting a convincing and immersive auditory environment. Our key insight is leveraging the reasoning capabilities of language models, such as ChatGPT. In this paper, we describe our workflow, which includes creating a scene context, brainstorming sounds, and generating the sounds.
In industry, machine anomalous sound detection (ASD) is in great demand. However, collecting enough abnormal samples is difficult due to the high cost, which boosts the rapid development of unsupervised ASD algorithms. Autoencoder (AE) based methods have been widely used for unsupervised ASD, but suffer from problems including 'shortcut', poor anti-noise ability and sub-optimal quality of features. To address these challenges, we propose a new AE-based framework termed AEGM. Specifically, we first insert an auxiliary classifier into AE to enhance ASD in a multi-task learning manner. Then, we design a group-based decoder structure, accompanied by an adaptive loss function, to endow the model with domain-specific knowledge. Results on the DCASE 2021 Task 2 development set show that our methods achieve a relative improvement of 13.11% and 15.20% respectively in average AUC over the official AE and MobileNetV2 across test sets of seven machines.
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.
The conventional approach of embedding an effective acoustic metric for sound motion in a background flat Minkowski space-time has recently been extended to incorporate more general curved background metrics, which might contain a black hole. Though the observational aspects of these kinds of acoustics horizons, including the sonic shadow structure and quasi normal modes, have received significant attention in the literature, there is room left for discussions about embedding more general classes of curved background space-times without optical horizons. Here, we propose and study a new class of acoustic metrics that is embedded in a black-bounce space-time, thereby giving a suitable tuneable system to understand possible observational effects of the presence or absence of acoustic horizons. After showing that the metric can represent five types of different effective backgrounds for sound motion, including a novel “acoustic wormhole–optical wormhole” branch, we discuss how the distinctive features of sonic shadows can appear even in the absence of any acoustic horizon due to the wormhole throat present in the acoustic metric.
Sylvia Indriani, Thanasak Sae-leaw, Soottawat Benjakul
et al.
This study focused on impact of ultrasound-assisted process (UAP) at pre-treatment (UP) and simultaneous treatment (US) during papain hydrolysis for preparing collagen hydrolysate (CH) from Asian bullfrog skin. Ultrasonication times were varied (10, 20, 30 min), and CH prepared using papain hydrolysis without UAP was used as control. Different UAPs provided CH with various hydroxyproline contents, α-amino group contents, surface hydrophobicities, and antioxidative activities. UP at 20 min (UP-20) and US at 30 min (US-30) provided highly antioxidative CHs, which were selected for further studies on their Oxygen reactive absorbance capacity (ORAC) and molecular characteristics. CHs from UP-20 and US-30 had higher ORAC than that of control group (p ≤ 0.05). Slight difference in amino acid composition was detected between samples. Based on these results, molecular characteristic styles, molecular weight profile, antioxidative peptide content, and secondary structure of each sample were obtained. These results indicate that UP and US used varied enzymatic hydrolysis patterns and modified molecular conformation of CH, resulting in enhanced antioxidative activity. Therefore, different UAPs as UP and US could be effectively used in preparation of CH using papain hydrolysis from Asian bullfrog skin, which could improve production process efficiency by enhancing their bioactivity, particularly antioxidative activity.