Hasil untuk "eess.AS"

Menampilkan 20 dari ~72 hasil Β· dari CrossRef, arXiv

JSON API
arXiv Open Access 2026
Towards Objective Gastrointestinal Auscultation: Automated Segmentation and Annotation of Bowel Sound Patterns

Zahra Mansour, Verena Uslar, Dirk Weyhe et al.

Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable automated signal analysis, offering the potential to provide clinicians with both objective and quantitative feedback on bowel activity. This study presents an automated pipeline for bowel sound segmentation and classification using a wearable acoustic SonicGuard sensor. BS signals from 83 subjects were recorded using a SonicGuard sensor. Data from 40 subjects were manually annotated by clinical experts and used to train an automatic annotation algorithm, while the remaining subjects were used for further model evaluation. An energy-based event detection algorithm was developed to detect BS events. Detected sound segments were then classified into BS patterns using a pretrained Audio Spectrogram Transformer (AST) model. Model performance was evaluated separately for healthy individuals and patients. The best configuration used two specialized models, one trained on healthy subjects and one on patients, achieving (accuracy: 0.97, AUROC: 0.98) for healthy group and (accuracy: 0.96, AUROC: 0.98) for patient group. The auto-annotation method reduced manual labeling time by approximately 70%, and expert review showed that less than 12% of automatically detected segments required correction. The proposed automated segmentation and classification system enables quantitative assessment of bowel activity, providing clinicians with an objective diagnostic tool that may improve the diagnostic of gastrointestinal function and support the annotation of large-scale datasets.

en cs.SD, cs.LG
arXiv Open Access 2025
Benchmarking machine learning for bowel sound pattern classification from tabular features to pretrained models

Zahra Mansour, Verena Uslar, Dirk Weyhe et al.

The development of electronic stethoscopes and wearable recording sensors opened the door to the automated analysis of bowel sound (BS) signals. This enables a data-driven analysis of bowel sound patterns, their interrelations, and their correlation to different pathologies. This work leverages a BS dataset collected from 16 healthy subjects that was annotated according to four established BS patterns. This dataset is used to evaluate the performance of machine learning models to detect and/or classify BS patterns. The selection of considered models covers models using tabular features, convolutional neural networks based on spectrograms and models pre-trained on large audio datasets. The results highlight the clear superiority of pre-trained models, particularly in detecting classes with few samples, achieving an AUC of 0.89 in distinguishing BS from non-BS using a HuBERT model and an AUC of 0.89 in differentiating bowel sound patterns using a Wav2Vec 2.0 model. These results pave the way for an improved understanding of bowel sounds in general and future machine-learning-driven diagnostic applications for gastrointestinal examinations

en cs.SD, cs.LG
arXiv Open Access 2024
Concerns for Self-Localization of Ad-Hoc Arrays Using Time Difference of Arrivals

Faxian Cao

This document presents some insights and observations regarding the paper that was published in IEEE Transactions on Signal Processing (TSP), titled "Self-Localization of Ad-Hoc Arrays Using Time Difference of Arrivals". In the spirit of constructive feedback, I wish to highlight two key areas of consideration. The first pertains to aspects related to methodology, experimental results, and statements made in the paper. The second part addresses specific equation/typographical errors. This work aims to initiate a constructive dialogue concerning certain aspects of the paper published in IEEE TSP. Our intention is to provide feedback that contributes to the ongoing improvement of the paper's robustness and clarity.

en eess.AS
arXiv Open Access 2024
A Bandpass Twin-T Active Filter Used in the Buchla 200 Electric Music Box Synthesizer

Aaron D. Lanterman

This paper analyzes an unusual active bandpass filter employed in the Buchla Model 295 10 Channel Comb Filter, a synthesizer module developed as part of the Buchla 200 Electric Music Box by Donald Buchla. The filter consists of a peculiar rearrangement of elements in a classic Twin-T configuration; to our knowledge, it has not been previously addressed in the literature. As an example, we explore its specific application in the Model 295.

en eess.AS, cs.SD
arXiv Open Access 2023
FN-SSL: Full-Band and Narrow-Band Fusion for Sound Source Localization

Yabo Wang, Bing Yang, Xiaofei Li

Extracting direct-path spatial features is critical for sound source localization in adverse acoustic environments. This paper proposes a full-band and narrow-band fusion network for estimating direct-path inter-channel phase difference (DP-IPD) from microphone signals. The alternating full-band and narrow-band layers are responsible for learning the full-band correlation and narrow-band extraction of DP-IPD, respectively. Experiments show that the proposed network noticeably outperforms other advanced methods on both simulated and real-world data.

en eess.AS
arXiv Open Access 2023
Improved Lossless Coding for Storage and Transmission of Multichannel Immersive Audio

Toni Hirvonen, Mahmoud Namazi

In this paper, techniques for improving multichannel lossless coding are examined. A method is proposed for the simultaneous coding of two or more different renderings (mixes) of the same content. The signal model uses both past samples of the upmix, and the current time samples of downmix samples to predict the upmix. Model parameters are optimized via a general linear solver, and the prediction residual is Rice coded. Additionally, the use of an SVD projection prior to residual coding is proposed. A comparison is made against various baselines, including FLAC. The proposed methods show improved compression ratios for the storage and transmission of immersive audio.

en eess.AS, cs.MM
arXiv Open Access 2022
Ambisonic Encoding of Signals From Spherical Microphone Arrays

Jens Ahrens

This document illustrates how to process the signals from the microphones of a rigid-sphere higher-order ambisonic microphone array so that they are encoded with N3D normalization and ACN channel order and thereby can be used with the standard ambisonic software tools such as SPARTA and the IEM Plugin Suite. A MATLAB script is provided.

en eess.AS, cs.SD
arXiv Open Access 2021
Tied & Reduced RNN-T Decoder

Rami Botros, Tara N. Sainath, Robert David et al.

Previous works on the Recurrent Neural Network-Transducer (RNN-T) models have shown that, under some conditions, it is possible to simplify its prediction network with little or no loss in recognition accuracy (arXiv:2003.07705 [eess.AS], [2], arXiv:2012.06749 [cs.CL]). This is done by limiting the context size of previous labels and/or using a simpler architecture for its layers instead of LSTMs. The benefits of such changes include reduction in model size, faster inference and power savings, which are all useful for on-device applications. In this work, we study ways to make the RNN-T decoder (prediction network + joint network) smaller and faster without degradation in recognition performance. Our prediction network performs a simple weighted averaging of the input embeddings, and shares its embedding matrix weights with the joint network's output layer (a.k.a. weight tying, commonly used in language modeling arXiv:1611.01462 [cs.LG]). This simple design, when used in conjunction with additional Edit-based Minimum Bayes Risk (EMBR) training, reduces the RNN-T Decoder from 23M parameters to just 2M, without affecting word-error rate (WER).

en cs.CL, cs.LG
arXiv Open Access 2021
RTNeural: Fast Neural Inferencing for Real-Time Systems

Jatin Chowdhury

RTNeural is a neural inferencing library written in C++. RTNeural is designed to be used in systems with hard real-time constraints, with additional emphasis on speed, flexibility, size, and convenience. The motivation and design of the library are described, as well as real-world use-cases, and performance comparisons with other neural inferencing libraries.

en eess.AS
arXiv Open Access 2021
USM-SED - A Dataset for Polyphonic Sound Event Detection in Urban Sound Monitoring Scenarios

Jakob Abeßer

This paper introduces a novel dataset for polyphonic sound event detection in urban sound monitoring use-cases. Based on isolated sounds taken from the FSD50k dataset, 20,000 polyphonic soundscapes are synthesized with sounds being randomly positioned in the stereo panorama using different loudness levels. The paper gives a detailed discussion of possible application scenarios, explains the dataset generation process in detail, and discusses current limitations of the proposed USM-SED dataset.

en eess.AS, cs.SD
arXiv Open Access 2021
Student-t Networks for Melody Estimation

Udhav Gupta, Avi, Bhavesh Jain

Melody estimation or melody extraction refers to the extraction of the primary or fundamental dominant frequency in a melody. This sequence of frequencies obtained represents the pitch of the dominant melodic line from recorded music audio signals. The music signal may be monophonic or polyphonic. The melody extraction problem from audio signals gets complicated when we start dealing with polyphonic audio data. This is because in generalized audio signals,the sounds are highly correlated over both frequency and time domains. This complex overlap of many sounds, makes identification of predominant frequency challenging.

en eess.AS, cs.SD
arXiv Open Access 2021
Complex-valued Spatial Autoencoders for Multichannel Speech Enhancement

Mhd Modar Halimeh, Walter Kellermann

In this contribution, we present a novel online approach to multichannel speech enhancement. The proposed method estimates the enhanced signal through a filter-and-sum framework. More specifically, complex-valued masks are estimated by a deep complex-valued neural network, termed the complex-valued spatial autoencoder. The proposed network is capable of exploiting as well as manipulating both the phase and the amplitude of the microphone signals. As shown by the experimental results, the proposed approach is able to exploit both spatial and spectral characteristics of the desired source signal resulting in a physically plausible spatial selectivity and superior speech quality compared to other baseline methods.

en eess.AS
arXiv Open Access 2020
A Hybrid Approach to Audio-to-Score Alignment

Ruchit Agrawal, Simon Dixon

Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features. We explore the usage of neural networks as a preprocessing step for DTW-based automatic alignment methods. Experiments on music data from different acoustic conditions demonstrate that this method generates robust alignments whilst being adaptable at the same time.

en eess.AS, cs.LG
arXiv Open Access 2020
Accelerating Auxiliary Function-based Independent Vector Analysis

Andreas Brendel, Walter Kellermann

Independent Vector Analysis (IVA) is an effective approach for Blind Source Separation (BSS) of convolutive mixtures of audio signals. As a practical realization of an IVA-based BSS algorithm, the so-called AuxIVA update rules based on the Majorize-Minimize (MM) principle have been proposed which allow for fast and computationally efficient optimization of the IVA cost function. For many real-time applications, however, update rules for IVA exhibiting even faster convergence are highly desirable. To this end, we investigate techniques which accelerate the convergence of the AuxIVA update rules without extra computational cost. The efficacy of the proposed methods is verified in experiments representing real-world acoustic scenarios.

en eess.AS
arXiv Open Access 2020
Audio-Visual Calibration with Polynomial Regression for 2-D Projection Using SVD-PHAT

Francois Grondin, Hao Tang, James Glass

This paper proposes a straightforward 2-D method to spatially calibrate the visual field of a camera with the auditory field of an array microphone by generating and overlaying an acoustic image over an optical image. Using a low-cost microphone array and an off-the-shelf camera, we show that polynomial regression can deal efficiently with non-linear camera distortion, and that a recently proposed sound source localization method for real-time processing, SVD-PHAT, can be adapted for this task.

en eess.AS, cs.SD
arXiv Open Access 2020
The Impact of Label Noise on a Music Tagger

Katharina Prinz, Arthur Flexer, Gerhard Widmer

We explore how much can be learned from noisy labels in audio music tagging. Our experiments show that carefully annotated labels result in highest figures of merit, but even high amounts of noisy labels contain enough information for successful learning. Artificial corruption of curated data allows us to quantize this contribution of noisy labels.

en eess.AS, cs.LG
arXiv Open Access 2019
USTCSpeech System for VOiCES from a Distance Challenge 2019

Lanhua You, Bin Gu, Wu Guo

This document describes the speaker verification systems developed in the Speech lab at the University of Science and Technology of China (USTC) for the VOiCES from a Distance Challenge 2019. We develop the system for the Fixed Condition on two public corpus, VoxCeleb and SITW. The frameworks of our systems are based on the mainstream ivector/PLDA and x-vector/PLDA algorithms.

en eess.AS
arXiv Open Access 2018
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis

Daisy Stanton, Yuxuan Wang, RJ Skerry-Ryan

Global Style Tokens (GSTs) are a recently-proposed method to learn latent disentangled representations of high-dimensional data. GSTs can be used within Tacotron, a state-of-the-art end-to-end text-to-speech synthesis system, to uncover expressive factors of variation in speaking style. In this work, we introduce the Text-Predicted Global Style Token (TP-GST) architecture, which treats GST combination weights or style embeddings as "virtual" speaking style labels within Tacotron. TP-GST learns to predict stylistic renderings from text alone, requiring neither explicit labels during training nor auxiliary inputs for inference. We show that, when trained on a dataset of expressive speech, our system generates audio with more pitch and energy variation than two state-of-the-art baseline models. We further demonstrate that TP-GSTs can synthesize speech with background noise removed, and corroborate these analyses with positive results on human-rated listener preference audiobook tasks. Finally, we demonstrate that multi-speaker TP-GST models successfully factorize speaker identity and speaking style. We provide a website with audio samples for each of our findings.

en cs.CL, cs.LG
arXiv Open Access 2018
A Study of the Complexity and Accuracy of Direction of Arrival Estimation Methods Based on GCC-PHAT for a Pair of Close Microphones

Francois Grondin, James Glass

This paper investigates the accuracy of various Generalized Cross-Correlation with Phase Transform (GCC-PHAT) methods for a close pair of microphones. We investigate interpolation-based methods and also propose another approach based on Singular Value Decomposition (SVD). All investigated methods are implemented in C code, and the execution time is measured to determine which approach is the most appealing for real-time applications on low-cost embedded hardware.

en eess.AS, cs.SD

Halaman 1 dari 4