Nicholas Brendle, Jonathan Chamberlain, Joel T. Johnson et al.
Hasil untuk "eess.AS"
Menampilkan 20 dari ~72 hasil · dari CrossRef, arXiv
Ram C. M. C. Shekar, Iván López-Espejo
We present LIWhiz, a non-intrusive lyric intelligibility prediction system submitted to the ICASSP 2026 Cadenza Challenge. LIWhiz leverages Whisper for robust feature extraction and a trainable back-end for score prediction. Tested on the Cadenza Lyric Intelligibility Prediction (CLIP) evaluation set, LIWhiz achieves a root mean square error (RMSE) of 27.07%, a 22.4% relative RMSE reduction over the STOI-based baseline, yielding a substantial improvement in normalized cross-correlation.
Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz
This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.
Dang Thoai Phan
In recent years, the continuous wavelet transform (CWT) has been employed as a spectral feature extractor for acoustic recognition tasks in conjunction with machine learning and deep learning models. However, applying the CWT to each individual audio sample is computationally intensive. This paper proposes an approach that applies the CWT to a subset of samples, spaced according to a specified hop size. Experimental results demonstrate that this method significantly reduces computational costs while maintaining the robust performance of the trained models.
Elsia Miranda Mildad Tatumang, Ahmatang Ahmatang

 
 
 
 Research aim : This study aims to test the Quality of Moodle Borneo E-Learning (BEL) which includes Service Quality, Student Quality and Lecturer Quality on Perceived Satisfaction, Perceived Usefulness and Benefits based on the E-learning System Success Evaluating Model for active BEL users
 Design/Methode/Approach : This research uses a quantitative approach with non-probability sampling method and the technique used is Quota Sampling. To determine the number of samples, the hair formula was used which determined that the sample consisted of 280 University of Borneo students who had used BEL. The data analysis method used is SEM (Structural Equation Modeling) with the help of the SmartPLS program.
 Research Finding : The results showed that the variables of Service Quality, Student Quality and Lecturer Quality had a positive and significant effect on satisfaction and usability. And the Satisfaction and Usefulness Variables also have a positive and significant effect on benefits.
 Theoretical contribution/Originality : It is hoped that this research can provide insight and knowledge as well as provide information to researchers and academics regarding the analysis of the quality of BEL application service quality using the EESS model.
 Practitionel/Policy implication : The results of this study are used as input for the Borneo Tarakan University's LP3M, so that in the future it can improve the quality and quality of BEL so that in the future students will be more comfortable doing online learning with BEL.
 Research limitation : In this study it only focuses on evaluating the quality of using the BEL application but only looks at it from the perspective of students at the University of Borneo Tarakan. And also focuses on the EESS conceptual model which includes only social factors, namely Service Quality, Learner Quality, Instructor Quality.
 
 
 
 
 
 sitive and significant effect on benefits
Pavan Seshadri, Chaeyeon Han, Bon-Woo Koo et al.
We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches.
Shahan Nercessian, Johannes Imort
We introduce the text-to-instrument task, which aims at generating sample-based musical instruments based on textual prompts. Accordingly, we propose InstrumentGen, a model that extends a text-prompted generative audio framework to condition on instrument family, source type, pitch (across an 88-key spectrum), velocity, and a joint text/audio embedding. Furthermore, we present a differentiable loss function to evaluate the intra-instrument timbral consistency of sample-based instruments. Our results establish a foundational text-to-instrument baseline, extending research in the domain of automatic sample-based instrument generation.
Boaz Rafaely
Spherical microphone arrays have been recently studied for spatial sound recording, speech communication, and sound field analysis for room acoustics and noise control. Complementary theoretical studies presented progress in spatial sampling and beamforming methods. This paper reviews recent results in spatial sampling that facilitate a wide range of spherical array configurations, from a single rigid sphere to free positioning of microphones. The paper then presents an overview of beamforming methods recently presented for spherical arrays, from the widely used delay-and-sum and Dolph-Chebyshev, to the more advanced optimal methods, typically performed in the spherical harmonics domain.
Haolin Chen, Philip N. Garner
Given the recent success of diffusion in producing natural-sounding synthetic speech, we investigate how diffusion can be used in speaker adaptive TTS. Taking cues from more traditional adaptation approaches, we show that adaptation can be included in a diffusion pipeline using conditional layer normalization with a step embedding. However, we show experimentally that, whilst the approach has merit, such adaptation alone cannot approach the performance of Transformer-based techniques. In a second experiment, we show that diffusion can be optimally combined with Transformer, with the latter taking the bulk of the adaptation load and the former contributing to improved naturalness.
Martin Vins, Jaroslav Dragoun, Patrik Kalaj et al.
Qinwen Hu, Zhongshu Hou, Xiaohuai Le et al.
Deep neural network based full-band speech enhancement systems face challenges of high demand of computational resources and imbalanced frequency distribution. In this paper, a light-weight full-band model is proposed with two dedicated strategies, i.e., a learnable spectral compression mapping for more effective high-band spectral information compression, and the utilization of the multi-head attention mechanism for more effective modeling of the global spectral pattern. Experiments validate the efficacy of the proposed strategies and show that the proposed model achieves competitive performance with only 0.89M parameters.
Ryan Duke, Alex Doboli
This paper presents diaLogic system, a Human-In-A-Loop system for modeling the behavior of teams during solving open-ended problems. Team behavior is modeled through the hypotheses extracted from features computed from acquired voice data. These features include speaker interactions, speaker emotions, fundamental frequencies, and the corresponding text and clauses. Hypotheses about the invariant and differentiated situations are found based on the similarities and dissimilarities of the behavior of teams over time. To provide full automation of data acquisition, the diaLogic system is executed within an intuitive, user-friendly GUI interface. Experiments present the performance of the system for a broad set of cases featuring team behavior during problem solving.
Jonathan Svirsky, Ofir Lindenbaum
We propose a novel voice activity detection (VAD) model in a low-resource environment. Our key idea is to model VAD as a denoising task, and construct a network that is designed to identify nuisance features for a speech classification task. We train the model to simultaneously identify irrelevant features while predicting the type of speech event. Our model contains only 7.8K parameters, outperforms the previously proposed methods on the AVA-Speech evaluation set, and provides comparative results on the HAVIC dataset. We present its architecture, experimental results, and ablation study on the model's components. We publish the code and the models here https://www.github.com/jsvir/vad.
Yue Qiao, Léo Guadagnin, Edgar Choueiri
Two isolation performance metrics, Inter-Zone Isolation (IZI) and Inter-Program Isolation (IPI), are introduced for evaluating Personal Sound Zone (PSZ) systems. Compared to the commonly-used Acoustic Contrast metric, IZI and IPI are generalized for multichannel audio, and quantify the isolation of sound zones and of audio programs, respectively. The two metrics are shown to be generally non-interchangeable and suitable for different scenarios, such as generating dark zones (IZI) or minimizing audio-on-audio interference (IPI). Furthermore, two examples with free-field simulations are presented and demonstrate the applications of IZI and IPI in evaluating PSZ performance in different rendering modes and PSZ robustness.
Eduardo Murakami, Agostinho Linhares, Luiz C. Trintinalia et al.
CS Makola, PF Le Roux, JA Jordaan
Srđan Kitić, Jérôme Daniel
We introduce and analyze Generalized Time Domain Velocity Vector (GTVV), an extension of the previously presented acoustic multipath footprint extracted from the Ambisonic recordings. GTVV is better adapted to adverse acoustic conditions, and enables efficient parameter estimation of multiple plane wave components in the recorded multichannel mixture. Experiments on simulated data confirm the predicted theoretical advantages of these new spatio-temporal features.
Ge Zhu, Frank Cwitkowitz, Zhiyao Duan
In this paper, we conduct a cross-dataset study on parametric and non-parametric raw-waveform based speaker embeddings through speaker verification experiments. In general, we observe a more significant performance degradation of these raw-waveform systems compared to spectral based systems. We then propose two strategies to improve the performance of raw-waveform based systems on cross-dataset tests. The first strategy is to change the real-valued filters into analytic filters to ensure shift-invariance. The second strategy is to apply variational dropout to non-parametric filters to prevent them from overfitting irrelevant nuance features.
Steven Fenton
This paper outlines the procedure for the effective modelling of a complex analogue filter circuit. The Fender Bassman 5F6-A is a circuit commonly employed in guitar amplifiers to shape the tonal characteristics of the amplifier output. On first inspection this circuit may look rather simple, however the controls are not orthogonal, resulting in complicated filter coefficients as the controls are varied. This in turn can make the circuit difficult to analyse without the use of mathematical emulation tools such as PSPICE or MATLAB. First the circuit is described, a method of analysis is proposed and general expressions for continuous-time coefficients are given. A MATLAB model is then produced and the frequency responses of which are shown.
Roy Fejgin, Janusz Klejsa, Lars Villemoes et al.
We consider source coding of audio signals with the help of a generative model. We use a construction where a waveform is first quantized, yielding a finite bitrate representation. The waveform is then reconstructed by random sampling from a model conditioned on the quantized waveform. The proposed coding scheme is theoretically analyzed. Using SampleRNN as the generative model, we demonstrate that the proposed coding structure provides performance competitive with state-of-the-art source coding tools for specific categories of audio signals.
Halaman 2 dari 4