Hasil untuk "eess.AS"

Menampilkan 20 dari ~72 hasil · dari arXiv, CrossRef

JSON API
arXiv Open Access 2026
Fast and Flexible Audio Bandwidth Extension via Vocos

Yatharth Sharma

We propose a Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content. Inputs are resampled to 48 kHz and processed by a neural vocoder backbone, enabling a single network to support arbitrary upsampling ratios. A lightweight Linkwitz-Riley-inspired refiner merges the original low band with the generated high frequencies via a smooth crossover. On validation, the model achieves competitive log-spectral distance while running at a real-time factor of 0.0001 on an NVIDIA A100 GPU and 0.0053 on an 8-core CPU, demonstrating practical, high-quality BWE at extreme throughput.

en eess.AS, cs.LG
arXiv Open Access 2026
Speech Emotion Recognition with ASR Integration

Yuanchao Li

Speech Emotion Recognition (SER) plays a pivotal role in understanding human communication, enabling emotionally intelligent systems, and serving as a fundamental component in the development of Artificial General Intelligence (AGI). However, deploying SER in real-world, spontaneous, and low-resource scenarios remains a significant challenge due to the complexity of emotional expression and the limitations of current speech and language technologies. This thesis investigates the integration of Automatic Speech Recognition (ASR) into SER, with the goal of enhancing the robustness, scalability, and practical applicability of emotion recognition from spoken language.

en eess.AS, cs.SD
arXiv Open Access 2025
IR-UWB Radar-Based Contactless Silent Speech Recognition with Attention-Enhanced Temporal Convolutional Networks

Sunghwa Lee, Jaewon Yu

Silent speech recognition (SSR) is a technology that recognizes speech content from non-acoustic speech-related biosignals. This paper utilizes an attention-enhanced temporal convolutional network architecture for contactless IR-UWB radar-based SSR, leveraging deep learning to learn discriminative representations directly from minimally processed radar signals. The architecture integrates temporal convolutions with self-attention and squeeze-and-excitation mechanisms to capture articulatory patterns. Evaluated on a 50-word recognition task using leave-one-session-out cross-validation, our approach achieves an average test accuracy of 91.1\% compared to 74.0\% for the conventional hand-crafted feature method, demonstrating significant improvement through end-to-end learning.

en eess.AS
arXiv Open Access 2025
Audio dequantization using instantaneous frequency

Vojtěch Kovanda, Pavel Rajmic

We present a dequantization method that employs a phase-aware regularizer, originally successfully applied in an audio inpainting problem. The method promotes a temporal continuity of sinusoidal components in time-frequency representation of the audio signal, and avoids energy loss artifacts commonly encountered with l1-based regularization approaches. The proposed method is called the Phase-Aware Audio Dequantizer (PHADQ). The method are evaluated against the state-of-the-art using the SDR and PEMO-Q ODG objective metrics, and a~subjective MUSHRA-like test.

en eess.AS
arXiv Open Access 2024
Crowdsourced Multilingual Speech Intelligibility Testing

Laura Lechler, Kamil Wojcicki

With the advent of generative audio features, there is an increasing need for rapid evaluation of their impact on speech intelligibility. Beyond the existing laboratory measures, which are expensive and do not scale well, there has been comparatively little work on crowdsourced assessment of intelligibility. Standards and recommendations are yet to be defined, and publicly available multilingual test materials are lacking. In response to this challenge, we propose an approach for a crowdsourced intelligibility assessment. We detail the test design, the collection and public release of the multilingual speech data, and the results of our early experiments.

en eess.AS, cs.AI
arXiv Open Access 2024
Introduction to speech recognition

Gabriel Dauphin

This document contains lectures and practical experimentations using Matlab and implementing a system which is actually correctly classifying three words (one, two and three) with the help of a very small database. To achieve this performance, it uses speech modeling specificities, powerful computer algorithms (dynamic time warping and Dijktra's algorithm) and machine learning (nearest neighbor). This document introduces also some machine learning evaluation metrics.

en eess.AS, cs.CL
CrossRef Open Access 2024
Correlation between serum interleukin-6 levels and the severity of Enlist ENL Severity Scale (EESS) in Erythema Nodosum Leprosum (ENL) patients

Nahrisyah, Ramona Dumasari Lubis, Remenda Siregar

Introduction: Leprosy is a chronic infection caused by Mycobacterium leprae (M. Leprae) which no biological examination can yet be appointed as an early marker. This research is aimed to analyze the correlation between serum IL-6 levels and the severity of the ENLIST ENL Severity Scale (EESS) in ENL. Methods: A cross-sectional study was conducted on all leprosy patients with ENL who sought treatment at the Dermatology and Venereology Polyclinic, North Sumatra University Hospital, dr. Pirngadi General Hospital, and H. Adam Malik General Hospital Medan. ENL patients who were uncooperative, with infectious, autoimmune, malignant diseases, or who were pregnant or breastfeeding were excluded. Data processing was carried out with the help of statistical software with a significance value <0.05. Results: A total of 40 ENL patients were studied. The mean age of the patients was 33.10±12.23 years, male (70%), high school (75%), and not working (95%). The degree of severity based on mild EESS was more dominant than severe, namely 57.5% vs 42.5%, respectively. The median IL6 level was 64.65±34.06 in ENL patients and was higher in severe compared with mild EESS, namely 105.18±24.14 versus 50.08±16.84, respectively (p=0.103, p=0.167, respectively). There was a very strong and significant positive correlation between serum IL6 levels and EESS (r: 0.813 and p<0.05). Conclusion: There was a correlation between serum IL6 levels and EESS. Further study is needed in measuring serum IL6 levels to detect early severity of ENL.

arXiv Open Access 2023
Large Vocabulary Spontaneous Speech Recognition for Tigrigna

Ataklti Kahsu, Solomon Teferra

This thesis proposes and describes a research attempt at designing and developing a speaker independent spontaneous automatic speech recognition system for Tigrigna The acoustic model of the Speech Recognition System is developed using Carnegie Mellon University Automatic Speech Recognition development tool (Sphinx) while the SRIM tool is used for the development of the language model. Keywords Automatic Speech Recognition Tigrigna language

en eess.AS, cs.LG
arXiv Open Access 2022
ADPCM with nonlinear prediction

Marcos Faundez-Zanuy, Oscar Oliva-Suarez

Many speech coders are based on linear prediction coding (LPC), nevertheless with LPC is not possible to model the nonlinearities present in the speech signal. Because of this there is a growing interest for nonlinear techniques. In this paper we discuss ADPCM schemes with a nonlinear predictor based on neural nets, which yields an increase of 1-2.5dB in the SEGSNR over classical methods. This paper will discuss the block-adaptive and sample-adaptive predictions.

en eess.AS, cs.LG
arXiv Open Access 2022
Environment Classification via Blind Roomprints Estimation

Malte Baum, Luca Cuccovillo, Artem Yaroshchuk et al.

In this paper we present a novel approach for environment classification for speech recordings, which does not require the selection of decaying reverberation tails. It is based on a multi-band RT60 analysis of blind channel estimates and achieves an accuracy of up to 93.6% on test recordings derived from the ACE corpus.

en eess.AS, cs.SD
arXiv Open Access 2022
Invertible Voice Conversion

Zexin Cai, Ming Li

In this paper, we propose an invertible deep learning framework called INVVC for voice conversion. It is designed against the possible threats that inherently come along with voice conversion systems. Specifically, we develop an invertible framework that makes the source identity traceable. The framework is built on a series of invertible $1\times1$ convolutions and flows consisting of affine coupling layers. We apply the proposed framework to one-to-one voice conversion and many-to-one conversion using parallel training data. Experimental results show that this approach yields impressive performance on voice conversion and, moreover, the converted results can be reversed back to the source inputs utilizing the same parameters as in forwarding.

en eess.AS, cs.LG
arXiv Open Access 2021
Golos: Russian Dataset for Speech Research

Nikolay Karpov, Alexander Denisenko, Fedor Minkin

This paper introduces a novel Russian speech dataset called Golos, a large corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available to download, along with the acoustic model with CTC loss prepared on this corpus. Additionally, transfer learning was applied to improve the performance of the acoustic model. In order to evaluate the quality of the dataset with the beam-search algorithm, we have built a 3-gram language model on the open Common Crawl dataset. The total word error rate (WER) metrics turned out to be about 3.3% and 11.5%.

en eess.AS
arXiv Open Access 2021
Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition

Somshubra Majumdar, Jagadeesh Balam, Oleksii Hrinchuk et al.

We propose Citrinet - a new end-to-end convolutional Connectionist Temporal Classification (CTC) based automatic speech recognition (ASR) model. Citrinet is deep residual neural model which uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation. The resulting architecture significantly reduces the gap between non-autoregressive and sequence-to-sequence and transducer models. We evaluate Citrinet on LibriSpeech, TED-LIUM2, AISHELL-1 and Multilingual LibriSpeech (MLS) English speech datasets. Citrinet accuracy on these datasets is close to the best autoregressive Transducer models.

en eess.AS
arXiv Open Access 2021
What is the ground truth? Reliability of multi-annotator data for audio tagging

Irene Martin-Morato, Annamaria Mesaros

Crowdsourcing has become a common approach for annotating large amounts of data. It has the advantage of harnessing a large workforce to produce large amounts of data in a short time, but comes with the disadvantage of employing non-expert annotators with different backgrounds. This raises the problem of data reliability, in addition to the general question of how to combine the opinions of multiple annotators in order to estimate the ground truth. This paper presents a study of the annotations and annotators' reliability for audio tagging. We adapt the use of Krippendorf's alpha and multi-annotator competence estimation (MACE) for a multi-labeled data scenario, and present how MACE can be used to estimate a candidate ground truth based on annotations from non-expert users with different levels of expertise and competence.

en eess.AS

Halaman 3 dari 4