Hasil untuk "eess.AS"

Menampilkan 20 dari ~72 hasil · dari CrossRef, arXiv

JSON API
arXiv Open Access 2026
Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

Frederik Rautenberg, Jana Wiechmann, Petra Wagner et al.

We introduce a system capable of faithfully modifying the perceptual voice quality of creak while preserving the speaker's perceived identity. While it is well known that high creak probability is typically correlated with low pitch, it is important to note that this is a property observed on a population of speakers but does not necessarily hold across all situations. Disentanglement of pitch from creak is achieved by augmentation of the training dataset of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flow. The experiments show greatly improved speaker verification performance over a range of creak manipulation strengths.

en eess.AS
arXiv Open Access 2025
The Overview of Segmental Durations Modification Algorithms on Speech Signal Characteristics

Kyeomeun Jang, Jiaying Li, Yinuo Wang

This paper deeply evaluates and analyzes several mainstream algorithms that can arbitrarily modify the duration of any portion of a given speech signal without changing the essential properties (e.g., pitch contour, power spectrum, etc.) of the original signal. Arbitrary modification in this context means that the duration of any region of the signal can be changed by specifying the starting and ending time for modification or the target duration of the specified interval, which can be either a fixed value of duration in the time domain or a scaling factor of the original duration. In addition, arbitrary modification also indicates any number of intervals can be modified at the same time.

en eess.AS, eess.SP
arXiv Open Access 2025
Melodic and Metrical Elements of Expressiveness in Hindustani Vocal Music

Yash Bhake, Ankit Anand, Preeti Rao

This paper presents an attempt to study the aesthetics of North Indian Khayal music with reference to the flexibility exercised by artists in performing popular compositions. We study expressive timing and pitch variations of the given lyrical content within and across performances and propose computational representations that can discriminate between different performances of the same song in terms of expression. We present the necessary audio processing and annotation procedures, and discuss our observations and insights from the analysis of a dataset of two songs in two ragas each rendered by ten prominent artists.

en eess.AS, cs.SD
arXiv Open Access 2025
Live Vocal Extraction from K-pop Performances

Yujin Kim, Richa Namballa, Magdalena Fuentes

K-pop's global success is fueled by its dynamic performances and vibrant fan engagement. Inspired by K-pop fan culture, we propose a methodology for automatically extracting live vocals from performances. We use a combination of source separation, cross-correlation, and amplitude scaling to automatically remove pre-recorded vocals and instrumentals from a live performance. Our preliminary work introduces the task of live vocal separation and provides a foundation for future research in this topic.

en eess.AS, cs.SD
arXiv Open Access 2025
Group-Aware Partial Model Merging for Children's Automatic Speech Recognition

Thomas Rolland, Alberto Abad

While supervised fine-tuning of adult pre-trained models for children's ASR has shown promise, it often fails to capture group-specific characteristics and variations among children. To address this, we introduce GRoup-Aware PARtial model Merging, a parameter-efficient approach that combines unsupervised clustering, partial fine-tuning, and model merging. Our approach adapts adult-pre-trained models to children by first grouping the children's data based on acoustic similarity. Each group is used to partially fine-tune an adult pre-trained model, and the resulting models are merged at the parameter level. Experiments conducted on the MyST children's speech corpus indicate that GRAPAM achieves a relative WER improvement of 6%, using the same amount of data, outperforming full fine-tuning while training fewer parameters.

en eess.AS
arXiv Open Access 2024
The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge

Ya Jiang, Hongbo Lan, Jun Du et al.

In the two-person conversation scenario with one wearing smart glasses, transcribing and displaying the speaker's content in real-time is an intriguing application, providing a priori information for subsequent tasks such as translation and comprehension. Meanwhile, multi-modal data captured from the smart glasses is scarce. Therefore, we propose utilizing simulation data with multiple overlap rates and a one-to-one matching training strategy to narrow down the deviation for the model training between real and simulated data. In addition, combining IMU unit data in the model can assist the audio to achieve better real-time speech recognition performance.

en eess.AS, cs.SD
arXiv Open Access 2024
USM RNN-T model weights binarization

Oleg Rybakov, Dmitriy Serdyuk, Chengjian Zheng

Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neural Network Transducer (RNN-T) and show that its model size can be reduced by 15.9x times at cost of word error rate (WER) increase by only 1.9% in comparison to the float32 model. It makes it attractive for practical applications.

en eess.AS, cs.SD
arXiv Open Access 2024
Distance Based Single-Channel Target Speech Extraction

Runwu Shi, Benjamin Yen, Kazuhiro Nakadai

This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.

en eess.AS, cs.SD
CrossRef Open Access 2023
Identification model for weak areas of transient energy balance in EESs based on dynamic grid partitioning

Shi Qiu, Kun Zhang, Zhuo Chen et al.

In response to the high uncertainty of large-scale new energy output in the electrical energy system (EES) and the weak controllability of energy output at multiple time scales, this paper proposes a weak grid identification model for transient energy balance in EESs based on grid partitioning, which has an increasingly complex impact on the weak areas of transient energy balance in the sending-end network. First, the accumulation of port energy during transient faults and the propagation mechanism of port energy in the sending-end system were studied, and an EES transient energy propagation mechanism model was established. Then, considering the energy balance support requirements of nodes, an EES grid partitioning model was established. Afterward, based on the characteristics of transient energy propagation and a grid partitioning model, an identification model for weak areas of transient energy balance in EESs was constructed. Finally, based on actual operating data, numerical simulations were conducted, and the results showed that the proposed weak grid identification model for transient energy balance can meet the requirements for transient stability analysis and transient energy balance characteristic analysis during actual operation of power grids.

arXiv Open Access 2023
Airborne Sound Analysis for the Detection of Bearing Faults in Railway Vehicles with Real-World Data

Matthias Kreuzer, David Schmidt, Simon Wokusch et al.

In this paper, we address the challenging problem of detecting bearing faults in railway vehicles by analyzing acoustic signals recorded during regular operation. For this, we introduce Mel Frequency Cepstral Coefficients (MFCCs) as features, which form the input to a simple Multi-Layer Perceptron classifier. The proposed method is evaluated with real-world data that was obtained for state-of-the-art commuter railway vehicles in a measurement campaign. The experiments show that with the chosen MFCC features bearing faults can be reliably detected even for bearing damages that were not included in training.

en eess.AS, cs.LG
arXiv Open Access 2023
Accurate Real-Time Estimation of 2-Dimensional Direction of Arrival using a 3-Microphone Array

Anton Kovalyov, Kashyap Patel, Issa Panahi

This paper presents a method for real-time estimation of 2-dimensional direction of arrival (2D-DOA) of one or more sound sources using a nonlinear array of three microphones. 2D-DOA is estimated employing frame-level time difference of arrival (TDOA) measurements. Unlike conventional methods, which infer location parameters from TDOAs using a theoretical model, we propose a more practical approach based on supervised learning. The proposed model employs nearest neighbor search (NNS) applied to a spherical Fibonacci lattice consisting of TDOA to 2D-DOA mappings learned directly in the field. Filtering and clustering post-processors are also introduced for improved source detection and localization robustness.

en eess.AS, cs.SD
arXiv Open Access 2023
A Study on Prosodic Entrainment in Relation to Therapist Empathy in Counseling Conversation

Dehua Tao, Tan Lee, Harold Chui et al.

Counseling is carried out as spoken conversation between a therapist and a client. The empathy level expressed by the therapist is considered an important index of the quality of counseling and often assessed by an observer or the client. This research investigates the entrainment of speech prosody in relation to subjectively rated empathy. Experimental results show that the entrainment of intensity is more influential to empathy observation than that of pitch or speech rate in client-therapist interaction. The observer and the client have different perceptions of therapist empathy with the same entrained phenomena in pitch and intensity. The client's intention to make adjustment on pitch variation and intensity of speech is considered an indicator of the client's perception of counseling quality.

en eess.AS
arXiv Open Access 2022
A Neural Vocoder Based Packet Loss Concealment Algorithm

Yao Zhou, Changchun Bao

The packet loss problem seriously affects the quality of service in Voice over IP (VoIP) sceneries. In this paper, we investigated online receiver-based packet loss concealment which is much more portable and applicable. For ensuring the speech naturalness, rather than directly processing time-domain waveforms or separately reconstructing amplitudes and phases in frequency domain, a flow-based neural vocoder is adopted to generate the substitution waveform of lost packet from Mel-spectrogram which is generated from history contents by a well-designed neural predictor. Furthermore, a waveform similarity-based smoothing post-process is created to mitigate the discontinuity of speech and avoid the artifacts. The experimental results show the outstanding performance of the proposed method.

en eess.AS, cs.SD
arXiv Open Access 2022
Fearless Steps Challenge Phase-1 Evaluation Plan

Aditya Joglekar, John H. L. Hansen

The Fearless Steps Challenge 2019 Phase-1 (FSC-P1) is the inaugural Challenge of the Fearless Steps Initiative hosted by the Center for Robust Speech Systems (CRSS) at the University of Texas at Dallas. The goal of this Challenge is to evaluate the performance of state-of-the-art speech and language systems for large task-oriented teams with naturalistic audio in challenging environments. Researchers may select to participate in any single or multiple of these challenge tasks. Researchers may also choose to employ the FEARLESS STEPS corpus for other related speech applications. All participants are encouraged to submit their solutions and results for consideration in the ISCA INTERSPEECH-2019 special session.

en eess.AS, cs.SD
arXiv Open Access 2022
AID: Open-source Anechoic Interferer Dataset

Philipp Götz, Cagdas Tuna, Andreas Walther et al.

A dataset of anechoic recordings of various sound sources encountered in domestic environments is presented. The dataset is intended to be a resource of non-stationary, environmental noise signals that, when convolved with acoustic impulse responses, can be used to simulate complex acoustic scenes. Additionally, a Python library is provided to generate random mixtures of the recordings in the dataset, which can be used as non-stationary interference signals.

en eess.AS, cs.SD
arXiv Open Access 2022
On the relevance of language in speaker recognition

Antonio Satue-Villar, Marcos Faundez-Zanuy

This paper presents a new database collected from a bilingual speakers set (49), in two different languages: Spanish and Catalan. Phonetically there are significative differences between both languages. These differences have let us to establish several conclusions on the relevance of language in speaker recognition, using two methods: vector quantization and covariance matrices

en eess.AS, cs.LG
arXiv Open Access 2020
Acoustic scene classification in DCASE 2020 Challenge: generalization across devices and low complexity solutions

Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

This paper presents the details of Task 1: Acoustic Scene Classification in the DCASE 2020 Challenge. The task consists of two subtasks: classification of data from multiple devices, requiring good generalization properties, and classification using low-complexity solutions. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.

en eess.AS
arXiv Open Access 2020
Audio Inpainting: Revisited and Reweighted

Ondřej Mokrý, Pavel Rajmic

We deal with the problem of sparsity-based audio inpainting, i.e. filling in the missing segments of audio. A consequence of the approaches based on mathematical optimization is the insufficient amplitude of the signal in the filled gaps. Remaining in the framework based on sparsity and convex optimization, we propose improvements to audio inpainting, aiming at compensating for such an energy loss. The new ideas are based on different types of weighting, both in the coefficient and the time domains. We show that our propositions improve the inpainting performance in terms of both the SNR and ODG.

en eess.AS, cs.SD
arXiv Open Access 2019
Multi-Frame Cross-Entropy Training for Convolutional Neural Networks in Speech Recognition

Tom Sercu, Neil Mallinar

We introduce Multi-Frame Cross-Entropy training (MFCE) for convolutional neural network acoustic models. Recognizing that similar to RNNs, CNNs are in nature sequence models that take variable length inputs, we propose to take as input to the CNN a part of an utterance long enough that multiple labels are predicted at once, therefore getting cross-entropy loss signal from multiple adjacent frames. This increases the amount of label information drastically for small marginal computational cost. We show large WER improvements on hub5 and rt02 after training on the 2000-hour Switchboard benchmark.

en eess.AS, cs.CL

Halaman 4 dari 4