Hasil "Oral communication. Speech"

arXiv Open Access 2026

Classification of Autistic and Non-Autistic Children's Speech: A Cross-Linguistic Study in Finnish, French, and Slovak

Sofoklis Kakouros, Ida-Lotta Myllylä

We present a cross-linguistic study of speech in autistic and non-autistic children speaking Finnish, French, and Slovak. We combine supervised classification with within-language and cross-corpus transfer experiments to evaluate classification performance within and across languages and to probe which acoustic cues are language-specific versus language-general. Using a large set of acoustic-prosodic features, we implement speaker-level classification benchmarks as an analytical tool rather than to seek state-of-the-art performance. Within-language models, evaluated with speaker-level cross-validation, yielded heterogeneous results. The Finnish model performed best (Accuracy 0.84, F1 0.88), followed by Slovak (Accuracy 0.63, F1 0.68) and French (Accuracy 0.68, F1 0.56). We then tested cross-language generalization. A model trained on all pooled corpora reached an overall Accuracy of 0.61 and F1 0.68. Leave-one-corpus-out experiments, which test transfer to an unseen language, showed moderate success when testing on Slovak (F1 0.70) and Finnish (F1 0.78), but poor transfer to French (F1 0.42). Feature-importance analyses across languages highlighted partially shared, but not fully language-invariant, acoustic markers of autism. These findings suggest that some autism-related speech cues generalize across typologically distinct languages, but robust cross-linguistic classifiers will likely require language-aware modeling and more homogeneous recording conditions.

en eess.AS

Detail Sumber

arXiv Open Access 2025

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Aristeidis Papadopoulos, Naomi Harte

Audio-Visual Speech Recognition (AVSR) models have surpassed their audio-only counterparts in terms of performance. However, the interpretability of AVSR systems, particularly the role of the visual modality, remains under-explored. In this paper, we apply several interpretability techniques to examine how visemes are encoded in AV-HuBERT a state-of-the-art AVSR model. First, we use t-distributed Stochastic Neighbour Embedding (t-SNE) to visualize learned features, revealing natural clustering driven by visual cues, which is further refined by the presence of audio. Then, we employ probing to show how audio contributes to refining feature representations, particularly for visemes that are visually ambiguous or under-represented. Our findings shed light on the interplay between modalities in AVSR and could point to new strategies for leveraging visual information to improve AVSR performance.

en eess.AS

Detail Sumber

arXiv Open Access 2025

Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement

Heitor R. Guimarães, Ke Tan, Juan Azcarreta et al.

Deploying speech enhancement (SE) systems in wearable devices, such as smart glasses, is challenging due to the limited computational resources on the device. Although deep learning methods have achieved high-quality results, their computational cost limits their feasibility on embedded platforms. This work presents an efficient end-to-end SE framework that leverages a Differentiable Digital Signal Processing (DDSP) vocoder for high-quality speech synthesis. First, a compact neural network predicts enhanced acoustic features from noisy speech: spectral envelope, fundamental frequency (F0), and periodicity. These features are fed into the DDSP vocoder to synthesize the enhanced waveform. The system is trained end-to-end with STFT and adversarial losses, enabling direct optimization at the feature and waveform levels. Experimental results show that our method improves intelligibility and quality by 4% (STOI) and 19% (DNSMOS) over strong baselines without significantly increasing computation, making it well-suited for real-time applications.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2025

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Gabriel Pirlogeanu, Alexandru-Lucian Georgescu, Horia Cucu

In this work, we present a new state-of-the-art Romanian Automatic Speech Recognition (ASR) system based on NVIDIA's FastConformer architecture--explored here for the first time in the context of Romanian. We train our model on a large corpus of, mostly, weakly supervised transcriptions, totaling over 2,600 hours of speech. Leveraging a hybrid decoder with both Connectionist Temporal Classification (CTC) and Token-Duration Transducer (TDT) branches, we evaluate a range of decoding strategies including greedy, ALSD, and CTC beam search with a 6-gram token-level language model. Our system achieves state-of-the-art performance across all Romanian evaluation benchmarks, including read, spontaneous, and domain-specific speech, with up to 27% relative WER reduction compared to previous best-performing systems. In addition to improved transcription accuracy, our approach demonstrates practical decoding efficiency, making it suitable for both research and deployment in low-latency ASR applications.

en eess.AS, cs.AI

Detail Sumber

arXiv Open Access 2025

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Jingyuan Xing, Zhipeng Li, Jialong Mai et al.

Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects.This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, considering the interdependence, the Coupled NAR model predicts detailed tokens based on the general AR model's output. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models. Speech demos are available at https://t1235-ch.github.io/pgpt/.

en eess.AS, cs.SD

Detail Sumber

DOAJ Open Access 2025

Can generative artificial intelligence act as a mediator in linguistic education?

M. N. Evstigneev

Importance. The rapid development and integration of AI technologies into the higher education system marks the beginning of one of the largest digital transformations. In the Russian Federation, this process is regulated by a number of state documents that lay the foundations for the digitalization of all spheres of life, including education. In this regard, the pedagogical community is actively exploring the didactic potential of AI, considering it as a tool for personalizing education, automating routine tasks, data analytics, and developing an AI-based educational environment. According to the CEFR, mediation is one of the key professional competencies of a modern foreign language teacher. However, in the context of the active integration of AI into education,a logical question arises: can generative AI take on the role of mediator? The purpose of the study is to develop a nomenclature of skills in foreign language mediation activities and compare the quality of mediation of a foreign language teacher and generative AI.Research Methods. Theoretical methods: study and analysis of scientific and educational literature on the integration of AI technologies into the higher education system, the integration of AI technologies into foreign language teaching methods, and the development of mediation skills among pre-service foreign language teachers. Empirical methods: questionnaires to obtain initial data, monitoring the activities of foreign language and AI teachers in the mediation process, conducting a comparative analysis between mediation results.Definition of Concepts. The key concept in the study is AI mediation.Results and Discussion. А nomenclature of mediation skills in accordance with three types of mediation is developed. Mediating a text: 1) the skill to carry out oral and written translation; 2) the skill to compress/expand text/speech utterance; 3) the skill to transform text (paraphrase); 4) the skill to adapt text to the target audience; 5) the skill to transcode information (create and interpret infographics); 6) the skill to create bilingual glossaries. Mediating concepts: 7) the skill to formulate a goal and build the logic of a statement; 8) the skill to synthesize information from various sources; 9) the skill to establish causal relationships between new information and existing information; 10) the skill to direct communication participants to find a common solution. Mediating communication: 11) the skill to create a positive atmosphere of communication; 12) the skill to focus attention on the subject of discussion; 13) the skill to manage the group dynamics of communication; 14) the skill to monitor compliance with the ethics of verbal communication; 15) the skill to prevent socio-cultural conflicts; 16) the skill to encourage respect for a communication partner; the skill to choose a mediation strategy in accordance with the format of communication; the skill to resolve professional and interpersonal conflicts; 19) the skill to take responsibility for the outcome of mediation.Conclusion. The conducted research reveals a clear distribution of strengths between a foreign language teacher and an AI. Generative AI demonstrates indisputable leadership in mediating a text. In mediating concepts, AI acts as the “architect of the text”, in some aspects not inferior, and even superior to the teacher. However, in mediating communication, the teacher has an insurmountable advantage based on human qualities.

Education (General), Philology. Linguistics

Detail DOI Sumber

CrossRef Open Access 2025

Editorial Board

en

Detail DOI Sumber

arXiv Open Access 2024

Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

Tsubasa Ochiai, Kazuma Iwamoto, Marc Delcroix et al.

It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2024

More than words: Advancements and challenges in speech recognition for singing

Anna Kruspe

This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.

en cs.SD, cs.CL

Detail Sumber

arXiv Open Access 2024

Improving Generalization of Speech Separation in Real-World Scenarios: Strategies in Simulation, Optimization, and Evaluation

Ke Chen, Jiaqi Su, Taylor Berg-Kirkpatrick et al.

Achieving robust speech separation for overlapping speakers in various acoustic environments with noise and reverberation remains an open challenge. Although existing datasets are available to train separators for specific scenarios, they do not effectively generalize across diverse real-world scenarios. In this paper, we present a novel data simulation pipeline that produces diverse training data from a range of acoustic environments and content, and propose new training paradigms to improve quality of a general speech separation model. Specifically, we first introduce AC-SIM, a data simulation pipeline that incorporates broad variations in both content and acoustics. Then we integrate multiple training objectives into the permutation invariant training (PIT) to enhance separation quality and generalization of the trained model. Finally, we conduct comprehensive objective and human listening experiments across separation architectures and benchmarks to validate our methods, demonstrating substantial improvement of generalization on both non-homologous and real-world test sets.

en cs.SD, cs.AI

Detail Sumber

arXiv Open Access 2024

A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives

Jan Lehečka, Josef V. Psutka, Luboš Šmídl et al.

In this paper, we are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance on a unique oral history archive containing a lot of mixed-language sentences. Our main goal is to push forward research on this unique dataset, which is an extremely valuable part of our cultural heritage. Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models, even when processing the oral history archive full of mixed-language sentences from non-native speakers. We also performed the same experiments on the public CommonVoice dataset to verify our results. We are contributing to the research community by releasing our pre-trained models to the public.

en cs.CL, cs.AI

Detail DOI Sumber

CrossRef Open Access 2024

Editorial Board

en

Detail DOI Sumber

arXiv Open Access 2023

Classifying Rhoticity of /r/ in Speech Sound Disorder using Age-and-Sex Normalized Formants

Nina R Benway, Jonathan L Preston, Asif Salekin et al.

Mispronunciation detection tools could increase treatment access for speech sound disorders impacting, e.g., /r/. We show age-and-sex normalized formant estimation outperforms cepstral representation for detection of fully rhotic vs. derhotic /r/ in the PERCEPT-R Corpus. Gated recurrent neural networks trained on this feature set achieve a mean test participant-specific F1-score =.81 (σx=.10, med = .83, n = 48), with post hoc modeling showing no significant effect of child age or sex.

en eess.AS

Detail DOI Sumber

CrossRef Open Access 2023

Editorial Board

en

Detail DOI Sumber

CrossRef Open Access 2023

Editorial Board

en

Detail DOI Sumber

arXiv Open Access 2022

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

Zhong-Qiu Wang, Gordon Wichern, Shinji Watanabe et al.

Deep learning based speech enhancement in the short-time Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window can lead to higher frequency resolution and potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed using the same window size. To reduce this inherent latency, we adapt a conventional dual-window-size approach, where a regular input window size is used for STFT but a shorter output window is used for overlap-add, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT-iSTFT configuration, we employ complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the DNN-predicted RI components to conduct frame-online beamforming, the results of which are used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamformer can be easily integrated with our DNNs and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation on noisy-reverberant speech enhancement shows the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.

en cs.SD, eess.AS

Detail Sumber

arXiv Open Access 2022

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez et al.

The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language. The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we 1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; 2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and 3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that end, we conducted mono-, multi-, and crosslingual experiments on a set of 13 phonetically diverse languages and several in-depth analyses. We found a number of universal phone tokens (IPA symbols) that are well-recognized cross-linguistically. Through a detailed analysis of results, we conclude that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.

en cs.SD, cs.CL

Detail Sumber

CrossRef Open Access 2022

Editorial Board

en

Detail DOI Sumber

arXiv Open Access 2021

Non-native English lexicon creation for bilingual speech synthesis

Arun Baby, Pranav Jawale, Saranya Vinnaitherthan et al.

Bilingual English speakers speak English as one of their languages. Their English is of a non-native kind, and their conversations are of a code-mixed fashion. The intelligibility of a bilingual text-to-speech (TTS) system for such non-native English speakers depends on a lexicon that captures the phoneme sequence used by non-native speakers. However, due to the lack of non-native English lexicon, existing bilingual TTS systems employ native English lexicons that are widely available, in addition to their native language lexicon. Due to the inconsistency between the non-native English pronunciation in the audio and native English lexicon in the text, the intelligibility of synthesized speech in such TTS systems is significantly reduced. This paper is motivated by the knowledge that the native language of the speaker highly influences non-native English pronunciation. We propose a generic approach to obtain rules based on letter to phoneme alignment to map native English lexicon to their non-native version. The effectiveness of such mapping is studied by comparing bilingual (Indian English and Hindi) TTS systems trained with and without the proposed rules. The subjective evaluation shows that the bilingual TTS system trained with the proposed non-native English lexicon rules obtains a 6% absolute improvement in preference.

en eess.AS, cs.CL

Detail Sumber

arXiv Open Access 2021

Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition

Haozhe Chen, Weiming Zhang, Kunlin Liu et al.

As an effective method for intellectual property (IP) protection, model watermarking technology has been applied on a wide variety of deep neural networks (DNN), including speech classification models. However, how to design a black-box watermarking scheme for automatic speech recognition (ASR) models is still an unsolved problem, which is a significant demand for protecting remote ASR Application Programming Interface (API) deployed in cloud servers. Due to conditional independence assumption and label-detection-based evasion attack risk of ASR models, the black-box model watermarking scheme for speech classification models cannot apply to ASR models. In this paper, we propose the first black-box model watermarking framework for protecting the IP of ASR models. Specifically, we synthesize trigger audios by spreading the speech clips of model owners over the entire input audios and labeling the trigger audios with the stego texts, which hides the authorship information with linguistic steganography. Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed watermarking scheme, which is robust against five kinds of attacks and has little impact on accuracy.

en cs.SD, cs.CL

Detail DOI Sumber

Hasil untuk "Oral communication. Speech"