Hasil "Oral communication. Speech"

arXiv Open Access 2026

iMiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis

Sofoklis Kakouros, Fang Kang, Haoyu Chen

This work presents iMiGUE-Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states. The new release focuses on speech and enriches the original dataset with additional metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. Unlike existing emotional speech datasets that rely on acted or laboratory-elicited emotions, iMiGUE-Speech captures spontaneous affect arising naturally from real match outcomes. To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset's ability to capture spontaneous affective states from both acoustic and linguistic modalities. iMiGUE-Speech can also be synchronously paired with micro-gesture annotations from the original iMiGUE dataset, forming a uniquely multimodal resource for studying speech-gesture affective dynamics. The extended dataset is available at https://github.com/CV-AC/imigue-speech.

en eess.AS, cs.CL

Detail Sumber

S2 Open Access 2020

Hearing and speech benefits of cochlear implantation in children: A review of the literature.

Sunil D. Sharma, S. Cushing, B. Papsin et al.

Cochlear implantation is a safe and reliable treatment for children with severe to profound hearing loss. The primary benefit of these medical devices in children is the acquisition of hearing, which promotes development of spoken language. The present paper reviews published literature demonstrating predictive effects of a number of factors on acquisition of hearing development and speech recognition. Of the many variables that contribute to an individual child's development after implantation, age at implantation, the presence of medical comorbidities, social determinants of health, and the provision of bilateral versus unilateral hearing are those that can vary widely and have consistently shown clear impacts. Specifically, age of implantation is crucial to reduce effects of deafness on the developing auditory system and capture the remarkable plasticity of early development. Language development after cochlear implantation requires therapy emphasizing hearing and oral communication, education, and other support which can be influenced by known social determinants of health; specifically, outcomes in children decline with reductions in socioeconomic status and levels of parental education. Medical co-morbidities also slow rates of progress after cochlear implantation. On the other hand, benefits of implantation increase in children who are provided with access to hearing from both ears. In sum, cochlear implants promote development of hearing in children and the best outcomes are achieved by providing early access to sound in both ears. These benefits can be limited by known social determinants of health which restrict access to needed support and medical comorbidities which add further complexity in care and outcome.

184 sitasi en Medicine

Detail DOI Sumber

arXiv Open Access 2025

Lightweight speech enhancement guided target speech extraction in noisy multi-speaker scenarios

Ziling Huang, Junnan Wu, Lichun Fan et al.

Target speech extraction (TSE) has achieved strong performance in relatively simple conditions such as one-speaker-plus-noise and two-speaker mixtures, but its performance remains unsatisfactory in noisy multi-speaker scenarios. To address this issue, we introduce a lightweight speech enhancement model, GTCRN, to better guide TSE in noisy environments. Building on our competitive previous speaker embedding/encoder-free framework SEF-PNet, we propose two extensions: LGTSE and D-LGTSE. LGTSE incorporates noise-agnostic enrollment guidance by denoising the input noisy speech before context interaction with enrollment speech, thereby reducing noise interference. D-LGTSE further improves system robustness against speech distortion by leveraging denoised speech as an additional noisy input during training, expanding the dynamic range of noisy conditions and enabling the model to directly learn from distorted signals. Furthermore, we propose a two-stage training strategy, first with GTCRN enhancement-guided pre-training and then joint fine-tuning, to fully exploit model potential.Experiments on the Libri2Mix dataset demonstrate significant improvements of 0.89 dB in SISDR, 0.16 in PESQ, and 1.97% in STOI, validating the effectiveness of our approach.

en eess.AS

Detail Sumber

arXiv Open Access 2025

Multilingual Source Tracing of Speech Deepfakes: A First Benchmark

Xi Xuan, Yang Xiao, Rohan Kumar Das et al.

Recent progress in generative AI has made it increasingly easy to create natural-sounding deepfake speech from just a few seconds of audio. While these tools support helpful applications, they also raise serious concerns by making it possible to generate convincing fake speech in many languages. Current research has largely focused on detecting fake speech, but little attention has been given to tracing the source models used to generate it. This paper introduces the first benchmark for multilingual speech deepfake source tracing, covering both mono- and cross-lingual scenarios. We comparatively investigate DSP- and SSL-based modeling; examine how SSL representations fine-tuned on different languages impact cross-lingual generalization performance; and evaluate generalization to unseen languages and speakers. Our findings offer the first comprehensive insights into the challenges of identifying speech generation models when training and inference languages differ. The dataset, protocol and code are available at https://github.com/xuanxixi/Multilingual-Source-Tracing.

en eess.AS, cs.CL

Detail Sumber

S2 Open Access 2022

Language and Speech Characteristics in Autism

I. Vogindroukas, M. Stankova, E. Chelas et al.

Abstract The aim of this paper is to provide information regarding diversity in speech and language profiles of individuals with Autism Spectrum Disorders (ASD) and try to classify these profiles according to the combination of the communication difficulties. Research findings confirm the existence of heterogeneity of communication challenges in ASD across the lifespan. A lot of children with ASD experience communication challenges and strengths across all language sub-systems including pragmatics, grammar, semantics, syntax, phonology, and morphology in both oral and written language, while some children with autism demonstrate exceptional language abilities incl. linguistic creativity. Communication issues vary on a continuum of severity so that some children may be verbal, whereas others remain non-verbal or minimally-verbal. The diversity of profiles in speech and language development stem from either the presence of comorbid factors, as a core symptom of autistic behavior without comorbidity or both, with the development of complex clinical symptoms. Difficulties with the semantic aspect of language affect the individual’s skills in abstract thinking, multiple meanings of words, concept categorization, and so on. Finally, the coexistence of ASD with other communication difficulties such as a Language Disorder, Apraxia of Speech, Speech Sound Disorders or/and other neurodevelopmental disorders raises the need for examining more carefully the emergence of new clinical profiles and clinical markers useful in performing differential diagnosis and different intervention.

94 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2024

Enhancing Communication in Critically Ill Patients with a Tracheostomy: A Systematic Review of Evidence-Based Interventions and Outcomes.

Mary N. Gentile, A. Irvine, Annamarie M. King et al.

Background Tracheostomy, a common procedure performed in intensive care units (ICU), is associated with communication impairment and affects patient well-being. While prior research has focused on physiological care, there is a need to address communication needs and quality of life (QOL). We aimed to evaluate how different types of communication devices affect QOL, speech intelligibility, voice quality, time to significant events, clinical response and tolerance, and healthcare utilization in patients undergoing tracheostomy. Methods Following PRISMA guidelines, a systematic review was conducted to assess studies from 2016 onwards. Eligible studies included adult ICU patients with a tracheostomy, comparing different types of communication devices. Data were extracted and synthesized to evaluate QOL, speech intelligibility, voice quality, time to significant events (initial communication device use, oral intake, decannulation), clinical response and tolerance, and healthcare utilization and facilitators/barriers to device implementation. Results Among 9,228 studies screened, 8 were included in the review. Various communication devices were employed, comprising both tracheostomy types and speaking valves, highlighting the multifaceted nature of interventions. Quality of life improvements were observed with voice restoration interventions, but challenges such as speech intelligibility impairments were noted. The median time for initial communication device usage post-intervention was 11.4 ± 5.56 days. The median duration of speech tolerance ranged between 30-60 minutes to 2-3 hours across different studies. Complications such as air trapping or breathing difficulties were reported in 15% of cases. Additionally, the median ICU length of stay post-intervention was 36.5 days. Key facilitators for device implementation included early intervention, while barriers ranged from service variability to physical intolerance issues. Conclusion Findings demonstrate that various types of communication devices can significantly enhance the quality of life, speech intelligibility, and voice quality for patients undergoing tracheostomy, aligning with the desired outcomes of improved clinical response and reduced healthcare utilization. The identification of facilitators and barriers to device implementation further informs clinical practice, suggesting a tailored, patient-centered approach is crucial for optimizing the benefits of communication devices in this population.

10 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2024

Adversarial Auto-Encoders Based Model for Classification of Speech Dysarthria

V. K. Devi, R. Sreenivas, E. Umamaheshwari et al.

Communication is effective based on various parameters, out of which phonetic or oral communication plays a vital role. Slurred speech or improper speech will lead to misunderstanding in speech, which could toss up any situation. There are many people, ranging from children to adults, who are affected with slurred speech, which is technically termed as Speech Dysarthria, a disease which tampers effective oral communication. Distinguishing between people affected with dysarthria and people with normal speech will be tedious process manually. Machine Learning (ML), and Artificial Intelligence (AI), can be pitched in to solve the problem. There are existing methodologies which classify people affected with speech dysarthria and people who communicate in a normal way. Some of the existing technologies used are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and so on. This paper aims at distinguishing between people affected with speech dysarthria and people with normal speech, using Adversarial Auto Encoders (AAE), a model which has its roots from Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN). This paper brings out a good result and proves to be effective.

3 sitasi en Computer Science

Detail DOI Sumber

DOAJ Open Access 2024

Preliminares

Preliminares de la revista

Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature

Detail DOI Sumber

arXiv Open Access 2024

FlashSpeech: Efficient Zero-Shot Speech Synthesis

Zhen Ye, Zeqian Ju, Haohe Liu et al.

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

en eess.AS, cs.AI

Detail Sumber

arXiv Open Access 2024

Investigating the Effects of Diffusion-based Conditional Generative Speech Models Used for Speech Enhancement on Dysarthric Speech

Joanna Reszka, Parvaneh Janbakhshi, Tilak Purohit et al.

In this study, we aim to explore the effect of pre-trained conditional generative speech models for the first time on dysarthric speech due to Parkinson's disease recorded in an ideal/non-noisy condition. Considering one category of generative models, i.e., diffusion-based speech enhancement, these models are previously trained to learn the distribution of clean (i.e, recorded in a noise-free environment) typical speech signals. Therefore, we hypothesized that when being exposed to dysarthric speech they might remove the unseen atypical paralinguistic cues during the enhancement process. By considering the automatic dysarthric speech detection task, in this study, we experimentally show that during the enhancement process of dysarthric speech data recorded in an ideal non-noisy environment, some of the acoustic dysarthric speech cues are lost. Therefore such pre-trained models are not yet suitable in the context of dysarthric speech enhancement since they manipulate the pathological speech cues when they process clean dysarthric speech. Furthermore, we show that the removed acoustics cues by the enhancement models in the form of residue speech signal can provide complementary dysarthric cues when fused with the original input speech signal in the feature space.

en eess.AS, cs.LG

Detail Sumber

arXiv Open Access 2024

Exploratory Evaluation of Speech Content Masking

Jennifer Williams, Karla Pizzi, Paul-Gauthier Noe et al.

Most recent speech privacy efforts have focused on anonymizing acoustic speaker attributes but there has not been as much research into protecting information from speech content. We introduce a toy problem that explores an emerging type of privacy called "content masking" which conceals selected words and phrases in speech. In our efforts to define this problem space, we evaluate an introductory baseline masking technique based on modifying sequences of discrete phone representations (phone codes) produced from a pre-trained vector-quantized variational autoencoder (VQ-VAE) and re-synthesized using WaveRNN. We investigate three different masking locations and three types of masking strategies: noise substitution, word deletion, and phone sequence reversal. Our work attempts to characterize how masking affects two downstream tasks: automatic speech recognition (ASR) and automatic speaker verification (ASV). We observe how the different masks types and locations impact these downstream tasks and discuss how these issues may influence privacy goals.

en eess.AS, cs.CR

Detail Sumber

S2 Open Access 2023

Innovative Application of Sensor Combined with Speech Recognition Technology in College English Education in the Context of Artificial Intelligence

Juan Guo

English listening is an effective way to improve students’ English expression ability and use oral communication. However, from the current situation of English teaching, the current English teaching methods are too single, and teachers do not focus on oral training in the classroom, resulting in low efficiency of classroom teaching. On the basis of following the principles of wholeness, interaction, balance, and sustainable development of educational ecology, by enhancing the synergy of ecological elements of English speaking classroom, promoting interactive dialogue among ecological subjects, and regulating classroom behaviors, it is conducive to giving full play to the advantageous role of information technology on English speaking teaching reform and promoting its sustainable development. This paper addresses the current situation of English listening teaching, especially the problem of reduced recognition rate of spoken language in noisy environment, and the principle of using dual-sensor speech recognition system proposed. We design the speech recognition method based on recurrent neural network by acquiring the weak vibration pressure speech signal of the jaw skin and the speech signal transmitted through the air during the vocalization process through the sensor. Deep machine learning algorithm is used for speech recognition in English teaching. A reasonable frame sampling frequency is set to obtain the English speech signal, then the feature parameters representing this speech signal are obtained by linear prediction coefficients, and the speech feature vector is generated, followed by the recurrent neural network algorithm to train the speech features. In the related experiments, by comparing with the commonly used speech recognition algorithms, it is proved that the proposed algorithm English teaching speech recognition has higher accuracy and faster convergence.

18 sitasi en

Detail DOI Sumber

S2 Open Access 2023

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis

Yayue Deng, Jinlong Xue, Fengping Wang et al.

Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence between interlocutors. Previous studies have verified that prior commonsense knowledge helps machines understand subtle psychological information (e.g., feelings and intentions) in spontaneous oral dialogues. Therefore, to enhance context understanding and improve the naturalness of synthesized speech, we propose a novel conversational speech synthesis system (CMCU-CSS) that incorporates the Commonsense-based Multi-modal Context Understanding (CMCU) module to model the dynamic emotional interaction among interlocutors. Specifically, we first utilize three implicit states (intent state, internal state and external state) in CMCU to model the context dependency between inter/intra speakers with the help of commonsense knowledge. Furthermore, we infer emotion vectors from the fusion of these implicit states and multi-modal features to enhance the emotion discriminability of synthesized speech. This is the first attempt to combine commonsense knowledge with conversational speech synthesis, and its effect in terms of emotion discriminability of synthetic speech is evaluated by emotion recognition in conversation task. The results of subjective and objective evaluations demonstrate that the CMCU-CSS model achieves more natural speech with context-appropriate emotion and is equipped with the best emotion discriminability, surpassing that of other conversational speech synthesis models.

10 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2023

Clinical outcomes associated with speech, language and swallowing difficulties post-stroke

Stephanie Kaylor, Shajila Singh

Background There is a lack of prospective research in South Africa’s speech therapy private sector, specifically, in the acute stroke population. There is a need to understand the quality of speech therapy services and outcomes post-stroke in the private sector. Objectives This prospective cohort study investigated associations between speech, language, and swallowing conditions (i.e. dysarthria, apraxia of speech, aphasia, dysphagia), and outcomes post-stroke (i.e. length of hospital stay [LOS], degree of physical disability according to the Modified Rankin Scale [mRS], functional level of oral intake according to the Functional Oral Intake Scale [FOIS], dehydration, weight loss, aspiration pneumonia, mortality). Method A prospective design was used to determine the incidence of speech, language, and swallowing conditions post-stroke. Convenience sampling was used to select participants (N = 68). Various statistical tests were used and the alpha level was set at Bonferroni correction p < 0.01. Results Co-occurring speech, language, and swallowing conditions frequently occurred post-stroke (88%). Participants who were referred to speech therapy later than 24 h post-admission (52.94%) stayed in hospital for a median of 3 days longer than those who were referred within 24 h (p = 0.042). Dysphagia was significantly associated with moderate to severe physical disability (p < 0.01). Dysphagia with aspiration was significantly associated with poor functional level of oral intake, at admission and at discharge (p < 0.01). At discharge, aspiration pneumonia was significantly associated with severe physical disability (p < 0.01, r = 0.70). Conclusion In South Africa’s private sector, co-occurring speech, language, and swallowing conditions commonly occurred post-stroke, and dysphagia was strongly associated with physical disability and poor functional level of oral intake. Length of hospital stay was increased by delayed speech therapy referrals. Contribution This article contributes data on speech therapy services, communication and swallowing disorders post-stroke, and outcomes in South Africa’s private sector.

9 sitasi en Medicine

Detail DOI Sumber

S2 Open Access 2023

Speech Production in Healthy Older Adults With or Without Amateur Singing Experience.

Pascale Tremblay, Lydia Gagnon, Johanna-Pascale Roy et al.

PURPOSE Amateur singing is a universal, accessible, and enjoyable musical activity that may have positive impacts on human communication. However, evidence of an impact of singing on speech articulation is still scarce, yet understanding the effects of vocal training on speech production could provide a model for treating people with speech deficits. The aim of this study was to examine speech production in younger and older adults with or without amateur singing experience. METHOD Thirty-eight amateur singers (aged 20-87 years, 23 women and 15 men) and 40 nonmusician active controls (aged 23-88 years, 19 women and 21 men) were recruited. A set of tasks were used to evaluate the oral motor sphere: two voice production tasks, a passage reading task, and a modified diadochokinetic (DDK) rates task performed at a natural rhythm and as quickly as possible. RESULTS Our results show that older age was associated with lower reading rate, lower articulation rate, and articulation rate variability in the DDK task, as well as reduced accuracy for the phonologically complex stimuli. Most importantly, our results show an advantage for singers over cognitively active nonsingers in terms of articulatory accuracy in the most challenging situations. CONCLUSION This result suggests extended maximal performance capacities in amateur singers perhaps resulting from the articulatory efforts required during singing. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.24274813.

6 sitasi en Medicine

Detail DOI Sumber

DOAJ Open Access 2023

When voice signals nationality and sexual orientation: Speakers’ self-perceptions and perceived stigmatization

Fasoli Fabio, Dragojevic Marko, Rakić Tamara

Research has shown that individuals speaking low-prestige language varieties are often negatively evaluated and stigmatized by others. However, less is known about how speakers of such language varieties perceive their own speech. Here, we examined self-perceptions and perceived stigma of speakers who belong to multiple social categories signaled by auditory cues. Specifically, we examined beliefs of sexual minority and heterosexual male speakers who were either British nationals (native English speakers) or foreigners living in the UK (non-native English speakers). British speakers believed their voices cue their nationality more than foreigners. Heterosexuals believed their voices reveal their sexual orientation, but only when they self-perceived as sounding masculine. Sexual minority and foreign speakers felt more stigmatized because of the way they sound than did heterosexual and British speakers, respectively. These findings have implications for intergroup communication and voice-based stigmatization literature.

Oral communication. Speech, Psychology

Detail DOI Sumber

arXiv Open Access 2023

Audio-Visual Speech Enhancement with Score-Based Generative Models

Julius Richter, Simone Frintrop, Timo Gerkmann

This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-super\-vised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligned, and incorporated into the noise conditional score network. Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality and reduces generative artifacts such as phonetic confusions with respect to the audio-only equivalent. The latter is supported by the word error rate of a downstream automatic speech recognition model, which decreases noticeably, especially at low input signal-to-noise ratios.

en eess.AS, cs.LG

Detail Sumber

arXiv Open Access 2023

On the Behavior of Intrusive and Non-intrusive Speech Enhancement Metrics in Predictive and Generative Settings

Danilo de Oliveira, Julius Richter, Jean-Marie Lemercier et al.

Since its inception, the field of deep speech enhancement has been dominated by predictive (discriminative) approaches, such as spectral mapping or masking. Recently, however, novel generative approaches have been applied to speech enhancement, attaining good denoising performance with high subjective quality scores. At the same time, advances in deep learning also allowed for the creation of neural network-based metrics, which have desirable traits such as being able to work without a reference (non-intrusively). Since generatively enhanced speech tends to exhibit radically different residual distortions, its evaluation using instrumental speech metrics may behave differently compared to predictively enhanced speech. In this paper, we evaluate the performance of the same speech enhancement backbone trained under predictive and generative paradigms on a variety of metrics and show that intrusive and non-intrusive measures correlate differently for each paradigm. This analysis motivates the search for metrics that can together paint a complete and unbiased picture of speech enhancement performance, irrespective of the model's training process.

en eess.AS, cs.LG

Detail Sumber

arXiv Open Access 2023

Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders

Nina R Benway, Yashish M Siriwardena, Jonathan L Preston et al.

Acoustic-to-articulatory speech inversion could enhance automated clinical mispronunciation detection to provide detailed articulatory feedback unattainable by formant-based mispronunciation detection algorithms; however, it is unclear the extent to which a speech inversion system trained on adult speech performs in the context of (1) child and (2) clinical speech. In the absence of an articulatory dataset in children with rhotic speech sound disorders, we show that classifiers trained on tract variables from acoustic-to-articulatory speech inversion meet or exceed the performance of state-of-the-art features when predicting clinician judgment of rhoticity. Index Terms: rhotic, speech sound disorder, mispronunciation detection

en eess.AS

Detail DOI Sumber

arXiv Open Access 2023

Investigating Speaker Embedding Disentanglement on Natural Read Speech

Michael Kuhlmann, Adrian Meise, Fritz Seebauer et al.

Disentanglement is the task of learning representations that identify and separate factors that explain the variation observed in data. Disentangled representations are useful to increase the generalizability, explainability, and fairness of data-driven models. Only little is known about how well such disentanglement works for speech representations. A major challenge when tackling disentanglement for speech representations are the unknown generative factors underlying the speech signal. In this work, we investigate to what degree speech representations encoding speaker identity can be disentangled. To quantify disentanglement, we identify acoustic features that are highly speaker-variant and can serve as proxies for the factors of variation underlying speech. We find that disentanglement of the speaker embedding is limited when trained with standard objectives promoting disentanglement but can be improved over vanilla representation learning to some extent.

en eess.AS

Detail Sumber

Hasil untuk "Oral communication. Speech"