Hasil "Oral communication. Speech"

arXiv Open Access 2026

ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy

Ya-Tse Wu, Chi-Chun Lee

This work investigates how emotional speech and generative strategies affect ASR performance. We analyze speech synthesized from three emotional TTS models and find that substitution errors dominate, with emotional expressiveness varying across models. Based on these insights, we introduce two generative strategies: one using transcription correctness and another using emotional salience, to construct fine-tuning subsets. Results show consistent WER improvements on real emotional datasets without noticeable degradation on clean LibriSpeech utterances. The combined strategy achieves the strongest gains, particularly for expressive speech. These findings highlight the importance of targeted augmentation for building emotion-aware ASR systems.

en eess.AS

Detail Sumber

DOAJ Open Access 2025

L'hétérogénéité de la manifestation de l'alliance thérapeutique et des dimensions cognitive et affective de l'empathie chez les orthophonistes en France

Floriane Ardellier, Juliette Le Douarin, Valérie Martinage et al.

Contexte. L’interaction entre les dimensions cognitive et affective de l’empathie et la capacité à établir une alliance thérapeutique avec le patient, a été explorée pour plusieurs professions de santé à l’exception des orthophonistes. Objectifs. Le travail décrit ici a été réalisé afin de collecter des mesures chez les orthophonistes exerçant en France afin d’identifier des groupes différant par leur alliance thérapeutique et leurs dimensions cognitive et affective de l’empathie. Méthodes. Une enquête nationale a été réalisée pour mesurer l’alliance thérapeutique (échelle WAI-SR) et les dimensions cognitive (prise de perspective) et affective (soin compassionnel) de l’empathie. Des items sociodémographiques (âge, sexe, ancienneté professionnelle) ont complété le questionnaire. Après validation psychométrique de l’échelle WAI-SR, une analyse corrélationnelle entre les différentes mesures a été réalisée. L’hétérogénéité de la population au regard de ces construits a été explorée grâce à une analyse par grappes. Résultats. 698 réponses ont été analysées. L’échelle WAI-SR présentait une structure uni-dimensionnelle invariante selon l’ancienneté professionnelle. Le score WAI-SR était corrélé positivement avec la prise de perspective (ρs = 0,407 ; p < ,001) et le soin compassionnel (ρs = 0,193 ; p < ,001). Ces corrélations ne différaient pas significativement (p < ,05) selon l’ancienneté professionnelle. Quatre grappes ont été identifiées comme différant significativement par leur alliance thérapeutique (K = 277,91 ; p < ,001), leur prise de perspective (K = 315,69 ; p < ,001) et leur soin compassionnel (K = 246,31 ; p < ,001). Conclusion. Ce travail constitue la première analyse de données collectées auprès d’orthophonistes exerçant en France concernant l’alliance thérapeutique et les dimensions cognitive et affective de l’empathie. Ces trois construits sont positivement corrélés et leurs manifestations sont hétérogènes dans l’échantillon testé. Cette hétérogénéité n’apparaît pas comme strictement dépendante de l’ancienneté professionnelle. D’autres possibles déterminants de cette hétérogénéité sont discutés.

Oral communication. Speech, Pathology

Detail DOI Sumber

DOAJ Open Access 2025

Gomila Pere, Joan (2022): L’obra folklòrica d’Isidor Macabich, Eivissa, Miquel Costa, 149 pp.

Joan Mahiques Climent

Reseña

Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature

Detail DOI Sumber

arXiv Open Access 2025

Towards Explainable Spoofed Speech Attribution and Detection:a Probabilistic Approach for Characterizing Speech Synthesizer Components

Jagabandhu Mishra, Manasi Chhibber, Hye-jin Shim et al.

We propose an explainable probabilistic framework for characterizing spoofed speech by decomposing it into probabilistic attribute embeddings. Unlike raw high-dimensional countermeasure embeddings, which lack interpretability, the proposed probabilistic attribute embeddings aim to detect specific speech synthesizer components, represented through high-level attributes and their corresponding values. We use these probabilistic embeddings with four classifier back-ends to address two downstream tasks: spoofing detection and spoofing attack attribution. The former is the well-known bonafide-spoof detection task, whereas the latter seeks to identify the source method (generator) of a spoofed utterance. We additionally use Shapley values, a widely used technique in machine learning, to quantify the relative contribution of each attribute value to the decision-making process in each task. Results on the ASVspoof2019 dataset demonstrate the substantial role of duration and conversion modeling in spoofing detection; and waveform generation and speaker modeling in spoofing attack attribution. In the detection task, the probabilistic attribute embeddings achieve $99.7\%$ balanced accuracy and $0.22\%$ equal error rate (EER), closely matching the performance of raw embeddings ($99.9\%$ balanced accuracy and $0.22\%$ EER). Similarly, in the attribution task, our embeddings achieve $90.23\%$ balanced accuracy and $2.07\%$ EER, compared to $90.16\%$ and $2.11\%$ with raw embeddings. These results demonstrate that the proposed framework is both inherently explainable by design and capable of achieving performance comparable to raw CM embeddings.

en eess.AS

Detail Sumber

arXiv Open Access 2025

Prosody Labeling with Phoneme-BERT and Speech Foundation Models

Tomoki Koriyama

This paper proposes a model for automatic prosodic label annotation, where the predicted labels can be used for training a prosody-controllable text-to-speech model. The proposed model utilizes not only rich acoustic features extracted by a self-supervised-learning (SSL)-based model or a Whisper encoder, but also linguistic features obtained from phoneme-input pretrained linguistic foundation models such as PnG BERT and PL-BERT. The concatenation of acoustic and linguistic features is used to predict phoneme-level prosodic labels. In the experimental evaluation on Japanese prosodic labels, including pitch accents and phrase break indices, it was observed that the combination of both speech and linguistic foundation models enhanced the prediction accuracy compared to using either a speech or linguistic input alone. Specifically, we achieved 89.8% prediction accuracy in accent labels, 93.2% in high-low pitch accents, and 94.3% in break indices.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2025

A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement

Jie Zhang, Haoyin Yan, Xiaofei Li

It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative branch optimizes the denoising objective of diffusion models. We utilize the output fusion and truncated diffusion scheme to effectively integrate predictive and generative modeling, where the former directly combines results from both branches and the latter modifies the reverse diffusion process with initial estimates from the predictive branch. Extensive experiments on several datasets verify the superiority of the proposed model over state-of-the-art baselines, demonstrating the complementarity and benefits of combining predictive and generative modeling.

en eess.AS

Detail Sumber

arXiv Open Access 2025

Predicting speech intelligibility in older adults for speech enhancement using the Gammachirp Envelope Similarity Index, GESI

Ayako Yamamoto, Fuki Miyazaki, Toshio Irino

We propose an objective intelligibility measure (OIM), called the Gammachirp Envelope Similarity Index (GESI), that can predict speech intelligibility (SI) in older adults. GESI is a bottom-up model based on psychoacoustic knowledge from the peripheral to the central auditory system. It computes the single SI metric using the gammachirp filterbank (GCFB), the modulation filterbank, and the extended cosine similarity measure. It takes into account not only the hearing level represented in the audiogram, but also the temporal processing characteristics captured by the temporal modulation transfer function (TMTF). To evaluate performance, SI experiments were conducted with older adults of various hearing levels using speech-in-noise with ideal speech enhancement on familiarity-controlled Japanese words. The prediction performance was compared with HASPIw2, which was developed for keyword SI prediction. The results showed that GESI predicted the subjective SI scores more accurately than HASPIw2. GESI was also found to be at least as effective as, if not more effective than, HASPIv2 in predicting English sentence-level SI. The effect of introducing TMTF into the GESI algorithm was insignificant, suggesting that TMTF measurements and models are not yet mature. Therefore, it may be necessary to perform TMTF measurements with bandpass noise and to improve the incorporation of temporal characteristics into the model.

en eess.AS

Detail DOI Sumber

DOAJ Open Access 2024

What HMC Teaches Us About Authenticity

Katrin Etzrodt, Jihyun Kim, Margot J. van der Goot et al.

This paper delves into what the application of authenticity to Human-Machine Communication (HMC) can teach us about authenticity and us as HMC researchers and as a com- munity. Inspired by the 2023 pre-conference “HMC: Authenticity in communicating with machines,” two central questions guide the discussion: How does HMC contribute to our understanding of authentic communication with machines? And how can the concept of authenticity contribute to our sense of self as researchers within the HMC field? Through the collaborative effort of 22 authors, the paper explores the re-conceptualization of authenticity and presents recent areas of tension that guide the HMC research and community. With this paper we aim at offering a gateway for scholars to connect and engage with the evolving HMC field.

Technology (General), Oral communication. Speech

Detail DOI Sumber

DOAJ Open Access 2024

The impact of length, rotation, and repetition in Arabic words: Event-related potential evidence

Mohamed Tarik N.

The current study investigated the effects of word length, rotation, and repetition on the early processing of Arabic words in 24 Arabic-speaking college students (Mage = 21.6 years, SD = 1.67). Using a priming repetition paradigm, participants quickly determined if prime (S1) and probe (S2) words were the same or different. Event-related potentials (ERPs) were measured, focusing on P100, N170, and N250 components. Behavioral results indicated longer response times (RTs) and higher error rates for nine-letter, rotated, and non-repeated words. ERP results for the prime (S1) showed significant effects of word length and rotation on P100 and N170 components. Specifically, nine-letter words and 90 °-rotated words elicited higher amplitudes. For the probe (S2), no effects were found on P100, but 90 °-rotated words showed higher N170 and N250 amplitudes. These findings suggest that early stages of Arabic word processing are influenced by word length and rotation. They contribute to the understanding of Arabic word recognition.

Oral communication. Speech, Psychology

Detail DOI Sumber

DOAJ Open Access 2024

Monika Janíková, Terézia Horská, Svetlana KapalkováPhD.

Theory of Mind and narrative abilities are closely related in a bidirectional way. On the one hand, Theory of Mind is reflected in the quality of discourse production and comprehension, and on the other hand narrative input influences the development of Theory of Mind in children. The present contribution is theoretical in nature, defining the Theory of Mind, its development, and describing in more detail its relationship to narrative abilities in children. It also provides information on the possibilities of assessing Theory of Mind within a narrative context in the child population.

Medicine, Oral communication. Speech

Detail DOI Sumber

DOAJ Open Access 2024

Attitude, Practice and Preparedness of Dental Practitioners for Providing Oral Care to Patients with Speech and Hearing Impairment

Khansa Rafi, Syed Masood-ul- Hasan, Saba Tabassum

Objective: To explore the attitudes, practices and preparedness of dental practitioners for providing oral care to patients with speech and hearing impairment. Methodology: This cross-sectional study was conducted on a sample of 467 dental practitioners between November 2021 and April 2022. A survey questionnaire was administered to practicing dental practitioners in Pakistan through online Google forms and in person. The survey included questions regarding dentists' demographic information, their attitudes and concerns, practices and views regarding training needs concerning the provision of oral care to patients with speech and hearing impairment. Data was analyzed with STATA version 16. Descriptive statistics with frequencies were computed and x² was used to find out the associations between different variables. Results: The findings suggest that a significant proportion of participants, specifically 88.22%, were unfamiliar with sign language and relied on the assistance of an attendant (54.82%) for communication purposes. According to their accounts, surgical extraction (54.4%) was perceived as the most challenging treatment to execute, whereas scaling and polishing (73.9%) was considered the easiest treatment to administer in this particular population. When it came to providing care for individuals with speech and hearing impairment, several barriers were identified. These included communication obstacles (65.3%), patient cooperation (63.2%), concerns about medical emergencies (29.1%), inadequately trained staff (26.3%), and limited experience (25.5%). The participants unanimously agreed that there is a pressing need for enhanced undergraduate training (43.68%). Additionally, they suggested that it should be obligatory for students to treat a specific number of patients with speech and hearing impairment as part of their undergraduate training (49.25%). Conclusion: There is a need to address the identified barriers and create a better understanding between dentists and individuals with speech and hearing im- pairments. Basic dental signs should be learned and mastered to engage patients more effectively via non-verbal communication.

Dentistry

Detail Sumber

arXiv Open Access 2024

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Shujie Hu, Xurong Xie, Mengzhe Geng et al.

Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53%, 1.90%, 2.04% and 7.97% absolute (24.10%, 23.84%, 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.

en eess.AS, cs.AI

Detail Sumber

arXiv Open Access 2024

USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering

Zhong-Qiu Wang

In reverberant conditions with a single speaker, each far-field microphone records a reverberant version of the same speaker signal at a different location. In over-determined conditions, where there are multiple microphones but only one speaker, each recorded mixture signal can be leveraged as a constraint to narrow down the solutions to target anechoic speech and thereby reduce reverberation. Equipped with this insight, we propose USDnet, a novel deep neural network (DNN) approach for unsupervised speech dereverberation (USD). At each training step, we first feed an input mixture to USDnet to produce an estimate for target speech, and then linearly filter the DNN estimate to approximate the multi-microphone mixture so that the constraint can be satisfied at each microphone, thereby regularizing the DNN estimate to approximate target anechoic speech. The linear filter can be estimated based on the mixture and DNN estimate via neural forward filtering algorithms such as forward convolutive prediction. We show that this novel methodology can promote unsupervised dereverberation of single-source reverberant speech.

en eess.AS, eess.SP

Detail Sumber

arXiv Open Access 2024

Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction

Ye-Xin Lu, Yang Ai, Hui-Peng Du et al.

Speech bandwidth extension (BWE) refers to widening the frequency bandwidth range of speech signals, enhancing the speech quality towards brighter and fuller. This paper proposes a generative adversarial network (GAN) based BWE model with parallel prediction of Amplitude and Phase spectra, named AP-BWE, which achieves both high-quality and efficient wideband speech waveform generation. The proposed AP-BWE generator is entirely based on convolutional neural networks (CNNs). It features a dual-stream architecture with mutual interaction, where the amplitude stream and the phase stream communicate with each other and respectively extend the high-frequency components from the input narrowband amplitude and phase spectra. To improve the naturalness of the extended speech signals, we employ a multi-period discriminator at the waveform level and design a pair of multi-resolution amplitude and phase discriminators at the spectral level, respectively. Experimental results demonstrate that our proposed AP-BWE achieves state-of-the-art performance in terms of speech quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In terms of generation efficiency, due to the all-convolutional architecture and all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1 times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE is the first to achieve the direct extension of the high-frequency phase spectrum, which is beneficial for improving the effectiveness of existing BWE methods.

en eess.AS, cs.SD

Detail Sumber

S2 Open Access 2020

What has social neuroscience learned from hyperscanning studies of spoken communication? A systematic review.

B. Kelsen, A. Sumich, N. Kasabov et al.

A growing body of literature examining the neurocognitive processes of interpersonal linguistic interaction indicates the emergence of neural alignment as participants engage in oral communication. However, questions have arisen whether the study results can be interpreted beyond observations of cortical functionality and extended to the mutual understanding between communicators. This review presents evidence from electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) hyperscanning studies of interbrain synchrony (IBS) in which participants communicated via spoken language. The studies are classified into: knowledge sharing; turn-taking speech co-ordination; cooperation, problem-solving and creativity; and naturalistic discussion paradigms according to the type of interaction specified in each study. Alignment predominantly occurred in the frontal and temporo-parietal areas, which may reflect activation of the mirror and mentalizing systems. We argue that the literature presents a significant contribution to advancing our understanding of IBS and mutual understanding between communicators. We end with suggestions for future research, including analytical approaches and experimental conditions and hypothesize that brain-inspired neural networks are promising techniques for better understanding of IBS through hyperscanning.

117 sitasi en Medicine, Psychology

Detail DOI Sumber

S2 Open Access 2017

Creating an Online Learning Community in a Flipped Classroom to Enhance EFL Learners. Oral Proficiency

W. Wu, Jun Scott Chen Hsieh, Jie-Chi Yang

213 sitasi en Psychology, Computer Science

Detail Sumber

DOAJ Open Access 2023

Book review - Primární progresivní afázie

Zuzana Konůpková

Medicine, Oral communication. Speech

Detail Sumber

arXiv Open Access 2023

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Ye-Xin Lu, Yang Ai, Zhen-Hua Ling

Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2023

Visual-Aware Text-to-Speech

Mohan Zhou, Yalong Bai, Wei Zhang et al.

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

en eess.AS, cs.CL

Detail DOI Sumber

arXiv Open Access 2023

Prospective Validation of Motor-Based Intervention with Automated Mispronunciation Detection of Rhotics in Residual Speech Sound Disorders

Nina R Benway, Jonathan L Preston

Because lab accuracy of clinical speech technology systems may be overoptimistic, clinical validation is vital to demonstrate system reproducibility - in this case, the ability of the PERCEPT-R Classifier to predict clinician judgment of American English /r/ during ChainingAI motor-based speech sound disorder intervention. All five participants experienced statistically-significant improvement in untreated words following 10 sessions of combined human-ChainingAI treatment. These gains, despite a wide range of PERCEPT-human and human-human (F1-score) agreement, raise questions about best measuring classification performance for clinical speech that may be perceptually ambiguous.

en eess.AS, cs.SD

Detail DOI Sumber

Hasil untuk "Oral communication. Speech"