Intelligibility, Oral Communication, and the Teaching of Pronunciation
John M. Levis
A proper understanding of intelligibility is at the heart of effective pronunciation teaching, and with it, successful teaching of speaking and listening. Far from being an optional 'add-it-on-if-we-have-time' language feature, pronunciation is essential because of its tremendous impact on speech intelligibility. Pronunciation dramatically affects the ability of language learners to make themselves understood and to understand the speech of others. But not all elements of pronunciation are equally important. Some affect intelligibility a great deal, while others do not. With a strong emphasis on classroom practice and how pronunciation teaching can be more effectively approached in different teaching contexts, this book provides an important resource for pronunciation researchers, with a distinctly practical focus. It shows how intelligibility research informs pronunciation teaching within communicative classrooms, enabling language teachers to incorporate intelligibility findings into their teaching. Professionals interested in oral communication, pronunciation, and speech perception will find the book fascinating.
Oral History Beyond Speech and Narrative: What Intellectual and Developmental Disability Can Teach Us
Nicki Pombier
ABSTRACT This article challenges the conventional boundaries of oral history by exploring communication beyond speech and narrative, particularly in the context of intellectual and developmental disabilities (IDD). The prevailing oral history practice, reliant on speech, inherently excludes individuals who communicate nonverbally. Drawing on over a decade of practice, disability scholarship, and the author’s personal experiences, this work argues for a more inclusive approach to oral history. It examines alternative forms of communication; such as embodied movements, gestures, and other nonverbal expressions, positioning the body as both language and archive. By focusing on the experiences of people with IDD, the article critiques ableist constraints and reimagines oral history as a practice of “being with” rather than “listening for” stories. Through a case study of arts-based collaborations facilitated by the Institute on Disabilities at Temple University, the author demonstrates how IDD individuals’ expressions—often overlooked—are crucial for expanding the field’s scope. While procedural and process-based modifications can make oral history more inclusive as a speech- and narrative-centered practice, the author argues for reimagining oral history in form as something new, beyond speech and narrative.
An Empirical Analysis of Discrete Unit Representations in Speech Language Modeling Pre-training
Yanis Labrak, Richard Dufour, Mickaël Rouvier
This paper investigates discrete unit representations in Speech Language Models (SLMs), focusing on optimizing speech modeling during continual pre-training. In this paper, we systematically examine how model architecture, data representation, and training robustness influence the pre-training stage in which we adapt existing pre-trained language models to the speech modality. Our experiments highlight the role of speech encoders and clustering granularity across different model scales, showing how optimal discretization strategies vary with model capacity. By examining cluster distribution and phonemic alignments, we investigate the effective use of discrete vocabulary, uncovering both linguistic and paralinguistic patterns. Additionally, we explore the impact of clustering data selection on model robustness, highlighting the importance of domain matching between discretization training and target applications.
Adaptive Convolution for CNN-based Speech Enhancement Models
Dahan Wang, Xiaobin Rong, Shiruo Sun
et al.
Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper, we introduce adaptive convolution, an efficient and versatile convolutional module that enhances the model's capability to adaptively represent speech signals. Adaptive convolution performs frame-wise causal dynamic convolution, generating time-varying kernels for each frame by assembling multiple parallel candidate kernels. A lightweight attention mechanism is proposed for adaptive convolution, leveraging both current and historical information to assign adaptive weights to each candidate kernel. This enables the convolution operation to adapt to frame-level speech spectral features, leading to more efficient extraction and reconstruction. We integrate adaptive convolution into various CNN-based models, highlighting its generalizability. Experimental results demonstrate that adaptive convolution significantly improves the performance with negligible increases in computational complexity, especially for lightweight models. Moreover, we present an intuitive analysis revealing a strong correlation between kernel selection and signal characteristics. Furthermore, we propose the adaptive convolutional recurrent network (AdaptCRN), an ultra-lightweight model that incorporates adaptive convolution and an efficient encoder-decoder design, achieving superior performance compared to models with similar or even higher computational costs.
MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement
Nikolai Lund Kühne, Jesper Jensen, Jan Østergaard
et al.
With new sequence models like Mamba and xLSTM, several studies have shown that these models match or outperform the state-of-the-art in single-channel speech enhancement and audio representation learning. However, prior research has demonstrated that sequence models like LSTM and Mamba tend to overfit to the training set. To address this, previous works have shown that adding self-attention to LSTMs substantially improves generalization performance for single-channel speech enhancement. Nevertheless, neither the concept of hybrid Mamba and time-frequency attention models nor their generalization performance have been explored for speech enhancement. In this paper, we propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules for generalizable single-channel speech enhancement. To train our model, we introduce VB-DemandEx, a dataset inspired by VoiceBank+Demand but with more challenging noise types and lower signal-to-noise ratios. Trained on VB-DemandEx, MambAttention significantly outperforms existing state-of-the-art discriminative LSTM-, xLSTM-, Mamba-, and Conformer-based systems of similar complexity across all reported metrics on two out-of-domain datasets: DNS 2020 without reverberation and EARS-WHAM_v2. MambAttention also matches or outperforms generative diffusion models in generalization performance while being competitive with language model baselines. Ablation studies highlight the importance of weight sharing between time- and frequency-multi-head attention modules for generalization performance. Finally, we explore integrating the shared time- and frequency-multi-head attention modules with LSTM and xLSTM, which yields a notable performance improvement on the out-of-domain datasets. Yet, MambAttention remains superior for cross-corpus generalization across all reported evaluation metrics.
The axiological framework of professional foreign language communication
Natalia D. Galskova, Elena A. Komochkina, Nataliia V. Poliakova
The article adopts a cross-disciplinary perspective on foreign language instruction for non-linguistic students, with particular emphasis on the axiological aspects of developing their professional speaking and writing skills in a foreign language. The theoretical framework of the study integrates contemporary achievements in axiology, cognitive psychology, social studies, and language education into a coherent teaching methodology, designed to reflect the lifestyles, modes of thinking, preferences, values, and cultural practices of today’s youth. The study is aimed at identifying the correlation between professional values and foreign language communicational skills and prove its positive effect on the formation of student’s professional speech culture through a foreign language course titled ‘Professional Speech Culture in the Native and Foreign Language’ for undergraduate science students. The course is tailored to the learners’ specific linguistic, cultural, and professional needs, focusing on oral and written communication in both L1 and L2 to support their future academic and professional success. The course incorporates authentic academic materials in both languages, collaborative activities, and grammar and vocabulary practice, with special attention given to the analysis and categorisation of specialised discourse units and markers within a cross-cultural professional context. The methodology was implemented and tested through a three-stage pedagogical experiment conducted among Physics and Mathematics students at the State University of Education during the 2024–2025 academic year. The results of current and final assessment procedures confirmed the effectiveness of the proposed methodology: students in the experimental group demonstrated greater fluency, fewer grammatical and lexical errors, and more substantial progress in both oral and written communication compared to the control group. Thus, we can state that the proposed methodology makes a valuable contribution to the theory and practice of foreign language instruction.
Education, Philology. Linguistics
Exploring socioeconomic concerns in Italian social media discourse on migration: A sentiment analysis of X posts
Indelicato Alessandro, Martín Juan Carlos, Di Rocco Rossella Maria Pia
et al.
Social media platforms, particularly Twitter (now X), have transformed into significant arenas for public debate on immigration, showcasing a diverse spectrum of opinions and emotions. The current study utilized sentiment analysis, a method widely used in social media research, to delve into public sentiment in social network comments related to immigration, specifically focusing on Italy. The study compiled a dataset of X posts containing the keyword “immigrati” (immigrants) and conducted sentiment analysis to gauge the prevailing attitudes. The study aimed to uncover prevailing themes and connections in public discourse by identifying the most frequent topics. These patterns shed light on the primary topics and terms associated with immigration in Italy and their interrelationships. The findings contribute to a deeper understanding of sentiment and discourse surrounding immigration on social media, particularly within the Italian context. Results revealed the intricate and multifaceted nature of public opinion on immigration, with Italian concerns predominantly revolving around the irregular status of newcomers, the role of NGOs, and economic and safety issues, reflecting the complex nature of public opinion on immigration in contemporary digital spaces.
Oral communication. Speech, Psychology
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks
Kai-Wei Chang, Haibin Wu, Yu-Kai Wang
et al.
Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.
A Dual-Branch Parallel Network for Speech Enhancement and Restoration
Da-Hee Yang, Dail Kim, Joon-Hyuk Chang
et al.
We present a novel general speech restoration model, DBP-Net (dual-branch parallel network), designed to effectively handle complex real-world distortions including noise, reverberation, and bandwidth degradation. Unlike prior approaches that rely on a single processing path or separate models for enhancement and restoration, DBP-Net introduces a unified architecture with dual parallel branches-a masking-based branch for distortion suppression and a mapping-based branch for spectrum reconstruction. A key innovation behind DBP-Net lies in the parameter sharing between the two branches and a cross-branch skip fusion, where the output of the masking branch is explicitly fused into the mapping branch. This design enables DBP-Net to simultaneously leverage complementary learning strategies-suppression and generation-within a lightweight framework. Experimental results show that DBP-Net significantly outperforms existing baselines in comprehensive speech restoration tasks while maintaining a compact model size. These findings suggest that DBP-Net offers an effective and scalable solution for unified speech enhancement and restoration in diverse distortion scenarios.
Speech Separation with Pretrained Frontend to Minimize Domain Mismatch
Wupeng Wang, Zexu Pan, Xinke Li
et al.
Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.
Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics
Syu-Siang Wang, Jia-Yang Chen, Bo-Ren Bai
et al.
The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a "human-in-the-loop" module during the learning process of HL-StarGAN. The overall HL-StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL-StarGAN, we curated a face-masked speech database named "FMVD," which comprises recordings from 34 speakers in three distinct face-masked scenarios and a clean condition. We conducted subjective and objective tests on the proposed HL-StarGAN using this database. The outcomes of the test results are as follows: (1) MaskQSS successfully predicted the quality scores of face mask voices, outperforming several existing speech assessment methods. (2) The integration of the MaskQSS predictor enhanced the ability of HL-StarGAN to transform face mask voices into high-quality speech; this enhancement is evident in both objective and subjective tests, outperforming conventional StarGAN and CycleGAN-based systems.
Effect of Persian Prosodic Units and Intonation Patterns on Intonation Pattern No. 7 in Iranian Students of Russian
Roya Azadi, Alireza Valipour
Iranian students of Russian often experience problems with phonetics and pronunciation that hinder their listening comprehension. Pronunciation mistakes and native-language interference are the main problems Iranian students have to face while studying Russian. No textbooks in Iran introduce a comparative analysis of Russian and Persian phonetic systems. This research was an attempt to compare the intonation systems of both languages in order to help Iranian students to master Intonation Pattern No.7 and develop their communication skills in the Russian language. Oral speech of 17 Iranian students was processed phonetically and acoustically using PRAAT 6.2.10. This program used pitch and intensity filters to analyze Russian Intonation Pattern No. 7 and visualize speech sounds as graphs. Intonation is a set of prosodic means that segments and structures speech flow in accordance with the message intended. In both languages, intonation patterns convey meaning and express emotions. In this study, native intonation pool prevented the Iranian students from imitating Russian Intonation Pattern No. 7. This comparative acoustic analysis of Russian and Persian intonation systems will help Iranian students to prevent phonetic, rhythmic, and intonation errors in Russian. The results can be used by students, second-language teachers, speech therapists, and communication specialists.
A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Siyang Wang, Gustav Eje Henter, Joakim Gustafson
et al.
Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
Kenichi Fujita, Takanori Ashihara, Hiroki Kanagawa
et al.
This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.
Speech therapy intervention program for language development in preschoolers
Iulia-Maria PAȘCALĂU
This paper aims to identify the developmental characteristics of language in preschool children, capturing the impact of the speech therapy program on language development, and utilizing the principles of the TALK program in speech therapy intervention. The study sample consisted of 22 preschoolers, aged between 5-6 years, from a kindergarten in Cluj-Napoca. After dividing the participants into two groups, control and experimental, the evaluation was performed using three assessment tools: the Curriculum TALK, Portage (Language subscale), and the Logopedic Album. A speech therapy intervention program was implemented for the experimental group, followed by a final evaluation conducted using the same assessment tools. After collecting the results from pre and post-tests, all data were centralized in an SPSS database for processing. Thus, by using the paired samples t-test (hypothesis 1), the independent samples t-test (hypothesis 2), and Pearson’s bivariate correlation (hypothesis 3), it was found that all research hypotheses were confirmed.
Philology. Linguistics, Oral communication. Speech
A psycholinguist approach of language and communication in autism spectrum disorder (ASD)
Ioana IRIMIA
Psycholinguistics is a field that combines methods and theories from psychology and linguistics to gain a fuller understanding of human language. The connection between the human psyche and language is the focus of research in psycholinguistics. Autism Spectrum Disorder (ASD) is a common and heterogeneous neurodevelopmental disorder characterized by language impairment, social and communicative deficits and repetitive behaviours. Echolalia is considered a predominant feature of ASD. Early intervention should be started as soon as a diagnosis is made, as social, play and language domains do not develop at the expected rate or in the expected multimodal, integrated manner. Using the language sample analysis technique, it is observed that echolalia and the number of abstract words do not influence mean utterance length, but the number of novel words correlates with mean utterance length. The psycholinguistic approach helps to gain a deeper understanding of language deficits and influences speech therapy intervention.
Philology. Linguistics, Oral communication. Speech
Low-resource automatic speech recognition and error analyses of oral cancer speech
B. Halpern, Siyuan Feng, R.J.J.H. van Son
et al.
In this paper, we introduce a new corpus of oral cancer speech and present our study on the automatic recognition and analysis of oral cancer speech. A two-hour English oral cancer speech dataset is collected from YouTube. Formulated as a low-resource oral cancer ASR task, we investigate three acoustic modelling approaches that previously have worked well with low-resource scenarios using two different architectures; a hybrid architecture and a transformer-based end-to-end (E2E) model: (1) a retraining approach; (2) a speaker adaptation approach; and (3) a disentangled representation learning approach (only using the hybrid architecture). The approaches achieve a (1) 4.7% (hybrid) and 7.5% (E2E); (2) 7.7%; and (3) 2.0% absolute word error rate reduction, respectively, compared to a baseline system which is not trained on oral cancer speech. A detailed analysis of the speech recognition results shows that (1) plosives and certain vowels are the most difficult sounds to recognise in oral cancer speech — this problem is successfully alleviated by our proposed approaches; (3) however these sounds are also relatively poorly recognised in the case of healthy speech with the exception of/p/. (2) recognition performance of certain phonemes is strongly data-dependent; (4) In terms of the manner of articulation, E2E performs better with the exception of vowels — however, vowels have a large contribution to overall performance. As for the place of articulation, vowels, labiodentals, dentals and glottals are better captured by hybrid models, E2E is better on bilabial, alveolar, postalveolar, palatal and velar information. (5) Finally, our analysis provides some guidelines for selecting words that can be used as voice commands for ASR systems for oral cancer speakers.
12 sitasi
en
Computer Science
Assessment of Oral Health Status and Communication Barriers in Hearing- and Speech-Impaired Children in Jeddah City
Y. Alyami, Rakan N Alamri, M. Abdulsamad
et al.
Background: Hearing- and speech-impaired people form a significant part of society. Literature reveals that these people have compromised oral health conditions as compared to people with decreased levels of oral health alertness and communication barriers. The aim of the study was to assess the oral health status in hearing- and speech-impaired children in Jeddah city. Methodology: One hundred sixteen children aged five to 16 years of either gender with hearing and speech impairment were selected. Parameters recorded were dental caries, gingival diseases, and malocclusion. The fluorosis status was also recorded. Results: The total number of children in the age group five to seven years was 16, eight to 10 years was 25, 11 to 13 years was 30 and 14 to 16 years was 45. Of the 116 children, there were 60 males and 56 females. Among the 116 children, 25 had decayed, 12 had missing and 30 had filled teeth. Malocclusion was found to be class II div I in 20, class II div II in 11, class III in five, spacing in 17, and rotation in 10. It was found that 32 had mild fluorosis and 74 had gingivitis. Method of communication was sign language in 62%, lip-reading in 20% and hearing aids in 18% of the population. Conclusion: Special care needs to be taken in children with hearing and speech disabilities together with poor oral hygiene status. A careful oral examination is needed in these children.
Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need
Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman
et al.
The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect pronunciation errors with high accuracy (only 60\% precision at 40\%-80\% recall). One of the key problems is the low availability of mispronounced speech that is needed for the reliable training of pronunciation error detection models. If we had a generative model that could mimic non-native speech and produce any amount of training data, then the task of detecting pronunciation errors would be much easier. We present three innovative techniques based on phoneme-to-phoneme (P2P), text-to-speech (T2S), and speech-to-speech (S2S) conversion to generate correctly pronounced and mispronounced synthetic speech. We show that these techniques not only improve the accuracy of three machine learning models for detecting pronunciation errors but also help establish a new state-of-the-art in the field. Earlier studies have used simple speech generation techniques such as P2P conversion, but only as an additional mechanism to improve the accuracy of pronunciation error detection. We, on the other hand, consider speech generation to be the first-class method of detecting pronunciation errors. The effectiveness of these techniques is assessed in the tasks of detecting pronunciation and lexical stress errors. Non-native English speech corpora of German, Italian, and Polish speakers are used in the evaluations. The best proposed S2S technique improves the accuracy of detecting pronunciation errors in AUC metric by 41\% from 0.528 to 0.749 compared to the state-of-the-art approach.