Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang
et al.
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\%$ to $0.98\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.
Karl El Hajal, Enno Hermann, Ajinkya Kulkarni
et al.
Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .
This paper introduces EmoSSLSphere, a novel framework for multilingual emotional text-to-speech (TTS) synthesis that combines spherical emotion vectors with discrete token features derived from self-supervised learning (SSL). By encoding emotions in a continuous spherical coordinate space and leveraging SSL-based representations for semantic and acoustic modeling, EmoSSLSphere enables fine-grained emotional control, effective cross-lingual emotion transfer, and robust preservation of speaker identity. We evaluate EmoSSLSphere on English and Japanese corpora, demonstrating significant improvements in speech intelligibility, spectral fidelity, prosodic consistency, and overall synthesis quality. Subjective evaluations further confirm that our method outperforms baseline models in terms of naturalness and emotional expressiveness, underscoring its potential as a scalable solution for multilingual emotional TTS.
Synthesizing second-language (L2) speech is potentially highly valued for L2 language learning experience and feedback. However, due to the lack of L2 speech synthesis datasets, it is difficult to synthesize L2 speech for low-resourced languages. In this paper, we provide a practical solution for editing native speech to approximate L2 speech and present PPG2Speech, a diffusion-based multispeaker Phonetic-Posteriorgrams-to-Speech model that is capable of editing a single phoneme without text alignment. We use Matcha-TTS's flow-matching decoder as the backbone, transforming Phonetic Posteriorgrams (PPGs) to mel-spectrograms conditioned on external speaker embeddings and pitch. PPG2Speech strengthens the Matcha-TTS's flow-matching decoder with Classifier-free Guidance (CFG) and Sway Sampling. We also propose a new task-specific objective evaluation metric, the Phonetic Aligned Consistency (PAC), between the edited PPGs and the PPGs extracted from the synthetic speech for editing effects. We validate the effectiveness of our method on Finnish, a low-resourced, nearly phonetic language, using approximately 60 hours of data. We conduct objective and subjective evaluations of our approach to compare its naturalness, speaker similarity, and editing effectiveness with TTS-based editing. Our source code is published at https://github.com/aalto-speech/PPG2Speech.
Christiaan Jacobs, Annelien Smith, Daleen Klop
et al.
We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.
Although paralinguistic cues are often considered the primary drivers of speech emotion recognition (SER), we investigate the role of lexical content extracted from speech and show that it can achieve competitive and in some cases higher performance compared to acoustic models. On the MELD dataset, our lexical-based approach obtains a weighted F1-score (WF1) of 51.5%, compared to 49.3% for an acoustic-only pipeline with a larger parameter count. Furthermore, we analyze different self-supervised (SSL) speech and text representations, conduct a layer-wise study of transformer-based encoders, and evaluate the effect of audio denoising.
Este artículo aborda la importancia de los cuentos de animales, en particular aquellos protagonizados por la raposa, en la herencia literaria de la provincia de León. Dichos cuentos, con raíces clásicas y medievales, han perdurado a lo largo de generaciones gracias a la transmisión oral y a las compilaciones escritas de investigadores locales. El estudio se enfoca en analizar versiones leonesas de cuentos con la raposa como protagonista, destacando sus funciones narrativas y aspectos literarios, así como las diferencias lingüísticas y geográficas en las narraciones recopiladas. El análisis revela divergencias en la narrativa, temática, y moralejas de los cuentos protagonizados por la raposa, incluso dentro de un mismo tipo de cuento en diferentes áreas de la provincia. El estudio enfatiza la relevancia de comprender y preservar el patrimonio literario regional y promueve análisis narratológicos para desentrañar su significado profundo.
Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature
Badania dotyczące wcześniaków częściej koncentrowały się na zachorowalności i śmiertelności ekstremalnych wcześniaków. Tymczasem wzrasta liczba wcześniaków urodzonych między 34. a 36. tygodniem i 6. dniem ciąży i zainteresowanie tymi dziećmi od kilkunastu już lat nie słabnie, bo wyniki badań pokazują ich dużą podatność na: opóźnienie neurorozwojowe, porażenie mózgowe, przewlekłe choroby układu oddechowego czy metaboliczne. Późne wcześniaki stanowią grupę wysokiego ryzyka, w związku z czym wymagają jak najwcześniejszej diagnostyki oraz stymulacji jako profilaktyki zaburzeń poznawczych i komunikacyjnych. Połączenie wczesnej interwencji medycznej i wczesnej stymulacji rozwoju lub w skrajnych przypadkach konsekwentnie prowadzonej terapii neurobiologicznej wydaje się mieć największy wpływ na szanse rozwojowe i jakość życia tej grupy dzieci i ich rodzin. Długofalowe monitorowanie rozwoju dziecka przedwcześnie urodzonego również w aspekcie rozwoju poznawczego (w tym językowego) to szansa na osiągnięcie sukcesu edukacyjnego każdego wcześniaka. Artykuł przedstawia wyniki badań logopedycznych kilkunastoosobowej grupy późnych wcześniaków, stając się asumptem do refleksji na temat wpływu późnego wcześniactwa na rozwój tej grupy dzieci.
Background/Objectives: This study aimed to assess the reliability and validity of the Korean version of the Eating and Drinking Ability Classification System (EDACS) for children with cerebral palsy (CP). Methods: This was a prospective cross-sectional study conducted to psychometrically assess 40 children with CP. Reliability was evaluated by a physician in pediatric rehabilitation, a speech–language therapist (SaLT), and caregivers. Results: In the evaluation of each level and level of assistance in the EDACS, the agreement observed between the physician and speech therapist was almost perfect (κW = 0.940, 0.919). The agreement between the physician and the caregiver was substantial (κW = 0.618, 0.592), whereas it was moderate between the therapist and caregiver (κW = 0.557, 0.556). Intra-rater reliability remained almost perfect, with the physician (κW = 0.979), the SaLT (κW = 0.980), and caregivers (κW = 0.980). The EDACS showed moderate to high correlation with the Functional Oral Intake Scale, Gross Motor Function Classification System, Manual Ability Classification System, and Communication Function Classification System (Kτ = −0.863, 0.656, 0.720, 0.616). Conclusions: These findings support the Korean EDACS as a reliable tool for classifying eating and drinking abilities in children with CP, thereby enhancing its clinical utility and communication among non-professionals and professionals.
Recent advances in self-supervised speech models have shown significant improvement in many downstream tasks. However, these models predominantly centered on frame-level training objectives, which can fall short in spoken language understanding tasks that require semantic comprehension. Existing works often rely on additional speech-text data as intermediate targets, which is costly in the real-world setting. To address this challenge, we propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process, where the targets are derived from a visually-ground speech model, notably eliminating the need for speech-text paired data. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
The author attempts to explore the threat as a speech genre of verbal aggression. It is argued that the threat is most often found in the texts of modern online-communication, the characteristics of which are close to a synthetic oral-written form of communication and the colloquial genre. It is emphasized that in modern studies the threat is researched from a legal and linguistic point of view. The author proves the idea that there are two main types of threat statements – a threat-warning, including a causal action, and a threatpunishment. A specific sub-genre of threat is a blackmail, which includes manipulation by means of information compromising the addressee. Semantically, the threat can be explicit, clearly expressed, and implicit, whose interpretation requires an appeal to the logics of the statement. Implicit threat in online-communication is the most difficult type to research, since its interpretation involves analyzing the logic of constructing a statement by identifying linguistic techniques presented in the text – the use of euphemisms, idioms, semantic convolution, metaphors, sarcasm, precedent case reference and non-verbal components of communication (images, emoji. Implicit threat is studied using data from various explanatory linguistic dictionaries and fragments of texts submitted for forensic linguistic examination. It is reported that the genre of a criminal threat should be distinguished from other genres such as promises, orders, warnings, verbally expressed suicidal intentions, malevolence, a threat in a broad sense, and calls to a violent action.
Background: There exists a dearth of research on the psychological experiences of childhood communication disorders. Caregivers of these children are one source who can provide us with this information since the child exists within a system. Literature on the experiences of caregivers of children with communication disorders, specifically in the South African city context, is lacking.
Objectives: The article presents the reported experiences of six parental caregivers raising a child with a communication disorder within Johannesburg.
Method: Data were collected via semi-structured interviews and underwent an interpretative phenomenological analysis.
Results: Five themes are presented: feeling out of control; barriers to accessing services; caregivers left speechless; a misunderstood disability; and relinquishing control: ‘I needed to be a mommy’.
Conclusion: The caregivers relayed an initially negative experience in raising a child with a communication disorder, marred with worry and fear for the future. In accessing services, they narrated their experiences of multiple barriers to access, including an overreliance on the private sector, financial barriers, time constraints, limited resources and poor information on available resources. Eventually, the caregivers were able to identify strengths, skills and capabilities within their children and themselves that allowed for adjustment and coping.
Contribution: The findings highlight the importance of considering how barriers to care may marginalise and seclude parental caregivers. Over time, the caregivers were able to empower themselves through building their own knowledge and reformulating their narratives by challenging assumptions and retitling the source of disability as a product of the failure of society.
Martina Costanzo, Francesco Puleio, Roberto Lo Giudice
et al.
Aim: This study investigates the collaborative efforts between speech language therapists and dentists to address the challenges faced by patients with concurrent dental and speech issues. It underscores the importance of understanding the complex interactions within the craniofacial system and its implications for both oral health and communication abilities.
Methods: A questionnaire-based research was conducted involving speech language therapists and dentists, using Google Forms as the medium of data collection. This study was disseminated through professional networks, where practitioners provided information on their demographics, practices of collaboration, and the perceived advantages of such interdisciplinary work. All ethical standards were adhered to, ensuring the confidentiality and informed consent of the participants.
Results: The findings indicate that dentists, often see first patient visits beyond recommended ages primarily for existing conditions, deviating from preventive norms. About half communicate monthly with speech language therapists, revealing gaps in early care and collaboration. Speech language therapists focus on language disorders and orofacial myofunctional disorders (SMOF), with 37% collaborating frequently with dentists. This interaction highlights the potential for enhanced patient outcomes through interdisciplinary cooperation. Despite these challenges, both groups acknowledged the benefits of their collaborative efforts.
Conclusions: The partnership between speech language therapists and dentists represents a vital area of healthcare, aimed at tackling the complex issues of oral health and communication through a multidisciplinary approach. Although obstacles in collaboration were identified, the positive outcomes observed from such partnerships emphasize the importance of developing a unified framework to improve integrated healthcare delivery in the future.
The article considers the peculiarity of speech communication in the conditions of the use of remote technologies (videoconferencing) by the participants of the process in relation to the possibility of using manipulative speech means of influencing the addressee. The influence of technical communication channels and means of sound reproduction, which introduce distortions into speech signals, on the production and perception of speech is noted. It is concluded that the participants, directly and remotely present in the courtroom, have different opportunities to use the arsenal of verbal and non-verbal means of persuading the court through manipulative and non-manipulative influence.
Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.
Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of temporal modulations from the spectrogram. This temporal modulation representation of spectrogram is called modulation spectral feature (MSF). As the constant-Q transform (CQT) provides higher resolution at emotion salient low-frequency regions of speech, we find that CQT-based spectrogram, together with its temporal modulations, provides a representation enriched with emotion-specific information. We argue that CQT-MSF when used with a 2-dimensional convolutional network can provide a time-shift invariant and deformation insensitive representation for SER. Our results show that CQT-MSF outperforms standard mel-scale based spectrogram and its modulation features on two popular SER databases, Berlin EmoDB and RAVDESS. We also show that our proposed feature outperforms the shift and deformation invariant scattering transform coefficients, hence, showing the importance of joint hand-crafted and self-learned feature extraction instead of reliance on complete hand-crafted features. Finally, we perform Grad-CAM analysis to visually inspect the contribution of constant-Q modulation features over SER.