Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
Abdullah Baihan, A. Alutaibi, Mohammed Alshehri
et al.
Speech impairment limits a person’s capacity for oral and auditory communication. Improvements in communication between the deaf and the general public can be progressed by a real-time sign language detector. Recent studies have contributed to make progress in motion and gesture identification processes using Deep Learning (DL) methods and computer vision. But the development of static and dynamic sign language recognition (SLR) models is still a challenging area of research. The difficulty is in obtaining an appropriate model that addresses the challenges of continuous signs that are independent of the signer. Different signers’ speeds, durations, and many other factors make it challenging to create a model with high accuracy and continuity. This study mainly focused on SLR using a modified DL and hybrid optimization approach. Notably, spatial and geometric-based features are extracted via the Visual Geometry Group 16 (VGG16), and motion features are extracted using the optical flow approach. A new DL model, CNNSa-LSTM, is a combination of a Convolutional Neural Network (CNN), Self-Attention (SA), and Long-Short-Term Memory (LSTM) to identify sign language. This model is developed for feature extraction by combining CNNs for spatial analysis with SA mechanisms for focusing on relevant features, while LSTM effectively models temporal dependencies. The proposed CNNSa-LSTM model enhances performance in tasks involving complex, sequential data, such as sign language processing. Besides, a Hybrid Optimizer (HO) is proposed using the Hippopotamus Optimization Algorithm (HOA) and the Pathfinder Algorithm (PFA). The proposed model has been implemented in Python, and it has been evaluated over the existing models in terms of accuracy (98.7%), sensitivity (98.2%), precision (98.5%), Word Error Rate (WER) (0.131), Sign Error Rate (SER) (0.114), and Normalized Discounted Cumulative Gain (NDCG) (98%) as well. The proposed model has recorded the highest accuracy of 98.7%.
The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.
Kishor Kayyar Lakshminarayana, Frank Zalkow, Christian Dittmar
et al.
In recent years, several text-to-speech systems have been proposed to synthesize natural speech in zero-shot, few-shot, and low-resource scenarios. However, these methods typically require training with data from many different speakers. The speech quality across the speaker set typically is diverse and imposes an upper limit on the quality achievable for the low-resource speaker. In the current work, we achieve high-quality speech synthesis using as little as five minutes of speech from the desired speaker by augmenting the low-resource speaker data with noise and employing multiple sampling techniques during training. Our method requires only four high-quality, high-resource speakers, which are easy to obtain and use in practice. Our low-complexity method achieves improved speaker similarity compared to the state-of-the-art zero-shot method HierSpeech++ and the recent low-resource method AdapterMix while maintaining comparable naturalness. Our proposed approach can also reduce the data requirements for speech synthesis for new speakers and languages.
Anne Bragard, Claire Fabert, Alexia Vanvoorst
et al.
Contexte. Un bon développement des différentes composantes de la conscience de l’écrit est essentiel pour favoriser l’apprentissage de la lecture et de l’écriture chez l’enfant.
Objectifs. Cette étude propose un outil d’évaluation de la conscience de l’écrit destiné aux enfants francophones âgés de 4 à 7 ans, afin de détecter d’éventuelles difficultés dans le développement de cette compétence.
Méthodes. Des données normées ont été récoltées sur 91 enfants répartis par niveau scolaire de la 2e maternelle (moyenne section maternelle) à la 1e primaire (cours préparatoire) en Belgique francophone.
Résultats. Les analyses psychométriques en termes de fiabilité et validité confirment la pertinence de cet outil dans le dépistage précoce des difficultés d’apprentissage du langage écrit. Une étude de cas illustre son utilisation ainsi que la nécessité de mettre en place au plus tôt une intervention pour certains enfants à risque.
Conclusion. Cet outil de dépistage permet aux enseignants ou cliniciens, en partenariat avec les parents, d’évaluer les compétences de l’enfant en conscience de l’écrit, une des compétences considérée comme un prérequis au bon développement du langage écrit. En cas de faiblesses, un soutien au développement de cette habileté peut ainsi être mis en place auprès de l’enfant.
Poirier Rachel C., Upadhyay Sri Siddhi N., Klin Celia M.
During reading, different cues influence readers’ focus. One of these is natural language quantifiers. Negative quantifiers, such as few (e.g., “Few attended the lecture”), have a complex influence on focus. They convey a sense of shortfall – a quantity that is less than what was expected (e.g. Moxey, 2006; Moxey & Sanford, 1987). This, in turn creates focus on the complement set (e.g., those not attending). Upadhyay et al. (2018) demonstrated the influence of story context on these focus effects. In three experiments, we investigated the influence of story context on focus effects with positive quantifiers. Although positive quantifiers are less diffuse in their meaning than negative quantifiers, context can lead positive quantifiers to convey shortfall, as negative quantifiers do. When the story context creates an expectation for a large quantity (e.g., many), the positive quantifier a few was understood as shortfall, leading to focus on the complement set.
Jesin James, Balamurali B. T., Binu Abeysinghe
et al.
This study investigates discriminative patterns learned by neural networks for accurate speech classification, with a specific focus on vowel classification tasks. By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms. Through the use of class activation mapping, we identify the frequencies that contribute to vowel classification and compare these findings with linguistic knowledge. Experiments on a American English dataset of vowels showcases the explainability of neural networks and provides valuable insights into the causes of misclassifications and their characteristics when differentiating them from unvoiced speech. This study not only enhances our understanding of the underlying acoustic cues in vowel classification but also offers opportunities for improving speech recognition by bridging the gap between abstract representations in neural networks and established linguistic knowledge
Ahmad A. Alanazi, Mohammed F. ALHarbi, Abrar M. AlMutairi
et al.
Background: Healthcare professionals are required to work effectively together to deliver the best healthcare services. Without awareness of other healthcare professionals’ roles and responsibilities, interprofessional practice (IPP) cannot be optimally achieved.
Objectives: This study aimed to investigate healthcare professionals’ awareness of audiology and speech-language pathology (SLP) services in Saudi Arabia.
Method: This cross-sectional descriptive study consisted of two parts. The content of a 20-item paper questionnaire was firstly validated. The full-scale study addressed the aim through distributing questionnaire items among potential participants. Descriptive statistics and chi-square test were used.
Results: A total of 403 participants completed the questionnaires for the main study. Most of the participants were Saudi citizens (84.1%), aged 18 years – 40 years (84.8%) years, and lived in Riyadh region (76.2%). Allied health professionals (40.2%), physicians (22.6%), nursing (15.4%) and dentistry (11.2%) were the main group of participants working mainly at governmental hospitals (69.2%). Of the total participants, 92.6% and 95.3% reported being fully aware of the services provided by audiologists and SLPs, respectively. No statistically significant association between the specialty of participants and their familiarity with the scope of practice for SLPs and audiologists was determined.
Conclusion: Our study examined healthcare professionals’ awareness of audiology and SLP services and revealed a high level of awareness.
Contribution: The existed level of awareness is expected to facilitate IPP and enhance the quality of care. Still, awareness campaigns about audiology and SLP services are needed to address the existing lack of knowledge among some healthcare professionals.
Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.
En el presente trabajo se aborda el primer pliego suelto poético del siglo xvi que conservamos de temática tremendista. Sería, por tanto, el primer “caso horrible y espantoso” impreso en verso. Por sus rasgos formales y de contenido se observa que es un espécimen primitivo que, a principios del siglo xvi, se muestra a medio camino entre la oralidad y la escritura. Además, hemos conseguido averiguar que los hechos que allí se narran ocurrieron realmente, por lo que nos encontramos ante una de las primeras noticias de tema truculento que debió correr de manera oral, pero también impresa en los albores del Quinientos.
Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature
The current study aims to investigate the functional variation in academic writing across six major disciplines of sciences, namely biology, physics, and computer science and of social sciences, namely linguistics, sociology, and psychology research articles. Furthermore, the current study also compares the use of language in science and social sciences research articles (RAs). For this purpose, a corpus of 300 research articles from six sub-disciplines of science and social sciences journals was compiled to conduct the analysis. A Multidimensional Analysis Tagger using version 1.3.2 was selected to tag and evaluate the RAs corpus. The triangulation research method was deployed to analyze the data quantitatively and to interpret the results qualitatively. The findings indicate that there are functional differences in the language employed in academic disciplines, particularly in terms of being informational, non-narrative, and persuasive. Likewise, comparable outcomes were noticed in dimension 3. The findings indicate that there are functional differences in the language employed in academic disciplines, particularly in terms of being informational, non-narrative, and persuasive. Likewise, comparable outcomes were noticed in dimension 3. The results may be helpful for the researchers, students of different disciplines, and teachers to spot trends across various sub-disciplines.
Language. Linguistic theory. Comparative grammar, Oral communication. Speech
Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$.
Mild cognitive impairment (MCI) threatens a large percentage of the elderly population worldwide. Symptoms can also include language and speech difficulties. This article aims at speech/language comprehension difficulties at the sentence level in people with MCI, with the use of a case study. Sentence comprehension was examined in a female patient with amnestic MCI, by means of the Czech version of the Sentence Comprehension Test. Impaired comprehension of all sentence types was found, with a high need for repetition, as well as the influence of sentence complexity. The results show that limited cognitive resources are a reason for insufficient linguistic information processing. Thus, therapy focused on cognitive training will also improve or maintain language.
Harsha Kathard, Prianka Parusnath, Lehana Thabane
et al.
Background: Early classroom communication supports are critical in influencing oral language development and emergent literacy skills. It is both support from peers and adults, as well as the environment that impacts the quality and efficacy of language learning. Schools in particular play a key role in communication development, which will be further explored in this article.
Objectives: To describe how communication was being supported in the classroom by teachers in the areas of language-learning environment (LLE), language-learning interactions (LLIs) and language-learning opportunities (LLOs) to focus on classroom communication supports to strengthen literacy outcomes for Grade R learners.
Method: A cross-sectional descriptive survey of classrooms was conducted using a structured observation method in 136 classrooms. The Communication Supporting Classroom Observation Tool was used to make 223 observations across classrooms of LLE (19 items), LLI (20 items) and LLO (5 items). A descriptive analysis of frequency of occurrence of items was conducted for each domain.
Results: Language-learning environment had more frequently occurring observations, with 12 out of 19 items being observed with a frequency of occurrence greater than 65%, indicating that the environment had adequate basic resources (e.g. good light). Language-learning interactions scores indicated that 3 out of 20 items were observed frequently, while 17 out of 20 items were observed less frequently (56.5%), including interactive strategies (e.g. turn-taking). Language Learning Opportunities scores indicated that teacher-led behaviours (e.g. including children in small group activities) were infrequently observed on all items.
Conclusion: While the physical environment was generally supportive, the behaviours in the interactive domains of LLI and LLO were observed less frequently. There is potential for speech–language therapists to work collaboratively with teachers to develop communication supporting classrooms as a critical primary-level intervention for language-literacy learning.
We present a scalable and efficient neural waveform coding system for speech compression. We formulate the speech coding problem as an autoencoding task, where a convolutional neural network (CNN) performs encoding and decoding as a neural waveform codec (NWC) during its feedforward routine. The proposed NWC also defines quantization and entropy coding as a trainable module, so the coding artifacts and bitrate control are handled during the optimization process. We achieve efficiency by introducing compact model components to NWC, such as gated residual networks and depthwise separable convolution. Furthermore, the proposed models are with a scalable architecture, cross-module residual learning (CMRL), to cover a wide range of bitrates. To this end, we employ the residual coding concept to concatenate multiple NWC autoencoding modules, where each NWC module performs residual coding to restore any reconstruction loss that its preceding modules have created. CMRL can scale down to cover lower bitrates as well, for which it employs linear predictive coding (LPC) module as its first autoencoder. The hybrid design integrates LPC and NWC by redefining LPC's quantization as a differentiable process, making the system training an end-to-end manner. The decoder of proposed system is with either one NWC (0.12 million parameters) in low to medium bitrate ranges (12 to 20 kbps) or two NWCs in the high bitrate (32 kbps). Although the decoding complexity is not yet as low as that of conventional speech codecs, it is significantly reduced from that of other neural speech coders, such as a WaveNet-based vocoder. For wide-band speech coding quality, our system yields comparable or superior performance to AMR-WB and Opus on TIMIT test utterances at low and medium bitrates. The proposed system can scale up to higher bitrates to achieve near transparent performance.