Prom, Natalia Aleksandrovna, Evtushenko, Oksana Aleksandrovna, Shestakova, Olga Александровна
The article presents analysis of the genre repertoire of electronic communication between a lecturer and a student, as one of the new areas of Internet communication, which has not been the subject of linguistic and genre analysis. The article offers and substantiates a classification of identified speech genres. The relevance of the topic is determined by the fact that this business form of communication is becoming preferable in modern pedagogical discourse due to numerous social and technological processes. The research material consists of written electronic asynchronous communication texts between lecturers and students in a computer-mediated environment. Using the methods of observation and comparison of various genres formed in the process of electronic communication, we have created the author’s speech genres classification relevant for modern electronic communication between a lecturer and a student. This classification includes informative, etiquette, imperative and evaluative groups of genres. In a quantitative sense, informative and etiquette genres significantly predominate over imperative and evaluative ones. The group of informative genres has been found to be the largest in our empirical material, which is due to the goals and objectives of this type of communication – providing and requesting information. The large number of etiquette speech genres indicates that the etiquette norms of institutional communication are followed due to their significance. Imperative and evaluative genres are included in the lectures’ speech repertoire, which is caused by the institutional nature of the considered form of interaction. The identified speech genres have a number of characteristics: secondary, written and formal and they possess signs of oral informal speech of primary genres, which allows us to assert that electronic communication between a lecturer and a student is going through a stage of formation and development.
Contexte. Les mécanismes d'apprentissage statistique (AS) jouent un rôle crucial dans le développement du langage oral. Ce domaine de recherche explore comment les individus détectent et utilisent les régularités statistiques des stimuli linguistiques pour acquérir des compétences langagières. Comprendre ces mécanismes apporte un éclairage intéressant concernant le développement du langage et ses difficultés d’acquisition. Cependant, l’AS et ses implications pour la clinique sont encore largement méconnus par les orthophonistes.
Objectifs. Cette revue de littérature a pour objectif de synthétiser les connaissances actuelles sur l’AS dans le développement du langage oral, tant typique qu’atypique, et d’explorer ses applications possibles en orthophonie. Elle vise à clarifier les concepts fondamentaux de l’AS, à analyser son rôle dans les trajectoires langagières et à identifier des approches pour intégrer ces principes dans les pratiques thérapeutiques. En s’appuyant sur les évidences scientifiques à disposition, elle discute également les limites méthodologiques et propose des perspectives pour renforcer les liens entre recherche théorique et clinique.
Méthodes. Cette revue narrative à visée heuristique et de synthèse, non exhaustive, a été réalisée à partir d’un examen non systématique mais structuré de la littérature. Les articles ont été sélectionnés via des bases de données (PsycInfo, ERIC, MEDLINE, MLA, PubMed et Google Scholar) avec des mots-clés définis. Les critères d’inclusion portaient sur le domaine d’étude, la population et la langue. Les méta-analyses, revues systématiques et études expérimentales ont été privilégiées, en intégrant à la fois des publications récentes et des travaux de référence.
Résultats. Les études révèlent l'efficacité de l'AS dès la petite enfance dans divers aspects du développement langagier. Cependant, les variations individuelles en AS, notamment chez les enfants présentant des difficultés langagières, soulignent la complexité des mécanismes d'apprentissage. L'article met en lumière les défis théoriques et méthodologiques dans la mesure et l'interprétation de l'AS, ainsi que les implications pratiques pour l'intervention orthophonique, proposant l’utilisation de principes basés sur l'AS pour améliorer l'efficacité thérapeutique.
Conclusions. L’AS, avec ses mécanismes puissants et rapides, offre un potentiel intéressant d’optimisation des interventions orthophoniques, complémentaire à l’apprentissage explicite. En centrant les thérapies sur l’input et en exploitant les principes de l’AS, les orthophonistes peuvent favoriser un apprentissage efficace et sans effort, ainsi que la généralisation des acquis. Bien que des recherches supplémentaires soient nécessaires, les données actuelles encouragent son intégration en clinique, tout en ouvrant la perspective de développer des outils pour le dépistage et l’intervention précoce dans différents domaines langagiers.
Background: Management of oropharyngeal dysphagia within the first few days of an infant’s life results in favourable breastfeeding outcomes, indicating the importance of investigating the breastfeeding management practices of speech-language therapists (SLTs) working within this field.
Objectives: Little has been published about SLTs’ management of breastfeeding in low- and middle-income settings. This study explores the perspectives of a group of experienced South African SLTs on their approach to breastfeeding management.
Method: Qualitative data were gathered through semi-structured online interviews with 12 experienced SLTs and were subjected to thematic analysis.
Results: Three main themes emerged: (1) Approach to breastfeeding management; (2) Exposure, skills and knowledge related to breastfeeding management and (3) Perspectives and attitudes towards breastfeeding management. Participants demonstrated a clear understanding of their roles within the scope of breastfeeding management, which aligns with established literature. Their approaches appeared to be influenced by their perspectives and attitudes towards breastfeeding, as well as their exposure, skills and knowledge in this area. The study underscores the need for breastfeeding training in undergraduate programmes and highlights the demand for continuous professional development opportunities.
Conclusion: While the findings are drawn from a small sample of experienced clinicians, they offer valuable insights for speech-language pathology clinical educators and professional organisations.
Contribution: This study suggests a re-evaluation of university curricula to enhance exposure to breastfeeding management.
We examined the number and size of animate characters illustrated in picture books targeting preschoolers from the U.S. and Japan. Our hypotheses were based on cross-cultural differences found in previous studies of visual attention and visual products. The results found that U.S. picture books had fewer animate characters illustrated than Japanese books, supporting our hypothesis. Our results also found that texts of the U.S. picture books placed more emphasis on animate characters than texts of Japanese books, also supporting our hypothesis. These results highlight that visual products that young children are exposed to early might be an important environmental factor contributing to their development that should be studied further.
Perrine Roulin, Lola Marmisse, Marie Kuppelin
et al.
Contexte : La Lésion Cérébrale Acquise (LCA) est responsable de multiples déficits altérant la motricité, le langage, la cognition, le comportement mais également la régulation émotionnelle, qui entravent la qualité de vie des patients et de leur entourage.
Objectif : Cette étude vise à explorer le vécu émotionnel de cette population après un apprentissage de régulation émotionnelle avec la thérapie comportementale dialectique GREMO (Groupe de Régulation ÉMOtionnelle).
Méthode : Des entretiens semi-dirigés ont été menés auprès de 17 patients (moyenne d’âge : 38,9 ans) et atteints de LCA (traumatisme crânien, accident vasculaire cérébral, encéphalite, anoxie cérébrale) datant en moyenne de 10,5 ans. Les données étaient analysées par l’Analyse Interprétative Phénoménologique (IPA). À partir de 20 heures 26 minutes d’entretiens, des transcriptions de 185 471 mots au total ont été analysées par double codage.
Résultats : Quatre super-thèmes ont été retrouvés : (1) identification et régulation des émotions facilitées par le GREMO ; (2) émotions envahissantes : un défi constant même après le GREMO dans la régulation émotionnelle ; (3) émotions et rapport aux autres ; (4) sens, relecture de sa vie et désir pour le futur à la lumière de la régulation émotionnelle.
Lors des entretiens, les participants ont pu identifier un éventail d'émotions telles que la colère, la peur, la tristesse, la honte, ainsi que des émotions agréables. Ils ont remarqué une prise de conscience accrue de leurs émotions depuis leur participation au GREMO, ce qui leur a permis de mieux comprendre, connaître et reconnaître leurs états émotionnels. De plus, ils ont noté une amélioration dans la maîtrise de leurs émotions et ont constaté que l'apprentissage des compétences GREMO a facilité leur capacité à les réguler. Malgré cela, ils ont souligné que la gestion émotionnelle reste un défi quotidien, et certains ont exprimé des difficultés persistantes dans leurs relations interpersonnelles, y compris des sentiments d'incompréhension et d'isolement social. Les entretiens ont également offert aux participants l'occasion de réfléchir à leur fonctionnement émotionnel et à leurs aspirations futures.
Conclusion : Il est donc primordial d'intégrer le vécu émotionnel des participants et leurs objectifs dans une perspective de croissance post-traumatique et de reconstruction identitaire, en accordant une attention particulière à cette dimension en rééducation orthophonique.
Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.
Thilo von Neumann, Christoph Boeddeker, Marc Delcroix
et al.
The predominant metric for evaluating speech recognizers, the Word Error Rate (WER) has been extended in different ways to handle transcripts produced by long-form multi-talker speech recognizers. These systems process long transcripts containing multiple speakers and complex speaking patterns so that the classical WER cannot be applied. There are speaker-attributed approaches that count speaker confusion errors, such as the concatenated minimum-permutation WER cpWER and the time-constrained cpWER (tcpWER), and speaker-agnostic approaches, which aim to ignore speaker confusion errors, such as the Optimal Reference Combination WER (ORC-WER) and the MIMO-WER. These WERs evaluate different aspects and error types (e.g., temporal misalignment). A detailed comparison has not been made. We therefore present a unified description of the existing WERs and highlight when to use which metric. To further analyze how many errors are caused by speaker confusion, we propose the Diarization-invariant cpWER (DI-cpWER). It ignores speaker attribution errors and its difference to cpWER reflects the impact of speaker confusions on the WER. Since error types cannot reliably be classified automatically, we discuss ways to visualize sequence alignments between the reference and hypothesis transcripts to facilitate the spotting of errors by a human judge. Since some WER definitions have high computational complexity, we introduce a greedy algorithm to approximate the ORC-WER and DI-cpWER with high precision ($<0.1\%$ deviation in our experiments) and polynomial complexity instead of exponential. To improve the plausibility of the metrics, we also incorporate the time constraint from the tcpWER into ORC-WER and MIMO-WER, also significantly reducing the computational complexity.
Decoding text, speech, or images from human neural signals holds promising potential both as neuroprosthesis for patients and as innovative communication tools for general users. Although neural signals contain various information on speech intentions, movements, and phonetic details, generating informative outputs from them remains challenging, with mostly focusing on decoding short intentions or producing fragmented outputs. In this study, we developed a diffusion model-based framework to decode visual speech intentions from speech-related non-invasive brain signals, to facilitate face-to-face neural communication. We designed an experiment to consolidate various phonemes to train visemes of each phoneme, aiming to learn the representation of corresponding lip formations from neural signals. By decoding visemes from both isolated trials and continuous sentences, we successfully reconstructed coherent lip movements, effectively bridging the gap between brain signals and dynamic visual interfaces. The results highlight the potential of viseme decoding and talking face reconstruction from human neural signals, marking a significant step toward dynamic neural communication systems and speech neuroprosthesis for patients.
Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time-frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder-decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.
Hate speech is communication, often oral or written, that incites, stigmatizes, or incites violence or prejudice against individuals or groups based on characteristics such as race, religion, ethnicity, gender, sexual orientation, or other protected characteristics. This usually involves expressions of hostility, contempt, or prejudice and can have harmful social consequences.Among the broader social landscape, an important problem and challenge facing the medical community is related to the impact of people’s verbal expression. These words have a significant and immediate effect on human behavior and psyche. Repeating such phrases can even lead to depression and social isolation.In an attempt to identify and classify these Telugu text samples in the social media domain, our research LSTM and the findings of this experiment are summarized in this paper, in which out of 27 participants, we obtained 8th place with an F1 score of 0.68.
BACKGROUND Pathological speech diagnosis is crucial for identifying and treating various speech disorders. Accurate diagnosis aids in developing targeted intervention strategies, improving patients' communication abilities, and enhancing their overall quality of life. With the rising incidence of speech-related conditions globally, including oral health, the need for efficient and reliable diagnostic tools has become paramount, emphasizing the significance of advanced research in this field. METHODS This paper introduces novel features for deep learning in the analysis of short voice signals. It proposes the incorporation of time-space and time-frequency features to accurately discern between two distinct groups: Individuals exhibiting normal vocal patterns and those manifesting pathological voice conditions. These advancements aim to enhance the precision and reliability of diagnostic procedures, paving the way for more targeted treatment approaches. RESULTS Utilizing a publicly available voice database, this study carried out training and validation using long short-term memory (LSTM) networks learning on the combined features, along with a data balancing strategy. The proposed approach yielded promising performance metrics: 90% accuracy, 93% sensitivity, 87% specificity, 88% precision, an F1 score of 0.90, and an area under the receiver operating characteristic curve of 0.96. The results surpassed those obtained by the networks trained using wavelet-time scattering coefficients, as well as several algorithms trained with alternative feature types. CONCLUSIONS The incorporation of time-frequency and time-space features extracted from short segments of voice signals for LSTM learning demonstrates significant promise as an AI tool for the diagnosis of speech pathology. The proposed approach has the potential to enhance the accuracy and allow for real-time pathological speech assessment, thereby facilitating more targeted and effective therapeutic interventions.
This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe.
Christoph Draxler, Henk van den Heuvel, Arjan van Hessen
et al.
Oral history is about oral sources of witnesses and commentors on historical events. Speech technology is an important instrument to process such recordings in order to obtain transcription and further enhancements to structure the oral account In this contribution we address the transcription portal and the webservices associated with speech processing at BAS, speech solutions developed at LINDAT, how to do it yourself with Whisper, remaining challenges, and future developments.
Chomsky’s Theory of Universal Grammar (UG) (1993, 1995) incorporates some universal principles for grammatical descriptions of all possible human languages. It is because all aspects of human languages may not be universal; this theory also offers some parameters through which languages may vary from one to another. The Minimalist Program (MP) (Chomsky, 1993) as a part of UG follows minimalist principles such as the principle of economy, the principle of uniformity, and the search for simplicity, which guide the creation of grammatical rules and representations. The current study analysed the interrogative sentences in Punjabi to determine the similarities between Punjabi and English by using the Minimalist Program (MP) as a theoretical framework. Almost all Punjabi dialects or Standard Punjabi published in the academic context provide many ways to express interrogation. Two common types of interrogative sentences analysed here include yes-no questions and wh-questions which are also called k-questions in Punjabi because they start with a k sound, for instance, kitthe (where), kinj (how), kaunN (why), kii (what), etc, have been analysed. The analysis showed that there is no obligatory movement, either in the case of yes-no questions or in wh-movement in Punjabi, such as in English. Similarly, unlike English, there is neither the movement of auxiliaries nor of any other word in any interrogative sentence. Furthermore, the study concluded that the analysis of the Punjabi language through minimalism is possible and, therefore, it is implied that Punjabi follows the proposed principle of Universal Grammar.
Language. Linguistic theory. Comparative grammar, Oral communication. Speech
Julien Hauret, Thomas Joubaud, Véronique Zimpfer
et al.
This paper presents a configurable version of Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with body-conduction microphones. We show that although these microphones significantly reduce environmental noise, this insensitivity to ambient noise happens at the expense of the bandwidth of the speech signal acquired by the wearer of the devices. The obtained captured signals therefore require the use of signal enhancement techniques to recover the full-bandwidth speech. EBEN leverages a configurable multiband decomposition of the raw captured signal. This decomposition allows the data time domain dimensions to be reduced and the full band signal to be better controlled. The multiband representation of the captured signal is processed through a U-Net-like model, which combines feature and adversarial losses to generate an enhanced speech signal. We also benefit from this original representation in the proposed configurable discriminators architecture. The configurable EBEN approach can achieve state-of-the-art enhancement results on synthetic data with a lightweight generator that allows real-time processing.
In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.
Abstract Background In head and neck cancer, many tools exist to measure speech impairment, but few evaluate the impact on communication abilities. Some self‐administered questionnaires are available to assess general activity limitations including communication. Others are not validated in oncology. These different tools result in scores that does not provide an accurate measure of the communication limitations perceived by the patients. Aim To develop a holistic score measuring the functional impact of speech disorders on communication in patients treated for oral or oropharyngeal cancer, in two steps: its construction and its validation. Methods & Procedures Patients treated for oral/oropharyngeal cancer filled six self‐questionnaires: two about communicative dynamics (ECVB and DIP), two assessing speech function (PHI and CHI) and two relating to quality of life (EORTC QLQ‐C30 and EORTC QLQ‐H&N35). A total of 174 items were initially collected. A dimensionality reduction methodology was then applied. Face validity analysis led to eliminate non‐relevant items by surveying a panel of nine experts from communication‐related disciplines (linguistics, medicine, speech pathology, computer science). Construct validity analysis led to eliminate redundant and insufficiently variable items. Finally, the holistic communication score was elaborated by principal component factor and validated using cross‐validation and latent profile analysis. Outcomes & Results A total of 25 patients filled the questionnaires (median age = 67 years, EIQ = 12; 15 men, 10 women; oral cavity = 14, oropharynx = 10, two locations = 1). After face validity analysis, 44 items were retained (κ > 0.80). Four additional items were excluded because of a very high correlation (r > 0.90) with other items presenting a better dispersion. A total of 40 items were finally included in the factor analysis. A post‐analysis score prediction was performed (mean = 100; SD = 10). A total of 24 items are finally retained for the construction of the holistic communication score (HoCoS): 19 items from questionnaires assessing communicative dynamics (13 from the ECVB and six from the DIP), four items from a perceived speech impairment questionnaire (PHI) and one from a quality‐of‐life questionnaire (EORTC QLQ‐H&N35). The reliability is good (five‐fold cross‐validation: r s = 0.91) and the complementary latent profile analysis shows a good validity of the HoCoS, clustering subjects by level of communication performance. Conclusions & Implications A global score allowing a measure of the impact of the speech disorder on communication was developed. It fills the lack of this type of score in head and neck oncology and allows the better understanding of the functional and psychosocial consequences of the pathology in the patients’ follow‐up. What this paper adds What is already known on the subject Because of their anatomical location, head and neck cancer degrades the speech abilities. Few tools currently allow the assessment of the impact of the speech disorder on communication abilities. In ENT oncology, self‐administered questionnaires are available to assess activity limitations and participation restrictions (International Classification of Functioning (ICF)—WHO). Other tools from the field of neurology allow an evaluation of communication dynamics. But these different tools, constructed by items, give global additive or averaged scores. This implies an identical weighting of each item, resulting in global scores that are not very representative of the communication limitations really perceived by the patients. What this paper adds to existing knowledge A new global holistic score allowing a measurement of the impact of speech impairment on communication after treatment of oral or oropharyngeal cancer has been developed. The methodology of its construction allows a better reflection of the symptomatological, pragmatic and psychosocial elements leading to a degradation of communication abilities. What are the potential or actual clinical implications of this work? The developed HoCoS score fills the gap in the absence of this type of tool in head and neck oncology. It may allow a better understanding of the factors involved in the functional and psychosocial limitations of these patients, and better customize their follow‐up.
This paper compares the speech coder and speaker recognizer applications, showing some parallelism between them. In this paper, some approaches used for speaker recognition are applied to speech coding in order to improve the prediction accuracy. Experimental results show an improvement in Segmental SNR (SEGSNR).