Hasil untuk "Oral communication. Speech"
Menampilkan 20 dari ~2515530 hasil · dari DOAJ, CrossRef, arXiv, Semantic Scholar
Zahra Nasseri Moghaddam, Emily K. Reinhardt, Audrey Thurm et al.
Abstract Background Creatine transporter (CTD) and guanidinoacetate methyltransferase (GAMT) deficiencies are rare inborn errors of creatine metabolism, resulting in cerebral creatine deficiency. Patients with either condition commonly exhibit intellectual and developmental disabilities, often accompanied by behavior problems, delayed speech, seizures, and motor impairments. There is currently no efficacious treatment for CTD, while current management for GAMT requires lifelong treatment with a protein restricted diet and intake of high amounts of oral supplements. Efforts to conduct clinical trials on potential treatments for these disorders are made more difficult by the lack of clinical and patient-derived meaningful outcomes. A core outcome set (COS) can facilitate consistent use of outcomes in studies. The current effort included patient and caregiver perspectives into the outcome selection of a COS for CTD and GAMT. Results We partnered with caregivers and health professionals to establish the first COS for CTD and GAMT. The COS developed includes seven outcomes (“Adaptive Functioning”, “Cognitive Functioning”, “Emotional Dysregulation”, “MRS Brain Creatine”, “Seizure/Convulsions”, “Expressive Communication”, and “Fine Motor Functions”) for both CTD and GAMT, and an additional outcome for GAMT (“Serum/Plasma Guanidinoacetate”) that are important to stakeholders and consequently should be considered for measurement in every clinical trial. Caregivers were valued partners throughout the COS development process, which increased community engagement and facilitated caregiver empowerment. Conclusions Development of this COS illustrates a patient-centered approach for clinical trial readiness for CTD and GAMT that if utilized will make clinical trial results comparable, minimize bias in clinical trial outcome selection, and promote efficient use of resources.
Toshio Irino, Ayako Yamamoto, Fuki Miyazaki
Age-related hearing loss (HL) reduces speech intelligibility (SI) in older adults (OAs). However, deficits in central and cognitive processing also substantially impact SI. Understanding these contributions is essential for explaining individual differences and developing effective assistive hearing strategies. This study presents a framework that distinguishes peripheral HL from central and cognitive influences on SI. This framework uses the Wakayama University Hearing Impairment Simulator (WHIS), and the Gammachirp Envelope Similarity Index (GESI), an objective measure of intelligibility. First, speech-in-noise tests were conducted with young, normal-hearing listeners (YNHs) using WHIS to simulate the audiogram of a target OA. The target OA achieved SI scores comparable to or higher than those of YNHs with simulated HL, suggesting contributions beyond peripheral hearing function. Then, GESI was used to predict SI scores for YNHs and OAs across different hearing levels. The prediction accuracy was comparable for both groups. Interestingly, many OAs' subjective SI scores were higher than those predicted using parameters derived from YNHs' experiments. This finding is inconsistent with previous research indicating that speech perception ability declines with age. This issue will be discussed. There was no significant correlation between the average hearing levels and the residual differences between the subjective and predicted SI scores. This suggests that GESI effectively absorbed the effects of peripheral HL. Thus, the proposed framework may facilitate systematic examination and comparison of central and cognitive factors beyond peripheral HL among individual YNHs and OAs with and without HL.
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota
Dysarthric speech severity classification is crucial for objective clinical assessment and progress monitoring in individuals with motor speech disorders. Although prior methods have addressed this task, achieving robust generalization in speaker-independent (SID) scenarios remains challenging. This work introduces DSSCNet, a novel deep neural architecture that combines Convolutional, Squeeze-Excitation (SE), and Residual network, helping it extract discriminative representations of dysarthric speech from mel spectrograms. The addition of SE block selectively focuses on the important features of the dysarthric speech, thereby minimizing loss and enhancing overall model performance. We also propose a cross-corpus fine-tuning framework for severity classification, adapted from detection-based transfer learning approaches. DSSCNet is evaluated on two benchmark dysarthric speech corpora: TORGO and UA-Speech under speaker-independent evaluation protocols: One-Speaker-Per-Severity (OSPS) and Leave-One-Speaker-Out (LOSO) protocols. DSSCNet achieves accuracies of 56.84% and 62.62% under OSPS and 63.47% and 64.18% under LOSO setting on TORGO and UA-Speech respectively outperforming existing state-of-the-art methods. Upon fine-tuning, the performance improves substantially, with DSSCNet achieving up to 75.80% accuracy on TORGO and 68.25% on UA-Speech in OSPS, and up to 77.76% and 79.44%, respectively, in LOSO. These results demonstrate the effectiveness and generalizability of DSSCNet for fine-grained severity classification across diverse dysarthric speech datasets.
Sai Samrat Kankanala, Ram Chandra, Sriram Ganapathy
Auditory attention and selective phase-locking are central to human speech understanding in complex acoustic scenes and cocktail party settings, yet these capabilities in multilingual subjects remain poorly understood. While machine understanding of natural speech has advanced in recent years, questions persist about comprehension of overlapped and mixed-channel speech. We propose a systematic paradigm for studying humans and machines in speech question-answering tasks in multilingual settings with clean and mixed-channel speech. For human listeners, selective attention to a target speaker was significantly better in their native language (L1) than in their second language (L2). For machine listening, speech-based large language models (LLMs) match or exceed human performance in clean, single-speaker conditions but often struggle to selectively attend in two-speaker settings. These results reveal a key divergence: humans rely on attentional cues that are more streamlined in their native language, whereas LLMs default to parallel information extraction which exceed human skills.
Muhammad Rashid, Musa Khan
This research aims to explore the advent of Metaverse as a medium in the emergent virtual reality landscape. It draws on the work of media theorist Marshall McLuhan related to the relationship among human beings, society, culture, and technology defined under the term “the medium is the message". The study examines the possible revival of McLuhan’s notion in the Age of Metaverse, as well as in other virtual worlds. Metaverse is an emerging technology of shared virtual spaces that combine virtual reality, augmented reality, mixed reality, and other immersive technologies. Following the notions of medium theory and technological determinism as analytical frameworks, the study provides detailed insights about the influence of new technologies reshaping people’s perception as well as theunderstanding of new communication tools and media. The research is based on the systematic literature review method. The findings present the current state of research on Metaverse and virtual reality including their usage and applications within the framework of McLuhan’s media theories. The review identifies several themes that emerge from the literature, including the Extension of Man, the Global Village, McLuhan’s new medium, and the medium is the message in the virtual age. The results suggest that Metaverse, as a new form of immersive medium, continues to emerge and shape human communication and social interactions.
Junyi Peng, Marc Delcroix, Tsubasa Ochiai et al.
Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.
S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan
A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with vanilla MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech-music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.
Sadia Malik, Saiqa Imtiaz Asif, Nasir Saleem
The discrepancy between teachers' self-reported and actual knowledge can be lethal for any pedagogic process. In that case, they cannot achieve the learning objectives by transferring the required knowledge and skills among the learners. This research intends to explore the gap between phonics-related self-reported and actual knowledge of Pakistani government primary school teachers to see how far they are competent in phonics knowledge. This investigation further elucidates the quality of pre-and in-service teacher preparation programs in equipping the primary grade instructors with the required skill. For this purpose, 230 government primary school teachers were asked to report their phonics knowledge through a Likert scale questionnaire. In addition, a section of 10 MCQs was added to assess their actual knowledge of which different aspects of phonics, i.e., the definition of phonics, number of sounds in the English language, identifying consonant and vowel digraphs, short and long vowel sounds, number of sounds and syllable in a word and consonant blend. The findings realised the severe discrepancy between the self-reported and actual knowledge as only 28.3 % of teachers could score average or above-average marks, though the majority overrated their competence. Comparatively older, less qualified, and less professionally trained teachers with non-M.A. English qualifications were more prone to the tendency of overvaluing their knowledge. This situation implies that specific phonics content-based training is not provided to the teachers during pre-and in-service teacher preparation programs. However, the phonology of English is taught to them to some extent; teacher knowledge is not assessed at any stage. These findings propose content-based teacher training, teacher knowledge assessment, and the incorporation of technology to facilitate teachers.
Ye. N. Zharinova
The relevance of the research is justified by the fact that modernity requires an increase in the productivity of graduates of higher educational institutions in independently formulating and solving professional tasks, increasing the requirements for their quality and the products they create. A special area is teaching the culture of communication, speech (oral, written), the culture of establishing professionally appropriate relationships between the subjects of education, ensuring their productive formation. One of the signs of productivity is the creation of innovative educational technologies. The purpose of the study is to develop a justification and approbation of a system of psychological and pedagogical methods to identify indicators of the productivity of the subjects of education, their development in the professional, educational and scientific environment of the university. The hypothesis that psychological training of personal growth and productive interaction of educational subjects develops activity in communication, confidence, emotional stability, and internal locus control has been confirmed. Sample: 3440 people, including 290 teachers of higher education and 3150 students of psychological and pedagogical, economic, information technology, service, technical, agricultural specialties, as well as specialties in physical culture and sports (1860 years) for the period 20012022. To identify indicators of productivity of subjects of education, the following methods were used: a questionnaire to determine the communicative potential (R.A. Maksimova); express diagnostics of SAN (V.A. Doskina, N.A. Lavrentieva, V.B. Sharai, M.P. Miroshnikov); a test questionnaire of the level of subjective control (E.F. Bazhin, E.A. Golinkina, A.M. Etkind). Results: the differences in indicators of communicative properties and features of subjective locus control in the control and experimental groups prove the effectiveness of the psychological training program developed by us. Conclusions: it is determined that the creation of innovative educational technologies play an important role in the intensification of personal growth and productive interaction of subjects of education by means of psychological training
Adrian Bogdan Stânea, Vlad Striletchi, Cosmin Striletchi et al.
Large speech models-derived features have recently shown increased performance over signal-based features across multiple downstream tasks, even when the networks are not finetuned towards the target task. In this paper we show the results of an analysis of several signal- and neural models-derived features for speech emotion recognition. We use pretrained models and explore their inherent potential abstractions of emotions. Simple classification methods are used so as to not interfere or add knowledge to the task. We show that, even without finetuning, some of these large neural speech models' representations can enclose information that enables performances close to, and even beyond state-of-the-art results across six standard speech emotion recognition datasets.
Shivam Mehta, Siyang Wang, Simon Alexanderson et al.
With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. Please see https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code.
Seamless Communication, Loïc Barrault, Yu-An Chung et al.
Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication
Portia Carr, D. Moser, Shana Williamson et al.
Many persons with aphasia (PWA) have limited access to speech-language treatment (SLT) due to limited funding, speech-language pathologist shortages, geographical barriers, physical disabilities, transportation barriers, and the COVID-19 pandemic. The purpose of this study was to determine if telepractice is an effective and feasible service delivery model for PWA. Ten PWA completed 8 hours of remote treatment over 4 weeks. Synchronous telepractice sessions employed Oral Reading for Language in Aphasia (ORLA) and Conversational Script Training (CST). Pre- and post-assessment outcome measures included the Communication Activities of Daily Living-3 (CADL-3) and the Communication Confidence Rating Scale for Aphasia (CCRSA). Participants completed a telepractice satisfaction survey following post-assessment. All participants demonstrated improvements in CCRSA scores, total words produced correctly on trained CST stimuli, and total words produced correctly on trained ORLA stimuli. No differences were noted in CADL-3 scores. All participants were highly satisfied with telepractice as a service delivery model.
Rebecca Amery, J. G. Wunungmurra, Parimala Raghavendra et al.
Abstract Yolŋu, Aboriginal people from Arnhem Land, Australia are at risk of Machado-Joseph disease, with progressive loss of speech. Yolŋu are interested in developing augmentative and alternative communication (AAC) systems in their own languages. This research aimed to develop a culturally responsive process to explore and create a core vocabulary word list for Yolŋu adults living with the disease for inclusion in AAC system prototypes. A list of 243 Yolŋu words and morphemes was created. In this highly collaborative, mixed methods, participatory action research, Balanda (the Yolŋu word for non-Aboriginal people) and Yolŋu researchers conducted cycles of transcription and analysis of a language sample, with oral group discussions to identify which words to include, omit, or add, based on Yolŋu perceptions of the structure and use of their languages. A Yolŋu metaphor, Gulaka-buma (“Harvesting yams”), was identified by Yolŋu researchers to represent and share the research process and findings. Three key themes were identified that summarize the main cultural and linguistic considerations related to changes made to the core vocabulary. Study findings emphasized the role of language as an expression of culture and identity for Indigenous peoples and the importance of considering cultural and linguistic factors in selecting vocabulary for AAC systems.
Halaman 7 dari 125777