Hasil "Oral communication. Speech"

arXiv Open Access 2025

EDNet: A Versatile Speech Enhancement Framework with Gating Mamba Mechanism and Phase Shift-Invariant Training

Doyeop Kwak, Youngjoon Jang, Seongyu Kim et al.

Speech signals in real-world environments are frequently affected by various distortions such as additive noise, reverberation, and bandwidth limitation, which may appear individually or in combination. Traditional speech enhancement methods typically rely on either masking, which focuses on suppressing non-speech components while preserving observable structure, or mapping, which seeks to recover clean speech through direct transformation of the input. Each approach offers strengths in specific scenarios but may be less effective outside its target conditions. We propose the Erase and Draw Network (EDNet), a versatile speech enhancement framework designed to handle a broad range of distortion types without prior assumptions about task or input characteristics. EDNet consists of two main components: (1) the Gating Mamba (GM) module, which adaptively combines masking and mapping through a learnable gating mechanism that selects between suppression (Erase) and reconstruction (Draw) based on local signal features, and (2) Phase Shift-Invariant Training (PSIT), a shift tolerant supervision strategy that improves phase estimation by enabling dynamic alignment during training while remaining compatible with standard loss functions. Experimental results on denoising, dereverberation, bandwidth extension, and multi distortion enhancement tasks show that EDNet consistently achieves strong performance across conditions, demonstrating its architectural flexibility and adaptability to diverse task settings.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2025

Best Practices and Considerations for Child Speech Corpus Collection and Curation in Educational, Clinical, and Forensic Scenarios

John Hansen, Satwik Dutta, Ellen Grand

A child's spoken ability continues to change until their adult age. Until 7-8yrs, their speech sound development and language structure evolve rapidly. This dynamic shift in their spoken communication skills and data privacy make it challenging to curate technology-ready speech corpora for children. This study aims to bridge this gap and provide researchers and practitioners with the best practices and considerations for developing such a corpus based on an intended goal. Although primarily focused on educational goals, applications of child speech data have spread across fields including clinical and forensics fields. Motivated by this goal, we describe the WHO, WHAT, WHEN, and WHERE of data collection inspired by prior collection efforts and our experience/knowledge. We also provide a guide to establish collaboration, trust, and for navigating the human subjects research protocol. This study concludes with guidelines for corpus quality check, triage, and annotation.

en cs.SD, cs.CY

Detail Sumber

arXiv Open Access 2025

Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Qiongqiong Wang, Hardik B. Sailor, Jeremy H. M. Wong et al.

Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2025

Cross-Cultural Adaptation and Validation of the Mini-Eating and Drinking Ability Classification System for Korean Children with Cerebral Palsy Aged 18–36 Months

You Gyoung Yi, Seoyon Yang, Jeong-Yi Kwon et al.

Background/Objectives: Feeding and swallowing difficulties are common in young children with cerebral palsy (CP), yet no validated tool has been available in Korea for those under 3 years. The Mini-Eating and Drinking Ability Classification System (Mini-EDACS) was designed for children aged 18–36 months. This study aimed to translate the Mini-EDACS into Korean and evaluate its reliability and validity. Methods: Translation followed international guidelines, including forward–backward translation and Delphi consensus with experts in pediatric dysphagia. Forty-eight children with CP (mean age 27.1 ± 5.0 months) were assessed. Caregivers and speech–language pathologists (SLPs) independently rated Mini-EDACS and assistance levels. Inter-rater reliability was examined using Cohen’s κ. Construct validity was tested by Spearman’s correlations with the Gross Motor Function Classification System (GMFCS), Mini-MACS, the Communication Function Classification System (CFCS), the Visual Function Classification System (VFCS), and the Functional Oral Intake Scale for Children (FOIS-C). Results: Agreement between caregivers and SLPs was excellent (κ = 0.90; weighted κ = 0.98). Assistance-level ratings also showed almost perfect concordance (κ = 0.97). Mini-EDACS correlated strongly with FOIS-C (ρ = −0.86, <i>p</i> < 0.001) and with assistance levels (ρ = 0.81, <i>p</i> < 0.001). Moderate-to-strong positive correlations were observed with GMFCS (ρ = 0.56), Mini-MACS (ρ = 0.64), CFCS (ρ = 0.61), and VFCS (ρ = 0.61), supporting construct validity. Conclusions: The Korean Mini-EDACS is a reliable and valid tool for classifying eating and drinking abilities in children with CP under 3 years. It enables standardized communication between caregivers and clinicians, complements existing functional classification systems, and may facilitate earlier identification and intervention for feeding difficulties.

Pediatrics

Detail DOI Sumber

arXiv Open Access 2024

Scalable Speech Enhancement with Dynamic Channel Pruning

Riccardo Miccini, Clement Laroche, Tobias Piechowiak et al.

Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save 29.6% of MACs while only causing a 0.75% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.

en eess.AS, cs.LG

Detail DOI Sumber

DOAJ Open Access 2024

An exploratory study of children’s expressive language productivity in relation to teachers’ use of communication-supporting strategies

Karin Edlund, Liselotte Kjellmer, Helena Hemmingsson et al.

Supporting children’s language development, particularly oral language, is crucial as it constitutes the basis for the further development of literacy skills. This study explored patterns of primary school children’s expressive language productivity over time in relation to a professional development (PD) program designed to further teachers’ use of communication-supporting strategies to promote child talk. Framed by a Tier 2 response to intervention model, teacher-child verbal interactions were observed during structured small-group conversations. The participants were 36 children (M age 8;2 years old), with and without speech, language and communication needs (SLCN) attending three mainstream classes in two Swedish municipalities. Two teachers followed a 10-week PD program, and a third teacher participated for comparison purposes. We measured the children’s expressive language productivity in relation to the teachers’ strategy use pre-intervention, post-intervention, and at a two-month follow-up. In the results, we found a tendency that when the teachers used the strategies frequently, the children’s expressive language productivity seemed suppressed. Although the results showed a large variation in the children’s expressive language productivity, a tentative conclusion is that the children with SLCN may also be stimulated and willing to participate verbally during this type of small-group conversation. This finding suggests support for an inclusive Tier 2 approach of mixed groups with both children with and without SLCN. Based on our findings, we also suggest stressing in the PD program the need for teachers to balance their strategy use in the interactions with the children to provide ample opportunities for the children to talk and express themselves.

Education (General)

Detail DOI Sumber

DOAJ Open Access 2024

La singularidad de los animales en la lírica tradicional de México: posibles resonancias del pensamiento prehispánico

Rosa Virginia Sánchez García

A partir de la premisa de que los animales que aparecen en las coplas populares mexicanas presentan ciertos rasgos distintivos frente a otros cancioneros latinoamericanos, en estas líneas se realiza una aproximación comparativa entre la tradición lírica de los antiguos poetas mexicanos y la de nuestras actuales coplas populares, con el fin de averiguar qué tan pertinente resulta pensar que algunas representaciones simbólicas de los animales en el pensamiento prehispánico pudieron haber penetrado en las coplillas hispanas, otorgando, con el paso del tiempo, parte de la personalidad que hoy ostentan los diversos animales en el cancionero mexicano.

Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature

Detail DOI Sumber

arXiv Open Access 2023

End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations

Giovanni Morrone, Samuele Cornell, Luca Serafini et al.

Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.

en eess.AS, cs.LG

Detail DOI Sumber

arXiv Open Access 2023

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Shengpeng Ji, Jialong Zuo, Minghui Fang et al.

Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes. 2) Furthermore, to address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle. This architecture treats text controllable TTS as a language model task, utilizing audio codec codes as an intermediate representation to replace the conventional mel-spectrogram. Finally, we successfully demonstrate the ability of the proposed model by showing a comparable performance in the controllable TTS task. Audio samples are available at https://sall-e.github.io/

en eess.AS, cs.LG

Detail DOI Sumber

DOAJ Open Access 2023

Impact of English Language Imperialism on Pakistani Learners’ Psychology

Tania Ali Khan

In the course of the past few years, English language has acquired uncommon consideration in Pakistan. In the progression of this momentum, students have embarked on learning and communicating in English at the cost of dismissing their native language, that is, Urdu. The current study attempted to discuss the dominance of English language and its consequences on the psychology of learners. This study addressed the question of whether students perceive English language imperialism as a threat to their national language, Urdu. The current study is quantitative in nature. A sample size comprising 100 participants was selected and those learners were studying in undergraduate and graduate programs at Kinnaird College for Women Lahore, Minhaj University, and the University of Punjab, Lahore. A self-constructed questionnaire was utilized to gather the information. Moreover, an investigation was undertaken on the impacts of the independent variable namely English language imperialism on the dependent variable namely students' psychology. The results indicated a notable interconnection among various psychological factors, collectively exerting a significant influence on student’s psychological well-being. Since they consider it a sign of accomplishment in their academic and professional careers.

Language. Linguistic theory. Comparative grammar, Oral communication. Speech

Detail DOI Sumber

DOAJ Open Access 2023

Stimulus-response binding is not a gradually learned association between specific stimuli and their responses: Evidence from a teenage bilingual population

Nkrumah Ivy Kesewaa

In the current study, participants made a verbal naming response to a prime target word flanked by a distractor word, followed by a lexical decision response to a probe target word or nonword, flanked by a distractor word. By tracking potential priming effects from having either the prime target become the probe target (attended repetition condition) or the prime distractor become the probe target (ignored repetition condition), consistent positive and negative priming effects were obtained. These results broaden our understanding that stimulus-response binding does not need to be gradually learned (Henson et al., 2014). Rather, it can be formed from a single S-R pairing.

Oral communication. Speech, Psychology

Detail DOI Sumber

arXiv Open Access 2022

ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

Xuechen Liu, Xin Wang, Md Sahidullah et al.

Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This article provides a summary of the ASVspoof 2021 challenge and the results of 54 participating teams that submitted to the evaluation phase. For the logical access (LA) task, results indicate that countermeasures are robust to newly introduced encoding and transmission effects. Results for the physical access (PA) task indicate the potential to detect replay attacks in real, as opposed to simulated physical spaces, but a lack of robustness to variations between simulated and real acoustic environments. The Deepfake (DF) task, new to the 2021 edition, targets solutions to the detection of manipulated, compressed speech data posted online. While detection solutions offer some resilience to compression effects, they lack generalization across different source datasets. In addition to a summary of the top-performing systems for each task, new analyses of influential data factors and results for hidden data subsets, the article includes a review of post-challenge results, an outline of the principal challenge limitations and a road-map for the future of ASVspoof.

en cs.SD, cs.CR

Detail DOI Sumber

DOAJ Open Access 2022

Etalonnage de « La Gestion de l’implicite » pour l’évaluation de la compréhension des inférences auprès de patients adultes et étude de l’impact des variables socio-démographiques sur les résultats au test

Capucine DIEMUNSCH, Morgane BOISEAU, Annick DUCHÊNE MAY-CARLE

Communiquer résulte d’un besoin de l’Homme d’entretenir des interactions sociales avec autrui. La communication ne peut donc se résumer en la combinaison d’éléments purement linguistiques, mais elle se définit comme un acte social. Les éléments pragmatiques se révèlent indispensables pour une compréhension holistique de l’énoncé véhiculé. Une compréhension erronée du langage peut confronter les sujets à des répercussions importantes dans leur vie quotidienne sociale, familiale, scolaire ou professionnelle. Il est donc primordial d’évaluer de façon précise ces difficultés. Le bilan orthophonique permet l’évaluation des versants formel et pragmatique de la communication. Une évaluation rigoureuse du langage élaboré chez l’adulte se montre indispensable car elle constitue la pierre angulaire de sa future prise en soins. Concernant l’évaluation de la compréhension des inférences, le test La Gestion de l’implicite (Duchêne May-Carle, 2000) est fréquemment utilisé par les orthophonistes. Cet outil évalue le langage élaboré grâce à des processus de traitement verbal explicite, logique, pragmatique, inférentiel et teste la résistance aux stimuli distracteurs. Son administration permet d’obtenir des informations quantitatives indispensables, mais l’absence d’étalonnage limite son utilisation puisque c’est ce processus qui permet la mise en relief des différences interindividuelles et donne aux scores bruts toute leur valeur. Dans un premier temps, cette étude propose l’étalonnage du test, permettant ainsi une mise en lien précise entre un résultat obtenu au test et aux différentes catégories de questions et des réflexions quant aux dimensions observées. Également, nous avons voulu étudier si différentes variables socio-démographiques impactaient significativement les résultats au test. Nous avons retrouvé un impact significatif de la variable sexe avec des résultats significativement supérieurs chez les femmes en comparaison aux hommes. En outre, les résultats se sont révélés significativement corrélés au nombre d’années d’études supérieures. En revanche, aucun impact de l’âge sur les performances au test n’a été démontré. Note de la rédaction : l'auteur met à disposition des lecteurs un fichier .xslx permettant le calcul des étalonnages.

Oral communication. Speech, Pathology

Detail Sumber

arXiv Open Access 2021

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Tamás Gábor Csapó

In this paper, we present our first experiments in text-to-articulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent neural networks, and show that FC-DNNs are more suitable for the prediction of sequential data than LSTMs, in case of limited training data. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are close to natural tongue movement. Articulatory movement prediction from text input can be useful for audiovisual speech synthesis or computer-assisted pronunciation training.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2021

Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

Erica Cooper, Xin Wang, Junichi Yamagishi

Speech synthesis and music audio generation from symbolic input differ in many aspects but share some similarities. In this study, we investigate how text-to-speech synthesis techniques can be used for piano MIDI-to-audio synthesis tasks. Our investigation includes Tacotron and neural source-filter waveform models as the basic components, with which we build MIDI-to-audio synthesis systems in similar ways to TTS frameworks. We also include reference systems using conventional sound modeling techniques such as sample-based and physical-modeling-based methods. The subjective experimental results demonstrate that the investigated TTS components can be applied to piano MIDI-to-audio synthesis with minor modifications. The results also reveal the performance bottleneck -- while the waveform model can synthesize high quality piano sound given natural acoustic features, the conversion from MIDI to acoustic features is challenging. The full MIDI-to-audio synthesis system is still inferior to the sample-based or physical-modeling-based approaches, but we encourage TTS researchers to test their TTS models for this new task and improve the performance.

en cs.SD, eess.AS

Detail Sumber

arXiv Open Access 2021

Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment

Yuanyuan Liu, Nelly Penttilä, Tiina Ihalainen et al.

Imprecise vowel articulation can be observed in people with Parkinson's disease (PD). Acoustic features measuring vowel articulation have been demonstrated to be effective indicators of PD in its assessment. Standard clinical vowel articulation features of vowel working space area (VSA), vowel articulation index (VAI) and formants centralization ratio (FCR), are derived the first two formants of the three corner vowels /a/, /i/ and /u/. Conventionally, manual annotation of the corner vowels from speech data is required before measuring vowel articulation. This process is time-consuming. The present work aims to reduce human effort in clinical analysis of PD speech by proposing an automatic pipeline for vowel articulation assessment. The method is based on automatic corner vowel detection using a language universal phoneme recognizer, followed by statistical analysis of the formant data. The approach removes the restrictions of prior knowledge of speaking content and the language in question. Experimental results on a Finnish PD speech corpus demonstrate the efficacy and reliability of the proposed automatic method in deriving VAI, VSA, FCR and F2i/F2u (the second formant ratio for vowels /i/ and /u/). The automatically computed parameters are shown to be highly correlated with features computed with manual annotations of corner vowels. In addition, automatically and manually computed vowel articulation features have comparable correlations with experts' ratings on speech intelligibility, voice impairment and overall severity of communication disorder. Language-independence of the proposed approach is further validated on a Spanish PD database, PC-GITA, as well as on TORGO corpus of English dysarthric speech.

en eess.AS, cs.SD

Detail DOI Sumber

arXiv Open Access 2021

Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

Mingyu Dong, Diqun Yan, Yongkang Gong et al.

An automatic speech recognition (ASR) system based on a deep neural network is vulnerable to attack by an adversarial example, especially if the command-dependent ASR fails. A defense method against adversarial examples is proposed to improve the robustness and security of the ASR system. We propose an algorithm of devastation and detection on adversarial examples that can attack current advanced ASR systems. We choose an advanced text- and command-dependent ASR system as our target, generating adversarial examples by an optimization-based attack on text-dependent ASR and the GA-based algorithm on command-dependent ASR. The method is based on input transformation of adversarial examples. Different random intensities and kinds of noise are added to adversarial examples to devastate the perturbation previously added to normal examples. Experimental results show that the method performs well. For the devastation of examples, the original speech similarity after adding noise can reach 99.68%, the similarity of adversarial examples can reach zero, and the detection rate of adversarial examples can reach 94%.

en cs.SD, cs.CR

Detail Sumber

DOAJ Open Access 2021

Social Responses to Media Technologies in the 21st Century: The Media are Social Actors Paradigm

Matthew Lombard, Kun Xu,

Clifford Nass and his colleagues proposed the Computers Are Social Actors (CASA) paradigm in the 1990s and demonstrated that we treat computers in some of the ways we treat humans. To account for technological advances and to refine explanations for CASA results, this paper proposes the Media Are Social Actors (MASA) paradigm. We begin by distinguishing the roles of primary and secondary cues in evoking medium-as-social-actor presence and social responses. We then discuss the roles of individual differences and contextual factors in these responses and identify mindless and mindful anthropomorphism as two major complementary mechanisms for understanding MASA phenomena. Based on evolutionary psychology explanations for socialness, we conclude with nine formal propositions and suggestions for future research to test and apply MASA.

Technology (General), Oral communication. Speech

Detail DOI Sumber

DOAJ Open Access 2021

Transcranial Direct-Current Stimulation and Behavioral Training, a Promising Tool for a Tailor-Made Post-stroke Aphasia Rehabilitation: A Review

Marina Zettin, Marina Zettin, Caterina Bondesan et al.

Aphasia is an acquired language disorder resulting from damage to portions of the brain which are responsible for language comprehension and formulation. This disorder can involve different levels of language processing with impairments in both oral and written comprehension and production. Over the last years, different rehabilitation and therapeutic interventions have been developed, especially non-invasive brain stimulation (NIBS) techniques. One of the most used NIBS techniques in aphasia rehabilitation is the Transcranial Direct-Current Stimulation (tDCS). It has been proven to be effective in promoting a successful recovery both in the short and the long term after a brain injury. The main strength of tDCS is its feasibility associated with relatively minor side effects, if safely and properly administered. TDCS requires two electrodes, an anode and a cathode, which are generally placed on the scalp. The electrode montage can be either unipolar or bipolar. The main aim of this review is to give an overview of the state of the art of tDCS for the treatment of aphasia. The studies described included patients with different types of language impairments, especially with non-fluent aphasia and in several cases anomia. The effects of tDCS are variable and depend on several factors, such as electrode size and montage, duration of the stimulation, current density and characteristics of the brain tissue underneath the electrodes. Generally, tDCS has led to promising results in rehabilitating patients with acquired aphasia, especially if combined with different language and communication therapies. The selection of the appropriate approach depends on the patients treated and their impaired language function. When used in combination with treatments such as Speech and Language Therapy, Constraint Induced Aphasia Therapy or Intensive Action Treatment, tDCS has generally promoted a better recovery of the impaired functions. In addition to these rehabilitation protocols, Action Observation Therapy, such as IMITAF, appeared to contribute to the reduction of post-stroke anomia. The potential of combining such techniques with tDCS would would therefore be a possibility for further improvement, also providing the clinician with a new action and intervention tool. The association of a tDCS protocol with a dedicated rehabilitation training would favor a generalized long-term improvement of the different components of language.

Neurosciences. Biological psychiatry. Neuropsychiatry

Detail DOI Sumber

arXiv Open Access 2020

TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition

Ji Won Yoon, Hyeonseung Lee, Hyung Yong Kim et al.

In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, end-to-end models typically require expensive computational cost to show successful performance. To reduce this computational burden, knowledge distillation (KD), which is a popular model compression method, has been used to transfer knowledge from a deep and complex model (teacher) to a shallower and simpler model (student). Previous KD approaches have commonly designed the architecture of the student model by reducing the width per layer or the number of layers of the teacher model. This structural reduction scheme might limit the flexibility of model selection since the student model structure should be similar to that of the given teacher. To cope with this limitation, we propose a new KD method for end-to-end speech recognition, namely TutorNet, that can transfer knowledge across different types of neural networks at the hidden representation-level as well as the output-level. For concrete realizations, we firstly apply representation-level knowledge distillation (RKD) during the initialization step, and then apply the softmax-level knowledge distillation (SKD) combined with the original task learning. When the student is trained with RKD, we make use of frame weighting that points out the frames to which the teacher model pays more attention. Through a number of experiments on LibriSpeech dataset, it is verified that the proposed method not only distills the knowledge between networks with different topologies but also significantly contributes to improving the word error rate (WER) performance of the distilled student. Interestingly, TutorNet allows the student model to surpass its teacher's performance in some particular cases.

en eess.AS

Detail DOI Sumber

Hasil untuk "Oral communication. Speech"