DisSR: Disentangling Speech Representation for Degradation-Prior Guided Cross-Domain Speech Restoration
Ziqi Liang, Zhijun Jia, Chang Liu
et al.
Previous speech restoration (SR) primarily focuses on single-task speech restoration (SSR), which cannot address general speech restoration problems. Training specific SSR models for different distortions is time-consuming and lacks generality. In addition, most studies ignore the problem of model generalization across unseen domains. To overcome those limitations, we propose DisSR, a Disentangling Speech Representation based general speech restoration model with two properties: 1) Degradation-prior guidance, which extracts speaker-invariant degradation representation to guide the diffusion-based speech restoration model. 2) Domain adaptation, where we design cross-domain alignment training to enhance the model's adaptability and generalization on cross-domain data, respectively. Experimental results demonstrate that our method can produce high-quality restored speech under various distortion conditions. Audio samples can be found at https://itspsp.github.io/DisSR.
Étude de l’efficacité d’un entraînement en groupe de la compréhension de la parole dans le bruit, auprès d’adultes appareillés et/ou implantés cochléaires.
Agathe Décultot, Françoise Estienne, Justine Wathour
Contexte. Même s’il existe peu d’études sur le sujet, la rééducation visant un entraînement de la compréhension de la parole dans le bruit est pourtant primordiale, les exercices dans le calme ne sont plus suffisants pour répondre aux difficultés des patients appareillés et/ou implantés cochléaires. La compréhension de la parole dans le bruit est une compétence essentielle pour communiquer au quotidien. Elle constitue la première plainte des personnes sourdes, appareillées et/ou implantées cochléaires, car elle impacte leurs interactions et leur qualité de vie (Maillard et al., 2023).
Objectif. Cette étude souhaite évaluer si un entraînement auditif dans le bruit améliore la compréhension de la parole dans ces situations d’écoute difficile.
Méthode. Un groupe expérimental constitué de neuf adultes sourds appareillés et/ou implantés a bénéficié de six séances d’entraînement auditif. Un pré-test et un post-test ont évalué la perception de mots dans le bruit et au sein de chansons, la localisation spatiale, les capacités d’écoute dichotique et le ressenti de la personne dans le bruit au quotidien. Ces mêmes variables ont été mesurées pour un groupe de sept adultes sourds appareillés et/ou implantés (groupe contrôle sourd) qui n’ont pas été entraînés et chez dix adultes normo-entendants (groupe contrôle normo-entendant).
Résultats. Les analyses statistiques ne montrent pas d’effets statistiques significatifs de l’entraînement sur les variables mesurées mais tous les participants ont des scores supérieurs en post-test sur au moins deux tâches et ont rapporté des éléments qualitatifs bénéfiques.
Conclusion. Ces séances leur ont permis de partager leur expérience de la surdité avec d’autres personnes confrontées aux mêmes difficultés, de prendre confiance en eux et de situer leur niveau en compréhension de la parole dans le bruit face aux tâches proposées et aux résultats des autres participants.
Oral communication. Speech, Pathology
Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers
Aneesha Sampath, James Tavernor, Emily Mower Provost
Accurate speech emotion recognition is essential for developing human-facing systems. Recent advancements have included finetuning large, pretrained transformer models like Wav2Vec 2.0. However, the finetuning process requires substantial computational resources, including high-memory GPUs and significant processing time. As the demand for accurate emotion recognition continues to grow, efficient finetuning approaches are needed to reduce the computational burden. Our study focuses on dimensional emotion recognition, predicting attributes such as activation (calm to excited) and valence (negative to positive). We present various finetuning techniques, including full finetuning, partial finetuning of transformer layers, finetuning with mixed precision, partial finetuning with caching, and low-rank adaptation (LoRA) on the Wav2Vec 2.0 base model. We find that partial finetuning with mixed precision achieves performance comparable to full finetuning while increasing training speed by 67%. Caching intermediate representations further boosts efficiency, yielding an 88% speedup and a 71% reduction in learnable parameters. We recommend finetuning the final three transformer layers in mixed precision to balance performance and training efficiency, and adding intermediate representation caching for optimal speed with minimal performance trade-offs. These findings lower the barriers to finetuning speech emotion recognition systems, making accurate emotion recognition more accessible to a broader range of researchers and practitioners.
Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis
Zhu Li, Yuqing Zhang, Xiyuan Gao
et al.
Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model's ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.
A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References
Simon Dahl Jepsen, Mads Græsbøll Christensen, Jesper Rindom Jensen
This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objective in supervised speech separation, when the training references contain noise, as is the case with the de facto benchmark WSJ0-2Mix. A derivation of the SI-SDR with noisy references reveals that noise limits the achievable SI-SDR, or leads to undesired noise in the separated outputs. To address this, a method is proposed to enhance references and augment the mixtures with WHAM!, aiming to train models that avoid learning noisy references. Two models trained on these enhanced datasets are evaluated with the non-intrusive NISQA.v2 metric. Results show reduced noise in separated speech but suggest that processing references may introduce artefacts, limiting overall quality gains. Negative correlation is found between SI-SDR and perceived noisiness across models on the WSJ0-2Mix and Libri2Mix test sets, underlining the conclusion from the derivation.
Est-ce qu’écrire sur un clavier aide les élèves présentant un Trouble Spécifique d’Apprentissage du Langage Écrit à améliorer leur orthographe ?
Lucie Broc, Marie Caillet
Contexte : Le Trouble Spécifique d’Apprentissage du Langage Écrit (TSApp-LE) est caractérisé par des difficultés d'apprentissage de la lecture (dyslexie) et de l'expression écrite (dysorthographie) pouvant impacter les performances scolaires. Dans ce contexte, différents aménagements pédagogiques peuvent être mis en place. Cependant, lorsque les troubles sont trop importants, l’outil informatique peut être proposé. L’utilisation de cette aide comme moyen de compensation nécessite une prise en soin en ergothérapie notamment pour apprendre à écrire sur un clavier et à maitriser les logiciels d’aide à l’écriture.
Objectif : Cette étude vise à évaluer les performances orthographiques d’élèves TSApp-LE lorsqu’ils écrivent sur un clavier d’ordinateur. Le but principal est de définir les intérêts et les limites de la mise en place de l’outil informatique comme outil de compensation.
Hypothèses : Le déficit orthographique des élèves présentant un TSApp-LE devrait se manifester à la fois en condition manuscrite et tapuscrite. Cependant, la prise en soin régulière et sur du long terme en ergothérapie devrait augmenter l’autonomie des élèves dans la maitrise des outils de compensation et par conséquent leur précision orthographique, dans la modalité tapuscrite.
Méthode : 11 élèves de 10 à 14 ans, présentant un TSApp-LE et utilisant l’outil informatique comme moyen de compensation au quotidien ont participé à cette étude. Ils ont produit deux narrations écrites, une à la main (condition manuscrite) et une sur un clavier d’ordinateur (condition tapuscrite). La longueur des narrations ainsi que la précision orthographique et le type d’erreurs d’orthographe produites ont été relevés.
Résultats : Il n’y a pas de différence significative entre les deux modalités d’écriture sur les deux variables testées. Globalement, les participants TSApp-LE ne produisent pas plus d’erreurs d’orthographe en condition manuscrite qu’en condition tapuscrite. Par ailleurs, ils produisent plus d’erreurs d’orthographe morphologique (flexionnelle et dérivationnelle) que d’erreurs lexicales et phonologiques et ce, quelle que soit la modalité de production.
Une analyse qualitative permet de nuancer ces résultats globaux puisque les participants qui ont bénéficié d’un suivi long et régulier en ergothérapie utilisent plus les aides mises à leur disposition et produisent moins d’erreurs d’orthographe dans la condition tapuscrite que dans la condition manuscrite. Cependant, la maîtrise de l’ordinateur n’est pas seulement liée au suivi en ergothérapie. Elle dépendrait également des comorbidités que présente l’enfant, telle que l’existence d’un trouble du déficit de l’attention, et de sa motivation face à la tâche écrite.
Conclusion : L’utilisation de l’ordinateur comme outil de compensation des difficultés orthographiques dans le TSApp-LE est à questionner individuellement pour chaque enfant présentant un trouble spécifique des apprentissages.
Oral communication. Speech, Pathology
Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints
Giampiero Salvi
This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.
VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance
Jiheum Yeom, Heeseung Kim, Jooyoung Choi
et al.
When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap against full-finetuned models. We carefully explore various ways of strengthening autoguidance, ultimately finding the optimal strategy. VoiceGuider as a result shows robust adaptation performance especially on extreme out-of-domain speech data. We provide audible samples in our demo page.
Sueños y delirios de «La Chiquita» y la «Mujer verdugo». Las mujeres criminales en las hojas de Vanegas Arroyo
Mariana Masera
A finales del siglo XIX, la prensa y las hojas volantes de Vanegas Arroyo informan sobre los crímenes cometidos por dos mujeres: María Villa, «La Chiquita» y Guadalupe Martínez Bejarano, «la Mujer verdugo». En ambos casos, no solo sufren la sanción penal, sino que también se las condena por transgredir la norma social del rol de la mujer. El presente trabajo aborda las representaciones de las mujeres transgresoras tanto en la prensa como en la literatura de cordel —así como el proceso de adopción y adaptación de las noticias a las hojas volantes realizadas por el editor mexicano—. Esto último a través del análisis de elementos fundamentales en los impresos populares, como las representaciones de las mujeres en las imágenes, las formas y los géneros poéticos utilizados en el discurso y el uso de la voz femenina.
Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature
Znaczenie obręczy barkowej u pacjenta z zaburzeniami miofunkcjonalnymi
Piotr Koprowski
Niniejszy artykuł przedstawia relacje anatomiczne oraz możliwość występowania wpływu niższych segmentów ciała na funkcjonowanie traktu orofacjalnego. Opisuje pracę opartą na terapii ruchowej z przykładowym pacjentem logopedycznym w gabinecie fizjoterapeutycznym. Ukazuje możliwość współpracy interdyscyplinarnej przyczyniającej się do optymalizacji osiągnięć pacjenta w zakresie realizacji celów terapeutycznych.
Oral communication. Speech, Pathology
A Sidecar Separator Can Convert a Single-Talker Speech Recognition System to a Multi-Talker One
Lingwei Meng, Jiawen Kang, Mingyu Cui
et al.
Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains challenging. Recent research revealed that ASR model's encoder captures different levels of information with different layers -- the lower layers tend to have more acoustic information, and the upper layers more linguistic. This inspires us to develop a Sidecar separator to empower a well-trained ASR model for multi-talker scenarios by separating the mixed speech embedding between two suitable layers. We experimented with a wav2vec 2.0-based ASR model with a Sidecar mounted. By freezing the parameters of the original model and training only the Sidecar (8.7 M, 8.4% of all parameters), the proposed approach outperforms the previous state-of-the-art by a large margin for the 2-speaker mixed LibriMix dataset, reaching a word error rate (WER) of 10.36%; and obtains comparable results (7.56%) for LibriSpeechMix dataset when limited training.
Integrating Uncertainty into Neural Network-based Speech Enhancement
Huajian Fang, Dennis Becker, Stefan Wermter
et al.
Supervised masking approaches in the time-frequency domain aim to employ deep neural networks to estimate a multiplicative mask to extract clean speech. This leads to a single estimate for each input without any guarantees or measures of reliability. In this paper, we study the benefits of modeling uncertainty in clean speech estimation. Prediction uncertainty is typically categorized into aleatoric uncertainty and epistemic uncertainty. The former refers to inherent randomness in data, while the latter describes uncertainty in the model parameters. In this work, we propose a framework to jointly model aleatoric and epistemic uncertainties in neural network-based speech enhancement. The proposed approach captures aleatoric uncertainty by estimating the statistical moments of the speech posterior distribution and explicitly incorporates the uncertainty estimate to further improve clean speech estimation. For epistemic uncertainty, we investigate two Bayesian deep learning approaches: Monte Carlo dropout and Deep ensembles to quantify the uncertainty of the neural network parameters. Our analyses show that the proposed framework promotes capturing practical and reliable uncertainty, while combining different sources of uncertainties yields more reliable predictive uncertainty estimates. Furthermore, we demonstrate the benefits of modeling uncertainty on speech enhancement performance by evaluating the framework on different datasets, exhibiting notable improvement over comparable models that fail to account for uncertainty.
Perceive and predict: self-supervised speech representation based loss functions for speech enhancement
George Close, William Ravenscroft, Thomas Hain
et al.
Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).
Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition
Qijie Shao, Pengcheng Guo, Jinghao Yan
et al.
Accents, as variations from standard pronunciation, pose significant challenges for speech recognition systems. Although joint automatic speech recognition (ASR) and accent recognition (AR) training has been proven effective in handling multi-accent scenarios, current multi-task ASR-AR approaches overlook the granularity differences between tasks. Fine-grained units capture pronunciation-related accent characteristics, while coarse-grained units are better for learning linguistic information. Moreover, an explicit interaction of two tasks can also provide complementary information and improve the performance of each other, but it is rarely used by existing approaches. In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. Specifically, AR and ASR are first decoupled by separated branches and two-granular modeling units to learn task-specific representations. The AR branch is from our previously proposed linguistic-acoustic bimodal AR model and the ASR branch is an encoder-decoder based Conformer model. Then, for the task interaction, the CTC branch provides aligned text for the AR task, while accent embeddings extracted from our AR model are incorporated into the ASR branch's encoder and decoder. Finally, during ASR inference, a cross-granular rescoring method is introduced to fuse the complementary information from the CTC and attention decoder after the decoupling. Our experiments on English and Chinese datasets demonstrate the effectiveness of the proposed model, which achieves 21.45%/28.53% AR accuracy relative improvement and 32.33%/14.55% ASR error rate relative reduction over a published standard baseline, respectively.
Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study
Salah Zaiem, Robin Algayres, Titouan Parcollet
et al.
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.
Self-stigma in stuttering - a preliminary study using the Polish adaptation of the Stuttering Perception Scale
Natalia Bauszek, Katarzyna Węsierska
The article presents the results of a pilot study using the Polish version of the Stuttering Perception Scale. The phenomenon of stigmatisation which is present in stuttering has a negative impact on the quality of life of those who stutter. The purpose of the study outlined below was to identify the opinions of those whoare known as 'double experts' on the phenomenon of stigmatisation and self-stigmatisation in stuttering, as well as on the correctness of the Polish adaptation ofthe scale and its potential applicability in the therapy of adults suffering from stuttering. This qualitative study involved 7 participants: speech therapists, psychologists, and/or leaders of the self-help movement for people who stutter, who have personal experience of stuttering. The research problem took the form of the question: How do respondents assess the reliability of the translation of the scale and its usefulness in the process of Speech Therapy intervention? The article cites selected opinions of the respondents and indicates the practical implications of the study.
Medicine, Oral communication. Speech
Verifying student identity in oral assessments with deep speaker
Jake Renzella, Andrew Cain, Jean-Guy Schneider
Contract cheating, a form of academic misconduct in which students outsource assessment activities to third parties, is a topic of concern among educators. As similarity-detection systems are ineffective at detecting contract cheating, some institutions have turned to intensely criticised proctoring systems, however student and educator bodies report high costs and privacy concerns. Oral assessment is an alternative assessment approach that provides valuable interpersonal and communication skills in graduates and can naturally help detect and deter cheating. However, oral assessment is typically time-consuming, and in larger courses, it is challenging to validate respondents' identity. Advancements in machine learning approaches can scale time-consuming tasks that previously required prohibitive educator effort. One such system, Deep Speaker, is a speaker identification and verification system that can verify if two audio samples resemble speech from the same person with high accuracy. This paper presents an innovative tool that integrates an online oral assessment tool, Real Talk, with Deep Speaker. This proposed system facilitates scalable student-tutor discussions while providing longitudinal student identity validation with minimal cost and impact for institutions and addressing student privacy concerns. We evaluated the system and showed that student audio responses collected via oral discussion tools are suitable for verification. We then discuss the impact our system may have when applied in higher education. We posit that institutions can use such approaches to detect cases of contract cheating, enhance learning outcomes, and pave the way for more student-friendly assessment and discussion models in online education.
Electronic computers. Computer science
A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition
Qiu-Shi Zhu, Jie Zhang, Zi-Qiang Zhang
et al.
Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.
A Speech Intelligibility Enhancement Model based on Canonical Correlation and Deep Learning for Hearing-Assistive Technologies
Tassadaq Hussain, Muhammad Diyan, Mandar Gogate
et al.
Current deep learning (DL) based approaches to speech intelligibility enhancement in noisy environments are generally trained to minimise the distance between clean and enhanced speech features. These often result in improved speech quality however they suffer from a lack of generalisation and may not deliver the required speech intelligibility in everyday noisy situations. In an attempt to address these challenges, researchers have explored intelligibility-oriented (I-O) loss functions to train DL approaches for robust speech enhancement (SE). In this paper, we formulate a novel canonical correlation-based I-O loss function to more effectively train DL algorithms. Specifically, we present a fully convolutional SE model that uses a modified canonical-correlation based short-time objective intelligibility (CC-STOI) metric as a training cost function. To the best of our knowledge, this is the first work that exploits the integration of canonical correlation in an I-O based loss function for SE. Comparative experimental results demonstrate that our proposed CC-STOI based SE framework outperforms DL models trained with conventional STOI and distance-based loss functions, in terms of both standard objective and subjective evaluation measures when dealing with unseen speakers and noises.