Hasil "Oral communication. Speech"

arXiv Open Access 2026

Asymmetric Encoder-Decoder Based on Time-Frequency Correlation for Speech Separation

Ui-Hyeop Shin, Hyung-Min Park

Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.

en eess.AS, cs.SD

Detail Sumber

S2 Open Access 2023

Deep Learning in Sign Language Recognition: A Hybrid Approach for the Recognition of Static and Dynamic Signs

Ahmed Mateen Buttar, U. Ahmad, Abdu H. Gumaei et al.

A speech impairment limits a person’s capacity for oral and auditory communication. A great improvement in communication between the deaf and the general public would be represented by a real-time sign language detector. This work proposes a deep learning-based algorithm that can identify words from a person’s gestures and detect them. There have been many studies on this topic, but the development of static and dynamic sign language recognition models is still a challenging area of research. The difficulty is in obtaining an appropriate model that addresses the challenges of continuous signs that are independent of the signer. Different signers’ speeds, durations, and many other factors make it challenging to create a model with high accuracy and continuity. For the accurate and effective recognition of signs, this study uses two different deep learning-based approaches. We create a real-time American Sign Language detector using the skeleton model, which reliably categorizes continuous signs in sign language in most cases using a deep learning approach. In the second deep learning approach, we create a sign language detector for static signs using YOLOv6. This application is very helpful for sign language users and learners to practice sign language in real time. After training both algorithms separately for static and continuous signs, we create a single algorithm using a hybrid approach. The proposed model, consisting of LSTM with MediaPipe holistic landmarks, achieves around 92% accuracy for different continuous signs, and the YOLOv6 model achieves 96% accuracy over different static signs. Throughout this study, we determine which approach is best for sequential movement detection and for the classification of different signs according to sign language and shows remarkable accuracy in real time.

77 sitasi en

Detail DOI Sumber

DOAJ Open Access 2025

Fake News are Easier, More Emotional and Less Complex: Some Evidence from Spanish

Reali Florencia

Characterizing false information based on linguistic analysis is important to understand the factors that affect the proliferation of fake news in the media. Previous work has identified some linguistic regularities that suggest a trend towards decreased complexity, polarization and sentiment in false information. This study is aimed at identifying linguistic differences between real and fake news using a corpus of annotated media news in Spanish via the automatic analysis of linguistic cues using dictionaries of lexical norms. We focus on lexical aspects of complexity, familiarity and sentiment. Consistent with previous results, we found that fake news are associated with lower cognitive loads, reflected by reduced sentence complexity, and increased lexical familiarity and imaginability. Moreover, and consistently with previous results, the analysis revealed that fake news are associated with more polarized emotional content.

Oral communication. Speech, Psychology

Detail DOI Sumber

arXiv Open Access 2025

Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition

Bingshen Mu, Kun Wei, Pengcheng Guo et al.

Despite improvements in automatic speech recognition, performance drops with accented speech. Generative error correction (GER) leverages the linguistic knowledge of large language models (LLMs), outperforming typical language model methods. However, it lacks specificity in accented speech scenarios. Accents represent deviations from standard pronunciation, making multi-granularity pronunciation and semantic information essential for accented speech recognition. Moreover, accents exhibit considerable diversity, with each accent possessing distinct characteristics. In this study, we leverage GER to improve transcription accuracy by addressing the two primary features. We propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level pronunciation information. These methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through low-rank adaptation (LoRA) fine-tuning. We employ a three-stage strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge mono-accent LoRA experts within a single multi-modal GER to overcome accent diversity challenges. Furthermore, multi-granularity GER leverages N-best word-level and phoneme-level hypotheses from the HDMoLE model to predict final transcriptions. Experiments on a multi-accent English dataset show that our methods reduce word error rate by 67.35% compared to the baseline vanilla Whisper-large-v3 model.

en cs.SD, eess.AS

Detail Sumber

DOAJ Open Access 2024

Smoothing Out Smart Tech’s Rough Edges: Imperfect Automation and the Human Fix

Christian Katzenbach, Christian Pentzold, Paloma Viejo Otero

In this article, we take issue with an idea of autonomous and efficient automation that is upheld through the paradoxical conjunction of a flawed vision of the technological fix and the under-acknowledged human work required to fill in the gaps between machines and users. Our argument is based on two case studies that sit at opposite tails of automation processes: the front end of self-service checkouts and the back end of content moderation. This juxtaposition allows us to surface three themes on how the hype around automation is enabled by human interventions: the ad-hoc sociality in situated practices of automation, the capture of mundane expertise, and the inverted assistance of humans to machines. We argue that this human fix is not a temporary repair of malfunction, but a permanent and constitutive feature of automated systems.

Technology (General), Oral communication. Speech

Detail DOI Sumber

DOAJ Open Access 2024

A Comparative Study of Chinese-Thai Speech Act of Request: A Case Study of Campus Scenarios

Noramon Sapermsup

A request is a common speech act in any language and is essential for successful communication. The study aims to compare the Chinese and Thai speech act of request in the campus scenario. Using the oral discourse completion test method, it examines the characteristics of request speech acts among 20 native Chinese speakers and 20 native Thai speakers. It focuses on analyzing the use of the initial action language in the request speech act, and the use of internal and external modifiers in the central action language. According to the research, it was found that although these two languages had some similarities in the implementation of the speech act of request, there were differences in many aspects and both had their own characteristics. This study argued that this difference was closely related to the Chinese and Thai socio-cultural characteristics.

Chinese language and literature

Detail DOI Sumber

arXiv Open Access 2024

Leveraging Content and Acoustic Representations for Speech Emotion Recognition

Soumya Dutta, Sriram Ganapathy

Speech emotion recognition (SER), the task of identifying the expression of emotion from spoken content, is challenging due to the difficulty in extracting representations that capture emotional attributes from speech. The scarcity of labeled datasets further complicates the challenge where large models are prone to over-fitting. In this paper, we propose CARE (Content and Acoustic Representations of Emotions), where we design a dual encoding scheme which emphasizes semantic and acoustic factors of speech. While the semantic encoder is trained using distillation from utterance-level text representations, the acoustic encoder is trained to predict low-level frame-wise features of the speech signal. The proposed dual encoding scheme is a base-sized model trained only on unsupervised raw speech. With a simple light-weight classification model trained on the downstream task, we show that the CARE embeddings provide effective emotion recognition on a variety of datasets. We compare the proposal with several other self-supervised models as well as recent large-language model based approaches. In these evaluations, the proposed CARE is shown to be the best performing model based on average performance across 8 diverse datasets. We also conduct several ablation studies to analyze the importance of various design choices.

en eess.AS

Detail Sumber

arXiv Open Access 2024

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

Haobin Tang, Xulong Zhang, Ning Cheng et al.

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.

en eess.AS, cs.SD

Detail Sumber

DOAJ Open Access 2023

Attitude, Practice and Preparedness of Dental Practitioners for Providing Oral Care to Patients with Speech and Hearing Impairment

khansa Rafi, Syed Masood-ul- Hasan, Saba Tabassum

Objective: To explore the attitudes, practices and preparedness of dental practitioners for providing oral care to patients with speech and hearing impairment. Methodology: This cross-sectional study was conducted on a sample of 467 dental practitioners between November 2021 and April 2022. A survey questionnaire was administered to practicing dental practitioners in Pakistan through online Google forms and in person. The survey included questions regarding dentists' demographic information, their attitudes and concerns, practices and views regarding training needs concerning the provision of oral care to patients with speech and hearing impairment. Data was analyzed with STATA version 16. Descriptive statistics with frequencies were computed and x² was used to find out the associations between different variables. Results: The findings suggest that a significant proportion of participants, specifically 88.22%, were unfamiliar with sign language and relied on the assistance of an attendant (54.82%) for communication purposes. According to their accounts, surgical extraction (54.4%) was perceived as the most challenging treatment to execute, whereas scaling and polishing (73.9%) was considered the easiest treatment to administer in this particular population. When it came to providing care for individuals with speech and hearing impairment, several barriers were identified. These included communication obstacles (65.3%), patient cooperation (63.2%), concerns about medical emergencies (29.1%), inadequately trained staff (26.3%), and limited experience (25.5%). The participants unanimously agreed that there is a pressing need for enhanced undergraduate training (43.68%). Additionally, they suggested that it should be obligatory for students to treat a specific number of patients with speech and hearing impairment as part of their undergraduate training (49.25%). Conclusion: There is a need to address the identified barriers and create a better understanding between dentists and individuals with speech and hearing im- pairments. Basic dental signs should be learned and mastered to engage patients more effectively via non-verbal communication.

Dentistry

Detail DOI Sumber

DOAJ Open Access 2023

Étude des phénomènes de liaison et d’enchaînement chez des locuteurs adultes bègues et normo-fluents

Eléonor Burkhardt, Samantha Ruvoletto

La liaison (les [le] + ours [uʁs]→ les[z]ours [le.zuʁs]) et l’enchaînement (une [yn] + autruche [ot.ʁyʃ] → u[n]autruche [y.not.ʁyʃ]) sont deux phénomènes phonologiques fréquents en français, entraînant une resyllabation. Ils interviennent entre deux mots : le premier a une consonne en finale de mot et le deuxième, une voyelle en initiale de mot. La consonne d’enchaînement (CE, le [n] dans une autruche [y.not.ʁyʃ]) se prononce, que le mot soit isolé ou associé à un deuxième mot. La consonne de liaison (CL, le [z] dans les[z]ours [le.zuʁs]), elle, est produite uniquement lorsque deux mots sont associés. Selon Durand (1936), les CL sont produites avec plus de tension que les CE. Or, le bégaiement se caractérise par une tension excessive dont l’importance est liée à sa sévérité. Nous faisons donc l’hypothèse que les adultes bègues produisent davantage de disfluences sur les liaisons que sur les enchaînements et que leur nombre augmente avec la sévérité du bégaiement. Par ailleurs, nous suggérons que la tension puisse s’exprimer à travers la durée phonatoire des consonnes, qui serait plus longue pour les liaisons que pour les enchaînements. Pour vérifier nos hypothèses, nous avons comparé les productions de 10 locuteurs bègues et de 10 adultes normo-fluents francophones, via une épreuve langagière. Nos résultats montrent que les CL sont nettement plus bégayées que les CE, a fortiori chez les participants ayant un bégaiement plus sévère. Également, les phonèmes resyllabés les plus bégayés sont les plus tendus voire les plus longs. Néanmoins, les liaisons ne sont pas plus longues que les enchaînements ; cela semble davantage tenir aux phonèmes impliqués. Nos résultats laissent penser qu’un lien existe entre le degré de tension des liaisons et des enchaînements, le niveau de tension des phonèmes et la sévérité du bégaiement. Ainsi, il serait opportun de porter une attention particulière aux CL en rééducation orthophonique.

Oral communication. Speech, Pathology

Detail Sumber

DOAJ Open Access 2023

The identification process of students with speech-language impairment in South Carolina

Ștefania FILIP

This presentation focuses on the process for the initial evaluation of students with speech-language impairment. An initial evaluation involves the use of a variety of assessment tools and strategies to gather relevant functional, developmental, and academic information to assist in determining if the child is eligible for special education services. There is a three-pronged question for eligibility: (1) whether the student is a child with a disability and by reason thereof, (2) has an educational impact and (3) requires direct services. Evaluations must also determine the present levels of academic achievement and functional performance. This shifts the focus of the initial evaluation from the determination of eligibility for services to the determination of what the child needs to enable him/her to learn effectively and to participate and progress in the general education curriculum.

Philology. Linguistics, Oral communication. Speech

Detail DOI Sumber

arXiv Open Access 2023

Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.

en eess.AS, cs.SD

Detail Sumber

arXiv Open Access 2023

DWFormer: Dynamic Window transFormer for Speech Emotion Recognition

Shuaiqi Chen, Xiaofen Xing, Weibin Zhang et al.

Speech emotion recognition is crucial to human-computer interaction. The temporal regions that represent different emotions scatter in different parts of the speech locally. Moreover, the temporal scales of important information may vary over a large range within and across speech segments. Although transformer-based models have made progress in this field, the existing models could not precisely locate important regions at different temporal scales. To address the issue, we propose Dynamic Window transFormer (DWFormer), a new architecture that leverages temporal importance by dynamically splitting samples into windows. Self-attention mechanism is applied within windows for capturing temporal important information locally in a fine-grained way. Cross-window information interaction is also taken into account for global communication. DWFormer is evaluated on both the IEMOCAP and the MELD datasets. Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.

en cs.SD, cs.CL

Detail Sumber

arXiv Open Access 2023

MFCCGAN: A Novel MFCC-Based Speech Synthesizer Using Adversarial Learning

Mohammad Reza Hasanabadi Majid Behdad Davood Gharavian

In this paper, we introduce MFCCGAN as a novel speech synthesizer based on adversarial learning that adopts MFCCs as input and generates raw speech waveforms. Benefiting the GAN model capabilities, it produces speech with higher intelligibility than a rule-based MFCC-based speech synthesizer WORLD. We evaluated the model based on a popular intrusive objective speech intelligibility measure (STOI) and quality (NISQA score). Experimental results show that our proposed system outperforms Librosa MFCC- inversion (by an increase of about 26% up to 53% in STOI and 16% up to 78% in NISQA score) and a rise of about 10% in intelligibility and about 4% in naturalness in comparison with conventional rule-based vocoder WORLD that used in the CycleGAN-VC family. However, WORLD needs additional data like F0. Finally, using perceptual loss in discriminators based on STOI could improve the quality more. WebMUSHRA-based subjective tests also show the quality of the proposed approach.

en cs.SD, cs.AI

Detail DOI Sumber

arXiv Open Access 2022

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

Jaejin Cho, Jes'us Villalba, Laureano Moro-Velazquez et al.

In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER.

en eess.AS, cs.AI

Detail DOI Sumber

arXiv Open Access 2022

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Nikolaos Ellinas, Myrsini Christidou, Alexandra Vioni et al.

In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

en cs.SD, cs.CL

Detail DOI Sumber

arXiv Open Access 2022

A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems

Marcely Zanon Boito, Laurent Besacier, Natalia Tomashenko et al.

Self-supervised models for speech processing emerged recently as popular foundation blocks in speech processing pipelines. These models are pre-trained on unlabeled audio data and then used in speech processing downstream tasks such as automatic speech recognition (ASR) or speech translation (ST). Since these models are now used in research and industrial systems alike, it becomes necessary to understand the impact caused by some features such as gender distribution within pre-training data. Using French as our investigation language, we train and compare gender-specific wav2vec 2.0 models against models containing different degrees of gender balance in their pre-training data. The comparison is performed by applying these models to two speech-to-text downstream tasks: ASR and ST. Results show the type of downstream integration matters. We observe lower overall performance using gender-specific pre-training before fine-tuning an end-to-end ASR system. However, when self-supervised models are used as feature extractors, the overall ASR and ST results follow more complex patterns in which the balanced pre-trained model does not necessarily lead to the best results. Lastly, our crude 'fairness' metric, the relative performance difference measured between female and male test sets, does not display a strong variation from balanced to gender-specific pre-trained wav2vec 2.0 models.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2022

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

Yang Xiang, Jesper Lisby Højvang, Morten Højfeldt Rasmussen et al.

This paper focuses on leveraging deep representation learning (DRL) for speech enhancement (SE). In general, the performance of the deep neural network (DNN) is heavily dependent on the learning of data representation. However, the DRL's importance is often ignored in many DNN-based SE algorithms. To obtain a higher quality enhanced speech, we propose a two-stage DRL-based SE method through adversarial training. In the first stage, we disentangle different latent variables because disentangled representations can help DNN generate a better enhanced speech. Specifically, we use the $β$-variational autoencoder (VAE) algorithm to obtain the speech and noise posterior estimations and related representations from the observed signal. However, since the posteriors and representations are intractable and we can only apply a conditional assumption to estimate them, it is difficult to ensure that these estimations are always pretty accurate, which may potentially degrade the final accuracy of the signal estimation. To further improve the quality of enhanced speech, in the second stage, we introduce adversarial training to reduce the effect of the inaccurate posterior towards signal reconstruction and improve the signal estimation accuracy, making our algorithm more robust for the potentially inaccurate posterior estimations. As a result, better SE performance can be achieved. The experimental results indicate that the proposed strategy can help similar DNN-based SE algorithms achieve higher short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and scale-invariant signal-to-distortion ratio (SI-SDR) scores. Moreover, the proposed algorithm can also outperform recent competitive SE algorithms.

en eess.AS, cs.SD

Detail Sumber

S2 Open Access 2008

Putting accent in its place: Rethinking obstacles to communication

Tracey M. Derwing, M. Munro

462 sitasi en Psychology

Detail DOI Sumber

DOAJ Open Access 2021

Lucie Nohová

Medicine, Oral communication. Speech

Detail DOI Sumber

Hasil untuk "Oral communication. Speech"