Hasil "Oral communication. Speech"

DOAJ Open Access 2026

Verb Aspect and its Role in Reasoning About Guilt

Coppola Vincent, Matlock Teenie

This study investigates the effects of verb aspect on legal judgments. In three experiments, that were presented in French to French speakers, participants received a scenario about a case of HIV transmission, in which the actions of the person referred to as responsible for the transmission were described using imperfective or perfective aspect. Furthermore, in experiment 1, the accused was portrayed as aware (versus unaware) of his HIV status, and in experiment 2, he was portrayed as having (versus not having) a previous criminal record. In experiment 3, the lexical verbs used to describe his behavior made the accused a high volitional agent (versus low volitional). Participants read the report and then made judgments on intentionality, expressed feelings toward the accused and ascribed him a blame and sentence. Results revealed that imperfective descriptions resulted in higher perceived intentionality, higher blame attribution, more dislike and higher sentence. But they also showed that these effects of imperfective versions occurred primarily when the accused was portrayed as knowing his seropositivity (experiment 1), having a criminal record (experiment 2) and when the volition of his behavior was attenuated (experiment 3). These findings provide novel insights about how language subtleties can impact perceptions of criminal intentionality and its related judgments.

Oral communication. Speech, Psychology

Detail DOI Sumber

CrossRef Open Access 2025

Editorial Board

en

Detail DOI Sumber

arXiv Open Access 2025

TF-MLPNet: Tiny Real-Time Neural Speech Separation

Malek Itani, Tuochao Chen, Shyamnath Gollakota

Speech separation on hearable devices can enable transformative augmented and enhanced hearing capabilities. However, state-of-the-art speech separation networks cannot run in real-time on tiny, low-power neural accelerators designed for hearables, due to their limited compute capabilities. We present TF-MLPNet, the first speech separation network capable of running in real-time on such low-power accelerators while outperforming existing streaming models for blind speech separation and target speech extraction. Our network operates in the time-frequency domain, processing frequency sequences with stacks of fully connected layers that alternate along the channel and frequency dimensions, and independently processing the time sequence at each frequency bin using convolutional layers. Results show that our mixed-precision quantization-aware trained (QAT) model can process 6 ms audio chunks in real-time on the GAP9 processor, achieving a 3.5-4x runtime reduction compared to prior speech separation models.

en cs.SD, cs.LG

Detail Sumber

arXiv Open Access 2025

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

Xinlu He, Jacob Whitehill

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Revisiting the Privacy of Low-Frequency Speech Signals: Exploring Resampling Methods, Evaluation Scenarios, and Speaker Characteristics

Jule Pohlhausen, Jörg Bitzer

While audio recordings in real life provide insights into social dynamics and conversational behavior, they also raise concerns about the privacy of personal, sensitive data. This article explores the effectiveness of restricting recordings to low-frequency audio to protect spoken content. For resampling the audio signals to different sampling rates, we compare the effect of employing anti-aliasing filtering. Privacy enhancement is measured by an increased word error rate of automatic speech recognition models. The impact on utility performance is measured with voice activity detection models. Our experimental results show that for clean recordings, models trained with a sampling rate of up to 800 Hz transcribe the majority of words correctly. For both models, we analyzed the impact of the speaker's sex and pitch, and we demonstrated that missing anti-aliasing filters more strongly compromise speech privacy.

en eess.AS

Detail DOI Sumber

CrossRef Open Access 2024

Editorial Board

en

Detail DOI Sumber

arXiv Open Access 2024

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Muhammad Shakeel, Yui Sudo, Yifan Peng et al.

End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.

en eess.AS, cs.CL

Detail DOI Sumber

arXiv Open Access 2024

Is Self-Supervised Learning Enough to Fill in the Gap? A Study on Speech Inpainting

Ihab Asaad, Maxime Jacquelin, Olivier Perrotin et al.

Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task -- here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder's output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder's input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.

en cs.SD, cs.CL

Detail Sumber

CrossRef Open Access 2023

Editorial Board

en

Detail DOI Sumber

CrossRef Open Access 2023

Editorial Board

en

Detail DOI Sumber

DOAJ Open Access 2023

Relevancia de la tradición oral en la narración oral urbana contemporánea.

Irene Fernández García

El artículo se propone analizar la relación que con el relato de tradición oral tienen las narradoras y narradores «urbanos y contemporáneos» (profesionales que durante las últimas décadas han visibilizado la práctica de la narración en diferentes espacios culturales y educativos) que residen en Argentina y España. Se atiende al grado de presencia que tienen los relatos de tradición oral en los repertorios y a los modos en que son transmitidos y seleccionados, con especial atención al carácter de las fuentes. Ello nos lleva a profundizar en la consideración que este material tiene dentro de la profesión y a plantear si la narración de estos relatos en un contexto urbano y contemporáneo implica su revitalización.

Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature

Detail DOI Sumber

DOAJ Open Access 2023

Motivational tasks in teaching Ukrainian as a foreign language

Oksana Samusenko

The paper deals with the issue of foreign language teaching methodology. It presents a question of motivating foreign students to intensify their speech activity while learning the Ukrainian language. The author’s principal focus is on the role of motivational tasks in the formation of foreign-language communicative competence, as well as in building professional and social skills by a foreign audience. Results of the conducted research confirmed the usefulness of creating such motivational tasks, based on cross-cultural sources, for foreign students of humanities. Said motivational tasks include open-type creative assignments that are based on the previously received phonetic, lexical and grammar skills and encourage learners to speak out using the Ukrainian language. This article provides models of communication situations, role-playing games, presentations, discussions, projects and some other creative oral and written exercises based on materials about life in Kyiv. The conclusions on the results of the integration of these tasks into the educational process are subsequently presented.

Slavic languages. Baltic languages. Albanian languages, History (General) and history of Europe

Detail DOI Sumber

arXiv Open Access 2023

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Haobin Tang, Xulong Zhang, Jianzong Wang et al.

There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.

en cs.SD, eess.AS

Detail Sumber

arXiv Open Access 2023

Evaluation of Speaker Anonymization on Emotional Speech

Hubert Nourtel, Pierre Champion, Denis Jouvet et al.

Speech data carries a range of personal information, such as the speaker's identity and emotional state. These attributes can be used for malicious purposes. With the development of virtual assistants, a new generation of privacy threats has emerged. Current studies have addressed the topic of preserving speech privacy. One of them, the VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology. The task selected for the VoicePrivacy 2020 Challenge (VPC) is about speaker anonymization. The goal is to hide the source speaker's identity while preserving the linguistic information. The baseline of the VPC makes use of a voice conversion. This paper studies the impact of the speaker anonymization baseline system of the VPC on emotional information present in speech utterances. Evaluation is performed following the VPC rules regarding the attackers' knowledge about the anonymization system. Our results show that the VPC baseline system does not suppress speakers' emotions against informed attackers. When comparing anonymized speech to original speech, the emotion recognition performance is degraded by 15\% relative to IEMOCAP data, similar to the degradation observed for automatic speech recognition used to evaluate the preservation of the linguistic information.

en eess.AS, cs.AI

Detail DOI Sumber

arXiv Open Access 2023

Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR

Yuchen Hu, Chen Chen, Qiushi Zhu et al.

Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works combine SE and currently popular self-supervised learning (SSL) to alleviate distortion and improve noise robustness. Despite the effectiveness, the speech distortion caused by conventional SE still cannot be cleared out. In this paper, we propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR. First, in pre-training stage the clean speech representations from SSL model are sent to lookup a discrete codebook via nearest-neighbor feature matching, the resulted code sequence are then exploited to reconstruct the original clean representations, in order to store them in codebook as prior. Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions. Furthermore, we propose an interactive feature fusion network to combine original noisy and the restored clean representations to consider both fidelity and quality, resulting in more informative features for downstream ASR. Finally, experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions, resulting in stronger robustness.

en eess.AS, cs.LG

Detail Sumber

arXiv Open Access 2023

Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features

Théo Mariotte, Anthony Larcher, Silvio Montrésor et al.

Speaker diarization is the task of answering Who spoke and when? in an audio stream. Pipeline systems rely on speech segmentation to extract speakers' segments and achieve robust speaker diarization. This paper proposes a common framework to solve three segmentation tasks in the distant speech scenario: Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), and Speaker Change Detection (SCD). In the literature, a few studies investigate the multi-microphone distant speech scenario. In this work, we propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA). These spatial features are extracted from multi-microphone audio data and combined with standard acoustic features. Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.

en cs.SD, cs.CL

Detail Sumber

CrossRef Open Access 2022

Editorial Board

en

Detail DOI Sumber

CrossRef Open Access 2022

Editorial Board

en

Detail DOI Sumber

CrossRef Open Access 2022

Editorial Board

en

Detail DOI Sumber

DOAJ Open Access 2022

L' évaluation critériée statique formelle en orthophonie

Julie Cattini, Guillaume Duboisdindien

Les outils d’évaluation normés statiques sont plébiscités par les orthophonistes pour poser un diagnostic et déterminer le profil atypique ou pathologique d’un patient. Les outils d’évaluation critériée sont, quant à eux, de bons candidats pour compléter les résultats de première ligne issus des approches évaluatives normées et pour évaluer les patients ne pouvant être comparés à une population dite normocognitive. A travers un état de l’art, cet article défend l'importance d'une approche d'évaluation critériée en orthophonie. Tout d’abord, une définition générale de l’évaluation critériée est proposée sur la base d’une taxonomie des différentes évaluations possibles pour les compétences langagières. Ce premier volet est ensuite illustré par plusieurs outils d’évaluation critériée employés en orthophonie et en orthopédagogie afin de dégager une liste de huit caractéristiques méthodologiques. Enfin, nous proposons de délimiter les principes méthodologiques et les enjeux cliniques ou pédagogiques de cette évaluation encore mal renseignée dans le domaine de l’orthophonie. Les conclusions et perspectives proposées ouvrent la voie pour permettre aux cliniciens d'élaborer un programme de rééducation ou encore observer l’évolution d’un patient par rapport à lui-même dans une approche fonctionnelle et systémique.

Oral communication. Speech, Pathology

Detail Sumber

Hasil untuk "Oral communication. Speech"