This study investigates the effects of verb aspect on legal judgments. In three experiments, that were presented in French to French speakers, participants received a scenario about a case of HIV transmission, in which the actions of the person referred to as responsible for the transmission were described using imperfective or perfective aspect. Furthermore, in experiment 1, the accused was portrayed as aware (versus unaware) of his HIV status, and in experiment 2, he was portrayed as having (versus not having) a previous criminal record. In experiment 3, the lexical verbs used to describe his behavior made the accused a high volitional agent (versus low volitional). Participants read the report and then made judgments on intentionality, expressed feelings toward the accused and ascribed him a blame and sentence. Results revealed that imperfective descriptions resulted in higher perceived intentionality, higher blame attribution, more dislike and higher sentence. But they also showed that these effects of imperfective versions occurred primarily when the accused was portrayed as knowing his seropositivity (experiment 1), having a criminal record (experiment 2) and when the volition of his behavior was attenuated (experiment 3). These findings provide novel insights about how language subtleties can impact perceptions of criminal intentionality and its related judgments.
Speech separation on hearable devices can enable transformative augmented and enhanced hearing capabilities. However, state-of-the-art speech separation networks cannot run in real-time on tiny, low-power neural accelerators designed for hearables, due to their limited compute capabilities. We present TF-MLPNet, the first speech separation network capable of running in real-time on such low-power accelerators while outperforming existing streaming models for blind speech separation and target speech extraction. Our network operates in the time-frequency domain, processing frequency sequences with stacks of fully connected layers that alternate along the channel and frequency dimensions, and independently processing the time sequence at each frequency bin using convolutional layers. Results show that our mixed-precision quantization-aware trained (QAT) model can process 6 ms audio chunks in real-time on the GAP9 processor, achieving a 3.5-4x runtime reduction compared to prior speech separation models.
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.
While audio recordings in real life provide insights into social dynamics and conversational behavior, they also raise concerns about the privacy of personal, sensitive data. This article explores the effectiveness of restricting recordings to low-frequency audio to protect spoken content. For resampling the audio signals to different sampling rates, we compare the effect of employing anti-aliasing filtering. Privacy enhancement is measured by an increased word error rate of automatic speech recognition models. The impact on utility performance is measured with voice activity detection models. Our experimental results show that for clean recordings, models trained with a sampling rate of up to 800 Hz transcribe the majority of words correctly. For both models, we analyzed the impact of the speaker's sex and pitch, and we demonstrated that missing anti-aliasing filters more strongly compromise speech privacy.
End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optimization of streaming and non-streaming ASR based on multi-decoder and knowledge distillation. Primarily, we study 1) the encoder integration of these ASR modules, followed by 2) separate decoders to make the switching mode flexible, and enhancing performance by 3) incorporating similarity-preserving knowledge distillation between the two modular encoders and decoders. Evaluation results show 2.6%-5.3% relative character error rate reductions (CERR) on CSJ for streaming ASR, and 8.3%-9.7% relative CERRs for non-streaming ASR within a single model compared to multiple standalone modules.
Ihab Asaad, Maxime Jacquelin, Olivier Perrotin
et al.
Speech inpainting consists in reconstructing corrupted or missing speech segments using surrounding context, a process that closely resembles the pretext tasks in Self-Supervised Learning (SSL) for speech encoders. This study investigates using SSL-trained speech encoders for inpainting without any additional training beyond the initial pretext task, and simply adding a decoder to generate a waveform. We compare this approach to supervised fine-tuning of speech encoders for a downstream task -- here, inpainting. Practically, we integrate HuBERT as the SSL encoder and HiFi-GAN as the decoder in two configurations: (1) fine-tuning the decoder to align with the frozen pre-trained encoder's output and (2) fine-tuning the encoder for an inpainting task based on a frozen decoder's input. Evaluations are conducted under single- and multi-speaker conditions using in-domain datasets and out-of-domain datasets (including unseen speakers, diverse speaking styles, and noise). Both informed and blind inpainting scenarios are considered, where the position of the corrupted segment is either known or unknown. The proposed SSL-based methods are benchmarked against several baselines, including a text-informed method combining automatic speech recognition with zero-shot text-to-speech synthesis. Performance is assessed using objective metrics and perceptual evaluations. The results demonstrate that both approaches outperform baselines, successfully reconstructing speech segments up to 200 ms, and sometimes up to 400 ms. Notably, fine-tuning the SSL encoder achieves more accurate speech reconstruction in single-speaker settings, while a pre-trained encoder proves more effective for multi-speaker scenarios. This demonstrates that an SSL pretext task can transfer to speech inpainting, enabling successful speech reconstruction with a pre-trained encoder.
El artículo se propone analizar la relación que con el relato de tradición oral tienen las narradoras y narradores «urbanos y contemporáneos» (profesionales que durante las últimas décadas han visibilizado la práctica de la narración en diferentes espacios culturales y educativos) que residen en Argentina y España. Se atiende al grado de presencia que tienen los relatos de tradición oral en los repertorios y a los modos en que son transmitidos y seleccionados, con especial atención al carácter de las fuentes. Ello nos lleva a profundizar en la consideración que este material tiene dentro de la profesión y a plantear si la narración de estos relatos en un contexto urbano y contemporáneo implica su revitalización.
Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature
The paper deals with the issue of foreign language teaching methodology. It presents a question of motivating foreign students to intensify their speech activity while learning the Ukrainian language. The author’s principal focus is on the role of motivational tasks in the formation of foreign-language communicative competence, as well as in building professional and social skills by a foreign audience. Results of the conducted research confirmed the usefulness of creating such motivational tasks, based on cross-cultural sources, for foreign students of humanities. Said motivational tasks include open-type creative assignments that are based on the previously received phonetic, lexical and grammar skills and encourage learners to speak out using the Ukrainian language. This article provides models of communication situations, role-playing games, presentations, discussions, projects and some other creative oral and written exercises based on materials about life in Kyiv. The conclusions on the results of the integration of these tasks into the educational process are subsequently presented.
Slavic languages. Baltic languages. Albanian languages, History (General) and history of Europe
There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.
Hubert Nourtel, Pierre Champion, Denis Jouvet
et al.
Speech data carries a range of personal information, such as the speaker's identity and emotional state. These attributes can be used for malicious purposes. With the development of virtual assistants, a new generation of privacy threats has emerged. Current studies have addressed the topic of preserving speech privacy. One of them, the VoicePrivacy initiative aims to promote the development of privacy preservation tools for speech technology. The task selected for the VoicePrivacy 2020 Challenge (VPC) is about speaker anonymization. The goal is to hide the source speaker's identity while preserving the linguistic information. The baseline of the VPC makes use of a voice conversion. This paper studies the impact of the speaker anonymization baseline system of the VPC on emotional information present in speech utterances. Evaluation is performed following the VPC rules regarding the attackers' knowledge about the anonymization system. Our results show that the VPC baseline system does not suppress speakers' emotions against informed attackers. When comparing anonymized speech to original speech, the emotion recognition performance is degraded by 15\% relative to IEMOCAP data, similar to the degradation observed for automatic speech recognition used to evaluate the preservation of the linguistic information.
Automatic speech recognition (ASR) has gained remarkable successes thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works combine SE and currently popular self-supervised learning (SSL) to alleviate distortion and improve noise robustness. Despite the effectiveness, the speech distortion caused by conventional SE still cannot be cleared out. In this paper, we propose a self-supervised framework named Wav2code to implement a feature-level SE with reduced distortions for noise-robust ASR. First, in pre-training stage the clean speech representations from SSL model are sent to lookup a discrete codebook via nearest-neighbor feature matching, the resulted code sequence are then exploited to reconstruct the original clean representations, in order to store them in codebook as prior. Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations with reduced distortions. Furthermore, we propose an interactive feature fusion network to combine original noisy and the restored clean representations to consider both fidelity and quality, resulting in more informative features for downstream ASR. Finally, experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions, resulting in stronger robustness.
Théo Mariotte, Anthony Larcher, Silvio Montrésor
et al.
Speaker diarization is the task of answering Who spoke and when? in an audio stream. Pipeline systems rely on speech segmentation to extract speakers' segments and achieve robust speaker diarization. This paper proposes a common framework to solve three segmentation tasks in the distant speech scenario: Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), and Speaker Change Detection (SCD). In the literature, a few studies investigate the multi-microphone distant speech scenario. In this work, we propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA). These spatial features are extracted from multi-microphone audio data and combined with standard acoustic features. Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
Les outils d’évaluation normés statiques sont plébiscités par les orthophonistes pour poser un diagnostic et déterminer le profil atypique ou pathologique d’un patient. Les outils d’évaluation critériée sont, quant à eux, de bons candidats pour compléter les résultats de première ligne issus des approches évaluatives normées et pour évaluer les patients ne pouvant être comparés à une population dite normocognitive. A travers un état de l’art, cet article défend l'importance d'une approche d'évaluation critériée en orthophonie.
Tout d’abord, une définition générale de l’évaluation critériée est proposée sur la base d’une taxonomie des différentes évaluations possibles pour les compétences langagières. Ce premier volet est ensuite illustré par plusieurs outils d’évaluation critériée employés en orthophonie et en orthopédagogie afin de dégager une liste de huit caractéristiques méthodologiques. Enfin, nous proposons de délimiter les principes méthodologiques et les enjeux cliniques ou pédagogiques de cette évaluation encore mal renseignée dans le domaine de l’orthophonie.
Les conclusions et perspectives proposées ouvrent la voie pour permettre aux cliniciens d'élaborer un programme de rééducation ou encore observer l’évolution d’un patient par rapport à lui-même dans une approche fonctionnelle et systémique.