Hasil "Oral communication. Speech"

arXiv Open Access 2025

Bilingual Dual-Head Deep Model for Parkinson's Disease Detection from Speech

Moreno La Quatra, Juan Rafael Orozco-Arroyave, Marco Sabato Siniscalchi

This work aims to tackle the Parkinson's disease (PD) detection problem from the speech signal in a bilingual setting by proposing an ad-hoc dual-head deep neural architecture for type-based binary classification. One head is specialized for diadochokinetic patterns. The other head looks for natural speech patterns present in continuous spoken utterances. Only one of the two heads is operative accordingly to the nature of the input. Speech representations are extracted from self-supervised learning (SSL) models and wavelet transforms. Adaptive layers, convolutional bottlenecks, and contrastive learning are exploited to reduce variations across languages. Our solution is assessed against two distinct datasets, EWA-DB, and PC-GITA, which cover Slovak and Spanish languages, respectively. Results indicate that conventional models trained on a single language dataset struggle with cross-linguistic generalization, and naive combinations of datasets are suboptimal. In contrast, our model improves generalization on both languages, simultaneously.

en eess.AS, cs.AI

Detail DOI Sumber

arXiv Open Access 2025

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications

Biel Tura Vecino, Adam Gabryś, Daniel Mątwicki et al.

Recent works have shown that modelling raw waveform directly from text in an end-to-end (E2E) fashion produces more natural-sounding speech than traditional neural text-to-speech (TTS) systems based on a cascade or two-stage approach. However, current E2E state-of-the-art models are computationally complex and memory-consuming, making them unsuitable for real-time offline on-device applications in low-resource scenarios. To address this issue, we propose a Lightweight E2E-TTS (LE2E) model that generates high-quality speech requiring minimal computational resources. We evaluate the proposed model on the LJSpeech dataset and show that it achieves state-of-the-art performance while being up to $90\%$ smaller in terms of model parameters and $10\times$ faster in real-time-factor. Furthermore, we demonstrate that the proposed E2E training paradigm achieves better quality compared to an equivalent architecture trained in a two-stage approach. Our results suggest that LE2E is a promising approach for developing real-time, high quality, low-resource TTS applications for on-device applications.

en cs.SD, cs.AI

Detail DOI Sumber

arXiv Open Access 2025

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition

Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly

Mixture-of-experts (MoE) architectures have expanded from language modeling to automatic speech recognition (ASR). Traditional MoE methods, such as the Switch Transformer, route experts independently within each layer. Our analysis reveals that routers in most layers make expert choices that are not strongly correlated with the choices of the routers in other layers. To increase the cooperation between experts in different layers and encourage greater specialization, we use a shared router across different MoE layers. We call this model Omni-router Transformer. Extensive experiments on a large-scale pseudo-labeled dataset and evaluations across 10 diverse, out-of-domain ASR benchmarks demonstrate that the Omni-router Transformer is able to achieve lower training loss and consistently outperform dense and Switch Transformer models, reducing average word error rates by 11.2% and 8.2%, respectively, while providing structured expert usage and improved robustness to diverse data.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Xun Gong, Yu Wu, Jinyu Li et al.

In this paper, we propose two novel approaches, which integrate long-content information into the factorized neural transducer (FNT) based architecture in both non-streaming (referred to as LongFNT ) and streaming (referred to as SLongFNT ) scenarios. We first investigate whether long-content transcriptions can improve the vanilla conformer transducer (C-T) models. Our experiments indicate that the vanilla C-T models do not exhibit improved performance when utilizing long-content transcriptions, possibly due to the predictor network of C-T models not functioning as a pure language model. Instead, FNT shows its potential in utilizing long-content information, where we propose the LongFNT model and explore the impact of long-content information in both text (LongFNT-Text) and speech (LongFNT-Speech). The proposed LongFNT-Text and LongFNT-Speech models further complement each other to achieve better performance, with transcription history proving more valuable to the model. The effectiveness of our LongFNT approach is evaluated on LibriSpeech and GigaSpeech corpora, and obtains relative 19% and 12% word error rate reduction, respectively. Furthermore, we extend the LongFNT model to the streaming scenario, which is named SLongFNT , consisting of SLongFNT-Text and SLongFNT-Speech approaches to utilize long-content text and speech information. Experiments show that the proposed SLongFNT model achieves relative 26% and 17% WER reduction on LibriSpeech and GigaSpeech respectively while keeping a good latency, compared to the FNT baseline. Overall, our proposed LongFNT and SLongFNT highlight the significance of considering long-content speech and transcription knowledge for improving both non-streaming and streaming speech recognition systems.

en cs.SD, eess.AS

Detail DOI Sumber

arXiv Open Access 2024

PEFT for Speech: Unveiling Optimal Placement, Merging Strategies, and Ensemble Techniques

Tzu-Han Lin, How-Shing Wang, Hao-Yung Weng et al.

Parameter-Efficient Fine-Tuning (PEFT) is increasingly recognized as an effective method in speech processing. However, the optimal approach and the placement of PEFT methods remain inconclusive. Our study conducts extensive experiments to compare different PEFT methods and their layer-wise placement adapting Differentiable Architecture Search (DARTS). We also explore the use of ensemble learning to leverage diverse PEFT strategies. The results reveal that DARTS does not outperform the baseline approach, which involves inserting the same PEFT method into all layers of a Self-Supervised Learning (SSL) model. In contrast, an ensemble learning approach, particularly one employing majority voting, demonstrates superior performance. Our statistical evidence indicates that different PEFT methods learn in varied ways. This variation might explain why the synergistic integration of various PEFT methods through ensemble learning can harness their unique learning capabilities more effectively compared to individual layer-wise optimization.

en cs.CL, cs.SD

Detail Sumber

DOAJ Open Access 2023

ChatGPT, LaMDA, and the Hype Around Communicative AI: The Automation of Communication as a Field of Research in Media and Communication Studies

Andreas Hepp, Wiebke Loosen, Stephan Dreyer et al.

The aim of this article is to more precisely define the field of research on the automation of communication, which is still only vaguely discernible. The central thesis argues that to be able to fully grasp the transformation of the media environment associated with the automation of communication, our view must be broadened from a preoccupation with direct interactions between humans and machines to societal communication. This more widely targeted question asks how the dynamics of societal communication change when communicative artificial intelligence—in short: communicative AI—is integrated into aspects of societal communication. To this end, we recommend an approach that follows the tradition of figurational sociology.

Technology (General), Oral communication. Speech

Detail DOI

DOAJ Open Access 2023

Are Polish CLIL learners more willing to communicate in English than non-CLIL learners?

Plichta Paulina, Muszyńska Karolina, Otwinowska Agnieszka

Willingness to communicate (WTC) is the probability that one will choose to initiate communication given the opportunity to do so. We investigated the second language (L2) WTC (L2WTC) in Polish teenagers aged 16-19 (N = 177) attending bilingual Content and Language Integrated Learning (CLIL) and general English classes in the same Polish school. Using a tailor-made WTC questionnaire, we gathered data twice, before and after the summer holidays, assuming that WTC would depend on the time of testing. The results did not reveal high WTC in English in both groups, and the time of testing did not influence the results. Students’ age, but not gender, influenced the L2WTC, with older students being more willing to communicate than the younger ones. Contrary to expectations, L2WTC in the bilingual CLIL and non-CLIL groups did not differ. The findings suggest that the bilingual programmes in Polish secondary schools may not increase students’ WTC in English.

Oral communication. Speech, Psychology

Detail DOI Sumber

arXiv Open Access 2023

Hate Speech Detection via Dual Contrastive Learning

Junyu Lu, Hongfei Lin, Xiaokun Zhang et al.

The fast spread of hate speech on social media impacts the Internet environment and our society by increasing prejudice and hurting people. Detecting hate speech has aroused broad attention in the field of natural language processing. Although hate speech detection has been addressed in recent work, this task still faces two inherent unsolved challenges. The first challenge lies in the complex semantic information conveyed in hate speech, particularly the interference of insulting words in hate speech detection. The second challenge is the imbalanced distribution of hate speech and non-hate speech, which may significantly deteriorate the performance of models. To tackle these challenges, we propose a novel dual contrastive learning (DCL) framework for hate speech detection. Our framework jointly optimizes the self-supervised and the supervised contrastive learning loss for capturing span-level information beyond the token-level emotional semantics used in existing models, particularly detecting speech containing abusive and insulting words. Moreover, we integrate the focal loss into the dual contrastive learning framework to alleviate the problem of data imbalance. We conduct experiments on two publicly available English datasets, and experimental results show that the proposed model outperforms the state-of-the-art models and precisely detects hate speeches.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

Kun Wei, Bei Li, Hang Lv et al.

Automatic Speech Recognition (ASR) in conversational settings presents unique challenges, including extracting relevant contextual information from previous conversational turns. Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. Our approach leverages a cross-modal extractor that combines pre-trained speech and text models through a specialized encoder and a modal-level mask input. This enables the extraction of richer historical speech context without explicit error propagation. We also incorporate conditional latent variational modules to learn conversational level attributes such as role preference and topic coherence. By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss, achieving relative accuracy improvements of 8.8% and 23% on Mandarin conversation datasets HKUST and MagicData-RAMC, respectively, compared to the standard Conformer model.

en cs.SD, cs.CL

Detail DOI Sumber

arXiv Open Access 2023

Where's the Liability in Harmful AI Speech?

Peter Henderson, Tatsunori Hashimoto, Mark Lemley

Generative AI, in particular text-based "foundation models" (large models trained on a huge variety of information including the internet), can generate speech that could be problematic under a wide range of liability regimes. Machine learning practitioners regularly "red team" models to identify and mitigate such problematic speech: from "hallucinations" falsely accusing people of serious misconduct to recipes for constructing an atomic bomb. A key question is whether these red-teamed behaviors actually present any liability risk for model creators and deployers under U.S. law, incentivizing investments in safety mechanisms. We examine three liability regimes, tying them to common examples of red-teamed model behaviors: defamation, speech integral to criminal conduct, and wrongful death. We find that any Section 230 immunity analysis or downstream liability analysis is intimately wrapped up in the technical details of algorithm design. And there are many roadblocks to truly finding models (and their associated parties) liable for generated speech. We argue that AI should not be categorically immune from liability in these scenarios and that as courts grapple with the already fine-grained complexities of platform algorithms, the technical details of generative AI loom above with thornier questions. Courts and policymakers should think carefully about what technical design incentives they create as they evaluate these issues.

en cs.CY, cs.AI

Detail Sumber

arXiv Open Access 2023

Training dynamic models using early exits for automatic speech recognition on resource-constrained devices

George August Wright, Umberto Cappellazzo, Salah Zaiem et al.

The ability to dynamically adjust the computational load of neural models during inference is crucial for on-device processing scenarios characterised by limited and time-varying computational resources. A promising solution is presented by early-exit architectures, in which additional exit branches are appended to intermediate layers of the encoder. In self-attention models for automatic speech recognition (ASR), early-exit architectures enable the development of dynamic models capable of adapting their size and architecture to varying levels of computational resources and ASR performance demands. Previous research on early-exiting ASR models has relied on pre-trained self-supervised models, fine-tuned with an early-exit loss. In this paper, we undertake an experimental comparison between fine-tuning pre-trained backbones and training models from scratch with the early-exiting objective. Experiments conducted on public datasets reveal that early-exit models trained from scratch not only preserve performance when using fewer encoder layers but also exhibit enhanced task accuracy compared to single-exit or pre-trained models. Furthermore, we explore an exit selection strategy grounded in posterior probabilities as an alternative to the conventional frame-based entropy approach. Results provide insights into the training dynamics of early-exit architectures for ASR models, particularly the efficacy of training strategies and exit selection methods.

en eess.AS, cs.CL

Detail Sumber

DOAJ Open Access 2022

In pursuit of best practice through contextually relevant, accountable and responsive research

Anita Edwards, Faheema Mahomed-Asmail, Anna-Mari Olivier et al.

No abstract available.

Oral communication. Speech

Detail DOI Sumber

DOAJ Open Access 2022

IVF So White, So Medical: Digital Normativity and Algorithm Bias in Infertility on Instagram

Caitlyn M. Jarvis, Margaret M. Quinlan

Increasingly, women experiencing infertility are turning online to social media platforms, like Instagram, to engage with a support network and foster empathy. However, Instagram is also noted for its augmentation of White, cis, and heteronormative femininity through a process of silencing and minoritizing alternative, non-White voices. Through an inductive analysis of the most frequently used infertility hashtags, we collected and analyzed 252 Instagram posts to investigate how these algorithmic practices may socially construct the idealized IVF experience through communicating normative expectations. We identify predominant patterns of use that reinforce stratification within infertility treatments as primarily accessible to White women and best handled through expensive, expert medical procedures. Ultimately, we argue for increased attention to how algorithms may communicatively constitute and socially construct existing health disparities.

Technology (General), Oral communication. Speech

Detail DOI Sumber

arXiv Open Access 2022

Language Dependencies in Adversarial Attacks on Speech Recognition Systems

Karla Markert, Donika Mirdita, Konstantin Böttinger

Automatic speech recognition (ASR) systems are ubiquitously present in our daily devices. They are vulnerable to adversarial attacks, where manipulated input samples fool the ASR system's recognition. While adversarial examples for various English ASR systems have already been analyzed, there exists no inter-language comparative vulnerability analysis. We compare the attackability of a German and an English ASR system, taking Deepspeech as an example. We investigate if one of the language models is more susceptible to manipulations than the other. The results of our experiments suggest statistically significant differences between English and German in terms of computational effort necessary for the successful generation of adversarial examples. This result encourages further research in language-dependent characteristics in the robustness analysis of ASR.

en cs.CL, cs.CR

Detail DOI Sumber

arXiv Open Access 2022

Towards Contextual Spelling Correction for Customization of End-to-end Speech Recognition Systems

Xiaoqiang Wang, Yanqing Liu, Jinyu Li et al.

Contextual biasing is an important and challenging task for end-to-end automatic speech recognition (ASR) systems, which aims to achieve better recognition performance by biasing the ASR system to particular context phrases such as person names, music list, proper nouns, etc. Existing methods mainly include contextual LM biasing and adding bias encoder into end-to-end ASR models. In this work, we introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system. We incorporate contextual information into a sequence-to-sequence spelling correction model with a shared context encoder. Our proposed model includes two different mechanisms: autoregressive (AR) and non-autoregressive (NAR). We propose filtering algorithms to handle large-size context lists, and performance balancing mechanisms to control the biasing degree of the model. We demonstrate the proposed model is a general biasing solution which is domain-insensitive and can be adopted in different scenarios. Experiments show that the proposed method achieves as much as 51% relative word error rate (WER) reduction over ASR system and outperforms traditional biasing methods. Compared to the AR solution, the proposed NAR model reduces model size by 43.2% and speeds up inference by 2.1 times.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2022

Robotic Speech Synthesis: Perspectives on Interactions, Scenarios, and Ethics

Yuanchao Li, Catherine Lai

In recent years, many works have investigated the feasibility of conversational robots for performing specific tasks, such as healthcare and interview. Along with this development comes a practical issue: how should we synthesize robotic voices to meet the needs of different situations? In this paper, we discuss this issue from three perspectives: 1) the difficulties of synthesizing non-verbal and interaction-oriented speech signals, particularly backchannels; 2) the scenario classification for robotic voice synthesis; 3) the ethical issues regarding the design of robot voice for its emotion and identity. We present the findings of relevant literature and our prior work, trying to bring the attention of human-robot interaction researchers to design better conversational robots in the future.

en cs.RO, cs.CL

Detail Sumber

DOAJ Open Access 2021

Mass media contra bandolers:

Pep Valsalobre

El pliego en catalán publicado en Barcelona en 1634 sobre el malhechor, Joan Sala, alias Serrallonga –el último en lengua catalana de temática bandolera–, nos permite observar la distinta visión que se ofrece de la figura del bandido en función de la lengua de publicación, que es como decir del destinatario, y, en último término, del objetivo del promotor. Los pliegos poéticos en catalán, propiciados o al menos controlados por las autoridades y dirigidos a un público inmerso en el fenómeno del bandolerismo, no se permiten expresión alguna de simpatía por el bandido. A diferencia de los pliegos impresos en castellano en Barcelona en 1633 con motivo de la captura de Serrallonga o del estampado en Madrid en 1635, en el de 1634 no se refleja ningún tipo de elogio. Los pliegos castellanos, en cambio, sea porque se dirigen a un público distinto en Cataluña de aquel que recibía el impreso catalán, sea porque se publican en Barcelona para ser exportados fuera del territorio catalán, inician una línea de creciente idealización del bandolero, paralela en parte a la de la literatura culta barroca. En el Apéndice I se edita el texto del pliego catalán de 1634. Y en el Apéndice II se reflexiona sobre las circunstancias de la captura del bandolero el 31 de octubre de 1633.

Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature

Detail Sumber

DOAJ Open Access 2021

CORRESPONDENCE IN THE MESSENGER: IDENTIFYING THE AUTHOR BY TEXT IN THE CONTEXT OF THE TRANSFORMATION OF INDIVIDUALIZING FEATURES

Anastasiya V. Gromova

The article discusses the texts of Internet-based communication and correspondence in a messenger, the attention is paid to description of similarities and differences between oral and written dialogical speech. The problems of neutralizing speech features in the format of Internet-based communication and the transformation of attributes, which provide individual characteristic of the author demonstrated whilst exchanging messages in a messenger. It has been proposed to define the form of speech, typical of correspondence in the messenger, as dialogical written (printed) speech, with reference to it as to the product of intellectual activity in combination with the form of its implementation, taking into account the factor of the author’s usage of technical means for typing. The author represents the approaches to identifying significant speech characteristics, which are demonstrated by the addresser in written correspondence in the messenger; these approaches are often analyzed in the process of authorship identification tests. The possibility of revealing a complex of author’s individualizing features is proved. This paper emphasizes the importance of studying the signs of the graphic and communicative levels of the analysis of dialogical texts, provides the examples of implementation of such signs. In this article the relevance of combining linguistic and quantitative methods of analysis in revealing the author’s individualizing identificational features is proved, the paper also outlines the prospects for further research in the field of studying the linguistic personality of the digital age.

Language and Literature

Detail DOI Sumber

arXiv Open Access 2021

Exploiting Adapters for Cross-lingual Low-resource Speech Recognition

Wenxin Hou, Han Zhu, Yidong Wang et al.

Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language. Since the low-resource language has limited training data, speech recognition models can easily overfit. In this paper, we propose to use adapters to investigate the performance of multiple adapters for parameter-efficient cross-lingual speech adaptation. Based on our previous MetaAdapter that implicitly leverages adapters, we propose a novel algorithms called SimAdapter for explicitly learning knowledge from adapters. Our algorithm leverages adapters which can be easily integrated into the Transformer structure.MetaAdapter leverages meta-learning to transfer the general knowledge from training data to the test language. SimAdapter aims to learn the similarities between the source and target languages during fine-tuning using the adapters. We conduct extensive experiments on five-low-resource languages in Common Voice dataset. Results demonstrate that our MetaAdapter and SimAdapter methods can reduce WER by 2.98% and 2.55% with only 2.5% and 15.5% of trainable parameters compared to the strong full-model fine-tuning baseline. Moreover, we also show that these two novel algorithms can be integrated for better performance with up to 3.55% relative WER reduction.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2021

Retweet communities reveal the main sources of hate speech

Bojan Evkoski, Andraz Pelicon, Igor Mozetic et al.

We address a challenging problem of identifying main sources of hate speech on Twitter. On one hand, we carefully annotate a large set of tweets for hate speech, and deploy advanced deep learning to produce high quality hate speech classification models. On the other hand, we create retweet networks, detect communities and monitor their evolution through time. This combined approach is applied to three years of Slovenian Twitter data. We report a number of interesting results. Hate speech is dominated by offensive tweets, related to political and ideological issues. The share of unacceptable tweets is moderately increasing with time, from the initial 20% to 30% by the end of 2020. Unacceptable tweets are retweeted significantly more often than acceptable tweets. About 60% of unacceptable tweets are produced by a single right-wing community of only moderate size. Institutional Twitter accounts and media accounts post significantly less unacceptable tweets than individual accounts. In fact, the main sources of unacceptable tweets are anonymous accounts, and accounts that were suspended or closed during the years 2018-2020.

en cs.SI, cs.CY

Detail DOI Sumber

Hasil untuk "Oral communication. Speech"