Use Matters: How Different Ways of Using ChatGPT Drive AI Acceptance and Solutionism
Florian Golo Flaßhoff, Fabian Anicker, Frank Marcinkowski
Artificial intelligence is central to solutionism—the vision of a world where all major problems are solved through technology. This study theorizes about how human–AI communication shapes attitudes toward AI and influences the formation of public opinion, sparking solutionist imaginaries. We empirically examine the attitude formation resulting from the non-simulated use of an unmanipulated conversational model in a controlled laboratory experiment. Using a between-subjects design, participants engaged in three semi-structured 20-minute sessions with ChatGPT, providing a novel perspective on the effects of its use. The findings reveal that mere use of ChatGPT causally increases AI acceptance; however, its impact significantly depends on how the system is used. Especially non-instrumental, guided exploratory usage activates the solutionist imagination and leads to greater AI acceptance. This study demonstrates that specific types of AI engagement can shape public attitudes toward AI, including the development of uncritical beliefs about its future capabilities.
Technology (General), Oral communication. Speech
TokenChain: A Discrete Speech Chain via Semantic Token Modeling
Mingxuan Wang, Satoshi Nakamura
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
Large Model Empowered Streaming Speech Semantic Communications
Zhenzi Weng, Zhijin Qin, Geoffrey Ye Li
In this paper, we introduce a large model-empowered streaming semantic communication system for speech transmission across various languages, named LSSC-ST. Specifically, we devise an edge-device collaborative semantic communication architecture by offloading the intricate semantic extraction and channel coding modules to edge servers, thereby reducing the computational burden on local devices. To support multilingual speech transmission, pre-trained large speech models are utilized to learn unified semantic features from speech in different languages, breaking the constraint of a single input language and enhancing the practicality of the LSSC-ST. Moreover, the input speech is sequentially streamed into the developed system as short speech segments, which enables low transmission latency without degrading the quality of the produced speech. A novel dynamic speech segmentation algorithm is proposed to further reduce the transmission latency by adaptively adjusting the duration of speech segments. According to simulation results, the LSSC-ST provides more accurate speech transmission and achieves a streaming manner with lower latency compared to the existing non-streaming semantic communication systems.
Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus
Golshid Shekoufandeh, Paul Boersma, Antal van den Bosch
We test and study the variation in speech recognition of fine-tuned versions of the Whisper model on child, elderly and non-native Dutch speech from the JASMIN-CGN corpus. Our primary goal is to evaluate how speakers' age and linguistic background influence Whisper's performance. Whisper achieves varying Word Error Rates (WER) when fine-tuned on subpopulations of specific ages and linguistic backgrounds. Fine-tuned performance is remarkably better than zero-shot performance, achieving a relative reduction in WER of 81% for native children, 72% for non-native children, 67% for non-native adults, and 65% for native elderly people. Our findings underscore the importance of training speech recognition models like Whisper on underrepresented subpopulations such as children, the elderly, and non-native speakers.
Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
Turi Abu, Ying Shi, Thomas Fang Zheng
et al.
We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.
Privacy Disclosure of Similarity Rank in Speech and Language Processing
Tom Bäckström, Mohammad Hassan Vali, My Nguyen
et al.
Speaker, author, and other biometric identification applications often compare a sample's similarity to a database of templates to determine the identity. Given that data may be noisy and similarity measures can be inaccurate, such a comparison may not reliably identify the true identity as the most similar. Still, even the similarity rank based on an inaccurate similarity measure can disclose private information about the true identity. We propose a methodology for quantifying the privacy disclosure of such a similarity rank by estimating its probability distribution. It is based on determining the histogram of the similarity rank of the true speaker, or when data is scarce, modeling the histogram with the beta-binomial distribution. We express the disclosure in terms of entropy (bits), such that the disclosure from independent features are additive. Our experiments demonstrate that all tested speaker and author characterizations contain personally identifying information (PII) that can aid in identification, with embeddings from speaker recognition algorithms containing the most information, followed by phone embeddings, linguistic embeddings, and fundamental frequency. Our initial experiments show that the disclosure of PII increases with the length of test samples, but it is bounded by the length of database templates. The provided metric, similarity rank disclosure, provides a way to compare the disclosure of PII between biometric features and merge them to aid identification. It can thus aid in the holistic evaluation of threats to privacy in speech and other biometric technologies.
Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement
Rauf Nasretdinov, Roman Korostik, Ante Jukić
In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schrödinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR performance. Furthermore, we compare the considered model with predictive and diffusion-based baselines and analyze the speech recognition performance when using different pre-trained ASR models. The proposed approach significantly reduces the word error rate, reducing it by approximately 40% relative to the unprocessed speech signals and by approximately 8% relative to a similarly sized predictive approach.
Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection
Théo Mariotte, Anthony Larcher, Silvio Montrésor
et al.
Voice Activity Detection (VAD) and Overlapped Speech Detection (OSD) are key pre-processing tasks for speaker diarization. In the meeting context, it is often easier to capture speech with a distant device. This consideration however leads to severe performance degradation. We study a unified supervised learning framework to solve distant multi-microphone joint VAD and OSD (VAD+OSD). This paper investigates various multi-channel VAD+OSD front-ends that weight and combine incoming channels. We propose three algorithms based on the Self-Attention Channel Combinator (SACC), previously proposed in the literature. Experiments conducted on the AMI meeting corpus exhibit that channel combination approaches bring significant VAD+OSD improvements in the distant speech scenario. Specifically, we explore the use of learned complex combination weights and demonstrate the benefits of such an approach in terms of explainability. Channel combination-based VAD+OSD systems are evaluated on the final back-end task, i.e. speaker diarization, and show significant improvements. Finally, since multi-channel systems are trained given a fixed array configuration, they may fail in generalizing to other array set-ups, e.g. mismatched number of microphones. A channel-number invariant loss is proposed to learn a unique feature representation regardless of the number of available microphones. The evaluation conducted on mismatched array configurations highlights the robustness of this training strategy.
Building speech corpus with diverse voice characteristics for its prompt-based representation
Aya Watanabe, Shinnosuke Takamichi, Yuki Saito
et al.
In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limits the diversity of voice characteristics available. Thus, we aim to address this gap by creating a novel corpus and developing a model for prompt-based manipulation of voice characteristics in text-to-speech synthesis, facilitating a broader range of voice characteristics. Specifically, we propose a method to build a sizable corpus pairing voice characteristics descriptions with corresponding speech samples. This involves automatically gathering voice-related speech data from the Internet, ensuring its quality, and manually annotating it using crowdsourcing. We implement this method with Japanese language data and analyze the results to validate its effectiveness. Subsequently, we propose a construction method of the model to retrieve speech from voice characteristics descriptions based on a contrastive learning method. We train the model using not only conservative contrastive learning but also feature prediction learning to predict quantitative speech features corresponding to voice characteristics. We evaluate the model performance via experiments with the corpus we constructed above.
Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks
Yang Ai, Zhen-Hua Ling
This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
Human-AI Teaming During an Ongoing Disaster: How Scripts Around Training and Feedback Reveal this is a Form of Human-Machine Communication
Keri K. Stephens, Anastazja G. Harris, Amanda L. Hughes
et al.
Humans play an integral role in identifying important information from social media during disasters. While human annotation of social media data to train machine learning models is often viewed as human-computer interaction, this study interrogates the ontological boundary between such interaction and human-machine communication. We conducted multiple interviews with participants who both labeled data to train machine learning models and corrected machine-inferred data labels. Findings reveal three themes: scripts invoked to manage decision-making, contextual scripts, and scripts around perceptions of machines. Humans use scripts around training the machine—a form of behavioral anthropomorphism—to develop social relationships with them. Correcting machine-inferred data labels changes these scripts and evokes self-doubt around who is right, which substantiates the argument that this is a form of human-machine communication.
Technology (General), Oral communication. Speech
Comment évaluer le contenu d’une narration enfantine dans une langue peu décrite et peu dotée ? Méthodologie et premiers résultats pour la Langue des Signes Française.
Stéphanie Caët, Marion Blondel, Claire Danet
et al.
Contexte. Les orthophonistes en France accueillent des enfants usagers d’autres langues que le français. Certaines de ces langues sont peu décrites et peu dotées. C’est le cas notamment de la langue des signes française (LSF) : les outils dont disposent les professionnels pour évaluer les compétences langagières d’enfants signeurs sont peu nombreux. Cela s’explique en partie par la diversité et le caractère atypique des parcours d’acquisition de cette langue ainsi que les spécificités des contextes professionnels dans lesquels elle évolue (recherche, enseignement, orthophonie).
Objectifs. Dans cet article, nous exposons la méthodologie adoptée pour créer un outil d’analyse et d’évaluation de récits, utilisable par des chercheurs et des professionnels et modulable selon leurs besoins respectifs.
Méthodes. Des récits, élicités à partir d’un dessin animé sans parole, ont été collectés auprès de 18 adultes sourds locuteurs de la LSF pour obtenir un corpus de référence. Nous inspirant des approches usage based (« basées sur l’usage ») du langage et de son acquisition, nous avons répertorié l’ensemble des micro-unités d’information produites par ces signeurs, puis écarté celles produites par moins de 25% d’entre eux. À partir de cette grille, nous avons comparé les résultats obtenus pour ces 18 adultes sourds signeurs à ceux obtenus pour 18 adultes entendants non signeurs locuteurs de la langue française. Nous avons également analysé les récits de 29 enfants sourds âgés de 4 à 11 ans dont la LSF est la langue de scolarisation. Les descriptions et commentaires produits par les signeurs ont également été identifiés.
Résultats et conclusions. Nous suggérons que les écarts observés entre les récits des adultes sourds signeurs (en LSF) et des adultes entendantsnon signeurs (en langue française) soulignent la nécessité de prendre en compte la spécificité de chaque langue et de son usage pour la création de ce type d’outil. Au sein du groupe d’enfants sourds signeurs, une corrélation positive significative entre l’âge et le nombre de micro-unités produites, mais aussi entre l’âge et le nombre de descriptions et de commentaires, a pu être observée.
Oral communication. Speech, Pathology
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
Florian Mai, Juan Zuluaga-Gomez, Titouan Parcollet
et al.
State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equivalent Conformer. (The HyperConformer recipe is publicly available in: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)
Pre-Finetuning for Few-Shot Emotional Speech Recognition
Maximillian Chen, Zhou Yu
Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.
The impact of face masks on the recall of spoken sentencesa)
Thanh Lan Truong, Sara D. Beck, Andrea Weber
The effect of face covering masks on listeners' recall of spoken sentences was investigated. Thirty-two German native listeners watched video recordings of a native speaker producing German sentences with and without a face mask, and then completed a cued-recall task. Listeners recalled significantly fewer words when the sentences had been spoken with a face mask. This might suggest that face masks increase processing demands, which in turn leaves fewer resources for encoding speech in memory. The result is also informative for policy-makers during the COVID-19 pandemic, regarding the impact of face masks on oral communication.
Understanding gender bias toward physicians using online doctor reviews
Gupta Sonam, Jordan Kayla
Gender bias continues to be an ongoing issue in the field of medicine. While bias may come in many forms, patients’ biases and perceptions have been understudied and may impact adherence to treatment, leading to unequal outcomes. Online reviews for doctors are a naturalistic way to study gender bias. In this study, we leveraged the LIWC psychological linguistic analysis tool to analyze the language styles of ZocDoc and RateMDs reviews and understand the potential role of gender in patients’ perceptions of their doctors. Mean differences were calculated using bootstrapped hierarchical linear modeling. We found that reviews for female physicians are generally more informal and emotional than those for male physicians. While our study was exploratory, the results suggest that both patients and physicians need to increase their awareness of how their biases may be affecting how they give and receive vital health information.
Oral communication. Speech, Psychology
Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration
Jean-Marie Lemercier, Julius Richter, Simon Welker
et al.
Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online at https://uhh.de/inf-sp-sgmsemultitask
Communication of deaf people based on the oral method
Kovačević Jasmina M., Radovanović Vesna S.
Throughout history, the phenomenon of hearing impairment has been the subject of interest of altruists, philosophers, doctors, priests, teachers and pedagogues. Today, this phenomenon is approached from a multidisciplinary aspect: medical, special education, sociological, pedagogical, psychological, linguistic, cultural and technological, all off which have the communication of deaf and hard of hearing people as a thing in common. Different approaches in building communication systems for deaf have led to three basic groups of methods: nonverbal, oral and combined. This paper will present an analysis of the development and application of the oral method, which at a certain point in time played a significant role in the process of education and rehabilitation of the deaf, with special emphasis on methods of demutation, ie speech and language construction in the deaf.
History (General) and history of Europe, Social sciences (General)
Guided Variational Autoencoder for Speech Enhancement With a Supervised Classifier
Guillaume Carbajal, Julius Richter, Timo Gerkmann
Recently, variational autoencoders have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. However, variational autoencoders are trained on clean speech only, which results in a limited ability of extracting the speech signal from noisy speech compared to supervised approaches. In this paper, we propose to guide the variational autoencoder with a supervised classifier separately trained on noisy speech. The estimated label is a high-level categorical variable describing the speech signal (e.g. speech activity) allowing for a more informed latent distribution compared to the standard variational autoencoder. We evaluate our method with different types of labels on real recordings of different noisy environments. Provided that the label better informs the latent distribution and that the classifier achieves good performance, the proposed approach outperforms the standard variational autoencoder and a conventional neural network-based supervised approach.
Speech Enhancement for Wake-Up-Word detection in Voice Assistants
David Bonet, Guillermo Cámbara, Fernando López
et al.
Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants. A very common issue of voice assistants is that they get easily activated by background noise like music, TV or background speech that accidentally triggers the device. In this paper, we propose a Speech Enhancement (SE) model adapted to the task of WUW detection that aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises. The SE model is a fully-convolutional denoising auto-encoder at waveform level and is trained using a log-Mel Spectrogram and waveform reconstruction losses together with the BCE loss of a simple WUW classification network. A new database has been purposely prepared for the task of recognizing the WUW in challenging conditions containing negative samples that are very phonetically similar to the keyword. The database is extended with public databases and an exhaustive data augmentation to simulate different noises and environments. The results obtained by concatenating the SE with a simple and state-of-the-art WUW detectors show that the SE does not have a negative impact on the recognition rate in quiet environments while increasing the performance in the presence of noise, especially when the SE and WUW detector are trained jointly end-to-end.