In this article, Jan Vansina’s essay on the deep-time population history of the Kwilu-Kasai region, now more than half a century old, is revisited through the kaleidoscopic lens of newly collected linguistic, archaeological, palaeoecological and genetic evidence collected between 2018 and 2023. In contrast to Vansina’s view of the region as a terminus for the settlement of diverse Bantu speech communities, the area was likely an important center of diversification and expansion for early Bantu speakers. Under a drier climate favoring the increase of grassland and open woodland around Idiofa but not a general loss of the forest cover, Bantu speakers most probably started occupying the region from the 4th century BCE onwards. Pottery-producing settlements first emerge along the Kwilu River, i.e., at Luani (371–148 BCE) and Lukombe (357–27 BCE), and then near Idiofa (146–18 BCE), where the earliest evidence for iron metallurgy marks the actual start of the Early Iron Age (146 BCE-750CE) south of the Congo rainforest. The first settlers lived in small and enclaved communities without much mutual communication. These were perfect conditions for early divergence within the Bantu language family’s major West-Coastal Bantu branch. While modern language data situate the West-Coastal Bantu homeland between the Kasai and Kamtsha Rivers, this zone was only settled towards the beginning of the common era based on the currently available archaeological data. This apparent mismatch between archaeology and historical linguistics might be due to subsequent processes of language death and population relocation and restructuring. Confirming Vansina’s idea that modern-day Kwilu-Kasai communities do not directly descend from the region’s earliest Bantu-speaking settlers, significant population reshuffling did indeed occur after the Early Iron Age. During the so-called ‘hiatus period’ (750–1450 CE), human activity contracted to a few refuge areas, especially near the modern towns of Bandundu and Idiofa, and the vegetation underwent an irreversible climate-induced shift to open habitats, possibly connected to the Medieval Climate Anomaly. During the Late Iron Age, the study area was rapidly resettled, first in the south (15–16th c. CE), and then in the north (17–18th c. CE), and large cross-regional exchange networks developed. In line with oral histories still existing among Kwilu-Kasai peoples today and with the clans and political and social institutions they share, the region underwent a very advanced demographic, linguistic and cultural integration in the four to five centuries preceding the start of European colonization, which may have involved populations no longer present there today, such as Ubangi-speaking groups.
Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.
COVID-19 is a highly contagious disease caused by the SARS-CoV-2 coronavirus, and its pandemic has gradually affected the entire world. While most symptoms of this disease disappear during the acute phase, some symptoms may persist for several months. This condition is referred to as the post-COVID-19 syndrome. Cognitive impairment is one of the possible long-lasting symptoms that are directly related to patients' communication abilities. These difficulties manifest mainly in the area of verbal recall, verbal fluency and the informational value of discourse. This review summarises the existing information on cognitive-communication disorders associated with the post-COVID-19 syndrome, but also provides an overview of other long-term persistent complications of language, speech and swallowing that may arise from COVID-19.
Nikolaos Flemotomos, Roger Hsiao, Pawel Swietojanski
et al.
Neural contextual biasing allows speech recognition models to leverage contextually relevant information, leading to improved transcription accuracy. However, the biasing mechanism is typically based on a cross-attention module between the audio and a catalogue of biasing entries, which means computational complexity can pose severe practical limitations on the size of the biasing catalogue and consequently on accuracy improvements. This work proposes an approximation to cross-attention scoring based on vector quantization and enables compute- and memory-efficient use of large biasing catalogues. We propose to use this technique jointly with a retrieval based contextual biasing approach. First, we use an efficient quantized retrieval module to shortlist biasing entries by grounding them on audio. Then we use retrieved entries for biasing. Since the proposed approach is agnostic to the biasing method, we investigate using full cross-attention, LLM prompting, and a combination of the two. We show that retrieval based shortlisting allows the system to efficiently leverage biasing catalogues of several thousands of entries, resulting in up to 71% relative error rate reduction in personal entity recognition. At the same time, the proposed approximation algorithm reduces compute time by 20% and memory usage by 85-95%, for lists of up to one million entries, when compared to standard dot-product cross-attention.
S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan
Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope (QSE). The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients (MFCC), a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy (LFBE) features trained on long short-term memory (LSTM) network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.
Room acoustical parameters (RAPs), room geometrical parameters (RGPs) and instantaneous occupancy level are essential metrics for parameterizing the room acoustical characteristics (RACs) of a sound field around a listener's local environment, offering comprehensive indications for various applications. Current blind estimation methods either fail to cover a broad range of real-world acoustic environments in the context of real background noise or estimate only a few RAPs and RGPs from noisy single-channel speech signals. In addition, they are limited in their ability to estimate the instantaneous occupancy level. In this paper, we propose a new universal blind estimation framework called the blind estimator of room parameters (BERP) to estimate RAPs, RGPs and occupancy level via a unified methodology. It consists of two modules: a unified room feature encoder that combines attention mechanisms with convolutional layers to learn common features across room parameters, and multiple separate parametric predictors for continuous estimation of each parameter in parallel. The combination of attention and convolutions enables the model to capture acoustic features locally and globally from speech, yielding more robust and multitask generalizable common features. Separate predictors allow the model to independently optimize for each room parameter to reduce task learning conflict and improve per-task performance. This estimation framework enables universal and efficient estimation of room parameters while maintaining satisfactory performance. To evaluate the effectiveness of the proposed framework, we compile a task-specific dataset from several publicly available datasets, including synthetic and real reverberant recordings. The results reveal that BERP achieves state-of-the-art (SOTA) performance and excellent adaptability to real-world scenarios. The code and weights are available on GitHub.
Semantic communication is a promising technology to improve communication efficiency by transmitting only the semantic information of the source data. However, traditional semantic communication methods primarily focus on data reconstruction tasks, which may not be efficient for emerging generative tasks such as text-to-speech (TTS) synthesis. To address this limitation, this paper develops a novel generative semantic communication framework for TTS synthesis, leveraging generative artificial intelligence technologies. Firstly, we utilize a pre-trained large speech model called WavLM and the residual vector quantization method to construct two semantic knowledge bases (KBs) at the transmitter and receiver, respectively. The KB at the transmitter enables effective semantic extraction, while the KB at the receiver facilitates lifelike speech synthesis. Then, we employ a transformer encoder and a diffusion model to achieve efficient semantic coding without introducing significant communication overhead. Finally, numerical results demonstrate that our framework achieves much higher fidelity for the generated speech than four baselines, in both cases with additive white Gaussian noise channel and Rayleigh fading channel.
Milehkina , Tatyana A., Baikulova, Alla Nikolayevna
The article analyzes the modern speech of children aged from 2-3 to 14 years and examines how other languages and cultures infl uence it. The research material comprises handwritten and transcribed audio and video recordings of children’s colloquial speech, correspondence of children in the Viber messenger. The material was collected during 2016–2022, its total volume amounts to about 22,000 instances of word usage. The methods used are those of included observation and discursive analysis. The vocabulary of children aged 2-3 to 14 years was shown to be actively supplied with lexical units sourced from the English language and Internet slang. The analysis of oral and written genres (post, stories, battle, live broadcast) created by children within the popular social networks TikTok, Likee, Instagram (2017) allowed us to establish that children typically copy speech patterns and behaviors of popular bloggers. A study of children’s voice and written messages in smartphone messengers Viber, Telegram (2022) revealed the formation of a specifi c vocabulary in children’s speech, including not only Anglicisms and school jargon, but also a telltale sign of Internet communication – the reduction of words to one syllable. Written messages are organized in a specifi c manner, consisting of statements, each of which is presented as a separate post. Various surveys and diff erent types of tasks based on the “reply and send to a friend” pattern are common. Using virtual reality, children strive for self-realization: they sing, act out dialogues with toys, conduct “scientifi c research”, but in the process they use less than perfect, most often foreign, role models. The authors conclude that not only the English language content of the Internet, but also the speech behavior, ideals and values advertised by popular, primarily Western, leaders of the blogosphere, have a signifi cant impact on modern children’s speech.
Congenital myotonic dystrophy is the most severe form of myotonic dystrophy type 1 (DM1), an autosomal dominant multisystem disorder. Myotonic dystrophies are primarily degenerative, genetically determined and progressive disorders of skeletal and smooth muscles. Myotonic dystrophy represents multi-organ diseases, and causes diseases of the eyes, heart, endocrine and central (but sometimes peripheral) nervous system. Speech and language therapy for a CDM child is recommended from the moment of their birth, because there is a higher risk of dysphagia. This paper presents the role of a Clinical Speech Therapist in the diagnosis and therapy of a patient during his postpartum hospitalisation in the neonatal unit of intermittent care.
Owen Henkel, Hannah Horne-Robinson, Libby Hills
et al.
This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts.
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning.
Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems is hindered by the scarcity of speaker-level data and performance sensitivity to transcription errors. To address these issues, a set of compact and data efficient speaker-dependent (SD) parameter representations are used to facilitate both speaker adaptive training and test-time unsupervised speaker adaptation of state-of-the-art Conformer ASR systems. The sensitivity to supervision quality is reduced using a confidence score-based selection of the less erroneous subset of speaker-level adaptation data. Two lightweight confidence score estimation modules are proposed to produce more reliable confidence scores. The data sparsity issue, which is exacerbated by data selection, is addressed by modelling the SD parameter uncertainty using Bayesian learning. Experiments on the benchmark 300-hour Switchboard and the 233-hour AMI datasets suggest that the proposed confidence score-based adaptation schemes consistently outperformed the baseline speaker-independent (SI) Conformer model and conventional non-Bayesian, point estimate-based adaptation using no speaker data selection. Similar consistent performance improvements were retained after external Transformer and LSTM language model rescoring. In particular, on the 300-hour Switchboard corpus, statistically significant WER reductions of 1.0%, 1.3%, and 1.4% absolute (9.5%, 10.9%, and 11.3% relative) were obtained over the baseline SI Conformer on the NIST Hub5'00, RT02, and RT03 evaluation sets respectively. Similar WER reductions of 2.7% and 3.3% absolute (8.9% and 10.2% relative) were also obtained on the AMI development and evaluation sets.
The past decade has witnessed substantial growth of data-driven speech enhancement (SE) techniques thanks to deep learning. While existing approaches have shown impressive performance in some common datasets, most of them are designed only for a single condition (e.g., single-channel, multi-channel, or a fixed sampling frequency) or only consider a single task (e.g., denoising or dereverberation). Currently, there is no universal SE approach that can effectively handle diverse input conditions with a single model. In this paper, we make the first attempt to investigate this line of research. First, we devise a single SE model that is independent of microphone channels, signal lengths, and sampling frequencies. Second, we design a universal SE benchmark by combining existing public corpora with multiple conditions. Our experiments on a wide range of datasets show that the proposed single model can successfully handle diverse conditions with strong performance.
Walter Heymans, Marelie H. Davel, Charl van Heerden
We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.
The cross-domain performance of automatic speech recognition (ASR) could be severely hampered due to the mismatch between training and testing distributions. Since the target domain usually lacks labeled data, and domain shifts exist at acoustic and linguistic levels, it is challenging to perform unsupervised domain adaptation (UDA) for ASR. Previous work has shown that self-supervised learning (SSL) or pseudo-labeling (PL) is effective in UDA by exploiting the self-supervisions of unlabeled data. However, these self-supervisions also face performance degradation in mismatched domain distributions, which previous work fails to address. This work presents a systematic UDA framework to fully utilize the unlabeled data with self-supervision in the pre-training and fine-tuning paradigm. On the one hand, we apply continued pre-training and data replay techniques to mitigate the domain mismatch of the SSL pre-trained model. On the other hand, we propose a domain-adaptive fine-tuning approach based on the PL technique with three unique modifications: Firstly, we design a dual-branch PL method to decrease the sensitivity to the erroneous pseudo-labels; Secondly, we devise an uncertainty-aware confidence filtering strategy to improve pseudo-label correctness; Thirdly, we introduce a two-step PL approach to incorporate target domain linguistic knowledge, thus generating more accurate target domain pseudo-labels. Experimental results on various cross-domain scenarios demonstrate that the proposed approach effectively boosts the cross-domain performance and significantly outperforms previous approaches.
Speech enhancement in the time-frequency domain is often performed by estimating a multiplicative mask to extract clean speech. However, most neural network-based methods perform point estimation, i.e., their output consists of a single mask. In this paper, we study the benefits of modeling uncertainty in neural network-based speech enhancement. For this, our neural network is trained to map a noisy spectrogram to the Wiener filter and its associated variance, which quantifies uncertainty, based on the maximum a posteriori (MAP) inference of spectral coefficients. By estimating the distribution instead of the point estimate, one can model the uncertainty associated with each estimate. We further propose to use the estimated Wiener filter and its uncertainty to build an approximate MAP (A-MAP) estimator of spectral magnitudes, which in turn is combined with the MAP inference of spectral coefficients to form a hybrid loss function to jointly reinforce the estimation. Experimental results on different datasets show that the proposed method can not only capture the uncertainty associated with the estimated filters, but also yield a higher enhancement performance over comparable models that do not take uncertainty into account.
The paper deals with discourse development tendencies determined by social, economic, cultural and historic characteristics of our present existence, on the one hand, and modern technologies of distant communication, on the other hand. The material analyzed comprises texts taken from network and media discourse as presented in the Internet and Russian contemporary oral and written speech card-catalogue compiled by the authors. The model of our study includes four components of a communicative situation: subjects, texts, chronotops and organizational characteristics. We argue that modern discourse is characterized by the following properties of its participants: expansion of self-presentation, juvenile manner of communication, critical attitude to information, communicative over-saturation. Semiotic properties of modern discourse characterize the texts used in various types of communicative interaction, they include significant growth of multimodal content in all the types of written communication (it corresponds to predominant usage of visual information transmitted by electronic media), vulgarization of speech, and oral and written texts diffusion in distant communication. Chronotopic properties of oral discourse consist in the communicative compression of messages we exchange, in acceleration of our life and shrinkage of communicative turns, and in new demands to everyday habitual existence which makes it a vital necessity to have the access to the Internet in every home. Organizational properties of modern communicative practice may be defined as emerging of inter-discursive hybrid types of communication concerning media and network discourse, diffusion and merging of personal and institutional and private and public communication, and formation of new rituals which are mostly realized in gestures or clips.
Pronunciation is an essential and rudimentary facet of language involved in an oral communication. This article attempts to highlight the significance of teaching the pronunciationof segmental and suprasegmental features of English. The segmental features involve consonant and vowel sounds or phonemes, whereas the suprasegmental features include stress, rhythm, intonation, pitch, length etc. The central function of a language is for communication through speech. The speech is sequences of the pronunciation of segmental as well as suprasegmental features. At least, legible pronunciation is essential for anunderstandable communication. Legible pronunciation of any one of Standard British English (SBE), Scottish Standard English and General American English (GAE) is indispensable for the proper and effective oral communication in the global context. Standard British English (SBE) is normally used in context of teaching English to the students in Nepal. It is a difficult task to teach the standard pronunciation to the students whose native tongue is not English, but teaching pronunciation can improve their ways of speaking to some extent. There is not always one-to-one corresponding correlation between spellings and their sounds in words. A spelling may retain different sounds in different phonetic environment. The article writer has pinpointed some instances where the wrong pronunciation of a speaker can lead to a misleading communication. To avoid the wrong or unintelligible pronunciation, it is necessary to teach the intelligible or standard pronunciation of English to our students.
El estudio se centra en los orígenes de la transmisión impresa de los romances, con atención particular a los pliegos poéticos postincunables conservados, pero sin perder de vista testimonios conocidos a través de los repertorios colombinos. El análisis llevado a cabo evidencia las interesantes interrelaciones y asociaciones entre textos de romances y textos de carácter lírico-cancioneril dentro de estos cuadernillos, al tiempo que se adentra en una interpretación global del fenómeno que contribuye a desentrañar el proceso de creación de los pliegos poéticos como productos textuales y editoriales conformados por selecciones muy pensadas y claramente intencionales de textos que configuran pequeñas joyas literarias de notable éxito editorial. El artículo incluye, además, una tabla de ocurrencias de romances que abarca la totalidad del siglo s. XVI, así como dos Apéndices en los que se reflexiona sobre la presencia de pliegos sueltos y romances en los registros colombinos (Apéndice I) y sobre las intrincadas interrelaciones entre pliegos sueltos y cancioneros (Apéndice II).
Oral communication. Speech, French literature - Italian literature - Spanish literature - Portuguese literature