Abstract It is demonstrated that the proofs given in prominent and well-established weak generative capacity arguments for natural language are flawed, due to unexpected interpretations of strings. However, once unique representations of lexical semantic senses form part of such intersection-based proofs, the arguments stand.
Résumé : Les changements continus qui affectent d’une manière très accélérée les différents domaines vitaux dans le monde, interpellent en premier lieu les systèmes éducatifs nationaux pour évaluer leurs capacités à concevoir et à construire des politiques éducatives qui répondraient aux exigences de la mondialisation. Le bilinguisme en tant que réalité linguistique dans le système éducatif et l’espace sociétal algériens, pourrait être une solution idoine aux nombreux défis de cette mondialisation, si on l’adoptait institutionnellement.
Mots-clés : Le bilinguisme atout important, Usage relatif, Communication universelle, Défi majeur, Mondialisation.
Arts in general, Computational linguistics. Natural language processing
This study examines the integration of AI–based speech recognition (ELSA Speak, Speechling) in a language centre and its effects on pronunciation accuracy and learner confidence. Two groups of intermediate EFL learners (N = 40) completed pre-test and post-test diagnostic pronunciation assessments, with the experimental group receiving support through AI-app usage analytics. The comparative results revealed that the AI-assisted group demonstrated significantly more positive outcomes than the control group (p < .001), thus achieving higher post-test scores and increased learner confidence in the pronunciation assessment in the experimental group (paired t, p < .001). Consistent engagement in AI-assisted pronunciation practice was associated with improved performance (r = .62, p < .01). Quantitative results showed that learners using AI tools achieved higher pronunciation scores and confidence, while qualitative findings indicated reduced speaking anxiety, sustained motivation, and enhanced learner autonomy, despite minor implementation challenges such as connectivity issues and occasional feedback perception mismatches. The findings show the effective integration of AI-based pronunciation tools with conventional language instruction through feedback and self-regulated learning scaffolds. However, the intact–class design, single-centre context, and short duration result in limiting generalizability, demanding a longitudinal, multi-site replication.
Conflict of interest:
The authors have declared no potential conflicts of interest and falsification/fabrication of data with respect to the research, authorship, and/or publication of this article.
Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing
Automatic text summarization, particularly headline generation, remains a critical yet under-explored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) – comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen – a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features – including category, aspect, and sentiment – with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other under-represented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
Computational linguistics. Natural language processing
Margarida Campos, António Farinhas, Chrysoula Zerva
et al.
Abstract The rapid proliferation of large language models and natural language processing (NLP) applications creates a crucial need for uncertainty quantification to mitigate risks such as Hallucinations and to enhance decision-making reliability in critical applications. Conformal prediction is emerging as a theoretically sound and practically useful framework, combining flexibility with strong statistical guarantees. Its model-agnostic and distribution-free nature makes it particularly promising to address the current shortcomings of NLP systems that stem from the absence of uncertainty quantification. This paper provides a comprehensive survey of conformal prediction techniques, their guarantees, and existing applications in NLP, pointing to directions for future research and open challenges.
Résumé : La chanson funèbre fait partie intégrante du champ d’étude de l’oralité africaine qui s’avère quelque peu accessible. En réalité, c’est particulièrement lors des funérailles ou de la veillée mortuaire qu’elle est fondamentalement usitée et mise en évidence. Dans le cadre de cette étude, il s’agit d’ausculter la chanson funèbre dans son registre tradi-moderne. Pour ce, le corpus est constitué d’un échantillon de cinq (5) chansons funèbres arrangées et chantées par l’orchestre tradi- moderne "Bane bu nsiane" (les orphelins). En l’occurrence, l’objet de cet article porte sur l’identification de l’apport de la chanson funèbre ainsi que de son impact sur l’auditoire. Les questions que pose notre sujet de recherche s’articule en ce sens : quel est l’impact de cette chanson dans la vie des locuteurs/des endeuillés ? Quel est le message livré dans la chanson funèbre vili ? Pour répondre aux questions de recherche, nous faisons le choix de deux approches théoriques dont la principale est la sociocritique de Pierre V. Zima (2000) et la seconde, c’est les Cultural Studies de Chris Barker (2003).
Mots-clés : chanson funèbre, cimuuntu, valeurs, mort, impact
Arts in general, Computational linguistics. Natural language processing
Résumé : Rien ne semble arrêter le péril djihadiste qui sévit intensément dans certaines zones d’Afrique de l’ouest, particulièrement les pays du Sahel, et menace de s’étendre à l’ensemble des pays de la région. Ce phénomène qui se mêle à d’autres types de violence (notamment des conflits parfois anciens, souvent mal gérés ou non gérés) prend des formes nouvelles et imprédictibles. Pour endiguer cette menace croissante, la Côte d’Ivoire s’est dotée d’un dispositif préventif censé surmonter les vulnérabilités sécuritaires, structurelles et sociales. Cet article analyse les fragilités de ces mesures, rendant complexe la riposte contre le djihadisme et impliquant le besoin d’une approche contextualisée de ce phénomène avec la prise en compte des réalités locales.
Mots-clés : péril djihadiste, riposte, conflit, menace, vulnérabilité
Arts in general, Computational linguistics. Natural language processing
Research in Natural Language Processing (NLP) and computational linguistics highly depends on a good quality representative corpus of any specific language. Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. This article describes the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono (Khulna University Monolingual corpus). The newly developed corpus consists of more than 350 million word tokens and more than one million unique tokens from 18 major text categories of online Bangla websites. We have conducted several word-level and character-level linguistic phenomenon analyses based on empirical studies of the developed corpus. The corpus follows Zipf’s curve and hapax legomena rule. The quality of the corpus is also assessed by analyzing and comparing the inherent sparseness of the corpus with existing Bangla corpora, by analyzing the distribution of function words of the corpus and vocabulary growth rate. We have developed a Bangla article categorization application based on the KUMono corpus and received compelling results by comparing to the state-of-the-art models.
Enrique Amigó, Alejandro Ariza-Casabona, Victor Fresno
et al.
In the context of text representation, Compositional Distributional Semantics models aim to fuse the Distributional Hypothesis and the Principle of Compositionality. Text embedding is based on co-ocurrence distributions and the representations are in turn combined by compositional functions taking into account the text structure. However, the theoretical basis of compositional functions is still an open issue. In this article we define and study the notion of Information Theory–based Compositional Distributional Semantics (ICDS): (i) We first establish formal properties for embedding, composition, and similarity functions based on Shannon’s Information Theory; (ii) we analyze the existing approaches under this prism, checking whether or not they comply with the established desirable properties; (iii) we propose two parameterizable composition and similarity functions that generalize traditional
approaches while fulfilling the formal properties; and finally (iv) we perform an empirical study on several textual similarity datasets that include sentences with a high and low lexical overlap, and on the similarity between words and their description. Our theoretical analysis and empirical results show that fulfilling formal properties affects positively the accuracy of text representation models in terms of correspondence (isometry) between the embedding and meaning spaces.
Computational linguistics. Natural language processing
Sunita Warjri, Partha Pakray, Saralin Lyngdoh
et al.
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a good performance of the tagger. Our main contribution in this research work is the designed Khasi POS corpus. Till date, there has been no form of any kind of Khasi corpus developed or formally developed. In the present designed Khasi POS corpus, each word is tagged manually using the designed tagset. Methods of deep learning have been used to experiment with our designed Khasi POS corpus. The POS tagger based on BiLSTM, combinations of BiLSTM with CRF, and character-based embedding with BiLSTM are presented. The main challenges of understanding and handling Natural Language toward Computational linguistics to encounter are anticipated. In the presently designed corpus, we have tried to solve the problems of ambiguities of words concerning their context usage, and also the orthography problems that arise in the designed POS corpus. The designed Khasi corpus size is around 96,100 tokens and consists of 6,616 distinct words. Initially, while running the first few sets of data of around 41,000 tokens in our experiment the taggers are found to yield considerably accurate results. When the Khasi corpus size has been increased to 96,100 tokens, we see an increase in accuracy rate and the analyses are more pertinent. As results, accuracy of 96.81% is achieved for the BiLSTM method, 96.98% for BiLSTM with CRF technique, and 95.86% for character-based with LSTM. Concerning substantial research from the NLP perspectives for Khasi, we also present some of the recently existing POS taggers and other NLP works on the Khasi language for comparative purposes.
Romanian is one of the understudied languages in computational linguistics, with few resources available for the development of natural language processing tools. In this paper, we introduce LaRoSeDa, a Large Romanian Sentiment Data Set, which is composed of 15,000 positive and negative reviews collected from the largest Romanian e-commerce platform. We employ two sentiment classification methods as baselines for our new data set, one based on low-level features (character n-grams) and one based on high-level features (bag-of-word-embeddings generated by clustering word embeddings with k-means). As an additional contribution, we replace the k-means clustering algorithm with self-organizing maps (SOMs), obtaining better results because the generated clusters of word embeddings are closer to the Zipf’s law distribution, which is known to govern natural language. We also demonstrate the generalization capacity of using SOMs for the clustering of word embeddings on another recently-introduced Romanian data set, for text categorization by topic.
Text alignment is one of the fundamental techniques text-related domains like natural language processing, computational linguistics, and digital humanities. It compares two or more texts with each other aiming to find similar textual patterns, or to estimate in general how different or similar the texts are. Visualizing alignment results is an essential task, because it helps researchers getting a comprehensive overview of individual findings and the overall pattern structure. Different approaches have been developed to visualize and help making sense of these patterns depending on text size, alignment methods, and, most importantly, the underlying research tasks demanding for alignment. On the basis of those tasks, we reviewed existing text alignment visualization approaches, and discuss their advantages and drawbacks. We finally derive design implications and shed light on related future challenges.
Personality profiling has long been used in psychology to predict life outcomes. Recently, automatic detection of personality traits from written messages has gained significant attention in computational linguistics and natural language processing communities, due to its applicability in various fields. In this survey, we show the trajectory of research towards automatic personality detection from purely psychology approaches, through psycholinguistics, to the recent purely natural language processing approaches on large datasets automatically extracted from social media. We point out what has been gained and what lost during that trajectory, and show what can be realistic expectations in the field.
Gorjan Popovski, S. Kochev, B. Korousic-Seljak
et al.
The application of Natural Language Processing (NLP) methods and resources to biomedical textual data has received growing attention over the past years. Previously organized biomedical NLP-shared tasks (such as, for example, BioNLP Shared Tasks) are related to extracting different biomedical entities (like genes, phenotypes, drugs, diseases, chemical entities) and finding relations between them. However, to the best of our knowledge there are limited NLP methods that can be used for information extraction of entities related to food concepts. For this reason, to extract food entities from unstructured textual data, we propose a rule-based named-entity recognition method for food information extraction, called FoodIE. It is comprised of a small number of rules based on computational linguistics and semantic information that describe the food entities. Experimental results from the evaluation performed using two different datasets showed that very promising results can be achieved. The proposed method achieved 97% precision, 94% recall, and 96% F1 score.