Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how fact-checking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research.
ABSTRACT Recently, the zeroing neural network (ZNN) has demonstrated remarkable effectiveness in tackling time‐varying problems, delivering robust performance across both noise‐free and noisy environments. However, existing ZNN models are limited in their ability to actively suppress noise, which constrains their robustness and precision in solving time‐varying problems. This paper introduces a novel active noise rejection ZNN (ANR‐ZNN) design that enhances noise suppression by integrating computational error dynamics and harmonic behaviour. Through rigorous theoretical analysis, we demonstrate that the proposed ANR‐ZNN maintains robust convergence in computational error performance under environmental noise. As a case study, the ANR‐ZNN model is specifically applied to time‐varying matrix inversion. Comprehensive computer simulations and robotic experiments further validate the ANR‐ZNN's effectiveness, emphasising the proposed design's superiority and potential for solving time‐varying problems.
Computational linguistics. Natural language processing, Computer software
Audrey Adine Laure ABEGHE BEKA et Mathurin OVONO ÉBÈ
Résumé : « Osímán » est une chanson que Maelé livre à ses mélomanes en 1987 dans l’album Chavely. Il y décrit l’orphelinage d’un enfant dans un village. Les conditions dans lesquelles sa tutrice le fait vivre révèlent la partialité, la méchanceté et la cruauté de cette femme. Par cette chanson, l’artiste s’engage en faveur de l’orphelin en dénonçant le traitement qui est souvent réservé dans nos sociétés. Les théories de l’engagement de l’artiste et de l’intellectuel développées par des penseurs tels que Jean-Paul Sartre, Albert Camus et Théodore Adorno, entre autres, démontrent le degré d’intellectualité de Maelé à travers cette chanson. Mais, l’artiste, opposant au régime de Teodoro Obiang Nguema et ayant pris la mesure de la dictature de ce dernier, use de paraboles, métaphores, allégories et autres images pour dénoncer la dérive autoritaire en Guinée Equatoriale. Une telle lecture de la chanson « Osímán » n’est possible que grâce aux théories de la réception développées par Hans Robert Jauss.
Mots-clés : Maelé – « Osímán » – Musique – Guinée Equatoriale – Dictature – Orphelin – Engagement
Arts in general, Computational linguistics. Natural language processing
Social media platforms like Facebook, Instagram, and Twitter are widely used; users frequently share their daily lives by uploading pictures, posts, and videos, which gain significant popularity. However, social media posts often receive a mix of reactions, ranging from positive to negative, and in some instances, negative comments escalate into cyberbullying. Numerous studies have addressed this issue by focusing on cyberbullying classification, primarily through binary classification using multimodal data or targeting either text or image data. This study investigates the identification of multi-class images like No-bullying, Religious, Sexual, and Others using the deep learning pre-trained model MobileNetV2 to detect multiple image labels and achieved an F1-score of 0.86. For categorizing hate comments, we consider multiple classes, including Not Hate, Slang, Sexual, Racial, and Religious-related content. Extensive experiments were conducted on a novel Bengali-English code-mixed dataset, utilizing a combination of advanced transformer models, traditional machine learning techniques, and deep learning approaches to detect multiple hate comment labels. Bangla BERT achieved the highest F1-score of 0.79, followed closely by SVM at 0.78 and BiLSTM with attention at 0.73. These findings underscore the effectiveness of these models in capturing the complexities of code-mixed Bengali-English, offering valuable insights into cyberbullying detection in diverse linguistic contexts. This research contributes essential strategies for improving online safety and fostering respectful digital interactions.
Computational linguistics. Natural language processing
Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.
Causal reasoning is a cornerstone of human intelligence and a critical capability for artificial systems aiming to achieve advanced understanding and decision-making. This thesis delves into various dimensions of causal reasoning and understanding in large language models (LLMs). It encompasses a series of studies that explore the causal inference skills of LLMs, the mechanisms behind their performance, and the implications of causal and anticausal learning for natural language processing (NLP) tasks. Additionally, it investigates the application of causal reasoning in text-based computational social science, specifically focusing on political decision-making and the evaluation of scientific impact through citations. Through novel datasets, benchmark tasks, and methodological frameworks, this work identifies key challenges and opportunities to improve the causal capabilities of LLMs, providing a comprehensive foundation for future research in this evolving field.
Badr AlKhamissi, Greta Tuckute, Yingtian Tang
et al.
Large language models (LLMs) exhibit remarkable similarity to neural activity in the human language network. However, the key properties of language shaping brain-like representations, and their evolution during training as a function of different tasks remain unclear. We here benchmark 34 training checkpoints spanning 300B tokens across 8 different model sizes to analyze how brain alignment relates to linguistic competence. Specifically, we find that brain alignment tracks the development of formal linguistic competence -- i.e., knowledge of linguistic rules -- more closely than functional linguistic competence. While functional competence, which involves world knowledge and reasoning, continues to develop throughout training, its relationship with brain alignment is weaker, suggesting that the human language network primarily encodes formal linguistic structure rather than broader cognitive functions. We further show that model size is not a reliable predictor of brain alignment when controlling for feature size and find that the correlation between next-word prediction, behavioral alignment and brain alignment fades once models surpass human language proficiency. Finally, using the largest set of rigorous neural language benchmarks to date, we show that language brain alignment benchmarks remain unsaturated, highlighting opportunities for improving future models. Taken together, our findings suggest that the human language network is best modeled by formal, rather than functional, aspects of language.
Aleksandra Krasnodębska, Karolina Seweryn, Szymon Łukasik
et al.
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures -- specifically, 90.29% F1 in named entity recognition, 73.79% LAS for dependency parsing, and 94.98% F1 for part-of-speech tagging. They also highlight existing challenges, such as domain adaptation and language variations between time periods. All the resources and models presented are available at https://hf.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.
Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches. We describe a wide range of NLP tasks in terms of two categories: natural language understanding (NLU) (for tasks such as dialect classification, sentiment analysis, parsing, and NLU benchmarks) and natural language generation (NLG) (for summarisation, machine translation, and dialogue systems). The survey is also broad in its coverage of languages which include English, Arabic, German, among others. We observe that past work in NLP concerning dialects goes deeper than mere dialect classification, and extends to several NLU and NLG tasks. For these tasks, we describe classical machine learning using statistical models, along with the recent deep learning-based approaches based on pre-trained language models. We expect that this survey will be useful to NLP researchers interested in building equitable language technologies by rethinking LLM benchmarks and model architectures.
Jafar Isbarov, Kavsar Huseynova, Elvin Mammadov
et al.
The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.
Jakub Piskorski, Michał Marcińczuk, Roman Yangarber
This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.
Large language model (LLM) is a representation of a major advancement in AI, and has been used in multiple natural language processing tasks. Nevertheless, in different business scenarios, LLM requires fine-tuning by engineers to achieve satisfactory performance, and the cost of achieving target performance and fine-turning may not match. Based on the Baidu STI dataset, we study the upper bound of the performance that classical information retrieval methods can achieve under a specific business, and compare it with the cost and performance of the participating team based on LLM. This paper gives an insight into the potential of classical computational linguistics algorithms, and which can help decision-makers make reasonable choices for LLM and low-cost methods in business R& D.
. Speech recognition representing a communication between computers and human as a sub (cid:12)eld of computational linguistics or natural language processing has a long history. Automatic Speech Recognition (ASR), Text To Speech (TTS), speech to text, Continuous Speech Recognition (CSR), and interactive voice response systems are di(cid:11)erent approaches to solving problems in this area. The performance improvement is partially attributed to the ability of the Deep Neural Network (DNN) to model complex correlations in speech features. In this paper, unlike the use of conventional model for sequential data like voice that employs Recurrent Neural Networks (RNNs) with the emergence of di(cid:11)erent architectures in deep networks and good performance of Conventional Neural Networks (CNNs) in image processing and feature extraction, the application of CNNs was developed in other domains. It was shown that prosodic features for Persian language could be extracted via CNNs for segmentation and labeling speech for short texts. By using 128 and 200 (cid:12)lters for CNN and special architectures, 19.46 error in detection rate and better time consumption than RNNs were obtained. In addition, CNN simpli(cid:12)es the learning procedure. Experimental results show that CNN networks can be a good feature extractor for speech recognition in various languages.
Berta Chulvi, M. Molpeceres, María F. Rodrigo
et al.
This study uses natural language processing (NLP) tools to analyze how politicians recreate the stereotype of immigrants in the Spanish Parliament. An interdisciplinary approach from computational linguistics and social psychology has been used to construct a variety of indices about content and linguistic styles. The analysis of 2,516 parliamentary interventions about immigration delivered between 1996 and 2016 by representatives of the two political parties that alternated in power during that period (conservative Popular Party and Spanish Socialist Party) shows that both the rhetorical strategy to present immigrants as “victims” or as a “threat” and the language style that politicians use reveal an interaction between the ideology of the party and the party's political position in government or in the opposition. Results also suggest some changes over time in the polarization and politicization of the debate about immigration.
This research aimed to analyze the use of abbreviations in the Kazakh language using a machine learning approach. Studies of the most commonly used words-abbreviations in Kazakh texts, as well as analysis and classification were carried out. Several machine learning models, tc, including naive Bayes, neural networks and support vector machines (SVMs) were tested for the study. The linguistic corpus was used for the experimental part Abbreviations-abb.xml and abbreviations.csv. The results showed that SVM outperformed other models with an accuracy of 0.85. The unique features of the Kazakh language were also discussed, such as the use of the Cyrillic alphabet and complex word forms. The implications of this study for natural language processing and computational linguistics were presented, and the limitations of the study were discussed. This study contributes to understanding the use of abbreviations in the Kazakh language and demonstrates the potential of machine learning approaches to analyze languages with complex characteristics.
Résumé : Dans ce travail, nous nous sommes fixée comme objectif d’expliquer comment Boris Cyrulnik a-t-il pu surmonter ses blessures psychiques subies à un âge précoce et arrivé à devenir aujourd’hui un des plus grands psychanalystes français et mondiaux. Nous nous sommes également appuyée sur le cas de plusieurs écrivains français classiques qui, ont d’ailleurs eu presque les mêmes circonstances dans leurs petites enfances. Cyrulnik sort de son trauma victorieux grâce aux mots écrits, il veut à travers son livre : La nuit, j’écrirai des soleils démontrer et éclaircir le secret thérapeutique de l’écriture, qui a fait de lui un magicien de la psychologie moderne.
Mots-clés : blessure psychique, trauma, écriture, écrivain, enfance
Arts in general, Computational linguistics. Natural language processing