Hasil "Computational linguistics. Natural language processing"

S2 Open Access 2021

Zhijiang Guo, M. Schlichtkrull, Andreas Vlachos

Fact-checking has become increasingly important due to the speed with which both information and misinformation can spread in the modern media ecosystem. Therefore, researchers have been exploring how fact-checking can be automated, using techniques based on natural language processing, machine learning, knowledge representation, and databases to automatically predict the veracity of claims. In this paper, we survey automated fact-checking stemming from natural language processing, and discuss its connections to related tasks and disciplines. In this process, we present an overview of existing datasets and models, aiming to unify the various definitions given and identify common concepts. Finally, we highlight challenges for future research.

685 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2002

Multiword Expressions: A Pain in the Neck for NLP

I. Sag, Timothy Baldwin, Francis Bond et al.

1308 sitasi en Computer Science

Detail DOI Sumber

DOAJ Open Access 2026

Design and Validation of Zeroing Neural Network With Active Noise Rejection Capability for Time‐Varying Problems Solving

Yilin Shang, Wenbo Zhang, Dongsheng Guo et al.

ABSTRACT Recently, the zeroing neural network (ZNN) has demonstrated remarkable effectiveness in tackling time‐varying problems, delivering robust performance across both noise‐free and noisy environments. However, existing ZNN models are limited in their ability to actively suppress noise, which constrains their robustness and precision in solving time‐varying problems. This paper introduces a novel active noise rejection ZNN (ANR‐ZNN) design that enhances noise suppression by integrating computational error dynamics and harmonic behaviour. Through rigorous theoretical analysis, we demonstrate that the proposed ANR‐ZNN maintains robust convergence in computational error performance under environmental noise. As a case study, the ANR‐ZNN model is specifically applied to time‐varying matrix inversion. Comprehensive computer simulations and robotic experiments further validate the ANR‐ZNN's effectiveness, emphasising the proposed design's superiority and potential for solving time‐varying problems.

Computational linguistics. Natural language processing, Computer software

Detail DOI Sumber

DOAJ Open Access 2025

L’ENGAGEMENT DE L’ARTISTE MUSICIEN GUINEO-EQUATORIEN MAELE EN FAVEUR DE L’ORPHELIN

Audrey Adine Laure ABEGHE BEKA et Mathurin OVONO ÉBÈ

Résumé : « Osímán » est une chanson que Maelé livre à ses mélomanes en 1987 dans l’album Chavely. Il y décrit l’orphelinage d’un enfant dans un village. Les conditions dans lesquelles sa tutrice le fait vivre révèlent la partialité, la méchanceté et la cruauté de cette femme. Par cette chanson, l’artiste s’engage en faveur de l’orphelin en dénonçant le traitement qui est souvent réservé dans nos sociétés. Les théories de l’engagement de l’artiste et de l’intellectuel développées par des penseurs tels que Jean-Paul Sartre, Albert Camus et Théodore Adorno, entre autres, démontrent le degré d’intellectualité de Maelé à travers cette chanson. Mais, l’artiste, opposant au régime de Teodoro Obiang Nguema et ayant pris la mesure de la dictature de ce dernier, use de paraboles, métaphores, allégories et autres images pour dénoncer la dérive autoritaire en Guinée Equatoriale. Une telle lecture de la chanson « Osímán » n’est possible que grâce aux théories de la réception développées par Hans Robert Jauss. Mots-clés : Maelé – « Osímán » – Musique – Guinée Equatoriale – Dictature – Orphelin – Engagement

Arts in general, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2025

A multi-class cyberbullying classification on image and text in code-mixed Bangla-English social media content

Animesh Chandra Roy, Tanvir Mahmud, Tahlil Abrar

Social media platforms like Facebook, Instagram, and Twitter are widely used; users frequently share their daily lives by uploading pictures, posts, and videos, which gain significant popularity. However, social media posts often receive a mix of reactions, ranging from positive to negative, and in some instances, negative comments escalate into cyberbullying. Numerous studies have addressed this issue by focusing on cyberbullying classification, primarily through binary classification using multimodal data or targeting either text or image data. This study investigates the identification of multi-class images like No-bullying, Religious, Sexual, and Others using the deep learning pre-trained model MobileNetV2 to detect multiple image labels and achieved an F1-score of 0.86. For categorizing hate comments, we consider multiple classes, including Not Hate, Slang, Sexual, Racial, and Religious-related content. Extensive experiments were conducted on a novel Bengali-English code-mixed dataset, utilizing a combination of advanced transformer models, traditional machine learning techniques, and deep learning approaches to detect multiple hate comment labels. Bangla BERT achieved the highest F1-score of 0.79, followed closely by SVM at 0.78 and BiLSTM with attention at 0.73. These findings underscore the effectiveness of these models in capturing the complexities of code-mixed Bengali-English, offering valuable insights into cyberbullying detection in diverse linguistic contexts. This research contributes essential strategies for improving online safety and fostering respectful digital interactions.

Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2025

Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers

Francisco Portillo López

Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Causality for Natural Language Processing

Zhijing Jin

Causal reasoning is a cornerstone of human intelligence and a critical capability for artificial systems aiming to achieve advanced understanding and decision-making. This thesis delves into various dimensions of causal reasoning and understanding in large language models (LLMs). It encompasses a series of studies that explore the causal inference skills of LLMs, the mechanisms behind their performance, and the implications of causal and anticausal learning for natural language processing (NLP) tasks. Additionally, it investigates the application of causal reasoning in text-based computational social science, specifically focusing on political decision-making and the evaluation of scientific impact through citations. Through novel datasets, benchmark tasks, and methodological frameworks, this work identifies key challenges and opportunities to improve the causal capabilities of LLMs, providing a comprehensive foundation for future research in this evolving field.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

From Language to Cognition: How LLMs Outgrow the Human Language Network

Badr AlKhamissi, Greta Tuckute, Yingtian Tang et al.

Large language models (LLMs) exhibit remarkable similarity to neural activity in the human language network. However, the key properties of language shaping brain-like representations, and their evolution during training as a function of different tasks remain unclear. We here benchmark 34 training checkpoints spanning 300B tokens across 8 different model sizes to analyze how brain alignment relates to linguistic competence. Specifically, we find that brain alignment tracks the development of formal linguistic competence -- i.e., knowledge of linguistic rules -- more closely than functional linguistic competence. While functional competence, which involves world knowledge and reasoning, continues to develop throughout training, its relationship with brain alignment is weaker, suggesting that the human language network primarily encodes formal linguistic structure rather than broader cognitive functions. We further show that model size is not a reliable predictor of brain alignment when controlling for feature size and find that the correlation between next-word prediction, behavioral alignment and brain alignment fades once models surpass human language proficiency. Finally, using the largest set of rigorous neural language benchmarks to date, we show that language brain alignment benchmarks remain unsaturated, highlighting opportunities for improving future models. Taken together, our findings suggest that the human language network is best modeled by formal, rather than functional, aspects of language.

en cs.CL

Detail Sumber

arXiv Open Access 2025

PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnodębska, Karolina Seweryn, Szymon Łukasik et al.

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models

Şaziye Betül Özateş, Tarık Emre Tıraş, Ece Elif Adak et al.

This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR, and the first Universal Dependencies treebank, OTA-BOUN, for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Furthermore, we introduce the Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results demonstrate prominent improvements in the computational analysis of historical Turkish, achieving strong performance on tasks that require understanding of historical linguistic structures -- specifically, 90.29% F1 in named entity recognition, 73.79% LAS for dependency parsing, and 94.98% F1 for part-of-speech tagging. They also highlight existing challenges, such as domain adaptation and language variations between time periods. All the resources and models presented are available at https://hf.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

Kilian Sennrich, Sina Ahmadi

Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

en cs.CL

Detail Sumber

CrossRef Open Access 2024

Multimodal Large Model Technology and Application for Railways

Fangxu Hu, Jing Gao, Fucai Hua et al.

2 sitasi en

Detail DOI Sumber

arXiv Open Access 2024

Natural Language Processing for Dialects of a Language: A Survey

Aditya Joshi, Raj Dabre, Diptesh Kanojia et al.

State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches. We describe a wide range of NLP tasks in terms of two categories: natural language understanding (NLU) (for tasks such as dialect classification, sentiment analysis, parsing, and NLU benchmarks) and natural language generation (NLG) (for summarisation, machine translation, and dialogue systems). The survey is also broad in its coverage of languages which include English, Arabic, German, among others. We observe that past work in NLP concerning dialects goes deeper than mere dialect classification, and extends to several NLU and NLG tasks. For these tasks, we describe classical machine learning using statistical models, along with the recent deep learning-based approaches based on pre-trained language models. We expect that this survey will be useful to NLP researchers interested in building equitable language technologies by rethinking LLM benchmarks and model architectures.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Open foundation models for Azerbaijani language

Jafar Isbarov, Kavsar Huseynova, Elvin Mammadov et al.

The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Cross-lingual Named Entity Corpus for Slavic Languages

Jakub Piskorski, Michał Marcińczuk, Roman Yangarber

This paper presents a corpus manually annotated with named entities for six Slavic languages - Bulgarian, Czech, Polish, Slovenian, Russian, and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017-2023 as a part of the Workshops on Slavic Natural Language Processing. The corpus consists of 5 017 documents on seven topics. The documents are annotated with five classes of named entities. Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits - single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture with the pre-trained multilingual models - XLM-RoBERTa-large for named entity mention recognition and categorization, and mT5-large for named entity lemmatization and linking.

en cs.CL, cs.AI

Detail Sumber

S2 Open Access 2023

When to Use Large Language Model: Upper Bound Analysis of BM25 Algorithms in Reading Comprehension Task

Tingzhen Liu, Qianqian Xiong, Shengxi Zhang

Large language model (LLM) is a representation of a major advancement in AI, and has been used in multiple natural language processing tasks. Nevertheless, in different business scenarios, LLM requires fine-tuning by engineers to achieve satisfactory performance, and the cost of achieving target performance and fine-turning may not match. Based on the Baidu STI dataset, we study the upper bound of the performance that classical information retrieval methods can achieve under a specific business, and compare it with the cost and performance of the participating team based on LLM. This paper gives an insight into the potential of classical computational linguistics algorithms, and which can help decision-makers make reasonable choices for LLM and low-cost methods in business R& D.

6 sitasi en

Detail DOI Sumber

S2 Open Access 2023

Language recognition by convolutional neural networks

L. Khosravani Pour, A. Farrokhi

. Speech recognition representing a communication between computers and human as a sub (cid:12)eld of computational linguistics or natural language processing has a long history. Automatic Speech Recognition (ASR), Text To Speech (TTS), speech to text, Continuous Speech Recognition (CSR), and interactive voice response systems are di(cid:11)erent approaches to solving problems in this area. The performance improvement is partially attributed to the ability of the Deep Neural Network (DNN) to model complex correlations in speech features. In this paper, unlike the use of conventional model for sequential data like voice that employs Recurrent Neural Networks (RNNs) with the emergence of di(cid:11)erent architectures in deep networks and good performance of Conventional Neural Networks (CNNs) in image processing and feature extraction, the application of CNNs was developed in other domains. It was shown that prosodic features for Persian language could be extracted via CNNs for segmentation and labeling speech for short texts. By using 128 and 200 (cid:12)lters for CNN and special architectures, 19.46 error in detection rate and better time consumption than RNNs were obtained. In addition, CNN simpli(cid:12)es the learning procedure. Experimental results show that CNN networks can be a good feature extractor for speech recognition in various languages.

5 sitasi en

Detail DOI Sumber

S2 Open Access 2023

Politicization of Immigration and Language Use in Political Elites: A Study of Spanish Parliamentary Speeches

Berta Chulvi, M. Molpeceres, María F. Rodrigo et al.

This study uses natural language processing (NLP) tools to analyze how politicians recreate the stereotype of immigrants in the Spanish Parliament. An interdisciplinary approach from computational linguistics and social psychology has been used to construct a variety of indices about content and linguistic styles. The analysis of 2,516 parliamentary interventions about immigration delivered between 1996 and 2016 by representatives of the two political parties that alternated in power during that period (conservative Popular Party and Spanish Socialist Party) shows that both the rhetorical strategy to present immigrants as “victims” or as a “threat” and the language style that politicians use reveal an interaction between the ideology of the party and the party's political position in government or in the opposition. Results also suggest some changes over time in the polarization and politicization of the debate about immigration.

5 sitasi en

Detail DOI Sumber

S2 Open Access 2023

Analysis of Kazakh Language Abbreviations Based on Machine Learning Approach

D. Rakhimova, Yerkin Suleimenov, Dinara Makulbek

This research aimed to analyze the use of abbreviations in the Kazakh language using a machine learning approach. Studies of the most commonly used words-abbreviations in Kazakh texts, as well as analysis and classification were carried out. Several machine learning models, tc, including naive Bayes, neural networks and support vector machines (SVMs) were tested for the study. The linguistic corpus was used for the experimental part Abbreviations-abb.xml and abbreviations.csv. The results showed that SVM outperformed other models with an accuracy of 0.85. The unique features of the Kazakh language were also discussed, such as the use of the Cyrillic alphabet and complex word forms. The implications of this study for natural language processing and computational linguistics were presented, and the limitations of the study were discussed. This study contributes to understanding the use of abbreviations in the Kazakh language and demonstrates the potential of machine learning approaches to analyze languages with complex characteristics.

4 sitasi en

Detail DOI Sumber

DOAJ Open Access 2023

Du trauma à la création littéraire Dans « La nuit, j’écrirai des soleils » de Boris Cyrulnik

Sarra BOUCHENE

Résumé : Dans ce travail, nous nous sommes fixée comme objectif d’expliquer comment Boris Cyrulnik a-t-il pu surmonter ses blessures psychiques subies à un âge précoce et arrivé à devenir aujourd’hui un des plus grands psychanalystes français et mondiaux. Nous nous sommes également appuyée sur le cas de plusieurs écrivains français classiques qui, ont d’ailleurs eu presque les mêmes circonstances dans leurs petites enfances. Cyrulnik sort de son trauma victorieux grâce aux mots écrits, il veut à travers son livre : La nuit, j’écrirai des soleils démontrer et éclaircir le secret thérapeutique de l’écriture, qui a fait de lui un magicien de la psychologie moderne. Mots-clés : blessure psychique, trauma, écriture, écrivain, enfance

Arts in general, Computational linguistics. Natural language processing

Detail Sumber

Hasil untuk "Computational linguistics. Natural language processing"