Hasil "Computational linguistics. Natural language processing"

S2 Open Access 2020

CLUE: A Chinese Language Understanding Evaluation Benchmark

Liang Xu, Xuanwei Zhang, Lu Li et al.

The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.cluebenchmarks.com

444 sitasi en Computer Science

Detail DOI Sumber

CrossRef Open Access 2026

Classifying Disaster Tweets Using Natural Language Processing and Machine Learning

XueQi Wang, Shengkun Xie

en

Detail DOI Sumber

DOAJ Open Access 2026

Class-wise guided weighted soft voting for deep learning-based date palm nutrient deficiency classification

Abdelaaziz Hessane, El Arbi Abdellaoui Alaoui, Amine El Hanafy et al.

Abstract Diagnosing nutrient deficiencies in date palm (Phoenix dactylifera L.) is challenging due to the high visual similarity of symptoms, such as between Magnesium and Potassium deficiency, making classic subjective methods unreliable. While automated deep learning models offer an alternative, the reliability of individual models is a key concern; a statistical evaluation over five independent runs confirmed that while a strong model like ConvNeXtTiny can establish a near-perfect performance ceiling (macro F1-score of 0.9969 ± 0.0028), weaker architectures like MobileNetV2 are highly unstable and less accurate (macro F1-score of 0.9219 ± 0.0486), posing a significant risk for reliable deployment. To mitigate this unreliability, we proposed and evaluated a Class-wise Guided Weighted Soft Voting (CG-WSV) ensemble heuristic. The empirical results establish a new, statistically robust performance benchmark, with the proposed CG-WSV ensemble achieving a high-performance macro F1-score of 0.9971 ± 0.0027. This performance matched that of Unweighted and Globally Weighted Soft Voting baselines, demonstrated a 0.33% relative improvement over Hard Voting, and represented a significant relative increase of 8.16% over the unstable MobileNetV2 model. The gains over the stronger individual models were 0.54% (vs. EfficientNetB0) and 0.02% (vs. ConvNeXtTiny), confirming its ability to match the observed performance ceiling. Critically, all soft voting ensembles, including CG-WSV, demonstrated exceptional stability by completely mitigating the high variance of the weaker model, validating it as a robust strategy for ensuring reliable diagnostic accuracy and providing a definitive statistical benchmark for this agricultural diagnostic problem.

Computational linguistics. Natural language processing, Electronic computers. Computer science

Detail DOI Sumber

S2 Open Access 2025

A survey on cutting-edge relation extraction techniques based on language models

J. A. Diaz-Garcia, Julio Amador Díaz López

5 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2025

Learning Language through Grounding

Freda Shi, Ziqiao Ma, Jiayuan Mao et al.

Grounding has been a long-standing concept in natural language processing (NLP) and computational linguistics (CL). This tutorial provides a historical overview and introduces recent advances in learning language through grounding, with a particular emphasis on the latter. We will begin by tracing the history of grounding and presenting a unified perspective on the term. In Parts II to IV, we will delve into recent progress in learning lexical semantics, syntax, and complex meanings through various forms of grounding. We will conclude by discussing future directions and open challenges, particularly those related to the growing trend of large language models and scaling.

2 sitasi en

Detail DOI Sumber

DOAJ Open Access 2025

Comparative analysis of AI regulation for fintech cybersecurity and privacy in the European Union and Qatar

Hafiz Muhammad Saqib, Hira Amin

Abstract Artificial Intelligence (AI) is transforming the finance industry, resulting in exponential growth opportunities as well as significant risks and challenges. This paper is a document analysis of AI regulatory frameworks and guidelines in the fintech sectors of the European Union (EU) and Qatar. It aims to identify relevant lessons and propose actionable, contextualised recommendations. It compares regulations central to governance, oversight, enforcement and the institutional architecture of AI regulation related specifically to fintech, cybersecurity and privacy. The paper illustrates that the EU leans towards risk management with clear, structured and comprehensive regulations with strict enforcement. On the other hand, Qatar is a small state with a highly ambitious vision of reform that wants to foster an environment conducive to innovative start-ups. This leads to regulatory flexibility and non-binding guidelines without enforcement mechanisms. The paper recommends that Qatar can strengthen its AI governance in the following ways: first, adopting a tiered, risk-based regulatory approach, similar to the EU, however, ensuring only those deemed high-risk are legally binding; second, establish a centralised AI governance unit; third, leverage its small size and research ecosystem to develop local innovation through regulatory sandboxes and public–private partnerships; and fourth, tap into the lucrative market of Shariah-compliant financial services.

Computational linguistics. Natural language processing, Electronic computers. Computer science

Detail DOI Sumber

arXiv Open Access 2025

BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

Farah Binta Haque, Md Yasin, Shishir Saha et al.

Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.

en cs.CL

Detail Sumber

S2 Open Access 2023

A survey on semantic processing techniques

Rui Mao, Kai He, Xulang Zhang et al.

Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.

45 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2024

The Philosophy of Theoretical Linguistics

R. Nefdt

What is the remit of theoretical linguistics? How are human languages different from animal calls or artificial languages? What philosophical insights about language can be gleaned from phonology, pragmatics, probabilistic linguistics, and deep learning? This book addresses the current philosophical issues at the heart of theoretical linguistics, which are widely debated not only by linguists, but also philosophers, psychologists, and computer scientists. It delves into hitherto uncharted territory, putting philosophy in direct conversation with phonology, sign language studies, supersemantics, computational linguistics, and language evolution. A range of theoretical positions are covered, from optimality theory and autosegmental phonology to generative syntax, dynamic semantics, and natural language processing with deep learning techniques. By both unwinding the complexities of natural language and delving into the nature of the science that studies it, this book ultimately improves our tools of discovery aimed at one of the most essential features of our humanity, our language.

4 sitasi en

Detail DOI Sumber

S2 Open Access 2024

Linguistics across Disciplinary Borders

Steven Coats, Veronika Laippala

This volume highlights the ways in which recent developments in corpus linguistics and natural language processing can engage with topics across language studies, humanities and social science disciplines. New approaches have emerged in recent years that blur disciplinary boundaries, facilitated by factors such as the application of computational methods, access to large data sets, and the sharing of code, as well as continual advances in technologies related to data storage, retrieval, and processing. The “march of data” denotes an area at the border region of linguistics, humanities, and social science disciplines, but also the inevitable development of the underlying technologies that drive analysis in these subject areas. Organized into 3 sections, the chapters are connected by the underlying thread of linguistic corpora: how they can be created, how they can shed light on varieties or registers, and how their metadata can be utilized to better understand the internal structure of similar resources. While some chapters in the volume make use of well-established existing corpora, others analyze data from platforms such as YouTube, Twitter or Reddit. The volume provides insight into the diversity of methods, approaches, and corpora that inform our understanding of the “border regions” between the realms of data science, language/linguistics, and social or cultural studies

3 sitasi en

Detail DOI Sumber

CrossRef Open Access 2024

Computational linguistics and natural language processing techniques for semantic field extraction in Arabic online news

Maulana Ihsan Ahmad, Moh. Kanif Anwari

The research aimed to extract semantic fields from Arabic online news and advance Natural Language Processing (NLP) applications in understanding and managing news information effectively. It provides a comprehensive approach to processing and analyzing large volumes of Arabic news data by integrating semantic field analysis, NLP, and computational linguistics. Using quantitative methods, Arabic news articles were collected and processed with Python, a popular programming language in data analysis, and applied various NLP techniques and machine learning models to accurately extract semantic fields. The primary objective was to evaluate the effectiveness of different classification models in categorizing Arabic news and to identify the most suitable model for semantic field extraction. The research evaluated five classification models: Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Random Forest, and Gradient Boosting. Among these, SVM achieves the highest overall accuracy of 90%. Specifically, SVM demonstrated exceptional performance in categorizing sports-related news, with a 99% probability and an F1-Score of 98%. However, it faced challenges in categorizing health and science news, achieving a lower F1-Score of 79%. Overall, the study demonstrated the effectiveness of computational methods, particularly SVM, in classifying Arabic news and extracting semantic fields, thereby advancing NLP and computational linguistics. The findings highlighted the potential of SVM for accurate news analysis and the need for further enhancement of NLP techniques to address multilingual and domain-specific challenges.

en

Detail DOI Sumber

DOAJ Open Access 2024

UKF‐MOT: An unscented Kalman filter‐based 3D multi‐object tracker

Meng Liu, Jianwei Niu, Yu Liu

Abstract Multi‐object tracking in autonomous driving is a non‐linear problem. To better address the tracking problem, this paper leveraged an unscented Kalman filter to predict the object's state. In the association stage, the Mahalanobis distance was employed as an affinity metric, and a Non‐minimum Suppression method was designed for matching. With the detections fed into the tracker and continuous ‘predicting‐matching’ steps, the states of each object at different time steps were described as their own continuous trajectories. We conducted extensive experiments to evaluate tracking accuracy on three challenging datasets (KITTI, nuScenes and Waymo). The experimental results demonstrated that our method effectively achieved multi‐object tracking with satisfactory accuracy and real‐time efficiency.

Computational linguistics. Natural language processing, Computer software

Detail DOI Sumber

DOAJ Open Access 2024

LA POTERIE ARCHEOLOGIQUE DE KENEKEUY (BURKINA FASO) : DE L’ETUDE DESCRIPTIVE ET ANALYTIQUE VERS UNE IDENTIFICATION DES AUTEURS

Yves Pascal Zossin SANOU & Charles KIENOU

Résumé : La poursuite des recherches archéologiques au Burkina Faso dans l’objectif de contribuer à la réalisation d’un fichier archéologique exhaustif du pays nous a conduits à Kenekuy, une zone dans la province de la Kossi où des prospections archéologiques ont décelé des sites de production importantes de la poterie dans le passé. Une étude descriptive et analytique du mobilier céramique collecté sur deux sites dans cette zone a permis de faire une rétrospection sur le talent artistique et les modes de productions de nos lointains ancêtres. Les aspects morphologiques, techniques et fonctionnels étudiés sur la céramique de Kenekuy ont fourni de riches et intéressantes informations non seulement sur l’histoire du peuplement de la zone, mais aussi de tendre vers l’identification des probables auteurs de cette céramique. Mots-clés : Burkina Faso, Kenekuy, archéologie, poterie, peuplement.

Arts in general, Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2024

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language

James Requeima, John Bronskill, Dami Choi et al.

Machine learning practitioners often face significant challenges in formally integrating their prior knowledge and beliefs into predictive models, limiting the potential for nuanced and context-aware analyses. Moreover, the expertise needed to integrate this prior knowledge into probabilistic modeling typically limits the application of these models to specialists. Our goal is to build a regression model that can process numerical data and make probabilistic predictions at arbitrary locations, guided by natural language text which describes a user's prior knowledge. Large Language Models (LLMs) provide a useful starting point for designing such a tool since they 1) provide an interface where users can incorporate expert insights in natural language and 2) provide an opportunity for leveraging latent problem-relevant knowledge encoded in LLMs that users may not have themselves. We start by exploring strategies for eliciting explicit, coherent numerical predictive distributions from LLMs. We examine these joint predictive distributions, which we call LLM Processes, over arbitrarily-many quantities in settings such as forecasting, multi-dimensional regression, black-box optimization, and image modeling. We investigate the practical details of prompting to elicit coherent predictive distributions, and demonstrate their effectiveness at regression. Finally, we demonstrate the ability to usefully incorporate text into numerical predictions, improving predictive performance and giving quantitative structure that reflects qualitative descriptions. This lets us begin to explore the rich, grounded hypothesis space that LLMs implicitly encode.

en stat.ML, cs.CL

Detail Sumber

arXiv Open Access 2024

Energy-Based Models with Applications to Speech and Language Processing

Zhijian Ou

Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.

en cs.LG, cs.CL

Detail DOI Sumber

arXiv Open Access 2024

MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model

Jiahao Huo, Yibo Yan, Boren Hu et al.

Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage mechanism for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. The source code is available at https://github.com/Z1zs/MMNeuron.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2024

Can Large Language Models Automatically Jailbreak GPT-4V?

Yuanwei Wu, Yue Huang, Yixin Liu et al.

GPT-4V has attracted considerable attention due to its extraordinary capacity for integrating and processing multimodal information. At the same time, its ability of face recognition raises new safety concerns of privacy leakage. Despite researchers' efforts in safety alignment through RLHF or preprocessing filters, vulnerabilities might still be exploited. In our study, we introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization. We leverage Large Language Models (LLMs) for red-teaming to refine the jailbreak prompt and employ weak-to-strong in-context learning prompts to boost efficiency. Furthermore, we present an effective search method that incorporates early stopping to minimize optimization time and token expenditure. Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3\%. This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.

en cs.CL

Detail Sumber

S2 Open Access 2023

Making Sense of Citizens’ Input through Artificial Intelligence: A Review of Methods for Computational Text Analysis to Support the Evaluation of Contributions in Public Participation

Julia Romberg, Tobias Escher

Public sector institutions that consult citizens to inform decision-making face the challenge of evaluating the contributions made by citizens. This evaluation has important democratic implications but at the same time, consumes substantial human resources. However, until now the use of artificial intelligence such as computer-supported text analysis has remained an under-studied solution to this problem. We identify three generic tasks in the evaluation process that could benefit from natural language processing (NLP). Based on a systematic literature search in two databases on computational linguistics and digital government, we provide a detailed review of existing methods and their performance. While some promising approaches exist, for instance to group data thematically and to detect arguments and opinions, we show that there remain important challenges before these could offer any reliable support in practice. These include the quality of results, the applicability to non-English language corpuses and making algorithmic models available to practitioners through software. We discuss a number of avenues that future research should pursue that can ultimately lead to solutions for practice. The most promising of these bring in the expertise of human evaluators, for example through active learning approaches or interactive topic modeling.

25 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2023

The Role of Typological Feature Prediction in NLP and Linguistics

Johannes Bjerva

Abstract Computational typology has gained traction in the field of Natural Language Processing (NLP) in recent years, as evidenced by the increasing number of papers on the topic and the establishment of a Special Interest Group on the topic (SIGTYP), including the organization of successful workshops and shared tasks. A considerable amount of work in this sub-field is concerned with prediction of typological features, for example, for databases such as the World Atlas of Language Structures (WALS) or Grambank. Prediction is argued to be useful either because (1) it allows for obtaining feature values for relatively undocumented languages, alleviating the sparseness in WALS, in turn argued to be useful for both NLP and linguistics; and (2) it allows us to probe models to see whether or not these typological features are encapsulated in, for example, language representations. In this article, we present a critical stance concerning prediction of typological features, investigating to what extent this line of research is aligned with purported needs—both from the perspective of NLP practitioners, and perhaps more importantly, from the perspective of linguists specialized in typology and language documentation. We provide evidence that this line of research in its current state suffers from a lack of interdisciplinary alignment. Based on an extensive survey of the linguistic typology community, we present concrete recommendations for future research in order to improve this alignment between linguists and NLP researchers, beyond the scope of typological feature prediction.

7 sitasi en Computer Science

Detail DOI Sumber

DOAJ Open Access 2023

Time-and-Space-Efficient Weighted Deduction

Jason Eisner

Computational linguistics. Natural language processing

Detail DOI Sumber

Hasil untuk "Computational linguistics. Natural language processing"