Hasil "Computational linguistics. Natural language processing"

S2 Open Access

Books Recieved

Peter Aczel, Michael Barlow, C. Albert et al.

812 sitasi en

DOAJ Open Access 2025

L’INSCRIPTION DE LA DIPLOMATIE DANS LE ROMAN CONTEMPORAIN : UNE LECTURE COMPAREE DE LE SUSPENDU DE CONAKRY DE JEAN-CHRISTOPHE RUFIN ET SHANGHAI 2040 DE JEAN-LOUIS ROY

Louis Hervé NGAFOMO

Résumé : Le roman Le Suspendu de Conakry (2018) de l’auteur français Jean-Christophe Rufin, est un récit qui porte en offrande de lecture une intrigue de crise politique à ressort diplomatique. Le Consul de France en Guinée Conakry Aurel Timescu est le symbole de coopération à soupçon d’un transfert d’attributs diplomatiques. Quant au récit Shanghai 2040(2021) de Jean-Louis Roy, il est le reflet d’une saga de révolution diplomatique conduite par une utopie diplomatique de déportation du siège des Nations Unies à Shanghai à l’horizon 2040. Wei Shu est ce symbole de leadership politique. Ces deux textes posent le problème de l’esthétique de la diplomatie au détour des enjeux de l’imaginaire politique. L’objet de cette communication est de ressortir les modalités scripturales de la diplomatie dans les deux œuvres. Quelles sont les variables politiques de la diplomatie qui émergent de ces romans ? Comment l’écriture participe-t-elle à ressortir les marqueurs de la diplomatie aux abords de ces deux imaginaires ? Sous quel angle l’inscription de la diplomatie postule au renouveau politique à valence de « poétique de la relation » chez Jean Louis Roy et chez Jean-Christophe Rufin ? La grille méthodologique privilégiée dans cette réflexion est la sociopoétique d’Alain Montandon axée sur les représentations sociales comme éléments dynamiques de la création littéraire. Quant au cadre théorique, le Postmodernisme de Jean François Lyotard sert de repère idéologique d’examen de la diplomatie, avec l’approche comparative de Claude Pichois et André Michel Rousseau en facteur critique. Mots-clés : Diplomatie, roman, relation, Roy, Rufin.

Arts in general, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2025

RDHNet: Reversible Data Hiding Method for Securing Colour Images Using AlexNet and Watershed Transform in a Fusion Domain

Mohamed Meselhy Eltoukhy, Faisal S. Alsubaei, Mostafa M. Abdel‐Aziz et al.

ABSTRACT Medical images play a crucial role in diagnosis, treatment procedures and overall healthcare. Nevertheless, they also pose substantial risks to patient confidentiality and safety. Safeguarding the confidentiality of patients' data has become an urgent and practical concern. We present a novel approach for reversible data hiding for colour medical images. In a hybrid domain, we employ AlexNet, tuned with watershed transform (WST) and L‐shaped fractal Tromino encryption. Our approach commences by constructing the host image's feature vector using a pre‐trained AlexNet model. Next, we use the watershed transform to convert the extracted feature vector into a vector for a topographic map, which we then encrypt using an L‐shaped fractal Tromino cryptosystem. We embed the secret image in the transformed image vector using a histogram‐based embedding strategy to enhance payload and visual fidelity. When there are no attacks, the RDHNet exhibits robust performance, can be reversed to the original image and maintains a visually appealing stego image, with an average PSNR of 73.14 dB, an SSIM of 0.9999 and perfect values of NC = 1 and BER = 0 under normal conditions. The proposed RDHNet demonstrates a robust ability to withstand detrimental geometric and noise‐adding attacks as well as various steganalysis methods. Furthermore, our RDHNet method initiative demonstrates efficacy in tackling contemporary confidentiality issues.

Computational linguistics. Natural language processing, Computer software

Detail DOI Sumber

arXiv Open Access 2025

Gender Encoding Patterns in Pretrained Language Model Representations

Mahdi Zakizadeh, Mohammad Taher Pilehvar

Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Can LLMs Interpret and Leverage Structured Linguistic Representations? A Case Study with AMRs

Ankush Raut, Xiaofeng Zhu, Maria Leonor Pacheco

This paper evaluates the ability of Large Language Models (LLMs) to leverage contextual information in the form of structured linguistic representations. Specifically, we examine the impact of encoding both short and long contexts using Abstract Meaning Representation (AMR) structures across a diverse set of language tasks. We perform our analysis using 8-bit quantized and instruction-tuned versions of Llama 3.1 (8B), Phi-3, and Mistral 7B. Our results indicate that, for tasks involving short contexts, augmenting the prompt with the AMR of the original language context often degrades the performance of the underlying LLM. However, for tasks that involve long contexts, such as dialogue summarization in the SAMSum dataset, this enhancement improves LLM performance, for example, by increasing the zero-shot cosine similarity score of Llama 3.1 from 66% to 76%. This improvement is more evident in the newer and larger LLMs, but does not extend to the older or smaller ones. In addition, we observe that LLMs can effectively reconstruct the original text from a linearized AMR, achieving a cosine similarity of 81% in the best-case scenario.

en cs.CL

Detail Sumber

DOAJ Open Access 2024

L’aménagement de l’espace dans une classe de FLE et sa contribution à l’apprentissage du français au niveau tertiaire : Le cas d’University of Energy and Natural Resources (UENR), Sunyani., Ghana

Richard Baffour Okyere, Francis Douglas Appiah, Tahiru Djato & Magdalene Mensah

Résumé : Cette étude se propose d’analyser l’aménagement de l’espace dans une classe de Français Langue Etrangère (FLE) au niveau tertiaire au Ghana. Souvent quand l’on discute de l’apprentissage d’une langue étrangère, on se limite le plus souvent sur soit l’approche soit la méthodologie utile pour dispenser le cours. Bien que la méthodologie englobe toutes les techniques et les activités pratiques pour dispenser un cours de FLE, l’aménagement de l’espace dans une classe de FLE est souvent ignoré. Cette étude examine le rôle de l’aménagement de l’espace dans une classe de FLE au niveau tertiaire. Pour constituer la méthodologie pour cette étude, nous avons adopté une approche d’échantillonnage de quatre types d’aménagements de l’espace en classe de FLE souvent employé par des enseignants de FLE à UENR. Nous avons ensuite proposé un cinquième aménagement spatial dans un cours de FLE. A la fin de l’enquête, nous avons découvert que l’aménagement de l’espace dans une classe de FLE joue un rôle fondamental pour dynamiser la classe et cela aide à susciter l’intérêt et à améliorer la compréhension de l’apprenant. Mots-clés : Aménagement de l’espace, FLE, Pédagogie, Apprentissage, University of Energy and Natural Resources (UENR)

Arts in general, Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2024

Comprehensive Implementation of TextCNN for Enhanced Collaboration between Natural Language Processing and System Recommendation

Xiaonan Xu, Zheng Xu, Zhipeng Ling et al.

Natural Language Processing (NLP) is an important branch of artificial intelligence that studies how to enable computers to understand, process, and generate human language. Text classification is a fundamental task in NLP, which aims to classify text into different predefined categories. Text classification is the most basic and classic task in natural language processing, and most of the tasks in natural language processing can be regarded as classification tasks. In recent years, deep learning has achieved great success in many research fields, and today, it has also become a standard technology in the field of NLP, which is widely integrated into text classification tasks. Unlike numbers and images, text processing emphasizes fine-grained processing ability. Traditional text classification methods generally require preprocessing the input model's text data. Additionally, they also need to obtain good sample features through manual annotation and then use classical machine learning algorithms for classification. Therefore, this paper analyzes the application status of deep learning in the three core tasks of NLP (including text representation, word order modeling, and knowledge representation). This content explores the improvement and synergy achieved through natural language processing in the context of text classification, while also taking into account the challenges posed by adversarial techniques in text generation, text classification, and semantic parsing. An empirical study on text classification tasks demonstrates the effectiveness of interactive integration training, particularly in conjunction with TextCNN, highlighting the significance of these advancements in text classification augmentation and enhancement.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Natural Language Processing for Human Resources: A Survey

Naoki Otani, Nikita Bhutani, Estevam Hruschka

Advances in Natural Language Processing (NLP) have the potential to transform HR processes, from recruitment to employee management. While recent breakthroughs in NLP have generated significant interest in its industrial applications, a comprehensive overview of how NLP can be applied across HR activities is still lacking. This paper discovers opportunities for researchers and practitioners to harness NLP's transformative potential in this domain. We analyze key fundamental tasks such as information extraction and text classification, and their roles in downstream applications like recommendation and language generation, while also discussing ethical concerns. Additionally, we identify gaps in current research and encourage future work to explore holistic approaches for achieving broader objectives in this field.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Conceptual In-Context Learning and Chain of Concepts: Solving Complex Conceptual Problems Using Large Language Models

Nishtha N. Vaidya, Thomas Runkler, Thomas Hubauer et al.

Science and engineering problems fall in the category of complex conceptual problems that require specific conceptual information (CI) like math/logic -related know-how, process information, or engineering guidelines to solve them. Large Language Models (LLMs) are promising agents to solve such complex conceptual problems due to their implications in advancing engineering and science tasks like assisted problem-solving. But vanilla LLMs, trained on open-world data, lack the necessary CI. In this work, we specifically explore shallow customization methods (SCMs) of LLMs for solving complex conceptual problems. We propose two novel SCM algorithms for LLM, to augment LLMs with CI and enable LLMs to solve complex conceptual problems: Conceptual In-Context Learning (C-ICL) and Chain of Concepts (CoC). The problem tackled in this paper is generation of proprietary data models in the engineering/industry domain based on conceptual information in data modelling guidelines. We evaluate our algorithms on varied sizes of the OpenAI LLMs against four evaluation metrics related to syntactic and semantic correctness, time and cost incurred. The proposed algorithms perform better than currently popular LLM SCMs like In-context Learning (ICL) and Chain of Thoughts (CoT). It was observed that as compared to CoT, response correctness increased by 30.6% and 29.88% for the new SCMs C-ICL and CoC respectively. Qualitative analysis suggests that the proposed new SCMs activate emergent capabilities in LLMs, previously unobserved in the existing SCMs. They make problem-solving processes more transparent and reduce hallucinations and the tendency of model responses to copy examples from prompts (parroting).

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

Zongmeng Zhang, Jinhua Zhu, Wengang Zhou et al.

Dense retrieval, which aims to encode the semantic information of arbitrary text into dense vector representations or embeddings, has emerged as an effective and efficient paradigm for text retrieval, consequently becoming an essential component in various natural language processing systems. These systems typically focus on optimizing the embedding space by attending to the relevance of text pairs, while overlooking the Boolean logic inherent in language, which may not be captured by current training objectives. In this work, we first investigate whether current retrieval systems can comprehend the Boolean logic implied in language. To answer this question, we formulate the task of Boolean Dense Retrieval and collect a benchmark dataset, BoolQuestions, which covers complex queries containing basic Boolean logic and corresponding annotated passages. Through extensive experimental results on the proposed task and benchmark dataset, we draw the conclusion that current dense retrieval systems do not fully understand Boolean logic in language, and there is a long way to go to improve our dense retrieval systems. Furthermore, to promote further research on enhancing the understanding of Boolean logic for language models, we explore Boolean operation on decomposed query and propose a contrastive continual training method that serves as a strong baseline for the research community.

en cs.IR, cs.CL

Detail Sumber

arXiv Open Access 2024

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

Nikitas Theodoropoulos, Giorgos Filandrianos, Vassilis Lyberatos et al.

We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-training using TinyStories: a recently introduced dataset of short stories. Initially, we train GPT-Neo models on subsets of TinyStories, while varying the amount of available data. We find that, even with access to less than 100M words, the models are able to generate high-quality, original completions to a given story, and acquire substantial linguistic knowledge. To measure the effect of synthetic story data, we train LTG-BERT encoder models on a combined dataset of: a subset of TinyStories, story completions generated by GPT-Neo, and a subset of the BabyLM dataset. Our experimentation reveals that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding. Our work offers an initial study on synthesizing story data in low resource settings and underscores their potential for augmentation in data-constrained language modeling. We publicly release our models and implementation on our GitHub.

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Evaluating a Century of Progress on the Cognitive Science of Adjective Ordering

William Dyer, Charles Torres, Gregory Scontras et al.

Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2023

“We will not atikuloot our future!”

Helen Ufuoma Ugah

This study argues that stance-taking, a discourse strategy in linguistics, serves the purpose of giving citizens voices to engage politicians on different social media platforms and taking them to task on their political agendas during electioneering campaigns. It aims to evince the extent to which Nigerians utilise Nigerianness – the domestication and acculturation of English language, to express their stances about the various socio-political realities in Nigeria. It deploys online comments generated from 20 posts in Nairaland and Sahara Reporters about the Nigerian 2019 general elections, and applies Martin/White’s (2005) Appraisal Theory to address Nigerians’ functional use of Nigerianness to embody their perceptions of the Nigerian socio-political dynamics. The study evinces that online forums are public spaces that enable Nigerians to follow up on and participate in debates on political events in the country, and the use of Nigerianness gives insight into the expectations of Nigerians from their political leaders.

Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2023

Assessing the Capacity of Transformer to Abstract Syntactic Representations: A Contrastive Analysis Based on Long-distance Agreement

Bingzhi Li, Guillaume Wisniewski, Benoît Crabbé

Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2023

mahaNLP: A Marathi Natural Language Processing Library

Vidula Magdum, Omkar Dhekane, Sharayu Hiwarkhedkar et al.

We present mahaNLP, an open-source natural language processing (NLP) library specifically built for the Marathi language. It aims to enhance the support for the low-resource Indian language Marathi in the field of NLP. It is an easy-to-use, extensible, and modular toolkit for Marathi text analysis built on state-of-the-art MahaBERT-based transformer models. Our work holds significant importance as other existing Indic NLP libraries provide basic Marathi processing support and rely on older models with restricted performance. Our toolkit stands out by offering a comprehensive array of NLP tasks, encompassing both fundamental preprocessing tasks and advanced NLP tasks like sentiment analysis, NER, hate speech detection, and sentence completion. This paper focuses on an overview of the mahaNLP framework, its features, and its usage. This work is a part of the L3Cube MahaNLP initiative, more information about it can be found at https://github.com/l3cube-pune/MarathiNLP .

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2023

CodeKGC: Code Language Model for Generative Knowledge Graph Construction

Zhen Bi, Jing Chen, Yinuo Jiang et al.

Current generative knowledge graph construction approaches usually fail to capture structural knowledge by simply flattening natural language into serialized texts or a specification language. However, large generative language model trained on structured data such as code has demonstrated impressive capability in understanding natural language for structural prediction and reasoning tasks. Intuitively, we address the task of generative knowledge graph construction with code language model: given a code-format natural language input, the target is to generate triples which can be represented as code completion tasks. Specifically, we develop schema-aware prompts that effectively utilize the semantic structure within the knowledge graph. As code inherently possesses structure, such as class and function definitions, it serves as a useful model for prior semantic structural knowledge. Furthermore, we employ a rationale-enhanced generation method to boost the performance. Rationales provide intermediate steps, thereby improving knowledge extraction abilities. Experimental results indicate that the proposed approach can obtain better performance on benchmark datasets compared with baselines. Code and datasets are available in https://github.com/zjunlp/DeepKE/tree/main/example/llm.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Evolution of Natural Language Processing Technology: Not Just Language Processing Towards General Purpose AI

Masahiro Yamamoto

Since the invention of computers, communication through natural language (actual human language) has been a dream technology. However, natural language is extremely difficult to mathematically formulate, making it difficult to realize as an algorithm without considering programming. While there have been numerous technological developments, one cannot say that any results allowing free utilization have been achieved thus far. In the case of language learning in humans, for instance when learning one's mother tongue or foreign language, one must admit that this process is similar to the adage "practice makes perfect" in principle, even though the learning method is significant up to a point. Deep learning has played a central role in contemporary AI technology in recent years. When applied to natural language processing (NLP), this produced unprecedented results. Achievements exceeding the initial predictions have been reported from the results of learning vast amounts of textual data using deep learning. For instance, four arithmetic operations could be performed without explicit learning, thereby enabling the explanation of complex images and the generation of images from corresponding explanatory texts. It is an accurate example of the learner embodying the concept of "practice makes perfect" by using vast amounts of textual data. This report provides a technological explanation of how cutting-edge NLP has made it possible to realize the "practice makes perfect" principle. Additionally, examples of how this can be applied to business are provided. We reported in June 2022 in Japanese on the NLP movement from late 2021 to early 2022. We would like to summarize this as a memorandum since this is just the initial movement leading to the current large language models (LLMs).

en cs.CL, cs.AI

Detail Sumber

S2 Open Access 2022

A review of machine transliteration, translation, evaluation metrics and datasets in Indian Languages

A. Jha, H. Patil

14 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2022

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Luan Thanh Nguyen, Luan Nguyen, Kiet Van Nguyen et al.

Text classification is a typical natural language processing or computational linguistics task with various interesting applications. As the number of users on social media platforms increases, data acceleration promotes emerging studies on Social Media Text Classification (SMTC) or social media text mining on these valuable resources. In contrast to English, Vietnamese, one of the low-resource languages, is still not concentrated on and exploited thoroughly. Inspired by the success of the GLUE, we introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks. With the proposed benchmark, we implement and analyze the effectiveness of a variety of multilingual BERT-based models (mBERT, XLM-R, and DistilmBERT) and monolingual BERT-based models (PhoBERT, viBERT, vELECTRA, and viBERT4news) for tasks in the SMTCE benchmark. Monolingual models outperform multilingual models and achieve state-of-the-art results on all text classification tasks. It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.

10 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2022

Machine Learning based Sentiment Analysis of Hindi Data with TF-IDF and Count Vectorization

Ashwani Gupta, U. Sharma

Sentiment refers to emotions. Sentiment analysis, often known as opinion mining, is the technique of identifying and extracting subjective data from pre-web and post-web reviews using text analytics, computational linguistics, and natural language processing. Hindi is an Indian language which is used by many of Indians. Due to phenomenal growth of online product reviews in Hindi post-web Hindi reviews are also increasing rapidly. A machine learning based method in this paper to analysis postweb text data. The present method is divided into four steps. First of all, an annotated Hindi review data set is developed from post-web sources. In second step, feature extraction is performed on annotated Hindi review dataset using the Term-Frequency/ Inverse-Document Frequency (TF-IDF) and count vectorization techniques. In the third step, the retrieved features are given to the classifier so it can make predictions. Moreover, annotated dataset translated into English. Second step and third step are performed on annotated English dataset in last step. A range of evaluation criteria, including precision, recall, and F1- score, are presented in the results. In both instances, the support vector machine produced the most pertinent results.

9 sitasi en Computer Science

Detail DOI Sumber

Hasil untuk "Computational linguistics. Natural language processing"