Hasil "Computational linguistics. Natural language processing"

DOAJ Open Access 2025

The Postmodern Novel, Deconstruction and Indeterminacy

Dr.Rehman Sarwar Bajwa

The influence of post-modernism is evident across various genres of Urdu literature, including poetry, ghazals, fiction, and novels. This paper seeks to examine and interpret the impact of post-modern philosophy and its stylistic trends on the language of the Urdu novel. The novel, as a genre, has been selected for this study due to its intrinsic connection with lived experience, serving as a dynamic reflection of society. In contemporary literature, the novel is regarded as one of the most potent social instruments, offering an unparalleled medium for articulating nuanced thoughts and ideologies. Through a critical analysis of linguistic and narrative techniques, this paper explores how post-modernism has reshaped the language and structure of Urdu novels, reinforcing their role as a mirror to evolving cultural and philosophical paradigms.

Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2025

José Luis Trullo (ed.), Juan Luis Vives, Vida y costumbres del humanista, Sevilla, Cypress Cultura, 2024, 42 pp.

Pedro Fernández Requena

French literature - Italian literature - Spanish literature - Portuguese literature, Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2025

Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Sergio E. Zanotto, Segun Aroyehun

The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written or machine-generated texts, our study focuses on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality, and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls, and model release dates. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human- and machine-generated texts show stylistic diversity across domains, with human-written texts displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to a homogenization of machine-generated texts.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction

WonJin Yoon, Boyu Ren, Spencer Thomas et al.

Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different information signals, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task -- 30-day readmission prediction from a psychiatric discharge -- using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2025

Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications

Guangxin Su, Hanchen Wang, Jianwei Wang et al.

Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM--TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.

en cs.CL, cs.AI

Detail Sumber

CrossRef Open Access 2024

Chinese Text Simplification Based on Large Language Models

Hongying Huo, Zetong Fu, Haochen Yu et al.

2 sitasi en

Detail DOI Sumber

DOAJ Open Access 2024

Соціолінгвістичні та лінгвокультурні особливості англомовних шлюбних оголошень

Вікторія Цимбалюк

У статті досліджуються лінгвістичні особливості англомовних шлюбних оголошень у контексті соціолінгвістичних та лінгвокультурних факторів, які впливають на їхнє конструювання, а згодом і дистрибуцію на інтернет-сайтах та додатках для знайомств. Для аналізу було обрано такі популярні додатки для знайомств як Tinder та Badoo. Різноманітність у змісті шлюбних оголошень, поглядах кандидатів на шлюб, їхніх очікуваннях від іншої особи та способах представити себе відображає швидкі та драматичні трансформації, які відбуваються в соціальному та культурному житті суспільства. У роботі досліджуються лексичний склад, стилістичні особливості та комунікативні стратегії, які автори шлюбних оголошень використовують для репрезентації себе та своїх бажань, а також вираження своїх очікувань від потенційного партнера. Віртуальні платформи для знайомств, такі як сайти та додатки, надають користувачам можливість свідомо формувати свій образ та керувати ним. Це особливо важливо у контексті проаналізованих шлюбних оголошень, де мова стає інструментом для створення бажаного враження, а комунікація визначається як стратегічна. Важливість ефективної комунікації полягає в забезпеченні належного сприйняття повідомлення та точного розуміння інформації, яка у ньому передається. Комунікативна взаємодія у контексті шлюбних оголошень сприяє побудові бажаних для авторів шлюбних оголошень стосунків. Автори англомовних шлюбних оголошень прагнуть досягти мети комунікації, яка полягає у пошуку людини, яка може стати як романтичним партнером, так і виконувати інші ролі у житті субʼєкта оголошення. У статті досліджуються соціолінгвістичні та лінгвокультурні особливості цього явища, з фокусом на те, як мовні прийоми сприяють управлінню враженнями у цифровому просторі.

Discourse analysis, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2024

Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis

Sohee Yang, Jonghyeon Kim, Joel Jang et al.

Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2024

ConvoSense: Overcoming Monotonous Commonsense Inferences for Conversational AI

Sarah E. Finch, Jinho D. Choi

Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2024

Unforgettable Generalization in Language Models

Eric Zhang, Leshem Chosen, Jacob Andreas

When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

en cs.LG, cs.CL

Detail Sumber

arXiv Open Access 2024

Robust Speech and Natural Language Processing Models for Depression Screening

Y. Lu, A. Harati, T. Rutowski et al.

Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.

en eess.AS, cs.CL

Detail Sumber

arXiv Open Access 2024

I or Not I: Unraveling the Linguistic Echoes of Identity in Samuel Beckett's "Not I" Through Natural Language Processing

Arezou Zahiri Pourzarandi, Farshad Jafari

Exploring the depths of Samuel Beckett's "Not I" through advanced natural language processing techniques, this research uncovers the intricate linguistic structures that underpin the text. By analyzing word frequency, detecting emotional sentiments with a BERT-based model, and examining repetitive motifs, we unveil how Beckett's minimalist yet complex language reflects the protagonist's fragmented psyche. Our results demonstrate that recurring themes of time, memory, and existential angst are artfully woven through recursive linguistic patterns and rhythmic repetition. This innovative approach not only deepens our understanding of Beckett's stylistic contributions but also highlights his unique role in modern literature, where language transcends simple communication to explore profound existential questions.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)

Hansi Hettiarachchi, Tharindu Ranasinghe, Paul Rayson et al.

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Is it the end of (generative) linguistics as we know it?

Cristiano Chesi

A significant debate has emerged in response to a paper written by Steven Piantadosi (Piantadosi, 2023) and uploaded to the LingBuzz platform, the open archive for generative linguistics. Piantadosi's dismissal of Chomsky's approach is ruthless, but generative linguists deserve it. In this paper, I will adopt three idealized perspectives -- computational, theoretical, and experimental -- to focus on two fundamental issues that lend partial support to Piantadosi's critique: (a) the evidence challenging the Poverty of Stimulus (PoS) hypothesis and (b) the notion of simplicity as conceived within mainstream Minimalism. In conclusion, I argue that, to reclaim a central role in language studies, generative linguistics -- representing a prototypical theoretical perspective on language -- needs a serious update leading to (i) more precise, consistent, and complete formalizations of foundational intuitions and (ii) the establishment and utilization of a standardized dataset of crucial empirical evidence to evaluate the theory's adequacy. On the other hand, ignoring the formal perspective leads to major drawbacks in both computational and experimental approaches. Neither descriptive nor explanatory adequacy can be easily achieved without the precise formulation of general principles that can be challenged empirically.

en cs.CL

Detail DOI Sumber

CrossRef Open Access 2023

Speech Act Classification in Computational Linguistics Using Supervised Machine Learning Models

Shadi Dini, Penny Hammrich, Aroutis Foster

This dissertation utilizes two supervised machine learning models, Random Forest and Support Vector Machine (SVM), to classify speech acts, specifically direct and indirect Requests and Refusals, in a dataset of over 5000 emails. The study focuses on analyzing asynchronous communications, particularly emails, as authentic sources of data. By comparing the performance of Random Forest and SVM, the research demonstrates that SVM outperforms Random Forest in accurately classifying both direct and indirect speech acts. The findings have significant implications for various fields, including linguistics, natural language processing, and education, highlighting the potential of SVM in speech act classification tasks and its contribution to the analysis of conversational data.

en

Detail DOI Sumber

DOAJ Open Access 2023

A microanalysis study of EFL teacher self-regulation in COVID-19 emergency remote teaching

Dong Li

Teacher self-regulation is under-studied yet important especially for teachers of English as a foreign language (EFL) who need to conduct remote teaching over the internet due to COVID-19, known as emergency remote teaching (ERT). In light of the teacher development and self-regulation model that consists of the three phases of forethought, performance and self-reflection, this qualitative study applies the microanalysis method to explore how one novice-level and two master-level Chinese university EFL teachers self-regulated their ERT teaching. After synthetical term frequency analysis using jiebaR, we summarised a total of 14 teacher self-regulation strategies corresponding to the three phases and categorised them into ERT goals (three strategies), motivation for ERT (three strategies), self-control in ERT (three strategies), self-observation in ERT (one strategy), self-judgment from the ERT class (two strategies) and self-reaction from the ERT class (one strategy). We ended by proposing a tentative EFL teacher self-regulation model for ERT context. Implications are provided.

Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2023

This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models

Iker García-Ferrero, Begoña Altuna, Javier Álvez et al.

Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although fine-tuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Language Cognition and Language Computation -- Human and Machine Language Understanding

Shaonan Wang, Nai Ding, Nan Lin et al.

Language understanding is a key scientific issue in the fields of cognitive and computer science. However, the two disciplines differ substantially in the specific research questions. Cognitive science focuses on analyzing the specific mechanism of the brain and investigating the brain's response to language; few studies have examined the brain's language system as a whole. By contrast, computer scientists focus on the efficiency of practical applications when choosing research questions but may ignore the most essential laws of language. Given these differences, can a combination of the disciplines offer new insights for building intelligent language models and studying language cognitive mechanisms? In the following text, we first review the research questions, history, and methods of language understanding in cognitive and computer science, focusing on the current progress and challenges. We then compare and contrast the research of language understanding in cognitive and computer sciences. Finally, we review existing work that combines insights from language cognition and language computation and offer prospects for future development trends.

en cs.CL

Detail Sumber

DOAJ Open Access 2022

Improved N-Best Extraction with an Evaluation on Language Data

Johanna Björklund, Frank Drewes, Anna Jonsson

AbstractWe show that a previously proposed algorithm for the N-best trees problem can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared with the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. The algorithm is implemented in the software Betty, and compared to the state-of-the-art algorithm for extracting the N best runs, implemented in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets with respect to running time, while Tiburon seems to be the more memory-efficient choice.

Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2022

Mitigating Covertly Unsafe Text within Natural Language Systems

Alex Mei, Anisha Kabir, Sharon Levy et al.

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system's information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.

en cs.AI, cs.CL

Detail Sumber

Hasil untuk "Computational linguistics. Natural language processing"