Hasil untuk "Language and Literature"

Menampilkan 20 dari ~880418 hasil · dari DOAJ, arXiv, Semantic Scholar

JSON API
S2 Open Access 2020
Automatic Detection of Machine Generated Text: A Critical Survey

Ganesh Jawahar, Muhammad Abdul-Mageed, L. Lakshmanan

Text generative models (TGMs) excel in producing text that matches the style of human language reasonably well. Such TGMs can be misused by adversaries, e.g., by automatically generating fake news and fake product reviews that can look authentic and fool humans. Detectors that can distinguish text generated by TGM from human written text play a vital role in mitigating such misuse of TGMs. Recently, there has been a flurry of works from both natural language processing (NLP) and machine learning (ML) communities to build accurate detectors for English. Despite the importance of this problem, there is currently no work that surveys this fast-growing literature and introduces newcomers to important research challenges. In this work, we fill this void by providing a critical survey and review of this literature to facilitate a comprehensive understanding of this problem. We conduct an in-depth error analysis of the state-of-the-art detector and discuss research directions to guide future work in this exciting area.

290 sitasi en Computer Science
arXiv Open Access 2026
L-ReLF: A Framework for Lexical Dataset Creation

Anass Sedrati, Mounir Afifi, Reda Benkhadra

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

en cs.CL
DOAJ Open Access 2025
Ontology-Guided Hypothesis Generation Using LLMs and Topic Modeling in mHealth Research

Vibha, Rajesh R. Pai, Sumith N.

This study proposes a semantic pipeline designed to generate domain-oriented and contextually relevant hypotheses by analyzing existing literature on mHealth applications in India. Using a corpus of mHealth texts, the framework extracts hidden semantics through TF-IDF, topic modeling, and contextual mapping with domain ontologies. It then employs prompt-based interactions with large language models (LLMs) to systematically generate and validate hypotheses aligned with identified topic-concept relationships. The results demonstrate the framework’s effectiveness in producing high-quality, structured hypotheses, as validated by expert ratings ranging from 4.2 to 4.6. Most hypotheses were found to be plausible or highly plausible, with low semantic redundancy indicating diversity across topics, except in stakeholder-related areas which showed moderate overlap. Although the inclusion of semantic augmentation increased processing time, it significantly enhanced interpretability and validity. The high lexical density observed (up to 0.90) further reflects the linguistic flexibility of the generated hypotheses. This approach underscores the potential of computational methods in automating hypothesis generation and enabling data-driven discoveries in the mHealth domain.

Electrical engineering. Electronics. Nuclear engineering
arXiv Open Access 2025
Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing

Atharva Mutsaddi, Aditya Choudhary

Plagiarism involves using another person's work or concepts without proper attribution, presenting them as original creations. With the growing amount of data communicated in regional languages such as Marathi -- one of India's regional languages -- it is crucial to design robust plagiarism detection systems tailored for low-resource languages. Language models like Bidirectional Encoder Representations from Transformers (BERT) have demonstrated exceptional capability in text representation and feature extraction, making them essential tools for semantic analysis and plagiarism detection. However, the application of BERT for low-resource languages remains under-explored, particularly in the context of plagiarism detection. This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts using BERT sentence embeddings in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) feature representation. This approach effectively captures statistical, semantic, and syntactic aspects of text features through a weighted voting ensemble of machine learning models.

en cs.CL, cs.AI
arXiv Open Access 2025
Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

Zhixian Zhao, Xinfa Zhu, Xinsheng Wang et al.

Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.

en cs.SD, cs.CL
arXiv Open Access 2025
Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Anthony GX-Chen, Dongyan Lin, Mandana Samiei et al.

Language model (LM) agents are increasingly used as autonomous decision-makers which need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established Blicket Test paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This "disjunctive bias" persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not child-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

en cs.AI, cs.CL
arXiv Open Access 2025
Natural Language Generation

Ehud Reiter

This book provides a broad overview of Natural Language Generation (NLG), including technology, user requirements, evaluation, and real-world applications. The focus is on concepts and insights which hopefully will remain relevant for many years, not on the latest LLM innovations. It draws on decades of work by the author and others on NLG. The book has the following chapters: Introduction to NLG; Rule-Based NLG; Machine Learning and Neural NLG; Requirements; Evaluation; Safety, Maintenance, and Testing; and Applications. All chapters include examples and anecdotes from the author's personal experiences, and end with a Further Reading section. The book should be especially useful to people working on applied NLG, including NLG researchers, people in other fields who want to use NLG, and commercial developers. It will not however be useful to people who want to understand the latest LLM technology. There is a companion site with more information at https://ehudreiter.com/book/

arXiv Open Access 2025
HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection

Guang Yang, Yujie Zhu

Pre-trained language models (PLMs) are increasingly being applied to code-related tasks. Although PLMs have achieved good results, they do not take into account potential high-order data correlations within the code. We propose three types of high-order correlations in code tokens, i.e. abstract syntax tree family correlation, lexical correlation, and line correlation. We design a tokens and hyperedges generator to capture these high-order data correlations. We improve the architecture of hypergraph neural networks and combine it with adapter tuning to propose a novel hypergraph-based adapter (HGAdapter) to fine-tune PLMs. HGAdapter can encode high-order data correlations and is allowed to be inserted into various PLMs to enhance performance. Experiments were conducted on several public datasets, including six languages of code summarization and code clone detection tasks. Our methods improved the performance of PLMs in datasets to varying degrees. Experimental results validate the introduction of high-order data correlations that contribute to improved effectiveness.

en cs.CL, cs.AI
arXiv Open Access 2025
Hierarchical Multi-agent Large Language Model Reasoning for Autonomous Functional Materials Discovery

Samuel Rothfarb, Megan C. Davis, Ivana Matanovic et al.

Artificial intelligence is reshaping scientific exploration, but most methods automate procedural tasks without engaging in scientific reasoning, limiting autonomy in discovery. We introduce Materials Agents for Simulation and Theory in Electronic-structure Reasoning (MASTER), an active learning framework where large language models autonomously design, execute, and interpret atomistic simulations. In MASTER, a multimodal system translates natural language into density functional theory workflows, while higher-level reasoning agents guide discovery through a hierarchy of strategies, including a single agent baseline and three multi-agent approaches: peer review, triage-ranking, and triage-forms. Across two chemical applications, CO adsorption on Cu-surface transition metal (M) adatoms and on M-N-C catalysts, reasoning-driven exploration reduces required atomistic simulations by up to 90% relative to trial-and-error selection. Reasoning trajectories reveal chemically grounded decisions that cannot be explained by stochastic sampling or semantic bias. Altogether, multi-agent collaboration accelerates materials discovery and marks a new paradigm for autonomous scientific exploration.

en cond-mat.mtrl-sci, cs.AI
DOAJ Open Access 2024
Sakha Language in a Northern Industrial Monotown: Linguistic Distribution in Speech Repertoire of Sakha People

I. A. Danilov

This study explores the functioning of the Yakut language in the conditions of the northern industrial monotown Mirny (Republic of Sakha (Yakutia)). The study identifies the features of linguistic distribution in the speech repertoire of ethnic Sakha based on data from a sociolinguistic survey and interviews (n=279). Descriptive statistical methods and content analysis are employed. The results reveal an asymmetric nature of Yakut-Russian bilingualism with Russian language predominance in public communication. While Sakha individuals exhibit a high level of language competence in the Yakut language, its application is limited. In familial and friendly communication, the Yakut language maintains significant positions; however, its demand is minimal in the professional-business sphere, especially in industry. Among individuals with higher education, the percentage of Russian speakers at work reaches 68.34% compared to 55.77% among those with only secondary vocational education. Russian language dominates in management, service sectors, and law enforcement (75—100%). Only in fields such as healthcare, education, and culture does the Yakut language maintain strong positions. The native language is primarily perceived by Sakha as a symbolic marker of identity, yielding to Russian in social prestige and communicative power. The vitality prospects of the Yakut language in the city are assessed ambiguously and are linked to the effectiveness of language policy in key institutional contexts (education, media, government).

Slavic languages. Baltic languages. Albanian languages
DOAJ Open Access 2024
Fusing talent horizons: the transformative role of data integration in modern talent management

Ahmed M. Asfahani

Abstract This study elucidates the transformative influence of data integration on talent management in the context of evolving technological paradigms, with a specific focus on sustainable practices in human resources. Historically anchored in societal norms and organizational culture, talent management has transitioned from traditional methodologies to harnessing diverse data sources, a shift that enhances sustainable HR strategies. By employing a narrative literature review, the research traces the trajectory of HR data sources, emphasizing the juxtaposition of structured and unstructured data. The digital transformation of HR is explored, not only highlighting the evolution of Human Resource Information Systems (HRIS) but also underscoring their role in promoting sustainable workforce management. The integration of advanced technologies such as machine learning and natural language processing is examined, reflecting on their impact on the efficiency and ecological aspects of HR practices. This paper not only underscores the imperative of balancing data-driven strategies with the quintessential human element of HR but also provides concrete examples demonstrating this balance in action for practitioners and scholars in sustainable human resources.

Environmental sciences
arXiv Open Access 2024
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Max Zuo, Francisco Piedrahita Velez, Xiaochen Li et al.

Recent works have explored using language models for planning problems. One approach examines translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). Existing evaluation methods struggle to ensure semantic correctness and rely on simple or unrealistic datasets. To bridge this gap, we introduce \textit{Planetarium}, a benchmark designed to evaluate language models' ability to generate PDDL code from natural language descriptions of planning tasks. \textit{Planetarium} features a novel PDDL equivalence algorithm that flexibly evaluates the correctness of generated PDDL, along with a dataset of 145,918 text-to-PDDL pairs across 73 unique state combinations with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task's complexity. For example, 96.1\% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 94.4\% are solvable, but only 24.8\% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

en cs.CL, cs.AI
arXiv Open Access 2024
Tamil Language Computing: the Present and the Future

Kengatharaiyer Sarveswaran

This paper delves into the text processing aspects of Language Computing, which enables computers to understand, interpret, and generate human language. Focusing on tasks such as speech recognition, machine translation, sentiment analysis, text summarization, and language modelling, language computing integrates disciplines including linguistics, computer science, and cognitive psychology to create meaningful human-computer interactions. Recent advancements in deep learning have made computers more accessible and capable of independent learning and adaptation. In examining the landscape of language computing, the paper emphasises foundational work like encoding, where Tamil transitioned from ASCII to Unicode, enhancing digital communication. It discusses the development of computational resources, including raw data, dictionaries, glossaries, annotated data, and computational grammars, necessary for effective language processing. The challenges of linguistic annotation, the creation of treebanks, and the training of large language models are also covered, emphasising the need for high-quality, annotated data and advanced language models. The paper underscores the importance of building practical applications for languages like Tamil to address everyday communication needs, highlighting gaps in current technology. It calls for increased research collaboration, digitization of historical texts, and fostering digital usage to ensure the comprehensive development of Tamil language processing, ultimately enhancing global communication and access to digital services.

en cs.CL

Halaman 29 dari 44021