Hasil "Language and Literature"

arXiv Open Access 2025

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet, Jörg Bornschein, Stephanie Chan et al.

Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2025

Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models

Ju-Young Kim, Ji-Hong Park, Se-Yeon Lee et al.

Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Unforgettable Generalization in Language Models

Eric Zhang, Leshem Chosen, Jacob Andreas

When language models (LMs) are trained to forget (or "unlearn'') a skill, how precisely does their behavior change? We study the behavior of transformer LMs in which tasks have been forgotten via fine-tuning on randomized labels. Such LMs learn to generate near-random predictions for individual examples in the "training'' set used for forgetting. Across tasks, however, LMs exhibit extreme variability in whether LM predictions change on examples outside the training set. In some tasks (like entailment classification), forgetting generalizes robustly, and causes models to produce uninformative predictions on new task instances; in other tasks (like physical commonsense reasoning and scientific question answering) forgetting affects only the training examples, and models continue to perform the "forgotten'' task accurately even for examples very similar to those that appeared in the training set. Dataset difficulty is not predictive of whether a behavior can be forgotten; instead, generalization in forgetting is (weakly) predicted by the confidence of LMs' initial task predictions and the variability of LM representations of training data, with low confidence and low variability both associated with greater generalization. Perhaps most surprisingly, random-label forgetting appears to be somewhat insensitive to the contents of the training set: for example, models trained on science questions with random labels continue to answer other science questions accurately, but begin to produce random labels on entailment classification tasks. Finally, we show that even generalizable forgetting is shallow: linear probes trained on LMs' representations can still perform tasks reliably after forgetting. Our results highlight the difficulty and unpredictability of performing targeted skill removal from models via fine-tuning.

en cs.LG, cs.CL

Detail Sumber

arXiv Open Access 2024

Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

Nurshat Fateh Ali, Md. Mahdi Mohtasim, Shakil Mosharrof et al.

This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Are Compressed Language Models Less Subgroup Robust?

Leonidas Gee, Andrea Zugarini, Novi Quadrianto

To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression.

en cs.LG, cs.CL

Detail DOI Sumber

arXiv Open Access 2024

How Important Is Tokenization in French Medical Masked Language Models?

Yanis Labrak, Adrien Bazoge, Beatrice Daille et al.

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting

Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab et al.

Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.

en cs.CL, cs.AI

Detail DOI Sumber

arXiv Open Access 2024

SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks

Kai-Wei Chang, Haibin Wu, Yu-Kai Wang et al.

Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.

en eess.AS, cs.AI

Detail DOI Sumber

DOAJ Open Access 2024

Wadliwa pozycja języka w realizacjach fonemu samogłoskowego /i/ oraz w realizacjach fonemów spółgłoskowych /ɕ, ʑ, ʨ, ʥ, ɲ, j

Lilianna Konopska

W pracy podjęto zagadnienia itowego układu języka w realizacjach fonemu samogłoskowego /i/, a także fonemów spółgłoskowych, w których realizację wpisany jest itowy układ języka. Omówiono wyniki współczesnych badań logopedycznych nad wybranymi anatomiczno-czynnościowymi uwarunkowaniami wadliwych realizacji fonemów spółgłoskowych /ɕ, ʑ, ʨ, ʥ, ɲ, j/ oraz /i/, a także zjawisko depalatalizacji.

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2024

Impact of Artificial Intelligence (Chatgpt and Google-Bard) on Undergraduates’ Creative Writing Skills at A University in Northeastern Nigeria

Abubakar Salihu, Muhammad Mukhtar Aliyu, Nur Fadillah Nurchalis

Artificial intelligence (AI) has a profound influence on various facets of modern-day society, notably within the realm of education. Its application in educational settings is extensive, primarily aimed at improving the methods of instruction and knowledge acquisition. Thus, this research investigates the impact of AI tools (ChatGPT, Google Bard) on the creative writing skills of Nigerian undergraduates using a pretest-posttest quasi-experimental research design. Eighty (80) third-year undergraduates participated in the study. Data were collected using pretest-posttest writing tasks. The writing scripts were graded using creative writing assessment rubrics. Paired sample t-test analysis was conducted to compare the pretest-posttest writing scores using SPSS. The results of the analysis showed a significant improvement in the participants’ overall writing scores after using the AI tools. The results also show significant improvement in all the aspects of creative writing: image, voice, characterization and story. Finally, the study makes some recommendations for practice and further studies.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2023

A Comprehensive Review of State-of-The-Art Methods for Java Code Generation from Natural Language Text

Jessica López Espejel, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham et al.

Java Code Generation consists in generating automatically Java code from a Natural Language Text. This NLP task helps in increasing programmers' productivity by providing them with immediate solutions to the simplest and most repetitive tasks. Code generation is a challenging task because of the hard syntactic rules and the necessity of a deep understanding of the semantic aspect of the programming language. Many works tried to tackle this task using either RNN-based, or Transformer-based models. The latter achieved remarkable advancement in the domain and they can be divided into three groups: (1) encoder-only models, (2) decoder-only models, and (3) encoder-decoder models. In this paper, we provide a comprehensive review of the evolution and progress of deep learning models in Java code generation task. We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community. In addition, we provide a detailed description of datasets and evaluation metrics used in the literature. Finally, we discuss results of different models on CONCODE dataset, then propose some future directions.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2023

Turkish Native Language Identification V2

Ahmet Yavuz Uluslu, Gerold Schneider

This paper presents the first application of Native Language Identification (NLI) for the Turkish language. NLI is the task of automatically identifying an individual's native language (L1) based on their writing or speech in a non-native language (L2). While most NLI research has focused on L2 English, our study extends this scope to L2 Turkish by analyzing a corpus of texts written by native speakers of Albanian, Arabic and Persian. We leverage a cleaned version of the Turkish Learner Corpus and demonstrate the effectiveness of syntactic features, comparing a structural Part-of-Speech n-gram model to a hybrid model that retains function words. Our models achieve promising results, and we analyze the most predictive features to reveal L1-specific transfer effects. We make our data and code publicly available for further study.

en cs.CL

Detail Sumber

arXiv Open Access 2023

AI Literature Review Suite

David A. Tovar

The process of conducting literature reviews is often time-consuming and labor-intensive. To streamline this process, I present an AI Literature Review Suite that integrates several functionalities to provide a comprehensive literature review. This tool leverages the power of open access science, large language models (LLMs) and natural language processing to enable the searching, downloading, and organizing of PDF files, as well as extracting content from articles. Semantic search queries are used for data retrieval, while text embeddings and summarization using LLMs present succinct literature reviews. Interaction with PDFs is enhanced through a user-friendly graphical user interface (GUI). The suite also features integrated programs for bibliographic organization, interaction and query, and literature review summaries. This tool presents a robust solution to automate and optimize the process of literature review in academic and industrial research.

en cs.DL, cs.AI

Detail Sumber

arXiv Open Access 2023

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou

Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable answer generation. On top of all, we train a learn to rank (LTR) model in order to combine our different modules, achieving competitive ranking results. Extensive experiments on VWSD demonstrate valuable insights to effectively drive future directions.

en cs.CL

Detail DOI Sumber

DOAJ Open Access 2023

رمزية المكان (الطلل) في شعر حسام الدين الحاجري

محمد المهداوي , صادق محمد

تعد الرموز من التقانات المهمة التي تثير فضول المتلقي وتزيد رغبته في الوصول الى قصد الشاعر ، والكشف عن دلالاتها القارة داخل نتاجه الشعري ، فديناميكية الرمز تكفل ولوج القارئ الى عمق النص ، فعمل القارئ يصبح فضاء لقراءات متعددة ، ذات الدلالة البعيدة عن الواقع المذكور ، إذ ان الرمز المكاني في النص الأدبي أداة فاعلة في كشف الخبايا النصية المفروزة من دلالاته في بعده المعبر عن البوح المكنون في أهواء الشاعر المكبوتة في دواخلها والمنطوية على ذاتها ، لذا فان اختيار هذا الموضوع يعد محاولة لتسليط الضوء على تلك الرموز وبخاصة الطللية منها ، وإظهار أهميتها في شعر حسام الدين الحاجري ، وما دامت الرموز متعلقة بالقارئ أكثر من ارتباطها بالنص ، لذا وجب على القارئ تفكيك تلك الرموز ، فهذا البحث قائم على رصدها والكشف عن مكنوناتها وبخاصة الرموز (الطللية) بوصفها مكانا محملا بأبعاد وايحاءات كثيرة .

History of scholarship and learning. The humanities, Arts in general

Detail DOI Sumber

DOAJ Open Access 2023

Didactic-methodic and psychological features of the use of video materials in teaching a foreign language

A. V. Zamkovaya

Modern technology is expanding the possibilities of classes and determining the need for new forms of learning. The study of multimedia technology is currently developing in an academic way. Visuals in general play a huge, sometimes decisive role in digital communication. The article is devoted to the usage of video materials in the foreign language classes, the author considers the methods of video resources usage in the process of teaching foreign languages, aimed at forming communicative competence of the students. The author analyses psycho-didactic possibilities of audiovisual teaching materials. Particular attention is paid to the peculiarities of clip thinking, the level of visual literacy and the typology of video resources for teaching foreign languages. The aim of the article is to develop a methodology of applying teaching and authentic video resources in the process of foreign language teaching aimed at forming audiovisual literacy and communicative competence of students. The author concludes that the rational use of video materials and video resources contributes to a deeper involvement of students in learning a foreign language, stimulates the development of creative thinking, social skills, contributes to the mastery of knowledge, skills and abilities of understanding, speaking and writing.

Education (General), Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2022

An Overview of Indian Spoken Language Recognition from Machine Learning Perspective

Spandan Dey, Md Sahidullah, Goutam Saha

Automatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields.

en cs.CL, cs.SD

Detail DOI Sumber

arXiv Open Access 2022

End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting

Thierry Desot, François Portet, Michel Vacher

Spoken Language Understanding (SLU) is a core task in most human-machine interaction systems. With the emergence of smart homes, smart phones and smart speakers, SLU has become a key technology for the industry. In a classical SLU approach, an Automatic Speech Recognition (ASR) module transcribes the speech signal into a textual representation from which a Natural Language Understanding (NLU) module extracts semantic information. Recently End-to-End SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it benefits from the joint optimization of the ASR and the NLU parts, hence limiting the cascade of error effect of the pipeline architecture. However, little is known about the actual linguistic properties used by E2E models to predict concepts and intents from speech input. In this paper, we present a study identifying the signal features and other linguistic properties used by an E2E model to perform the SLU task. The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands. The results show that a good E2E SLU performance does not always require a perfect ASR capability. Furthermore, the results show the superior capabilities of the E2E model in handling background noise and syntactic variation compared to the pipeline model. Finally, a finer-grained analysis suggests that the E2E model uses the pitch information of the input signal to identify voice command concepts. The results and methodology outlined in this paper provide a springboard for further analyses of E2E models in speech processing.

en cs.CL, cs.SD

Detail DOI Sumber

DOAJ Open Access 2022

The value of face-to-face communication in the digital world: What people miss about in-person interactions when those are limited

Jonathan Gruber, Eszter Hargittai, Minh Hao Nguyen

Face-to-face communication is important for building and maintaining relationships. The COVID-19 pandemic led to severe limitations in people’s face-to-face interactions, resulting in most people relying more heavily on digital communication for social connection. Existing research has contributed to the understanding of how face-to-face communication is used alongside digital communication. However, we know little about what elements of face-to-face interactions people miss especially when in-person meetings are heavily reduced, and how this is related to their use of digital communication for social connection. In this study, we draw upon survey data that we collected in spring 2020 from a national sample of U. S. adults to answer these questions. We find that most people missed elements of face-to-face interactions and particularly valued spontaneous interactions, physical closeness, and independence from technology about in-person interactions. More frequent and increasing use of popular digital modes such as voice calls, video calls, text messages, and social media were all positively related to missing face-to-face communication. Our results contribute to the understanding of the role and value of in-person interactions in a digital world.

Communication. Mass media, Social Sciences

Detail DOI

DOAJ Open Access 2020

Dancing monkeys in Serbian and Korean – exhaustivity requirements on distributive share markers

Ana Bosnić, Hamida Demirdache, Jennifer Spenader

In some languages, distributive markers/quantifiers can attach to the argument that is being distributed (the distributive share), as opposed to the restrictor of the sentence (the distributive key). Researchers agree that distributive share markers can also distribute over events (and not only individuals), but disagree as to what these markers are semantically – universal distributive quantifiers or event plurality (pluractional) markers. In this paper, we experimentally probe spatial event distribution. On a universal quantification account, exhaustive distribution over a spatial distributive key is enforced, while on the pluractional analysis there is no such requirement. We carried out two picture verification experiments to test exhaustivity requirements in intransitive sentences with distributive share markers from two typologically different languages: the Serbian marker po and the Korean marker -ssik. We found evidence for an exhaustivity requirement over pluralities of non-atomic individuals (groups), but not over designated spatial locations. We interpret these findings as evidence that the semantics of (spatial) event distribution with distributive share markers involves a (spatial) distributive key. Specifically, po/-ssik have a universal quantificational force (with a meaning akin to per (each)) establishing a distributive relation between individual events and elements of the spatial distributive key. Plural individuals made salient in the visual input can serve to divide up the spatial key into chunks of space that have to be exhausted.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

Hasil untuk "Language and Literature"