Hasil "Language. Linguistic theory. Comparative grammar"

arXiv Open Access 2025

CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Masato Kikuchi, Masatsugu Ono, Toshioki Soga et al.

Although WordNet is a valuable resource because of its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this issue, we developed a version of WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our approach, we constructed a large-scale corpus containing both sense and CEFR-level information from the annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on this corpus perform comparably to those fine-tuned on gold-standard annotations. Furthermore, by combining this corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81. This result provides indirect evidence that the transferred labels are largely consistent with the gold-standard levels. The annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Large Language Models and Arabic Content: A Review

Haneh Rhel, Dmitri Roussinov

Over the past three years, the rapid advancement of Large Language Models (LLMs) has had a profound impact on multiple areas of Artificial Intelligence (AI), particularly in Natural Language Processing (NLP) across diverse languages, including Arabic. Although Arabic is considered one of the most widely spoken languages across 27 countries in the Arabic world and used as a second language in some other non-Arabic countries as well, there is still a scarcity of Arabic resources, datasets, and tools. Arabic NLP tasks face various challenges due to the complexities of the Arabic language, including its rich morphology, intricate structure, and diverse writing standards, among other factors. Researchers have been actively addressing these challenges, demonstrating that pre-trained Large Language Models (LLMs) trained on multilingual corpora achieve significant success in various Arabic NLP tasks. This study provides an overview of using large language models (LLMs) for the Arabic language, highlighting early pre-trained Arabic Language models across various NLP applications and their ability to handle diverse Arabic content tasks and dialects. It also provides an overview of how techniques like finetuning and prompt engineering can enhance the performance of these models. Additionally, the study summarizes common Arabic benchmarks and datasets while presenting our observations on the persistent upward trend in the adoption of LLMs.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Generating Inputs for Grammar Mining using Dynamic Symbolic Execution

Andreas Pointner, Josef Pichler, Herbert Prähofer

A vast number of software systems include components that parse and process structured input. In addition to programming languages, which are analyzed by compilers or interpreters, there are numerous components that process standardized or proprietary data formats of varying complexity. Even if such components were initially developed and tested based on a specification, such as a grammar, numerous modifications and adaptations over the course of software evolution can make it impossible to precisely determine which inputs they actually accept. In this situation, grammar mining can be used to reconstruct the specification in the form of a grammar. Established approaches already produce useful results, provided that sufficient input data is available to fully cover the input language. However, achieving this completeness is a major challenge. In practice, only input data recorded during the operation of the software systems is available. If this data is used for grammar mining, the resulting grammar reflects only the actual processed inputs but not the complete grammar of the input language accepted by the software component. As a result, edge cases or previously supported features that no longer appear in the available input data are missing from the generated grammar. This work addresses this challenge by introducing a novel approach for the automatic generation of inputs for grammar mining. Although input generators have already been used for fuzz testing, it remains unclear whether they are also suitable for grammar miners. Building on the grammar miner Mimid, this work presents a fully automated approach to input generation. The approach leverages Dynamic Symbolic Execution (DSE) and extends it with two mechanisms to overcome the limitations of DSE regarding structured input parsers. First, the search for new inputs is guided by an iterative expansion that starts with a single-character input and gradually extends it. Second, input generation is structured into a novel three-phase approach, which separates the generation of inputs for parser functions. The proposed method was evaluated against a diverse set of eleven benchmark applications from the existing literature. Results demonstrate that the approach achieves precision and recall for extracted grammars close to those derived from state-of-the-art grammar miners such as Mimid. Notably, it successfully uncovers subtle features and edge cases in parsers that are typically missed by such grammar miners. The effectiveness of the method is supported by empirical evidence, showing that it can achieve high performance in various domains without requiring prior input samples. This contribution is significant for researchers and practitioners in software engineering, offering an automated, scalable, and precise solution for grammar mining. By eliminating the need for manual input generation, the approach not only reduces workload but also enhances the robustness and comprehensiveness of the extracted grammars. Following this approach, software engineers can reconstruct specification from existing (legacy) parsers.

en cs.PL

Detail DOI Sumber

arXiv Open Access 2025

Detecting Linguistic Diversity on Social Media

Sidney Wong, Benjamin Adams, Jonathan Dunn

This chapter explores the efficacy of using social media data to examine changing linguistic behaviour of a place. We focus our investigation on Aotearoa New Zealand where official statistics from the census is the only source of language use data. We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We use place as the common denominator between the two data sources. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. We then compare levels of linguistic diversity at national, regional, and local geographies. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place. We show that social media is sensitive to demographic and sociopolitical changes within a language and at low-level regional and local geographies.

en cs.CL

Detail Sumber

arXiv Open Access 2024

A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

Samuel Ackerman, Ella Rabinovich, Eitan Farchi et al.

We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the model's answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models on the created datasets.

en cs.CL, stat.AP

Detail Sumber

arXiv Open Access 2024

Training Data for Large Language Model

Yiming Ju, Huanhuan Ma

In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.

en cs.AI

Detail Sumber

arXiv Open Access 2024

LOLA -- An Open-Source Massively Multilingual Large Language Model

Nikit Srivastava, Denis Kuchelev, Tatiana Moteu Ngoli et al.

This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Multiplicity Problems on Algebraic Series and Context-Free Grammars

Nikhil Balaji, Lorenzo Clemente, Klara Nosan et al.

In this paper we obtain complexity bounds for computational problems on algebraic power series over several commuting variables. The power series are specified by systems of polynomial equations: a formalism closely related to weighted context-free grammars. We focus on three problems -- decide whether a given algebraic series is identically zero, determine whether all but finitely many coefficients are zero, and compute the coefficient of a specific monomial. We relate these questions to well-known computational problems on arithmetic circuits and thereby show that all three problems lie in the counting hierarchy. Our main result improves the best known complexity bound on deciding zeroness of an algebraic series. This problem is known to lie in PSPACE by reduction to the decision problem for the existential fragment of the theory of real closed fields. Here we show that the problem lies in the counting hierarchy by reduction to the problem of computing the degree of a polynomial given by an arithmetic circuit. As a corollary we obtain new complexity bounds on multiplicity equivalence of context-free grammars restricted to a bounded language, language inclusion of a nondeterministic finite automaton in an unambiguous context-free grammar, and language inclusion of a non-deterministic context-free grammar in an unambiguous finite automaton.

en cs.FL, cs.CC

Detail Sumber

arXiv Open Access 2023

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng et al.

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

en cs.CL, cs.AI

Detail Sumber

CrossRef Open Access 2022

The relevance of Autism for Evolutionary Linguistic Theory

Stephanie Durrleman

en

Detail DOI Sumber

arXiv Open Access 2022

Quantum Natural Language Generation on Near-Term Devices

Amin Karamlou, Marcel Pfaffhauser, James Wootton

The emergence of noisy medium-scale quantum devices has led to proof-of-concept applications for quantum computing in various domains. Examples include Natural Language Processing (NLP) where sentence classification experiments have been carried out, as well as procedural generation, where tasks such as geopolitical map creation, and image manipulation have been performed. We explore applications at the intersection of these two areas by designing a hybrid quantum-classical algorithm for sentence generation. Our algorithm is based on the well-known simulated annealing technique for combinatorial optimisation. An implementation is provided and used to demonstrate successful sentence generation on both simulated and real quantum hardware. A variant of our algorithm can also be used for music generation. This paper aims to be self-contained, introducing all the necessary background on NLP and quantum computing along the way.

en quant-ph, cs.CL

Detail Sumber

arXiv Open Access 2022

Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition

Samuel Belkadi, Lifeng Han, Yuping Wu et al.

The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based LLMs namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional CRF layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost - e.g. with 39.80\% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59\%/97.44\%/96.84\% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores - i.e. around 0.02 points lower compared to using the full dataset. Our models will be hosted at \url{https://github.com/HECTA-UoM/TransformerCRF}

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2022

Towards Pragmatic Production Strategies for Natural Language Generation Tasks

Mario Giulianelli

This position paper proposes a conceptual framework for the design of Natural Language Generation (NLG) systems that follow efficient and effective production strategies in order to achieve complex communicative goals. In this general framework, efficiency is characterised as the parsimonious regulation of production and comprehension costs while effectiveness is measured with respect to task-oriented and contextually grounded communicative goals. We provide concrete suggestions for the estimation of goals, costs, and utility via modern statistical methods, demonstrating applications of our framework to the classic pragmatic task of visually grounded referential games and to abstractive text summarisation, two popular generation tasks with real-world applications. In sum, we advocate for the development of NLG systems that learn to make pragmatic production decisions from experience, by reasoning about goals, costs, and utility in a human-like way.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2022

Construction of English Resume Corpus and Test with Pre-trained Language Models

Chengguang Gan, Tatsunori Mori

Information extraction(IE) has always been one of the essential tasks of NLP. Moreover, one of the most critical application scenarios of information extraction is the information extraction of resumes. Constructed text is obtained by classifying each part of the resume. It is convenient to store these texts for later search and analysis. Furthermore, the constructed resume data can also be used in the AI resume screening system. Significantly reduce the labor cost of HR. This study aims to transform the information extraction task of resumes into a simple sentence classification task. Based on the English resume dataset produced by the prior study. The classification rules are improved to create a larger and more fine-grained classification dataset of resumes. This corpus is also used to test some current mainstream Pre-training language models (PLMs) performance.Furthermore, in order to explore the relationship between the number of training samples and the correctness rate of the resume dataset, we also performed comparison experiments with training sets of different train set sizes.The final multiple experimental results show that the resume dataset with improved annotation rules and increased sample size of the dataset improves the accuracy of the original resume dataset.

en cs.CL

Detail Sumber

arXiv Open Access 2022

Effectiveness of French Language Models on Abstractive Dialogue Summarization Task

Yongxin Zhou, François Portet, Fabien Ringeval

Pre-trained language models have established the state-of-the-art on various natural language processing tasks, including dialogue summarization, which allows the reader to quickly access key information from long conversations in meetings, interviews or phone calls. However, such dialogues are still difficult to handle with current models because the spontaneity of the language involves expressions that are rarely present in the corpora used for pre-training the language models. Moreover, the vast majority of the work accomplished in this field has been focused on English. In this work, we present a study on the summarization of spontaneous oral dialogues in French using several language specific pre-trained models: BARThez, and BelGPT-2, as well as multilingual pre-trained models: mBART, mBARThez, and mT5. Experiments were performed on the DECODA (Call Center) dialogue corpus whose task is to generate abstractive synopses from call center conversations between a caller and one or several agents depending on the situation. Results show that the BARThez models offer the best performance far above the previous state-of-the-art on DECODA. We further discuss the limits of such pre-trained models and the challenges that must be addressed for summarizing spontaneous dialogues.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2022

Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook

Baihan Lin

In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics, and last but not least, the speech and natural language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving the training of deep neural networks with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures and generalizability. In this survey, we present an overview of recent advancements of reinforcement learning and bandits, and discuss how they can be effectively employed to solve speech and natural language processing problems with models that are adaptive, interactive and scalable.

en cs.AI, cs.CL

Detail Sumber

arXiv Open Access 2020

Anchor-based Bilingual Word Embeddings for Low-Resource Languages

Tobias Eder, Viktor Hangya, Alexander Fraser

Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text. MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs. For low resource languages training MWEs monolingually results in MWEs of poor quality, and thus poor bilingual word embeddings (BWEs) as well. This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point for training an embedding space for the low resource target language. By using the source vectors as anchors the vector spaces are automatically aligned during training. We experiment on English-German, English-Hiligaynon and English-Macedonian. We show that our approach results not only in improved BWEs and bilingual lexicon induction performance, but also in improved target language MWE quality as measured using monolingual word similarity.

en cs.CL

Detail Sumber

arXiv Open Access 2019

Resource production of written forms of Sign Languages by a user-centered editor, SWift (SignWriting improved fast transcriber)

Fabrizio Borgia, Claudia S. Bianchini, Patrice Dalle et al.

The SignWriting improved fast transcriber (SWift), presented in this paper, is an advanced editor for computer-aided writing and transcribing of any Sign Language (SL) using SignWriting (SW). The application is an editor which allows composing and saving desired signs using the SW elementary components, called "glyphs". These make up a sort of alphabet, which does not depend on the national Sign Language and which codes the basic components of any sign. The user is guided through a fully-automated procedure, making the composition process fast and intuitive. SWift pursues the goal of helping to break down the "electronic barriers" that keep deaf people away from the web, and at the same time to support linguistic research about Sign Languages features. For this reason it has been designed with a special attention to deaf user needs, and to general usability issues. The editor has been developed in a modular way, so it can be integrated everywhere the use of SW as an alternative to written "verbal" language may be advisable.

en cs.CL

Detail Sumber

arXiv Open Access 2019

Towards Interpretable Graph Modeling with Vertex Replacement Grammars

Justus Hibshman, Satyaki Sikdar, Tim Weninger

An enormous amount of real-world data exists in the form of graphs. Oftentimes, interesting patterns that describe the complex dynamics of these graphs are captured in the form of frequently reoccurring substructures. Recent work at the intersection of formal language theory and graph theory has explored the use of graph grammars for graph modeling and pattern mining. However, existing formulations do not extract meaningful and easily interpretable patterns from the data. The present work addresses this limitation by extracting a special type of vertex replacement grammar, which we call a KT grammar, according to the Minimum Description Length (MDL) heuristic. In experiments on synthetic and real-world datasets, we show that KT-grammars can be efficiently extracted from a graph and that these grammars encode meaningful patterns that represent the dynamics of the real-world system.

en cs.SI, cs.DM

Detail DOI Sumber

CrossRef Open Access 2007

Theoretical and developmental issues in the syntax of subjects: Evidence from near-native Italian

Adriana Belletti, Elisa Bennati, Antonella Sorace

239 sitasi en

Detail DOI Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"