Hasil "Philology. Linguistics"

arXiv Open Access 2026

LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Lifu Tu, Rongguang Wang, Tao Sheng et al.

Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

The Grammar of FAIR: A Granular Architecture of Semantic Units for FAIR Semantics, Inspired by Biology and Linguistics

Lars Vogt, Barend Mons

The FAIR Principles aim to make data and knowledge Findable, Accessible, Interoperable, and Reusable, yet current digital infrastructures often lack a unifying semantic framework that bridges human cognition and machine-actionability. In this paper, we introduce the Grammar of FAIR: a granular and modular architecture for FAIR semantics built on the concept of semantic units. Semantic units, comprising atomic statement units and composite compound units, implement the principle of semantic modularisation, decomposing data and knowledge into independently identifiable, semantically meaningful, and machine-actionable units. A central metaphor guiding our approach is the analogy between the hierarchy of level of organisation in biological systems and the hierarchy of levels of organisation in information systems: both are structured by granular building blocks that mediate across multiple perspectives while preserving functional unity. Drawing further inspiration from concept formation and natural language grammar, we show how these building blocks map to FAIR Digitial Objects (FDOs), enabling format-agnostic semantic transitivity from natural language token models to schema-based representations. This dual biological-linguistic analogy provides a semantics-first foundation for evolving cross-ecosystem infrastructures, paving the way for the Internet of FAIR Data and Services (IFDS) and a future of modular, AI-ready, and citation-granular scholarly communication.

en cs.DB

Detail Sumber

arXiv Open Access 2025

Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology

Ajitesh Bankula, Praney Bankula

Cross-lingual transfer has become a crucial aspect of multilingual NLP, as it allows for models trained on resource-rich languages to be applied to low-resource languages more effectively. Recently massively multilingual pre-trained language models (e.g., mBERT, XLM-R) demonstrate strong zero-shot transfer capabilities[14] [13]. This paper investigates cross-linguistic transfer through the lens of language families and morphology. Investigating how language family proximity and morphological similarity affect performance across NLP tasks. We further discuss our results and how it relates to findings from recent literature. Overall, we compare multilingual model performance and review how linguistic distance metrics correlate with transfer outcomes. We also look into emerging approaches that integrate typological and morphological information into model pre-training to improve transfer to diverse languages[18] [19].

en cs.CL

Detail Sumber

arXiv Open Access 2025

Linguistic and Audio Embedding-Based Machine Learning for Alzheimer's Dementia and Mild Cognitive Impairment Detection: Insights from the PROCESS Challenge

Adharsha Sam Edwin Sam Devahi, Sohail Singh Sangha, Prachee Priyadarshinee et al.

Early detection of Alzheimer's Dementia (AD) and Mild Cognitive Impairment (MCI) is critical for timely intervention, yet current diagnostic approaches remain resource-intensive and invasive. Speech, encompassing both acoustic and linguistic dimensions, offers a promising non-invasive biomarker for cognitive decline. In this study, we present a machine learning framework for the PROCESS Challenge, leveraging both audio embeddings and linguistic features derived from spontaneous speech recordings. Audio representations were extracted using Whisper embeddings from the Cookie Theft description task, while linguistic features-spanning pronoun usage, syntactic complexity, filler words, and clause structure-were obtained from transcriptions across Semantic Fluency, Phonemic Fluency, and Cookie Theft picture description. Classification models aimed to distinguish between Healthy Controls (HC), MCI, and AD participants, while regression models predicted Mini-Mental State Examination (MMSE) scores. Results demonstrated that voted ensemble models trained on concatenated linguistic features achieved the best classification performance (F1 = 0.497), while Whisper embedding-based ensemble regressors yielded the lowest MMSE prediction error (RMSE = 2.843). Comparative evaluation within the PROCESS Challenge placed our models among the top submissions in regression task, and mid-range for classification, highlighting the complementary strengths of linguistic and audio embeddings. These findings reinforce the potential of multimodal speech-based approaches for scalable, non-invasive cognitive assessment and underline the importance of integrating task-specific linguistic and acoustic markers in dementia detection.

en cs.SD, cs.AI

Detail Sumber

arXiv Open Access 2024

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

Li Meng, Zhao Qi, Lyu Shuchang et al.

Recognizing and grasping novel-category objects remains a crucial yet challenging problem in real-world robotic applications. Despite its significance, limited research has been conducted in this specific domain. To address this, we seamlessly propose a novel framework that integrates open-vocabulary learning into the domain of robotic grasping, empowering robots with the capability to adeptly handle novel objects. Our contributions are threefold. Firstly, we present a large-scale benchmark dataset specifically tailored for evaluating the performance of open-vocabulary grasping tasks. Secondly, we propose a unified visual-linguistic framework that serves as a guide for robots in successfully grasping both base and novel objects. Thirdly, we introduce two alignment modules designed to enhance visual-linguistic perception in the robotic grasping process. Extensive experiments validate the efficacy and utility of our approach. Notably, our framework achieves an average accuracy of 71.2\% and 64.4\% on base and novel categories in our new dataset, respectively.

en cs.RO

Detail Sumber

arXiv Open Access 2024

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Xi Chen, Jiakun Pei, Liumeng Xue et al.

Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.

en cs.CL, cs.SD

Detail Sumber

arXiv Open Access 2024

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Akhilesh Aravapalli, Mounika Marreddy, Radhika Mamidi et al.

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Large corpora and large language models: a replicable method for automating grammatical annotation

Cameron Morin, Matti Marttinen Larsson

Much linguistic research relies on annotated datasets of features extracted from text corpora, but the rapid quantitative growth of these corpora has created practical difficulties for linguists to manually annotate large data samples. In this paper, we present a replicable, supervised method that leverages large language models for assisting the linguist in grammatical annotation through prompt engineering, training, and evaluation. We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y', based on the large language model Claude 3.5 Sonnet and corpus data from Davies' NOW and EnTenTen21 (SketchEngine). Overall, we reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data, validating the method for the annotation of very large quantities of tokens of the construction in the future. We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change, underlining the value of AI copilots as tools for future linguistic research, notwithstanding some important caveats.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2024

EmojiPrompt: Generative Prompt Obfuscation for Privacy-Preserving Communication with Cloud-based LLMs

Sam Lin, Wenyue Hua, Zhenting Wang et al.

Cloud-based Large Language Models (LLMs) such as ChatGPT have become increasingly integral to daily operations. Nevertheless, they also introduce privacy concerns: firstly, numerous studies underscore the risks to user privacy posed by jailbreaking cloud-based LLMs; secondly, the LLM service providers have access to all user data, which deters individuals from confidently utilizing such services. To address such concerns, we propose a simple yet effective paradigm, EmojiPrompt, to protect user privacy. At its core, EmojiPrompt performs generative transformation, obfuscating private data within prompts with linguistic and non-linguistic elements before submitting them to cloud-based LLMs. We evaluate EmojiPrompt's performance across 8 datasets from various domains. We also propose simulated inference attacks to assess EmojiPrompt's ability to preserve user privacy. The results demonstrate that EmojiPrompt effectively obfuscates user private data, while largely maintaining, or even enhancing, performances compared to the unobfuscated version. Furthermore, EmojiPrompt's atomic-level obfuscation allows it to function exclusively with cloud-based LLMs. For source code, please refer to: https://github.com/agiresearch/EmojiCrypt.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

Bastian Bunzeck, Daniel Duran, Leonie Schade et al.

Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.

en cs.CL

Detail Sumber

CrossRef Open Access 2024

A terminological frame-network model of the conceptual and semantic organization of notions of computational linguistics

Olga Nikolaevna Prokhorova, Olga Nikolaevna Polshchykova

The research object is the conceptual and semantic organization of the terminological apparatus, the research subject is the logical and conceptual structure of notions denoted by terms of computational linguistics. The study aims to identify the features of the systemic organization of notions of computational linguistics based on the use of a terminological frame-network model. The paper briefly analyzes the use of the frame approach to modeling the structure of terminological systems by linguists. A terminological frame-network model of the subject area “Computational linguistics” is proposed. The study is illustrated with fragments of the terminological network of frames of the subject area “Synonymy of terms of computational linguistics” presented in the form of diagrams. The work is novel in that it provides an original definition of a terminological network of frames and is the first to formulate the rules for its construction. The research findings showed the possibility of using a terminological frame-network model to analyze the logical and conceptual structure of the subject area “Computational linguistics” and the organization of semantic term-formation processes and lexicosemantic relations in the terminology under study.

en

Detail DOI Sumber

CrossRef Open Access 2024

“Slips of the tongue” in linguistics and psychoanalysis: Structure, semantics and discourse

Arkadiy Petrovich Sedykh, Anastasia Olegovna Chizhova, Elvira Nikolajevna Akimova

The aim of the study is to identify the vectors of interaction between linguistic and psychoanalytic discourses. The scientific novelty of the work consists in constructing a universal linguo-semiological model for describing parapraxis discourse in order to clarify the conceptual apparatus and more clearly structurize the semantic terminology of linguistics and psychoanalysis, as well as the theory of language and the theory of communication. The key point of the study is the attitude to language as an open, actively developing system of semiotic supports and reference points used to achieve the goals of interaction with other individuals, which cannot be interpreted without involving emotions, memory, additional cognitive efforts, will and imagination. The hybrid method of information processing at the junction of linguistics and psychoanalysis in line with the interpretative-semiological approach is also important. The obtained data make it possible to identify a number of important mechanisms of linguosemiotic synergy between the subjects of the narrative of various sciences, necessary for the implementation of adequate communication associated with the activation of the dynamics of the development of linguosemiotic and linguocultural parameters of the speech act, on the path to creating a new metadiscourse axiology both in theory and in practice, not only in the linguosemiotics of psychoanalysis, but also in general linguistics.

en

Detail DOI Sumber

CrossRef Open Access 2023

Pāṇini and the Non-Head (<i>upasarjana</i>) of Attributive Endocentric Compounds

Davide Mocci

This study aims at contributing to our understanding of Pāṇini’s classification of compounds. In particular, this study investigates the notion upasarjana – roughly translatable as ‘non-head’ – in attributive endocentric (so-called karmadhāraya) compounds like nava-jvāra- ‘new suffering’, by addressing the following questions: do the units which Pāṇini designated as upasarjana in subordinate endocentric (so-called tatpuruṣa proper) compounds like aśva-śapha- ‘horse’s hoof’ share any feature with the units which he designated as upasarjana in karmadhārayas? More generally, what is the hallmark of the units designated as upasarjana? To answer these questions, we shall delve into several rules of Pāṇini’s grammar – the Aṣṭādhyāyī – which conceal such powerful grammatical tools as silent case endings and the operation of case-copying.

1 sitasi en

Detail DOI Sumber

DOAJ Open Access 2023

Authentic Materials And Task Design: A Teaching Amalgam

Aleyda Jasmin Alfonso Vargas, Paola Ximena Romero Molina

The use of authentic materials in foreign language teaching has been a growing practice over the last few decades. With the emergence of technology, these materials are more accessible, particularly in digital formats. Despite being primarily designed for non-pedagogical uses, these materials have become valuable resources for English language learning. However, using them effectively for teaching and learning purposes requires a systematic structure. This article examines the use of authentic materials in relation to various language skills, with a particular focus on listening comprehension framed within a pre-, while-, and post-stage cycle at both local and international levels, with the latter referring to the Colombian context. A review of scholarship shows that the use of authentic materials is a seemingly frequent practice among teacher-researchers, but not among a larger audience of practitioners. Additionally, studies do not provide guidance on how to prepare practitioners to design appropriate activities that complement authentic materials. Therefore, after examining studies and based on our teaching-research experience, we aim to contribute to the implementation of a pedagogical strategy that combines the systematic use of authentic materials with listening comprehension and a task design that provides a balance of challenge and support.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2023

Is Japanese CCGBank empirically correct? A case study of passive and causative constructions

Daisuke Bekki, Hitomi Yanaka

The Japanese CCGBank serves as training and evaluation data for developing Japanese CCG parsers. However, since it is automatically generated from the Kyoto Corpus, a dependency treebank, its linguistic validity still needs to be sufficiently verified. In this paper, we focus on the analysis of passive/causative constructions in the Japanese CCGBank and show that, together with the compositional semantics of ccg2lambda, a semantic parsing system, it yields empirically wrong predictions for the nested construction of passives and causatives.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas

Sidney G. -J. Wong, Jonathan Dunn, Benjamin Adams

This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand (i.e., subnational administrative areas). We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations. The results from the current study suggests that there is potential to use online social media language data to observe spatial and temporal changes in linguistic diversity at subnational geographic areas; however, further work is required to understand how well social media represents real-world behaviour.

en cs.CL

Detail DOI Sumber

CrossRef Open Access 2022

Egyptian Greek: A Contact Variety

Sonja Dahlgren

4 sitasi en

Detail DOI Sumber

DOAJ Open Access 2022

Methodological Challenges in Audiovisual Translation: Experimenting New Software for Multimodal Corpus-based Analysis

Dora Renna

Audiovisual translation has long struggled to strike a balance between corpus-based analysis of large amounts of text and the need to systematically integrate multimodality in its research scope, in order to fully acknowledge the complex nature of the audiovisual product. This paper aims to relate on the experimentation that is currently being conducted at Ca’ Foscari University of Venice, with the collaboration of the University of Basel (Switzerland): using an existing software (created for pragmatics) for audiovisual translation. The main aim is to make sure this software is able to support the researcher in transcribing, annotating, adding metadata, managing and querying text and video files. The experimented software has never been used for parallel aligned audiovisual text so far. The contribution here presented first briefly describes a framework developed to analyse language variation and multimodality (with a focus on character design) in audiovisual translation, which was then immersed in the software. Subsequently, the software itself is described in detail, with specific attention to its potential and limits in the use within the field of Translation Studies and audiovisual translation. This is done by showing examples from a pilot study that belongs to a broader corpus currently under construction.

Language. Linguistic theory. Comparative grammar, Communication. Mass media

Detail DOI Sumber

arXiv Open Access 2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo et al.

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models' shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

en cs.CV, cs.CL

Detail Sumber

arXiv Open Access 2022

Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection

Juuso Eronen, Michal Ptaszynski, Fumito Masui et al.

In this research, we study the change in the performance of machine learning (ML) classifiers when various linguistic preprocessing methods of a dataset were used, with the specific focus on linguistically-backed embeddings in Convolutional Neural Networks (CNN). Moreover, we study the concept of Feature Density and confirm its potential to comparatively predict the performance of ML classifiers, including CNN. The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection. The dataset was re-annotated by objective experts (psychologists), as the importance of professional annotation in cyberbullying research has been indicated multiple times. The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density while also proposing a new approach of training various linguistically-backed embeddings for Convolutional Neural Networks.

en cs.CL, cs.AI

Detail Sumber

Hasil untuk "Philology. Linguistics"