Hasil "Philology. Linguistics"

arXiv Open Access 2025

BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

Daeen Kabir, Minhajur Rahman Chowdhury Mahim, Sheikh Shafayat et al.

In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

The Syntactic Acceptability Dataset (Preview): A Resource for Machine Learning and Linguistic Analysis of English

Tom S Juzek

We present a preview of the Syntactic Acceptability Dataset, a resource being designed for both syntax and computational linguistics research. In its current form, the dataset comprises 1,000 English sequences from the syntactic discourse: Half from textbooks and half from the journal Linguistic Inquiry, the latter to ensure a representation of the contemporary discourse. Each entry is labeled with its grammatical status ("well-formedness" according to syntactic formalisms) extracted from the literature, as well as its acceptability status ("intuitive goodness" as determined by native speakers) obtained through crowdsourcing, with highest experimental standards. Even in its preliminary form, this dataset stands as the largest of its kind that is publicly accessible. We also offer preliminary analyses addressing three debates in linguistics and computational linguistics: We observe that grammaticality and acceptability judgments converge in about 83% of the cases and that "in-betweenness" occurs frequently. This corroborates existing research. We also find that while machine learning models struggle with predicting grammaticality, they perform considerably better in predicting acceptability. This is a novel finding. Future work will focus on expanding the dataset.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2025

A Bayesian dawn in linguistics: Trends, benefits and good practices

Natalia Levshina

In recent years, Bayesian statistics has gained traction across a wide range of scientific disciplines. This paper explores the growing application of Bayesian methods within the field of linguistics and considers their future potential. A survey of articles from different linguistics journals indicates that Bayesian regression has transitioned from fringe to fairly mainstream over the past five years. This paper discusses the main drivers of this shift, including the increased availability of user-friendly software and the replicability crisis in adjacent disciplines, which exposed the shortcomings of the traditional statistical paradigm. It outlines the fundamental conceptual distinctions between frequentist and Bayesian approaches, emphasizing how Bayesian methods can help address the problems. Additionally, the paper highlights the methodological benefits of Bayesian regression for a diverse array of research questions and data types. It also identifies key theoretical and practical challenges associated with Bayesian analysis and offers a set of good practices and recommendations for researchers considering the adoption of Bayesian methods.

en stat.AP

Detail Sumber

arXiv Open Access 2025

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Yifei Zhang, Chang Liu, Jin Wei et al.

Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

en cs.CV

Detail Sumber

DOAJ Open Access 2024

Linguistic and cultural characteristics of the speech genre “wish”

Karasik, Vladimir Ilyich

The paper deals with the speech genre “wish” treated as emotional and volitive communicative event, its meaning is an expressed wish of implementation of something good or bad for somebody. Grammatically, this genre is a kind of mood named “optative”. It is a speech act describing a certain situation which implies an emotional reaction of a person. The reaction of an addressee for a good wish is usually expressed in words or gestures of gratitude. In case of a bad wish expressed directly it merges with a quarrel and a reverse bad wish. Common objects of good wishes are good health, happiness, prosperity, success and good luck. Good wishes may be combined with proactive criticism, accusation, reproach, or request. This speech act may be either neutral or discourse specific (used by soldiers, sailors, students, etc.), it is specifically used when addressed to elder people or applied to certain events, e.g. weddings. Bad wishes were initially pronounced as magic meant to bring harm to someone, and nowadays they are used as a kind of invectives or jokes. Bad wishes are usually combined with insult.

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2024

ILiAD: An Interactive Corpus for Linguistic Annotated Data from Twitter Posts

Simon Gonzalez

Social Media platforms have offered invaluable opportunities for linguistic research. The availability of up-to-date data, coming from any part in the world, and coming from natural contexts, has allowed researchers to study language in real time. One of the fields that has made great use of social media platforms is Corpus Linguistics. There is currently a wide range of projects which have been able to successfully create corpora from social media. In this paper, we present the development and deployment of a linguistic corpus from Twitter posts in English, coming from 26 news agencies and 27 individuals. The main goal was to create a fully annotated English corpus for linguistic analysis. We include information on morphology and syntax, as well as NLP features such as tokenization, lemmas, and n- grams. The information is presented through a range of powerful visualisations for users to explore linguistic patterns in the corpus. With this tool, we aim to contribute to the area of language technologies applied to linguistic research.

en cs.CL, cs.DL

Detail Sumber

arXiv Open Access 2024

Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models

Xinyu Zhou, Delong Chen, Samuel Cahyawijaya et al.

We introduce a novel analysis that leverages linguistic minimal pairs to probe the internal linguistic representations of Large Language Models (LLMs). By measuring the similarity between LLM activation differences across minimal pairs, we quantify the and gain insight into the linguistic knowledge captured by LLMs. Our large-scale experiments, spanning 100+ LLMs and 150k minimal pairs in three languages, reveal properties of linguistic similarity from four key aspects: consistency across LLMs, relation to theoretical categorizations, dependency to semantic context, and cross-lingual alignment of relevant phenomena. Our findings suggest that 1) linguistic similarity is significantly influenced by training data exposure, leading to higher cross-LLM agreement in higher-resource languages. 2) Linguistic similarity strongly aligns with fine-grained theoretical linguistic categories but weakly with broader ones. 3) Linguistic similarity shows a weak correlation with semantic similarity, showing its context-dependent nature. 4) LLMs exhibit limited cross-lingual alignment in their understanding of relevant linguistic phenomena. This work demonstrates the potential of minimal pairs as a window into the neural representations of language in LLMs, shedding light on the relationship between LLMs and linguistic theory. Codes and data are available at https://github.com/ChenDelong1999/Linguistic-Similarity

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Conceptualisation de l'espace numérique dans l'enseignement-apprentissage des langues

France Lafleur

In this article on the conceptualization of the digital space in the teaching-learning of languages, our contribution to the "Architecture of the processes of production and reception" of language (François & Nespoulous, 2014), consists in the identification of the interlinguistic didactic constants of the teaching-learning of foreign languages and their integration into a three-dimensional pedagogical model integrating the stratified structural components of their deep learning. This is an experimental model, but immediately applicable in the teaching-learning-evaluation of languages, which therefore exposes one of the results of our current research-actions in distance learning (FAD).In the introduction, we begin our remarks by presenting the abundant parameters of the place of digital technology in the teaching-learning-assessment of languages. Our methodology is that of analytical and conceptual research on the teaching-learning-evaluation of languages. Our analyses are based on the founding documents of the European Community (EC), in particular the Common European Framework of Reference for Languages, CEFR (Conseil de l'Europe, 2001, 2018, 2021). Our objective is to cover as many of the language components and skills required to update them within the framework of the action-oriented approach advocated by the EC.Our discussion focuses on the technological conditions for applying this model and our conclusion on the prospects, already at our doorstep, of the organic integration of the artificial intelligence of languages into humans.

Special aspects of education, Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2023

JCoLA: Japanese Corpus of Linguistic Acceptability

Taiga Someya, Yushi Sugimoto, Yohei Oseki

Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.

en cs.CL

Detail Sumber

arXiv Open Access 2023

The ACL OCL Corpus: Advancing Open Science in Computational Linguistics

Shaurya Rohatgi, Yanxia Qin, Benjamin Aw et al.

We present ACL OCL, a scholarly corpus derived from the ACL Anthology to assist Open scientific research in the Computational Linguistics domain. Integrating and enhancing the previous versions of the ACL Anthology, the ACL OCL contributes metadata, PDF files, citation graphs and additional structured full texts with sections, figures, and links to a large knowledge resource (Semantic Scholar). The ACL OCL spans seven decades, containing 73K papers, alongside 210K figures. We spotlight how ACL OCL applies to observe trends in computational linguistics. By detecting paper topics with a supervised neural model, we note that interest in "Syntax: Tagging, Chunking and Parsing" is waning and "Natural Language Generation" is resurging. Our dataset is available from HuggingFace (https://huggingface.co/datasets/WINGNUS/ACL-OCL).

en cs.CL, cs.DL

Detail Sumber

arXiv Open Access 2023

Learning Non-linguistic Skills without Sacrificing Linguistic Proficiency

Mandar Sharma, Nikhil Muralidhar, Naren Ramakrishnan

The field of Math-NLP has witnessed significant growth in recent years, motivated by the desire to expand LLM performance to the learning of non-linguistic notions (numerals, and subsequently, arithmetic reasoning). However, non-linguistic skill injection typically comes at a cost for LLMs: it leads to catastrophic forgetting of core linguistic skills, a consequence that often remains unaddressed in the literature. As Math-NLP has been able to create LLMs that can closely approximate the mathematical skills of a grade-schooler or the arithmetic reasoning skills of a calculator, the practicality of these models fail if they concomitantly shed their linguistic capabilities. In this work, we take a closer look into the phenomena of catastrophic forgetting as it pertains to LLMs and subsequently offer a novel framework for non-linguistic skill injection for LLMs based on information theoretic interventions and skill-specific losses that enable the learning of strict arithmetic reasoning. Our model outperforms the state-of-the-art both on injected non-linguistic skills and on linguistic knowledge retention, and does so with a fraction of the non-linguistic training data (1/4) and zero additional synthetic linguistic training data.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Why Linguistics Will Thrive in the 21st Century: A Reply to Piantadosi (2023)

Jordan Kodner, Sarah Payne, Jeffrey Heinz

We present a critical assessment of Piantadosi's (2023) claim that "Modern language models refute Chomsky's approach to language," focusing on four main points. First, despite the impressive performance and utility of large language models (LLMs), humans achieve their capacity for language after exposure to several orders of magnitude less data. The fact that young children become competent, fluent speakers of their native languages with relatively little exposure to them is the central mystery of language learning to which Chomsky initially drew attention, and LLMs currently show little promise of solving this mystery. Second, what can the artificial reveal about the natural? Put simply, the implications of LLMs for our understanding of the cognitive structures and mechanisms underlying language and its acquisition are like the implications of airplanes for understanding how birds fly. Third, LLMs cannot constitute scientific theories of language for several reasons, not least of which is that scientific theories must provide interpretable explanations, not just predictions. This leads to our final point: to even determine whether the linguistic and cognitive capabilities of LLMs rival those of humans requires explicating what humans' capacities actually are. In other words, it requires a separate theory of language and cognition; generative linguistics provides precisely such a theory. As such, we conclude that generative linguistics as a scientific discipline will remain indispensable throughout the 21st century and beyond.

en cs.CL, cs.AI

Detail Sumber

S2 Open Access 2022

Grace Andrus de Laguna: A Perspective from the History of Linguistics

Brigitte Nerlich

ABSTRACT Grace de Laguna was a philosopher working in the first part of the twentieth century on analytic and speculative philosophy, as well as on the psychology and philosophy of language, especially the social function of language. Joel Katzav’s lead essay focuses mainly on the former part of her work, while my commentary focuses mostly on the latter. Katzav shows how her work played a role in the development of analytic philosophy, I try to show how her work played a role in the development of pragmatic linguistics, whose development overlapped with some aspects of analytic philosophy. I situate it in the context of an emerging ‘pragmatic Weltanschauung’ that was embraced by many working in linguistics, philology, philosophy, psychology, anthropology and what would later be called linguistic pragmatics.

2 sitasi en

Detail DOI Sumber

DOAJ Open Access 2022

Jubileum Heleny Karlíkové (*1. 6. 1957)

Jana Villnow Komárková, Vít Boček

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2022

GILDA BONA (COORD.), SEIS TEXTOS DE AUTORAS DEL NEA, RESISTENCIA, CONTEXTO LIBROS, 2021

Genesis Gaczek

Philology. Linguistics, French literature - Italian literature - Spanish literature - Portuguese literature

Detail Sumber

arXiv Open Access 2022

Linguistic Approach to Time Series Forecasting

Dmytro Lande, Volodymyr Yuzefovych, Yevheniia Tsybulska

This paper proposes methods of predicting dynamic time series (including non-stationary ones) based on a linguistic approach, namely, the study of occurrences and repetition of so-called N-grams. This approach is used in computational linguistics to create statistical translators, detect plagiarism and duplicate documents. However, the scope of application can be extended beyond linguistics by taking into account the correlations of sequences of stable word combinations, as well as trends. The proposed methods do not require a preliminary study and determination of the characteristics of time series or complex tuning of the input parameters of the forecasting model. They allow, with a high level of automation, to carry out short-term and medium-term forecasts of time series, characterized by trends and cyclicality, in particular, series of publication dynamics in content monitoring systems. Also, the proposed methods can be used to predict the values of the parameters of a large complex system with the aim of monitoring its state, when the number of such parameters is significant, and therefore a high level of automation of the forecasting process is desirable. A significant advantage of the approach is the absence of requirements for time series stationarity and a small number of tuning parameters. Further research may focus on the study of various criteria for the similarity of time series fragments, the use of nonlinear similarity criteria, the search for ways to automatically determine the rational step of quantization of the time series.

en math.NA, cs.DM

Detail Sumber

S2 Open Access 2022

Current Topics of Research in Comparative Philology in Contemporary Kazakhstan

Moldir Alshaynbaeva, Darina Amanbekova, M. Balabekova

The topicality and importance of the problems associated with comparative linguistics constantly draws the attention of linguists to the issues of studying the lexicon and concepts of foreign language origin. This article deals with current topics of doctoral dissertation research in modern Kazakhstan. The impact of this article is determined by the possibility of using the information contained in it for further research on the issues of linguistic convergence, the study of integration and adaptation of foreign words, identifying conceptual fields of “richness” and “poverty” concepts, as well as the study of sacral concepts in comparative linguistics.

en

Detail DOI Sumber

DOAJ Open Access 2021

Phonological units for phonological change: synchrony shall provide them

Dmitrii Zelenskii

The question of what types of units and domains are needed in order to capture phonological change is a reasonable one to ask. To answer this question, however, we first need to properly define how we understand phonological change, and the definition that we adopt for that clearly depends on the phonological framework that is assumed. I consider several influential frameworks here and then come to the conclusion that the same condition holds for all of them: change can only be described in terms of the same units (and domains) as are used for synchronic description. This leads to the following conclusion: the set of units for phonological change is a subset of the set of units that are needed for synchronic phonological description. However, it is also unlikely that some units needed for synchronic description can be fully ignored for all descriptions of changes, which leads us to the conclusion that the set of units that are needed for phonological change is also a superset of that set. The sets are thus equal: the phonological units needed for synchronic description are the units needed to account for phonological change, and the question above is meaningless.

Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2021

Evaluation of Level of Understanding in Children from Early Age up to 3 Years of Age

Svetlana Kapalková, Radoslava Lizáková

raná logopedická diagnostika, raný vek, hodnotenie porozumenia, rizikové deti

Medicine, Oral communication. Speech

Detail DOI Sumber

arXiv Open Access 2021

The Rediscovery Hypothesis: Language Models Need to Meet Linguistics

Vassilina Nikoulina, Maxat Tezekbayev, Nuradil Kozhakhmet et al.

There is an ongoing debate in the NLP community whether modern language models contain linguistic knowledge, recovered through so-called probes. In this paper, we study whether linguistic knowledge is a necessary condition for the good performance of modern language models, which we call the \textit{rediscovery hypothesis}. In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English.

en cs.CL

Detail DOI Sumber

Hasil untuk "Philology. Linguistics"