Hasil "Language. Linguistic theory. Comparative grammar"

S2 Open Access 2003

Constructing a Language: A Usage-Based Theory of Language Acquisition

M. Tomasello

1445 sitasi en Psychology

Detail DOI Sumber

S2 Open Access 1975

The Logical Structure of Linguistic Theory

Noam Chomsky

1093 sitasi en Computer Science

Detail Sumber

arXiv Open Access 2026

X-Blocks: Linguistic Building Blocks of Natural Language Explanations for Automated Vehicles

Ashkan Y. Zadeh, Xiaomeng Li, Andry Rakotonirainy et al.

Natural language explanations play a critical role in establishing trust and acceptance of automated vehicles (AVs), yet existing approaches lack systematic frameworks for analysing how humans linguistically construct driving rationales across diverse scenarios. This paper introduces X-Blocks (eXplanation Blocks), a hierarchical analytical framework that identifies the linguistic building blocks of natural language explanations for AVs at three levels: context, syntax, and lexicon. At the context level, we propose RACE (Reasoning-Aligned Classification of Explanations), a multi-LLM ensemble framework that combines Chain-of-Thought reasoning with self-consistency mechanisms to robustly classify explanations into 32 scenario-aware categories. Applied to human-authored explanations from the Berkeley DeepDrive-X dataset, RACE achieves 91.45 percent accuracy and a Cohens kappa of 0.91 against cases with human annotator agreement, indicating near-human reliability for context classification. At the lexical level, log-odds analysis with informative Dirichlet priors reveals context-specific vocabulary patterns that distinguish driving scenarios. At the syntactic level, dependency parsing and template extraction show that explanations draw from a limited repertoire of reusable grammar families, with systematic variation in predicate types and causal constructions across contexts. The X-Blocks framework is dataset-agnostic and task-independent, offering broad applicability to other automated driving datasets and safety-critical domains. Overall, our findings provide evidence-based linguistic design principles for generating scenario-aware explanations that support transparency, user trust, and cognitive accessibility in automated driving systems.

en cs.AI, cs.CL

Detail Sumber

S2 Open Access 2026

Women’s Language Features in a Patriarchal Context: A Lakoffian Analysis of The Yellow Wallpaper

Dini Fatma Karomah Aljamilah, Oktaviani Ramadhani, Sin Syifa Auliya et al.

This study explores women’s language features in a patriarchal context by analyzing Charlotte Perkins Gilman’s The Yellow Wallpaper through the lens of Robin Lakoff’s (1975) framework. Focusing exclusively on the female narrator, the study examines narrative passages, sentences, and expressions to identify linguistic patterns that reflect social and ideological constraints. Using a qualitative descriptive approach, nine women’s language features were identified: lexical hedges, tag questions, rising intonation, intensifiers, hypercorrect grammar, super‑polite forms, avoidance of strong swear words, precise color terms, and empty adjectives. The findings show that these features serve to convey tentativeness, politeness, emotional expression, and psychological sensitivity, highlighting the narrator’s constrained agency under patriarchal and institutional control, particularly within the context of the rest cure. From a feminist literary perspective, the language simultaneously provides subtle means for resistance and self-expression, revealing how linguistic strategies negotiate power, uncertainty, and emotion. The study underscores the value of combining Lakoff’s sociolinguistic framework with feminist literary theory to examine how language reflects and contests gendered authority in literature. Limitations include the focus on a single narrator and one literary work, suggesting the need for comparative and corpus-based studies across genres and historical contexts. Future research could explore the relationship between women’s language features, narrative perspective, and character agency to further understand how literary language constructs female subjectivity and ideological critique.

en

Detail DOI Sumber

arXiv Open Access 2025

Evaluating Polish linguistic and cultural competency in large language models

Sławomir Dadas, Małgorzata Grębowiec, Michał Perełkiewicz et al.

Large language models (LLMs) are becoming increasingly proficient in processing and generating multilingual texts, which allows them to address real-world problems more effectively. However, language understanding is a far more complex issue that goes beyond simple text analysis. It requires familiarity with cultural context, including references to everyday life, historical events, traditions, folklore, literature, and pop culture. A lack of such knowledge can lead to misinterpretations and subtle, hard-to-detect errors. To examine language models' knowledge of the Polish cultural context, we introduce the Polish linguistic and cultural competency benchmark, consisting of 600 manually crafted questions. The benchmark is divided into six categories: history, geography, culture & tradition, art & entertainment, grammar, and vocabulary. As part of our study, we conduct an extensive evaluation involving over 30 open-weight and commercial LLMs. Our experiments provide a new perspective on Polish competencies in language models, moving past traditional natural language processing tasks and general knowledge assessment.

en cs.CL

Detail Sumber

arXiv Open Access 2025

Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

Ambalika Guha, Sajal Saha, Debanjan Ballav et al.

Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Michael Y. Hu, Jackson Petty, Chuan Shi et al.

Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models

Rebekka Görge, Michael Mock, Héctor Allende-Cid

Social categories and stereotypes are embedded in language and can introduce data bias into Large Language Models (LLMs). Despite safeguards, these biases often persist in model behavior, potentially leading to representational harm in outputs. While sociolinguistic research provides valuable insights into the formation of stereotypes, NLP approaches for stereotype detection rarely draw on this foundation and often lack objectivity, precision, and interpretability. To fill this gap, in this work we propose a new approach that detects and quantifies the linguistic indicators of stereotypes in a sentence. We derive linguistic indicators from the Social Category and Stereotype Communication (SCSC) framework which indicate strong social category formulation and stereotyping in language, and use them to build a categorization scheme. To automate this approach, we instruct different LLMs using in-context learning to apply the approach to a sentence, where the LLM examines the linguistic properties and provides a basis for a fine-grained assessment. Based on an empirical evaluation of the importance of different linguistic indicators, we learn a scoring function that measures the linguistic indicators of a stereotype. Our annotations of stereotyped sentences show that these indicators are present in these sentences and explain the strength of a stereotype. In terms of model performance, our results show that the models generally perform well in detecting and classifying linguistic indicators of category labels used to denote a category, but sometimes struggle to correctly evaluate the associated behaviors and characteristics. Using more few-shot examples within the prompts, significantly improves performance. Model performance increases with size, as Llama-3.3-70B-Instruct and GPT-4 achieve comparable results that surpass those of Mixtral-8x7B-Instruct, GPT-4-mini and Llama-3.1-8B-Instruct.

en cs.CL, cs.AI

Detail Sumber

S2 Open Access 2025

Investigating Valency in Causative Verb Derivational Mechanisms: The Case of the Oromo Language

Ayub Ismael Jarso

This paper examines verb derivation in Oromo, focusing on how causative morphological suffixes such as -s-, -sis-, and -sisiis- alter verb valency and argument structure. The study investigates whether verbs derived through these causative morphemes behave differently from non-derived verbs, especially in subsequent valency-changing operations. It also explores the interaction of causative morphology with inchoative and ideophonic roots. A qualitative descriptive methodology was employed, combining primary data from native Oromo speakers with secondary sources such as grammars and linguistic corpora. Verbs were categorized by inherent valency and analyzed using a morphosyntactic framework grounded in valency theory and derivational morphology. Findings reveal that the suffix –s– in CS1 pattern marks direct causation and agentivization, and increases valency by introducing an agentive subject. The suffix –si(i)s– in CS2 pattern encodes indirect causation, adding intermediate agents and expanding argument roles. The suffix –sisiis– in CS3 pattern represents causative stacking, where the subject functions as an effector, producing multi-agent constructions. The study reinterprets geminated forms like -ess- as composites of inchoative and causative morphemes, shaped by morphophonemic processes such as palatalization and vowel harmony. The research proposes a valency code system ([1A] through [5E]) to map argument expansion across derivational layers. Comparative insights from other Cushitic languages highlight both shared phonological processes and divergent morphological strategies. This study contributes to morphosyntactic typology by clarifying the functional distinctions among causative suffixes, demonstrating how morphological stacking reshapes clause architecture, and offering a framework for analyzing valency manipulation in Afroasiatic languages.

en

Detail DOI Sumber

S2 Open Access 2025

A Study on the Semantic Expansion Model of the Korean Expression ‘-eo beorida’ and Learner Usage Patterns : Focusing on Metonymy Theory

Lin Tong, Jaewook Kim

This study aims to systematically analyze the semantic expansion process of the Korean expression ‘-eo beorida’ based on metonymy theory, propose a model for this expansion, and examine learners’ usage patterns in detail. ‘-eo beorida’ is an intermediate-level expression with complex meanings that reflect not only the “end of an action” but also the speaker’s psychological attitude and expectations. Existing studies in Korean linguistics have primarily focused on the function and meaning of this expression but have not addressed the cognitive mechanisms by which learners discern multiple meanings or the semantic expansion process during grammar acquisition. Consequently, when confronted with complex grammatical expressions that carry multiple meanings, learners often fail to understand them accurately, leading to errors or a tendency to avoid using the expression. To address these limitations, this study applies metonymy theory from a cognitive linguistic perspective to elucidate the semantic expansion process of ‘-eo beorida’ and investigates learners’ tendencies in semantic acquisition by integrating the results of learner corpus analysis with the proposed model. Furthermore, by analyzing the characteristics of Chinese learners through a comparative study of the entire learner group and the Chinese learner group, the study identifies meaning acquisition patterns specific to learners from a particular language background and provides foundational data for future research. (Hankuk University of Foreign Studies)

en

Detail DOI Sumber

S2 Open Access 2024

EXPLORING THE COGNITIVE DIMENSIONS OF LANGUAGE ACQUISITION

Astghik Chubaryan, Mariam Vardanyan

Cognitive Linguistics, as an expanding discipline in language study, delves into language as a mental phenomenon, presenting a holistic approach that integrates linguistic and cognitive elements. Unlike traditional linguistics, which scrutinizes syntactical patterns, word structures, grammar rules, phonology, semantics, and lexical meanings, Cognitive Linguistics offers a comprehensive model that interweaves various closely connected theories. This study seeks to elucidate the manifestations of Cognitive Linguistics in language, providing scholars with a discerning analysis of existing theories. Furthermore, employing methods of descriptive and comparative analysis, this study aims to demonstrate the impact of Cognitive Linguistics on language processing and acquisition. It highlights its significance in second language acquisition while elucidating the underlying mental processes involved. The paper offers a succinct overview of this evolving discipline, encapsulating its unique characteristics within the encompassing term Cognitive Linguistics.

1 sitasi en

Detail DOI Sumber

DOAJ Open Access 2024

Interaction between Morphology and Syntax in the Language of Science: A Case Study in Noun Incorporation and Phrase Formation

Mohammadreza Razavi, Marziyeh Allahyari

Abstract The interface between morphology and syntax has been a disputed issue among linguists. The main goal of this research is to investigate the interface between morphology and syntax and represent the existence of interaction between these two structural modules of language. Therefore, we have studied two productive word formation processes in the language of science. One of them is the result of the module of morphology, but we can assign a syntactic substructure to it. The other one is itself a syntactic entity, but it shows some morphological behaviors. These two processes are respectively noun incorporation and the formation of syntactic phrases. For this purpose, 500 terms from the first dictionary of a collection of terms approved by the Academy of Persian Language and Literature, and 500 terms from the sixteenth dictionary of a collection of terms have been chosen randomly and studied in this research. The main framework of this research is the interactional approach. We have also utilized additional tools from other theoretical approaches, such as constructional schemas in construction morphology. The research conducted in this essay depicts the definitions of incorporating nouns, in the dictionaries, to confirm their syntactic structure. On the other hand, losing some syntactic markers in syntactic phrases demonstrates the lexicalization of these structures. This study showed that incorporating nouns had a greater diversity of constructional schemas than syntactic phrases. However, the productivity of syntactic phrases was higher in comparison to incorporating nouns. Furthermore, the usage of syntactic phrases had increased in the ten-year period. As a result, the overall trend was ascending. Keywords: morphology, syntax, language of science, interactional approach, constructional morphology, noun incorporation, syntactic phrase. Introduction The most generative module of language is syntax. This statement encouraged the author to understand whether this competence of language in word formation of the language of science, which is one of the registers of language and could have its own word formation rules, is profitable or not. On the other hand, the formation of compound words is the most similar process to a syntactic process (Giegrich, 2009: 319). Particularly, one of the most productive processes of making compounds in Farsi is "incorporation". Noun incorporation in Farsi is the first type of incorporation according to Mithun (1984,1986), in which we observe 2 corresponding syntactic constructions. Thus, the existence of this syntactic sub-construction depicts that analyzing this word formation process is a good example of the interaction between morphology and syntax but the noun incorporation is not the only construction which has a syntactic side; the formation of a syntactic phrase is another process which is syntactic by itself, but in some cases these phrases lose their syntactic elements like "ezâfe" in this research. It means that the syntactic phrases also have some morphological behaviors and they could, as a result, take an account as another sign of the interaction between morphology and syntax. In this research, we have chosen 500 words from the first dictionary of a collection of words approved by the Academy of Persian Language and Literature, and 500 words from the sixteenth dictionary. We will find out whether the definition of incorporating nouns illustrates the syntactic construction or not. In the second phase, we will determine the syntactic and semantic roles of each constituent in incorporating nouns and syntactic phrases in order to understand which syntactic and semantic roles have been used more. For this purpose, the constructional schemas were helpful tools used in this research. It should be mentioned that our main aim is to demonstrate the interaction between morphology and syntax, but we have not made ourselves deprived of having certain semantic surveys. Literature Review The morphology-syntax interface has been a disputed issue in linguistics. The autonomy of morphology has led to 2 different approaches. The lexicalist approach is a term that has been used in generative theories (O’Neil, 2016: 242). According to this approach, the output of morphology is the input of syntax. Syntax, moreover, does not have any access to the internal structure of words and cannot change their meaning (Chomsky, 1970: 240). However, in the other approach, which is called distributed morphology, we do not have a coherent lexicon as we do in generative grammar. The only generative component of language is syntax which produces words, phrases, and sentences (Anoushe, 2021: 616). Noun incorporation is perhaps the most nearly syntactic of all morphological processes (Mithun, 1984: 847). According to Kroeber's definition (1909: 541), "Incorporation is a process which produces a word from a combination of direct object to the verb. This new word has the role of predicate in sentences," while Sapir (1911: 255) is against this twofold syntactic-morphological perception. Mithun (1984), considers noun incorporation as a morphological structure. Rosen (1989), also has the same view and considers noun incorporation as a morphological process that occurs in the lexicon. However, Baker (1988), has a syntactic perception to the incorporation. Thus, it is obvious that the exact place of noun incorporation as a morphological or syntactic process is still a controversial issue among different linguists. Methodology The interactional approach to language is a new approach that is the main framework of this research. In this approach, we believe in independent modules of the language, but we also assign an interactional relation between them (Dabir-Moghaddam, 2020: 41). This is one of the most important principles of this approach which helps illustrate the dimensions of interaction between modules of morphology and syntax in this research. The interactional approach believes that there is a syntactic construction for each incorporating noun, and it is an abstract template for producing these nominal compounds, and whenever it is necessary, this abstract structure is transformed into that syntactic construction (Dabir-Moghaddam, 2020: 358-359). As we have mentioned before, we have also had certain semantic analyses in this research. For this purpose, constructional schemas in constructional morphology are beneficial tools. The word "construction" is a concept that can be used in both levels of word formation and syntactic phrase (Booij, 2010a: 1). One of the main processes of word formation is compound formation which is considered as a construction. The constructional schema for this structure has been shown here: (1) [[a]Xk[b]Ni]j [SEMi with relation R with to SEMk]j The double arrow symbolizes the relationship between a particular form and a particular meaning. The variable X stands for the major lexical categories (N, V, A, and P). The variables "a" and "b" in this schema stand for arbitrary sound sequences. The variables "i"," j", and "k" stand for the lexical indexes on the phonological, syntactic, and semantic (SEM) properties of words. In (1), the meaning contribution of the compound schema is speciﬁed, as morphology deals with the correlation between form and meaning in sets of complex words. For instance, the substitution of one of the incorporating nouns in this corpus is like this: (2) [[hobab]N(DIR-OBJ(k [saz]Vi]j [a tool which shows the action SEMi is done on the patient SEMk]j As it is obviously clear in this constructional schema, the first constituent of this construction is the direct object of the syntactic construction of incorporating the noun "Hobabsaz" [bubblemaker], this constituent furthermore has the semantic role of a patient in supposed syntactic construction. "R" depicts a relation in which a tool does an action on a patient. Conclusion The research conducted in this thesis depicts the definitions that have been used to describe incorporating nouns in dictionaries to confirm the syntactic construction of incorporating nouns. On the other hand, losing some syntactic markers, in syntactic phrases, demonstrates lexicalization of these structures. The most common syntactic role of an incorporated constituent in incorporating nouns was the object. In addition, the syntactic and semantic roles of most of the constituents used with predicate nouns, in syntactic phrases, were respectively "adjuncts" and "manner". Incorporating nouns had a greater diversity of constructional schemas than syntactic phrases. However, the productivity of syntactic phrases was higher. Furthermore, the usage of syntactic phrases has increased in the ten-year period since the year of publication of the first dictionary up to the sixteenth dictionary. As a result, the overall trend was ascending. Acknowledgments This essay is inspired by my master's degree dissertation. Thus, I should appreciate my professor, Doctor Mohammad Dabir-Moghaddam, who was the supervisor of this dissertation. I was not able to write even one word without his worthwhile pieces of advice. Secondly, I appreciate Doctor Mohammad Reza Razavi who always has had profitable comments in order to write a readable essay.

Language and Literature, Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2024

Certainty Stance Adverbs in Chinese Linguistic Academic Writing: A Corpus-based Study

Natalia M. Dugalich, Hao Han

This study delves into the intricate usage of certainty stance adverbs in the academic writings of the Chinese. The object of this study is to conduct a contrastive analysis of certainty stance adverbs between Chinese linguistic MA novices and linguistic experts. The subject of this study is to explore the similarities and differences in the use of these adverbs within the academic discourse of the two groups mentioned. In addressing these disparities, the study seeks to analyze them from the perspective of interlanguage and interpersonal function, with the ultimate goal of enhancing the production of high-quality academic papers by the Chinese. The study employs two key methods: the contrastive analysis method and the corpus-based method. The novelty of the study lies in its contrastive examination of certainty stance adverbs between Chinese linguistic MA novices and experts. Through an in-depth analysis of theoretical frameworks and linguistic data extracted from the MA theses of Chinese linguistic novices and research articles of linguistic experts, the study underscores the importance of considering the similarities and differences in the use of certainty stance adverbs within the realm of second language acquisition and pragmatic studies. This consideration aims to refine learners’ syntactical and pragmatic command of certainty stance adverbs to align with expert academic discourse, ultimately fostering effective interpersonal communication in academic writing. The findings of this research offer valuable insights into the specific linguistic challenges encountered by the Chinese, thereby laying a solid groundwork for the development of targeted pedagogical strategies to bolster their academic writing skills.

Language. Linguistic theory. Comparative grammar, Semantics

Detail DOI Sumber

arXiv Open Access 2024

Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models

Sina Bagheri Nezhad, Ameeta Agrawal, Rhitabrat Pokharel

Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity highlights the importance of shared cultural and linguistic contexts. These insights offer valuable guidance for developing more equitable and effective multilingual language models, particularly for underrepresented languages.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

Andreas Waldis, Yotam Perlitz, Leshem Choshen et al.

We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Perceptions of Linguistic Uncertainty by Language Models and Humans

Catarina G Belem, Markelle Kelly, Mark Steyvers et al.

_Uncertainty expressions_ such as "probably" or "highly unlikely" are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans quantitatively interpret these expressions, there has been little inquiry into the abilities of language models in the same context. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model's own certainty about that statement. We find that 7 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI and AI-AI communication.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Kushal Tatariya, Vladimir Araujo, Thomas Bauwens et al.

Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text. While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowledge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum. Our findings reveal a substantial gap between the model's visual and linguistic understanding. The lower layers of PIXEL predominantly capture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of pixel-based language models.

en cs.CL

Detail Sumber

S2 Open Access 2024

O2 is a multiple context-free grammar: an implementation-, formalisation-friendly proof

M. Caminati

Classifying formal languages according to the expressiveness of grammars able to generate them is a fundamental problem in computational linguistics and, therefore, in the theory of computation. Furthermore, such kind of analysis can give insight into the classification of abstract algebraic structure such as groups, for example through the correspondence given by the word problem. While many such classification problems remain open, others have been settled. Recently, it was proved that $n$-balanced languages (i.e., whose strings contain the same occurrences of letters $a_i$ and $A_i$ with $1\leq i \leq n$) can be generated by multiple context-free grammars (MCFGs), which are one of the several slight extensions of context free grammars added to the classical Chomsky hierarchy to make the mentioned classification more precise. This paper analyses the existing proofs from the computational and the proof-theoretical point of views, systematically studying whether each proof can lead to a verified (i.e., checked by a proof assistant) algorithm parsing balanced languages via MCFGs. We conclude that none of the existing proofs is realistically suitable against this practical goal, and proceed to provide a radically new, elementary, extremely short proof for the crucial case $n \leq 2$. A comparative analysis with respect to the existing proofs is finally performed to justify why the proposed proof is a substantial step towards concretely obtaining a verified parsing algorithm for $O_2$.

en Computer Science, Mathematics

Detail DOI Sumber

S2 Open Access 2024

The Use and Evaluation of Corpus-Based English Language Teaching

J. S, Isaac E. N.

This paper investigates the use and evaluation of corpus-based English Grammar teaching. The primary focus is on advancing language education methodologies by integrating corpus linguistics into pedagogical practices. Through a nuanced exploration of qualitative and quantitative dimensions, the research aims at enhancing grammar learning outcomes. The qualitative aspect employs purposeful sampling, capturing the perspectives of language learners and instructors immersed in corpus-based teaching. The quantitative phase employs a randomized controlled trial to evaluate the impact of corpus-based teaching on language proficiency. Rigorous pre-and post-assessments, complemented by statistical analysis, illuminate the quantitative gains achieved by learners exposed to this innovative pedagogical approach. A critical component of the study lies in the meticulous selection of the corpus. Criteria such as size, genre diversity, and representativeness guide this process, ensuring a comprehensive examination of linguistic features. The ethical dimensions of the research are upheld through strict adherence to informed consent, participant confidentiality, and privacy standards. The research timeline delineates distinct phases, allowing for systematic progress from corpus selection to data analysis. Theoretical framework underpinning CBLT is a complex and dynamic amalgamation of linguistic theories but in this thesis, we use the communicative language teaching framework (CLT). The research design encompasses a quasi-experimental approach, employing a mixed-methods approach combining qualitative measures of language proficiency assessments with qualitative insights gathered through surveys and interviews. The findings are that the use of CBLT in teaching aspects of English grammar is more result oriented than the use of traditional teaching methods. This study has not only contributed to the theoretical foundations of language education but also provide practical insights for educators and policymakers seeking to optimize language learning strategies.

en

Detail DOI Sumber

S2 Open Access 2023

A Comparative Investigation of Sibawayh and Jakobson in Functional Linguistics

Albatool Abalkheel, Maha Sourani

The objective of this research is to connect Arabic theoretical linguistics with modern linguistics, not by disregarding the present and attributing the later theories solely to Arab grammarians and rhetoricians, as has often been assumed. Rather, this research aims to qualitatively explore the linguistic accomplishments of ancient Arab scholars from a contemporary perspective and identify commonalities and controversies between past and present perspectives. Specifically, this study focuses on Sibawayh's interpretation of functional linguistics and its final formulation by Jakobson. Additionally, the study aims to analyse and compare the principles of language and grammar advocated by ancient Arab linguists and contemporary linguists to deepen our understanding of language and its role in human society. The results of this study reveal that while Sibawayh concentrated on the analysis of structure and function in Arabic grammar, modern linguists such as Jakobson have broadened their approach to include a wider range of perspectives including sociolinguistics, psycholinguistics, and neurolinguistics. Therefore, it is essential to acknowledge the contributions of Sibawayh as his work formed the foundation for language theories in general and linguistics in particular.

3 sitasi en

Detail DOI Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"