Hasil untuk "Comparative grammar"

Menampilkan 20 dari ~3701920 hasil · dari DOAJ, arXiv, Semantic Scholar, CrossRef

JSON API
DOAJ Open Access 2026
Gender biases in GPT-4 short biographies.

Anna-Maria De Cesare

As has been shown in various studies considering different languages, in professional contexts women tend to be referred to differently than men. While men are typically referred to by their surname (e. g., Fermi), women are more often referenced with their full name (e. g., Samantha Cristoforetti) or first name alone (e. g., Samantha). The present study proposes an empirical case study investigating whether this gender-indexing bias is also present in texts generated by large language models (LLMs). Based on the analysis of a self-assembled data collection comprising 420 biographies produced by GPT-4 on 140 eminent Italian and French female and male personalities, our study reveals that the synthetic texts investigated not only reflect the gender biases found in human-authored texts but, in some cases, even amplify them.

Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar
arXiv Open Access 2026
Doc2Spec: Synthesizing Formal Programming Specifications from Natural Language via Grammar Induction

Shihao Xia, Mengting He, Haomin Jia et al.

Ensuring that API implementations and usage comply with natural language programming rules is critical for software correctness, security, and reliability. Formal verification can provide strong guarantees but requires precise specifications, which are difficult and costly to write manually. To address this challenge, we present Doc2Spec, a multi-agent framework that uses LLMs to automatically induce a specification grammar from natural-language rules and then generates formal specifications guided by the induced grammar. The grammar captures essential domain knowledge, constrains the specification space, and enforces consistent representations, thereby improving the reliability and quality of generated specifications. Evaluated on seven benchmarks across three programming languages, Doc2Spec outperforms a baseline without grammar induction and achieves competitive results against a technique with a manually crafted grammar, demonstrating the effectiveness of automated grammar induction for formalizing natural-language rules.

en cs.PL, cs.AI
arXiv Open Access 2025
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?

Qingyuan Liang, Zhao Zhang, Zeyu Sun et al.

Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.

en cs.PL, cs.AI
arXiv Open Access 2025
Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park, Timothy Zhou, Loris D'Antoni

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

en cs.CL, cs.AI
arXiv Open Access 2025
SAGE:Specification-Aware Grammar Extraction for Automated Test Case Generation with LLMs

Aditi, Hyunwoo Park, Sicheol Sung et al.

Grammar-based test case generation has proven effective for competitive programming problems, but generating valid and general grammars from natural language specifications remains a key challenge, especially under limited supervision. Context-Free Grammars with Counters (CCFGs) have recently been introduced as a formalism to represent such specifications with logical constraints by storing and reusing counter values during derivation. In this work, we explore the use of open-source large language models (LLMs) to induce CCFGs from specifications using a small number of labeled examples and verifiable reward-guided reinforcement learning. Our approach first fine-tunes an open-source LLM to perform specification-to-grammar translation, and further applies Group Relative Policy Optimization (GRPO) to enhance grammar validity and generality. We also examine the effectiveness of iterative feedback for open and closed-source LLMs in correcting syntactic and semantic errors in generated grammars. Experimental results show that our approach SAGE achieves stronger generalization and outperforms 17 open and closed-source LLMs in both grammar quality and test effectiveness, improving over the state-of-the-art by 15.92%p in grammar validity and 12.34%p in test effectiveness. We provide our implementation and dataset at the following anonymous repository:https://anonymous.4open.science/r/SAGE-5714

en cs.CL
DOAJ Open Access 2024
Interaction between Morphology and Syntax in the Language of Science: A Case Study in Noun Incorporation and Phrase Formation

Mohammadreza Razavi, Marziyeh Allahyari

Abstract The interface between morphology and syntax has been a disputed issue among linguists. The main goal of this research is to investigate the interface between morphology and syntax and represent the existence of interaction between these two structural modules of language. Therefore, we have studied two productive word formation processes in the language of science. One of them is the result of the module of morphology, but we can assign a syntactic substructure to it. The other one is itself a syntactic entity, but it shows some morphological behaviors. These two processes are respectively noun incorporation and the formation of syntactic phrases. For this purpose, 500 terms from the first dictionary of a collection of terms approved by the Academy of Persian Language and Literature, and 500 terms from the sixteenth dictionary of a collection of terms have been chosen randomly and studied in this research. The main framework of this research is the interactional approach. We have also utilized additional tools from other theoretical approaches, such as constructional schemas in construction morphology. The research conducted in this essay depicts the definitions of incorporating nouns, in the dictionaries, to confirm their syntactic structure. On the other hand, losing some syntactic markers in syntactic phrases demonstrates the lexicalization of these structures. This study showed that incorporating nouns had a greater diversity of constructional schemas than syntactic phrases. However, the productivity of syntactic phrases was higher in comparison to incorporating nouns. Furthermore, the usage of syntactic phrases had increased in the ten-year period. As a result, the overall trend was ascending. Keywords: morphology, syntax, language of science, interactional approach, constructional morphology, noun incorporation, syntactic phrase. Introduction The most generative module of language is syntax. This statement encouraged the author to understand whether this competence of language in word formation of the language of science, which is one of the registers of language and could have its own word formation rules, is profitable or not. On the other hand, the formation of compound words is the most similar process to a syntactic process (Giegrich, 2009: 319). Particularly, one of the most productive processes of making compounds in Farsi is "incorporation". Noun incorporation in Farsi is the first type of incorporation according to Mithun (1984,1986), in which we observe 2 corresponding syntactic constructions. Thus, the existence of this syntactic sub-construction depicts that analyzing this word formation process is a good example of the interaction between morphology and syntax but the noun incorporation is not the only construction which has a syntactic side; the formation of a syntactic phrase is another process which is syntactic by itself, but in some cases these phrases lose their syntactic elements like "ezâfe" in this research. It means that the syntactic phrases also have some morphological behaviors and they could, as a result, take an account as another sign of the interaction between morphology and syntax. In this research, we have chosen 500 words from the first dictionary of a collection of words approved by the Academy of Persian Language and Literature, and 500 words from the sixteenth dictionary. We will find out whether the definition of incorporating nouns illustrates the syntactic construction or not. In the second phase, we will determine the syntactic and semantic roles of each constituent in incorporating nouns and syntactic phrases in order to understand which syntactic and semantic roles have been used more. For this purpose, the constructional schemas were helpful tools used in this research. It should be mentioned that our main aim is to demonstrate the interaction between morphology and syntax, but we have not made ourselves deprived of having certain semantic surveys.   Literature Review The morphology-syntax interface has been a disputed issue in linguistics. The autonomy of morphology has led to 2 different approaches. The lexicalist approach is a term that has been used in generative theories (O’Neil, 2016: 242). According to this approach, the output of morphology is the input of syntax. Syntax, moreover, does not have any access to the internal structure of words and cannot change their meaning (Chomsky, 1970: 240). However, in the other approach, which is called distributed morphology, we do not have a coherent lexicon as we do in generative grammar. The only generative component of language is syntax which produces words, phrases, and sentences (Anoushe, 2021: 616). Noun incorporation is perhaps the most nearly syntactic of all morphological processes (Mithun, 1984: 847). According to Kroeber's definition (1909: 541), "Incorporation is a process which produces a word from a combination of direct object to the verb. This new word has the role of predicate in sentences," while Sapir (1911: 255) is against this twofold syntactic-morphological perception. Mithun (1984), considers noun incorporation as a morphological structure. Rosen (1989), also has the same view and considers noun incorporation as a morphological process that occurs in the lexicon. However, Baker (1988), has a syntactic perception to the incorporation. Thus, it is obvious that the exact place of noun incorporation as a morphological or syntactic process is still a controversial issue among different linguists.           Methodology  The interactional approach to language is a new approach that is the main framework of this research. In this approach, we believe in independent modules of the language, but we also assign an interactional relation between them (Dabir-Moghaddam, 2020: 41). This is one of the most important principles of this approach which helps illustrate the dimensions of interaction between modules of morphology and syntax in this research. The interactional approach believes that there is a syntactic construction for each incorporating noun, and it is an abstract template for producing these nominal compounds, and whenever it is necessary, this abstract structure is transformed into that syntactic construction (Dabir-Moghaddam, 2020: 358-359).  As we have mentioned before, we have also had certain semantic analyses in this research. For this purpose, constructional schemas in constructional morphology are beneficial tools. The word "construction" is a concept that can be used in both levels of word formation and syntactic phrase (Booij, 2010a: 1). One of the main processes of word formation is compound formation which is considered as a construction. The constructional schema for this structure has been shown here: (1) [[a]Xk[b]Ni]j               [SEMi with relation R with to SEMk]j The double arrow symbolizes the relationship between a particular form and a particular meaning. The variable X stands for the major lexical categories (N, V, A, and P). The variables "a" and "b" in this schema stand for arbitrary sound sequences. The variables "i"," j", and "k" stand for the lexical indexes on the phonological, syntactic, and semantic (SEM) properties of words. In (1), the meaning contribution of the compound schema is specified, as morphology deals with the correlation between form and meaning in sets of complex words. For instance, the substitution of one of the incorporating nouns in this corpus is like this: (2) [[hobab]N(DIR-OBJ(k [saz]Vi]j             [a tool which shows the action SEMi is done on the patient SEMk]j As it is obviously clear in this constructional schema, the first constituent of this construction is the direct object of the syntactic construction of incorporating the noun "Hobabsaz" [bubblemaker], this constituent furthermore has the semantic role of a patient in supposed syntactic construction. "R" depicts a relation in which a tool does an action on a patient.    Conclusion The research conducted in this thesis depicts the definitions that have been used to describe incorporating nouns in dictionaries to confirm the syntactic construction of incorporating nouns. On the other hand, losing some syntactic markers, in syntactic phrases, demonstrates lexicalization of these structures. The most common syntactic role of an incorporated constituent in incorporating nouns was the object. In addition, the syntactic and semantic roles of most of the constituents used with predicate nouns, in syntactic phrases, were respectively "adjuncts" and "manner". Incorporating nouns had a greater diversity of constructional schemas than syntactic phrases. However, the productivity of syntactic phrases was higher. Furthermore, the usage of syntactic phrases has increased in the ten-year period since the year of publication of the first dictionary up to the sixteenth dictionary. As a result, the overall trend was ascending.                                          Acknowledgments This essay is inspired by my master's degree dissertation. Thus, I should appreciate my professor, Doctor Mohammad Dabir-Moghaddam, who was the supervisor of this dissertation. I was not able to write even one word without his worthwhile pieces of advice. Secondly, I appreciate Doctor Mohammad Reza Razavi who always has had profitable comments in order to write a readable essay.

Language and Literature, Language. Linguistic theory. Comparative grammar
DOAJ Open Access 2024
Certainty Stance Adverbs in Chinese Linguistic Academic Writing: A Corpus-based Study

Natalia M. Dugalich, Hao Han

This study delves into the intricate usage of certainty stance adverbs in the academic writings of the Chinese. The object of this study is to conduct a contrastive analysis of certainty stance adverbs between Chinese linguistic MA novices and linguistic experts. The subject of this study is to explore the similarities and differences in the use of these adverbs within the academic discourse of the two groups mentioned. In addressing these disparities, the study seeks to analyze them from the perspective of interlanguage and interpersonal function, with the ultimate goal of enhancing the production of high-quality academic papers by the Chinese. The study employs two key methods: the contrastive analysis method and the corpus-based method. The novelty of the study lies in its contrastive examination of certainty stance adverbs between Chinese linguistic MA novices and experts. Through an in-depth analysis of theoretical frameworks and linguistic data extracted from the MA theses of Chinese linguistic novices and research articles of linguistic experts, the study underscores the importance of considering the similarities and differences in the use of certainty stance adverbs within the realm of second language acquisition and pragmatic studies. This consideration aims to refine learners’ syntactical and pragmatic command of certainty stance adverbs to align with expert academic discourse, ultimately fostering effective interpersonal communication in academic writing. The findings of this research offer valuable insights into the specific linguistic challenges encountered by the Chinese, thereby laying a solid groundwork for the development of targeted pedagogical strategies to bolster their academic writing skills.

Language. Linguistic theory. Comparative grammar, Semantics
arXiv Open Access 2024
Normal ordered grammars

Shi-Mei Ma, Toufik Mansour, Jean Yeh et al.

We introduce the theory of normal ordered grammars, which gives a natural generalization of the normal ordering problem. To illustrate the main idea, we explore normal ordered grammars associated with the Eulerian polynomials and the second-order Eulerian polynomials. In particular, we present a normal ordered grammatical interpretation for the (cdes,cyc) (p,q)-Eulerian polynomials, where cdes and cyc are the cycle descent and cycle statistics, respectively. The exponential generating function for a family of polynomials, generated by a normal ordered grammar associated with the second-order Eulerian polynomials, reveals an interesting feature: its expression involves the generating function for Catalan numbers as its exponent. In the final part, we discuss some normal ordered grammars related to the type B Eulerian polynomials. A normal ordered grammatical interpretation of the up-down run polynomial is also established.

en math.CO
arXiv Open Access 2024
Incremental Context-free Grammar Inference in Black Box Settings

Feifei Li, Xiao Chen, Xi Xiao et al.

Black-box context-free grammar inference presents a significant challenge in many practical settings due to limited access to example programs. The state-of-the-art methods, Arvada and Treevada, employ heuristic approaches to generalize grammar rules, initiating from flat parse trees and exploring diverse generalization sequences. We have observed that these approaches suffer from low quality and readability, primarily because they process entire example strings, adding to the complexity and substantially slowing down computations. To overcome these limitations, we propose a novel method that segments example strings into smaller units and incrementally infers the grammar. Our approach, named Kedavra, has demonstrated superior grammar quality (enhanced precision and recall), faster runtime, and improved readability through empirical comparison.

en cs.PL, cs.SE
DOAJ Open Access 2023
Aproximações entre infâncias, indígenas mulheres e feminismos no filme Tainá (2000)

Arthur Felipe Fiel, Alice Santos, Patrícia Cardoso D'Abreu

Este trabalho busca analisar o filme – Tainá:  uma aventura na Amazônia, direção de Tânia Lamarco e Sérgio Bloch, 2000, que possui a primeira representação de uma criança do gênero feminino em papel ativo e proativo diante das nuances e desfechos da trama. A obra é analisada em seu contexto sociocultural e momento histórico e traz à tona as tensões que nesse contexto se apresentam. Por meio do levantamento teórico, apontamos algumas importantes transformações narrativas que reconfiguram o cinema e o audiovisual infantil brasileiro e, em especial, buscamos ressaltar a importante colaboração das pautas feministas para o avanço do campo dos estudos relacionados às crianças e às infâncias, bem como para as obras produzidas e endereçadas ao público infantil no Brasil.

Discourse analysis
arXiv Open Access 2023
Spanish Resource Grammar version 2023

Olga Zamaraeva, Lorena S. Allegue, Carlos Gómez-Rodríguez

We present the latest version of the Spanish Resource Grammar (SRG), a grammar of Spanish implemented in the HPSG formalism. Such grammars encode a complex set of hypotheses about syntax making them a resource for empirical testing of linguistic theory. They also encode a strict notion of grammaticality which makes them a resource for natural language processing applications in computer-assisted language learning. This version of the SRG uses the recent version of the Freeling morphological analyzer and is released along with an automatically created, manually verified treebank of 2,291 sentences. We explain the treebanking process, emphasizing how it is different from treebanking with manual annotation and how it contributes to empirically-driven development of syntactic theory. The treebanks' high level of consistency and detail makes them a resource for training high-quality semantic parsers and generally systems that benefit from precise and detailed semantics. Finally, we present the grammar's coverage and overgeneration on 100 sentences from a learner corpus, a new research line related to developing methodologies for robust empirical evaluation of hypotheses in second language acquisition.

en cs.CL
DOAJ Open Access 2022
Leitura de sentenças em alemão por falantes de hunsriqueano: acesso lexical não seletivo e coativação sintática

Bernardo Kolling Limberger

Estudos sobre o processamento da leitura bilíngue/multilíngue têm evidenciado de forma robusta que o acesso lexical é não seletivo, havendo uma busca paralela em todos os subsistemas linguísticos. Esse efeito tem sido replicado nas pesquisas mais recentes nas quais são apresentadas sentenças aos participantes. O contexto de sentença também mostra que pode haver coativação sintática, de modo que as representações sintáticas de uma língua possam facilitar o processamento de estruturas semelhantes na outra língua. De forma geral, tais as investigações incluem línguas hegemônicas e majoritárias. Diante disso, o objetivo deste estudo foi investigar o acesso lexical durante o processamento de sentenças em alemão padrão por falantes de hunsriqueano, língua de imigração alemã falada no Brasil. Foi aplicada uma Tarefa de compreensão de sentenças em alemão. Dois grupos participaram da pesquisa: um composto por falantes de hunsriqueano, o outro por não falantes de qualquer língua minoritária de origem alemã. Todos os participantes estudam alemão como língua estrangeira. Os resultados mostram um efeito do compartilhamento de representações semânticas, fonológicas (acesso lexical não seletivo) e sintáticas (coativação sintática) entre hunsriqueano e alemão no processamento de sentenças em alemão. Por meio dos resultados, visamos contribuir com a pesquisa em línguas minoritárias e processamento da leitura, uma relação nem sempre trivial.

History of scholarship and learning. The humanities, Philology. Linguistics
DOAJ Open Access 2022
بررسی آ سیب های آموزش از راه دور بریادگیری دانشجویان زبان آلمانی

فاطمه خداکرمی

با همه‌گیری‌کرونا درسراسردنیا تمامی فعالیت‌های آموزشی‌تعطیل شدند و تنها گزینۀ ادامه فعالیت مراکز آموزشی"آموزش از راه‌ دور" بود. در‌ این پژوهش تلاش شده کیفیت و تاثیرآموزش مجازی بر یادگیری دانشجویان ورودی جدید زبان‌آلمانی که شروع تحصیل‌آنها در دانشگاه همزمان با تغییرسیستم آموزش‌حضوری به شکل آموزش آنلاین- مجازی بوده‌است، درسه مهارت زبانی خواندن، نوشتن، صحبت‌ کردن‌ مورد ارزیابی قرارگیرد و سطح علمی دروس پایۀ تعداد 55 نفر از دانشجویان درچهار بخش‌خواندن، زبان ‌گفتاری، زبان نوشتاری و دستور در آزمون‌های پایانی نیمسال اول 1401-1400 در دو سطح کمی و کیفی بررسی و همچنین دیدگاه آنها درباره‌آموزش‌ از راه دور مورد ارزیابی قرارگیرد. نتایج حاصل از این ارزیابی‌حاکی از تفاوت قابل توجه در قبولی دانشجویان است؛ هیجکدام از دانشجویان جامعه آماری این تحقیق بر خلاف سنوات گذشته در مواد امتحانی دروس پیش‌دانشگاهی و ترم یک رد نشدند، این در حالی است که براساس ارزیابی فقط 12% از دانشجویان ورودی جدید به زبان آلمانی آشنایی داشتند و %80 از دانشجویان‌ تجربۀ آموزش از راه دور نداشتند و سطح آگاهی آنها از اطلاعات فناوری پایین بوده‌است، ضمنا دراین بررسی مزایا و معایب مختلف یادگیری از راه دور و راه حل‌های آموزشی برای رفع مشکلات ناخواسته بواسطۀ ضعف‌ها و نقصان‌های آموزش از راه دور ارائه گردید.

Language. Linguistic theory. Comparative grammar, Indo-Iranian languages and literature
arXiv Open Access 2022
Finding Dataset Shortcuts with Grammar Induction

Dan Friedman, Alexander Wettig, Danqi Chen

Many NLP datasets have been found to contain shortcuts: simple decision rules that achieve surprisingly high accuracy. However, it is difficult to discover shortcuts automatically. Prior work on automatic shortcut detection has focused on enumerating features like unigrams or bigrams, which can find only low-level shortcuts, or relied on post-hoc model interpretability methods like saliency maps, which reveal qualitative patterns without a clear statistical interpretation. In this work, we propose to use probabilistic grammars to characterize and discover shortcuts in NLP datasets. Specifically, we use a context-free grammar to model patterns in sentence classification datasets and use a synchronous context-free grammar to model datasets involving sentence pairs. The resulting grammars reveal interesting shortcut features in a number of datasets, including both simple and high-level features, and automatically identify groups of test examples on which conventional classifiers fail. Finally, we show that the features we discover can be used to generate diagnostic contrast examples and incorporated into standard robust optimization methods to improve worst-group accuracy.

en cs.CL
arXiv Open Access 2022
On Unsupervised Training of Link Grammar Based Language Models

Nikolay Mikhaylovskiy

In this short note we explore what is needed for the unsupervised training of graph language models based on link grammars. First, we introduce the ter-mination tags formalism required to build a language model based on a link grammar formalism of Sleator and Temperley [21] and discuss the influence of context on the unsupervised learning of link grammars. Second, we pro-pose a statistical link grammar formalism, allowing for statistical language generation. Third, based on the above formalism, we show that the classical dissertation of Yuret [25] on discovery of linguistic relations using lexical at-traction ignores contextual properties of the language, and thus the approach to unsupervised language learning relying just on bigrams is flawed. This correlates well with the unimpressive results in unsupervised training of graph language models based on bigram approach of Yuret.

en cs.CL, cs.AI
S2 Open Access 2021
Slovak comparative correlatives

Jakob Horsch

Comparative Correlatives (CCs) are biclausal constructions (e.g. The harder you work, the more you earn) that have complex semantics and form. This is the first construction grammar-based corpus study to investigate Slovak CCs, based on a 500-token sample. I argue that intra-clausal word-order phenomena can be explained through processing efficiency, based on Hawkins’ principle of Early Immediate Constituents (2004), and I use covarying-collexeme analysis (Stefanowitsch & Gries 2005) to provide evidence for the existence of meso-constructions. The findings of this study contribute to construction grammar’s “aspirations toward universal applicability” (Fried 2017: 249), proving that the theory is also suitable for analysis of syntactic patterns in Slavic languages.

DOAJ Open Access 2021
Beyond the libretto

Aleksandra Ożarowska

Nowadays both intra- and interlingual surtitles are an inherent element of almost all opera produc­tions and, partly thanks to this technology, opera is now going through a renaissance. The trend of staging operas in a modernised fashion is especially popular these days, but it represents a particu­lar challenge for surtitlers. It is argued in this article that while surtitles accompanying traditional opera productions are usually intrasemiotic, as their source text is just the libretto, modernised productions often have intersemiotic surtitles. The article analyses fragments of surtitles prepared for four different operas staged in the Metropolitan Opera House, Bayerische Staatsoper and Royal Opera House. The result show that while traditionally surtitles provide the viewers with the mean­ing of the libretto, the role of intersemiotic surtitles is much more extended, as they provide the audience with more comprehensive information about the whole opera production.

Translating and interpreting

Halaman 11 dari 185096