Jean-Yves Pollock
Hasil untuk "Comparative grammar"
Menampilkan 20 dari ~3696277 hasil · dari arXiv, DOAJ, Semantic Scholar, CrossRef
Mohammad Rifat Arefin, Shanto Rahman, Christoph Csallner
Black-box context-free grammar inference is crucial for program analysis, reverse engineering, and security, yet existing tools such as Arvada, TreeVada, and Kedavra struggle with scalability, readability, and accuracy on large, complex languages. We present NatGI, a novel LLM-guided grammar inference framework that extends TreeVada's parse tree recovery with three key innovations: bracket-guided bubble exploration, LLM-driven bubble generation and non-terminal labeling, and hierarchical delta debugging (HDD) for systematic tree simplification. Bracket-guided exploration leverages syntactic cues such as parentheses to propose well-structured grammar fragments, while LLM guidance produces meaningful non-terminal names and selects more promising merges. Finally, HDD incrementally reduces unnecessary rules, which makes the grammars both compact and interpretable. In our experiments, we evaluate NatGI on a comprehensive benchmark suite ranging from small languages to larger ones such as lua, c, and mysql. Our results show that NatGI consistently outperforms strong baselines in terms of F1 score. On average, NatGI achieves an F1 score of 0.57, which is 25pp (percentage points) higher than the best-performing baseline, TreeVada. In the case of interpretability, our generated grammars perform significantly better than those produced by existing approaches. Leveraging LLM-based node renaming and bubble exploration, NatGI produces rules with meaningful non-terminal names and compact structures that align more closely with human intuition. As a result, developers and researchers can achieve higher accuracy while still being able to easily inspect, verify, and reason about the structure and semantics of the induced grammars.
Dominik Glandorf, Peng Cui, Detmar Meurers et al.
Chatbots based on large language models offer cheap conversation practice opportunities for language learners. However, they are hard to control for linguistic forms that correspond to learners' current needs, such as grammar. We control grammar in chatbot conversation practice by grounding a dialogue response generation model in a pedagogical repository of grammar skills. We also explore how this control helps learners to produce specific grammar. We comprehensively evaluate prompting, fine-tuning, and decoding strategies for grammar-controlled dialogue response generation. Strategically decoding Llama3 outperforms GPT-3.5 when tolerating minor response quality losses. Our simulation predicts grammar-controlled responses to support grammar acquisition adapted to learner proficiency. Existing language learning chatbots and research on second language acquisition benefit from these affordances. Code available on GitHub.
Andreas Pointner, Josef Pichler, Herbert Prähofer
A vast number of software systems include components that parse and process structured input. In addition to programming languages, which are analyzed by compilers or interpreters, there are numerous components that process standardized or proprietary data formats of varying complexity. Even if such components were initially developed and tested based on a specification, such as a grammar, numerous modifications and adaptations over the course of software evolution can make it impossible to precisely determine which inputs they actually accept. In this situation, grammar mining can be used to reconstruct the specification in the form of a grammar. Established approaches already produce useful results, provided that sufficient input data is available to fully cover the input language. However, achieving this completeness is a major challenge. In practice, only input data recorded during the operation of the software systems is available. If this data is used for grammar mining, the resulting grammar reflects only the actual processed inputs but not the complete grammar of the input language accepted by the software component. As a result, edge cases or previously supported features that no longer appear in the available input data are missing from the generated grammar. This work addresses this challenge by introducing a novel approach for the automatic generation of inputs for grammar mining. Although input generators have already been used for fuzz testing, it remains unclear whether they are also suitable for grammar miners. Building on the grammar miner Mimid, this work presents a fully automated approach to input generation. The approach leverages Dynamic Symbolic Execution (DSE) and extends it with two mechanisms to overcome the limitations of DSE regarding structured input parsers. First, the search for new inputs is guided by an iterative expansion that starts with a single-character input and gradually extends it. Second, input generation is structured into a novel three-phase approach, which separates the generation of inputs for parser functions. The proposed method was evaluated against a diverse set of eleven benchmark applications from the existing literature. Results demonstrate that the approach achieves precision and recall for extracted grammars close to those derived from state-of-the-art grammar miners such as Mimid. Notably, it successfully uncovers subtle features and edge cases in parsers that are typically missed by such grammar miners. The effectiveness of the method is supported by empirical evidence, showing that it can achieve high performance in various domains without requiring prior input samples. This contribution is significant for researchers and practitioners in software engineering, offering an automated, scalable, and precise solution for grammar mining. By eliminating the need for manual input generation, the approach not only reduces workload but also enhances the robustness and comprehensiveness of the extracted grammars. Following this approach, software engineers can reconstruct specification from existing (legacy) parsers.
Katrin Affolter, Kurt Stockinger, A. Bernstein
Over the last few years, natural language interfaces (NLI) for databases have gained significant traction both in academia and industry. These systems use very different approaches as described in recent survey papers. However, these systems have not been systematically compared against a set of benchmark questions in order to rigorously evaluate their functionalities and expressive power. In this paper, we give an overview over 24 recently developed NLIs for databases. Each of the systems is evaluated using a curated list of ten sample questions to show their strengths and weaknesses. We categorize the NLIs into four groups based on the methodology they are using: keyword-, pattern-, parsing- and grammar-based NLI. Overall, we learned that keyword-based systems are enough to answer simple questions. To solve more complex questions involving subqueries, the system needs to apply some sort of parsing to identify structural dependencies. Grammar-based systems are overall the most powerful ones, but are highly dependent on their manually designed rules. In addition to providing a systematic analysis of the major systems, we derive lessons learned that are vital for designing NLIs that can answer a wide range of user questions.
Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick et al.
Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup. Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint. Specifically, in grammar-constrained decoding (GCD), the LLM's output must follow a given grammar. In this paper, we demonstrate that GCD techniques (and in general constrained decoding techniques) can distort the LLM's distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM, and so ultimately are low-quality. We call the problem of aligning sampling with a grammar constraint, grammar-aligned decoding (GAD), and propose adaptive sampling with approximate expected futures (ASAp), a decoding algorithm that guarantees the output to be grammatical while provably producing outputs that match the conditional probability of the LLM's distribution conditioned on the given grammar constraint. Our algorithm uses prior sample outputs to soundly overapproximate the future grammaticality of different output prefixes. Our evaluation on code generation and structured NLP tasks shows how ASAp often produces outputs with higher likelihood (according to the LLM's distribution) than existing GCD techniques, while still enforcing the desired grammatical constraints.
Yu Zhao, Hao Fei, Shengqiong Wu et al.
Grammar Induction could benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel \emph{unsupervised visual-audio-text grammar induction} task (named \textbf{VAT-GI}), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a \emph{textless} setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (\textbf{VaTiora}) framework, which leverages rich modal-specific and complementary features for effective grammar parsing. Besides, a more challenging benchmark data is constructed to assess the generalization ability of VAT-GI system. Experiments on two benchmark datasets demonstrate that our proposed VaTiora system is more effective in incorporating the various multimodal signals, and also presents new state-of-the-art performance of VAT-GI.
Anna E. Bazanova, Mohamed Alsadig Hamid Musa
The concept and phenomenon of a language game, its main functions, types and application in commercial advertising is a way to attract the attention of consumers and promote a product. Examples of phonetic, morphological and syntactic wordplay in the texts of English-language commercial advertising are analyzed. The purpose of the article is to analyze the techniques of a language game and identify their functional features at various levels in an English-language advertising text. In this article, the following methods were used: descriptive-analytical method, interpretation method, search method. When choosing a material for analysis, the method of continuous sampling was used. As the material of the research, we used English-language advertisements in various resources, such as from magazines and newspapers and videos, in which a language game was revealed. Thus, the language game implemented in advertising texts is an important phenomenon, since it contributes to the maximum impact on the consumer, since the recipient, thanks to his techniques and functions, draws attention to this advertisement. In addition, an advertisement in which a language game as present is an indicator of a high level of the consumer’s language competence.
Rosiane Maria Soares da Silva Xypas, Anne Godard, Simone Aubin
R. Cox, Karen Hunt, R. R. Hill
BACKGROUND Artificial intelligence (AI) has the potential to revolutionize nursing education. This study compared NCLEX-RN questions generated by AI and those created by nurse educators. METHOD Faculty of accredited baccalaureate programs were invited to participate. Likert-scale items for grammar and clarity of the item stem and distractors were compared using Mann-Whitney U, and yes/no questions about clinical relevance and complex terminology were analyzed using chi-square. A one-sample binomial test with confidence intervals evaluated participants' question preference (AI-generated or educator-written). Qualitative responses identified themes across faculty. RESULTS Item clarity, grammar, and difficulty were similar for AI and educator-created questions. Clinical relevance and use of complex terminology was similar for all question pairs. Of the four sets with preference for one item, three were generated by AI. CONCLUSION AI can assist faculty with item generation to prepare nursing students for the NCLEX-RN examination. Faculty expertise is necessary to refine questions written using both methods. [J Nurs Educ. 2023;62(12):679-687.].
Martin Haspelmath
Abstract This article revisits the distinction between inflectional and derivational patterns in general grammar and discusses the possibility that this well-known distinction is not rooted in the reality of languages, but in the Western tradition of describing languages, through dictionaries (for words, including derived lexemes) and through grammar books (where we often find tables of exemplary paradigms). This tradition has led to rather different terminological treatments of the two kinds of patterns, but from the perspective of a constructional view of morphology, there is no need to incorporate such differences into formal grammatical descriptions. For practical purposes, we need clear and simple definitions of entrenched terms of general linguistics, so the article proposes semantically based (retro-) definitions of inflection, derivation and lexeme that cover the bulk of the existing usage. Finally, I briefly explain why we need sharp definitions of comparative concepts, and why prototype-based and fuzzy definitions of traditional terms are not helpful.
Leonie Weissweiler, Valentin Hofmann, Abdullatif Köksal et al.
Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.
Martin Haspelmath
Meanings and linguistic shapes (or forms) do not always map onto each other in a unique way, and linguists have used all kinds of different terms for such situations: Ambiguity, polysemy, syncretism, lexicalization, semantic maps; portmanteau, cumulative exponence, feature bundling, underspecification, and so on. In the domain of lexical comparison, the term colexification has become generally established in recent years, and in the present paper, I extend this word-formation pattern in a regular way (cogrammification, coexpression; syllexification, syngrammification, synexpression). These novel terms allow us to chart the range of relevant phenomena in a systematic way across the grammar-lexicon continuum, and to ask whether highly general explanations of coexpression and synexpression patterns are possible. While there is no new proposal for explaining coexpression here, I will suggest that frequency of occurrence plays a crucial role in explaining synexpression patterns.
Lavinia C. Pindaru, A. Nita, Iulian M. Niculae et al.
Environment Action Programmes (EAP's) are the most important documents defining the environmental policies within the European Union. Their implementation, over the previous 50 years, represented a significant advance in raising eco-friendly awareness and suggesting solutions for environmental problems in the European Union. In this paper, we used Institutional Grammar Tool and network analysis to identify the evolution of EU EAP's by investigating the most recent two programmes (7th Environment Action Programme and 8th Environment Action Programme), particularly in priority objectives, institutional statements, enforcement perspectives, and projected participation of stakeholders. We found that the EU's 8th Environment Action Programme (2021–2030) is further streamlined and target oriented as compared to 7th Environment Action Programme. Furthermore, institutional statements included in the 8th EAP will be implemented predominantly at the levels of European Union and European Commission. On the contrary, in the 7th EAP, the number of institutions, frameworks, and stakeholders is higher and often regional and local (e.g., European Union, Environment Action Programme, European Environment Agency, European Commission, European Parliament, Convention on Biological Diversity, regional authorities, local authorities). The close links of the 8th EAP targets with the 2030 Agenda for Sustainable Development and the European Green Deal represent an important step towards a greater applicability of environmental policies in the European Union. Our study reveals that comparative analysis of legal documents using Institutional Grammar Tool and network analysis can assist policymakers in assessing the drafting of legal environmental documents and obtain indispensable information about the changes to improve environmental policies.
Clara Fridman, M. Polinsky, Natalia Meir
While it is known that heritage speakers diverge from the homeland baseline, there is still no consensus on the mechanisms triggering this divergence. We investigate the impact of two potential factors shaping adult heritage language (HL) grammars: (1) cross-linguistic influence (CLI), originally proposed for second language acquisition (SLA), and (2) background factors associated with input. To assess the role of CLI and input we compared two groups of adult heritage speakers of Russian (n = 66) with two typologically distinct societal languages (SLs), Hebrew and American English. Their production was evaluated for three morphosyntactic phenomena: adjective–noun agreement, accusative case morphology, and numerical phrases. Using self-rating and baseline vocabulary tasks as proficiency measures, we conducted controlled experiments to assess mastery of the target phenomena. Our results show that, while CLI is the main mechanism behind HL grammar maintenance, increased input and proficiency can modulate performance in the absence of grammatical similarities between the HL and SL. An analysis of non-target responses revealed systematic patterns, including reliance on default, or unmarked, forms in both groups, in line with previous research. These findings contribute to the literature on the mechanisms of HL grammar formation and maintenance.
Chenlei Zhou
The paper describes comparative constructions in Zhoutun, a Chinese variety that was heavily influenced by Amdo Tibetan and spoken in Guide County, Qinghai Province. There are five comparative constructions (Cxn), based on the type of comparative marker, in Zhoutun, namely (1) the xa-Cxn; (2) the pi-Cxn; (3) the ‘look’-Cxn; (4) the ‘and’-Cxn; and (5) the hybrid Cxn. The five constructions illustrate features from both Chinese and Amdo Tibetan, and their co-existence demonstrates the mixed nature of the comparative constructions, as well as the grammar system of Zhoutun due to language contact. This paper also argues that the “comparative subject” should be further subcategorized into “comparative subject” and “attributive subject”, and that the “comparative result” should be divided into “abstract measurement” and “concrete measurement” in the typological study of comparative constructions.
Dieter Hillert, K. Fujita
The question of how humans got language is crucial for understanding the uniqueness of the human mind and the cognitive resources and processes shared with nonhuman species. We discuss the origin of symbolic elements in hominins and how a pragmatic grammar emerged from action-based event-structures. In the context of comparative neurobiological findings, we report support for the global workspace hypothesis and social brain hypothesis. In addition, reverse linguistic analysis informs us about the particular role of a pragmatic grammar stage. We assume that this stage was associated with changes to the hominin genotype. Homo erectus may have used a pragmatic grammar which consisted of two or three symbolic elements. Extended syntax and morphology, including hierarchical branching, are not based on genotype changes but may reflect cultural accumulations related to socioecological adaptations. We conclude that the biological capacity for language may have emerged already 1.8 million years ago with the appearance of genus Homo.
Lin Huang
Grammar is not only an essential component of learning a language, but it is also a challenging topic for foreign language learners. Inductive and deductive methods are two fundamental teaching strategies for grammar, both of which have advantages. The domestic research on this topic began relatively later than the numerous overseas studies on these two grammar teaching techniques, and the research findings are often insufficient. Therefore, it is necessary to explore grammar instruction strategies n the present context. By comparing experimental data from inductive and deductive research methodologies, this study examines the comparative impact of teaching English grammar. The study also sought to see which of the two approaches had a positive impact on the grammar academic performance of Chinese senior high school students. Based on previous experimental studies, a conclusion can be drawn that inductive and deductive teaching methods both have their own advantages and disadvantages. The variables in the language teaching or teaching context will determine the appropriate strategy for teaching grammar.
Albatool Abalkheel, Maha Sourani
The objective of this research is to connect Arabic theoretical linguistics with modern linguistics, not by disregarding the present and attributing the later theories solely to Arab grammarians and rhetoricians, as has often been assumed. Rather, this research aims to qualitatively explore the linguistic accomplishments of ancient Arab scholars from a contemporary perspective and identify commonalities and controversies between past and present perspectives. Specifically, this study focuses on Sibawayh's interpretation of functional linguistics and its final formulation by Jakobson. Additionally, the study aims to analyse and compare the principles of language and grammar advocated by ancient Arab linguists and contemporary linguists to deepen our understanding of language and its role in human society. The results of this study reveal that while Sibawayh concentrated on the analysis of structure and function in Arabic grammar, modern linguists such as Jakobson have broadened their approach to include a wider range of perspectives including sociolinguistics, psycholinguistics, and neurolinguistics. Therefore, it is essential to acknowledge the contributions of Sibawayh as his work formed the foundation for language theories in general and linguistics in particular.
Sean Deyo, Veit Elser
We introduce the logical grammar emdebbing (LGE), a model inspired by pregroup grammars and categorial grammars to enable unsupervised inference of lexical categories and syntactic rules from a corpus of text. LGE produces comprehensible output summarizing its inferences, has a completely transparent process for producing novel sentences, and can learn from as few as a hundred sentences.
Halaman 6 dari 184814