Hasil "Language. Linguistic theory. Comparative grammar"

arXiv Open Access 2025

PyFCG: Fluid Construction Grammar in Python

Paul Van Eecke, Katrien Beuls

We present PyFCG, an open source software library that ports Fluid Construction Grammar (FCG) to the Python programming language. PyFCG enables its users to seamlessly integrate FCG functionality into Python programs, and to use FCG in combination with other libraries within Python's rich ecosystem. Apart from a general description of the library, this paper provides three walkthrough tutorials that demonstrate example usage of PyFCG in typical use cases of FCG: (i) formalising and testing construction grammar analyses, (ii) learning usage-based construction grammars from corpora, and (iii) implementing agent-based experiments on emergent communication.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language

Kilian Sennrich, Sina Ahmadi

Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata's lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

en cs.CL

Detail Sumber

arXiv Open Access 2024

On Computational Completeness of Semi-Conditional Matrix Grammars

Henning Fernau, Lakshmanan Kuppusamy, Indhumathi Raman

Matrix grammars are one of the first approaches ever proposed in regulated rewriting, prescribing that rules have to be applied in a certain order. Originally, they have been introduced by Ábrahám on linguistic grounds. In traditional regulated rewriting, the most interesting case shows up when all rules are context-free. Typical descriptional complexity measures incorporate the number of nonterminals or the matrix length, i.e., the number of rules per matrix. When viewing matrices as program fragments, it becomes natural to consider additional applicability conditions for such matrices. Here, we focus on attaching a permitting and a forbidden string to every matrix in a matrix grammar. The matrix is applicable to a sentential form~$w$ only if the permitting string is a subword in~$w$ and the forbidden string is not a subword in~$w$. We call such a grammar, where the application of a matrix is conditioned as described, a semi-conditional matrix grammar. We consider $(1)$ the maximal lengths of permitting and forbidden strings, $(2)$ the number of nonterminals, $(3)$ the number of conditional matrices, $(4)$ the maximal length of any matrix and $(5)$ the number of conditional matrices with nonempty permitting and forbidden strings, as the resources (descriptional complexity measures) of a semi-conditional matrix grammar. In this paper, we show that certain semi-conditional matrix grammar families defined by restricting resources can generate all of the recursively enumerable languages.

en cs.FL

Detail Sumber

arXiv Open Access 2024

Improving Low-Resource Machine Translation via Cross-Linguistic Transfer from Typologically Similar High-Resource Languages

Saughmon Boujkian

This study examines the cross-linguistic effectiveness of transfer learning for low-resource machine translation by fine-tuning models initially trained on typologically similar high-resource languages, using limited data from the target low-resource language. We hypothesize that linguistic similarity enables efficient adaptation, reducing the need for extensive training data. To test this, we conduct experiments on five typologically diverse language pairs spanning distinct families: Semitic (Modern Standard Arabic to Levantine Arabic), Bantu (Hausa to Zulu), Romance (Spanish to Catalan), Slavic (Slovak to Macedonian), and a language isolate (Eastern Armenian to Western Armenian). Results show that transfer learning consistently improves translation quality across all pairs, confirming its applicability beyond closely related languages. As a secondary analysis, we vary key hyperparameters learning rate, batch size, number of epochs, and weight decay to ensure results are not dependent on a single configuration. We find that moderate batch sizes (e.g., 32) are often optimal for similar pairs, smaller sizes benefit less similar pairs, and excessively high learning rates can destabilize training. These findings provide empirical evidence for the generalizability of transfer learning across language families and offer practical guidance for building machine translation systems in low-resource settings with minimal tuning effort.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2024

Grounding Toxicity in Real-World Events across Languages

Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Social media conversations frequently suffer from toxicity, creating significant issues for users, moderators, and entire communities. Events in the real world, like elections or conflicts, can initiate and escalate toxic behavior online. Our study investigates how real-world events influence the origin and spread of toxicity in online discussions across various languages and regions. We gathered Reddit data comprising 4.5 million comments from 31 thousand posts in six different languages (Dutch, English, German, Arabic, Turkish and Spanish). We target fifteen major social and political world events that occurred between 2020 and 2023. We observe significant variations in toxicity, negative sentiment, and emotion expressions across different events and language communities, showing that toxicity is a complex phenomenon in which many different factors interact and still need to be investigated. We will release the data for further research along with our code.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Concurrent Linguistic Error Detection (CLED): a New Methodology for Error Detection in Large Language Models

Jinhua Zhu, Javier Conde, Zhen Gao et al.

The wide adoption of Large language models (LLMs) makes their dependability a pressing concern. Detection of errors is the first step to mitigating their impact on a system and thus, efficient error detection for LLMs is an important issue. In many settings, the LLM is considered as a black box with no access to the internal nodes; this prevents the use of many error detection schemes that need access to the model's internal nodes. An interesting observation is that the output of LLMs in error-free operation should be valid and normal text. Therefore, when the text is not valid or differs significantly from normal text, it is likely that there is an error. Based on this observation we propose to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts some linguistic features of the text generated by the LLM and feeds them to a concurrent classifier that detects errors. Since the proposed error detection mechanism only relies on the outputs of the model, then it can be used on LLMs in which there is no access to the internal nodes. The proposed CLED scheme has been evaluated on the T5 model when used for news summarization and on the OPUS-MT model when used for translation. In both cases, the same set of linguistic features has been used for error detection to illustrate the applicability of the proposed scheme beyond a specific case. The results show that CLED can detect most of the errors at a low overhead penalty. The use of the concurrent classifier also enables a trade-off between error detection effectiveness and its associated overhead, so providing flexibility to a designer.

en cs.AI, cs.CL

Detail DOI Sumber

arXiv Open Access 2024

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat et al.

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2023

An Evaluation on Large Language Model Outputs: Discourse and Memorization

Adrian de Wynter, Xun Wang, Alex Sokolov et al.

We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. We conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text.

en cs.CL, cs.AI

Detail DOI Sumber

DOAJ Open Access 2022

La Spagna nel cuore di Leonardo Sciascia

Sarina Macaluso

L'amore e l'attrazione di Leonardo Sciascia per la Spagna risalgono alla sua giovinezza e sono ampiamente testimoniati nei suoi scritti. Non c'è aspetto della Spagna che lo scrittore non abbia analizzato con la sua acutezza e lucidità, mettendo spesso in relazione le caratteristiche del Paese iberico e dei suoi abitanti con quelle della Sicilia. Attraverso le sue riflessioni, Sciascia ha sottolineato quei legami storici, culturali e antropologici che testimoniano la speciale affinità tra le due terre a lui particolarmente care.

Language. Linguistic theory. Comparative grammar, French literature - Italian literature - Spanish literature - Portuguese literature

Detail DOI Sumber

DOAJ Open Access 2022

Legal phraseology within the context of Polish-German legal translation – a new monograph by Paweł Bielawski

Justyna Katarzyna Sekuła

The article attempts to provide a comprehensive assessment of the new monograph by Paweł Bielawski. For this purpose, the paper has been divided into four parts. The first part provides an introduction. The second part presents the theoretical basis, which the author of the monograph demonstrated in individual chapters. The third chapter assesses the monograph in terms of structure, editorial aspects and the compilation of literature. Finally, the last part is devoted to the possible addressees of the monograph and its advantages with regard to them.

Language. Linguistic theory. Comparative grammar, Comparative law. International uniform law

Detail DOI Sumber

DOAJ Open Access 2022

STUDENTS' PERCEPTIONS OF USING READTHEORY.ORG IN READING COMPREHENSION

Krisna Ismawati, Syafryadin Syafryadin

The purpose of this study was to determine and explore students' perceptions of the use of ReadTheory in reading comprehension. Quantitative descriptive was used to conduct this research. The researcher used a closed questionnaire using five Likert scales to collect data. In this study 50 students as the sample. Based on the overall results, the researcher can conclude that the students' perception of the use of ReadTheory in reading comprehension of the majority of the subjects showed positive responses, especially on the points related to the use of the platform towards changes in their abilities for the better even though the changes were not too significant. This can be seen from their responses that they will continue to use ReadTheory in the future because they feel that they have received many benefits. Therefore, the researcher can conclude that students' perception of using the ReadTheory platform in reading comprehension tends to be positive because it provides benefits and changes their ability to understand reading texts.

Language. Linguistic theory. Comparative grammar, Theory and practice of education

Detail DOI Sumber

DOAJ Open Access 2022

Students' self-repair in EFL classroom interactions: implications for classroom dynamics

Mohammed Beshir, Abiy Yigzaw

Abstract This study aimed to evaluate whether or not students engage in self-help behavior when they encounter difficulties during their presentations. The participants were second year EFL college trainees; the data were obtained from their classroom presentations. The audio recorded data has been transcribed, coded, and categorized into quantitative and qualitative data. Descriptive statistical tools were employed to analyze the quantitative component, while thematic analysis was used to explain the qualitative data. Based on the findings, the students engaged in different kinds of strategies for self-repair, including same-information repair (repetition) 48 (36.9%); appropriateness repair 46 (35.4%); error repair 32 (24.6%); and back-to-error 4 (3.1%). From the point of view of the trouble sources that trigger self-repair, this research highlighted difficulties with syntactic and lexical errors as the most prevalent problems among EFL learners. The self-repair evidences revealed the importance of giving time for students to modify their utterances by themselves. In addition, the result could help teachers determine where to focus their efforts in helping students.

Special aspects of education, Language acquisition

Detail DOI Sumber

arXiv Open Access 2022

A simple language-agnostic yet very strong baseline system for hate speech and offensive content identification

Yves Bestgen

For automatically identifying hate speech and offensive content in tweets, a system based on a classical supervised algorithm only fed with character n-grams, and thus completely language-agnostic, is proposed by the SATLab team. After its optimization in terms of the feature weighting and the classifier parameters, it reached, in the multilingual HASOC 2021 challenge, a medium performance level in English, the language for which it is easy to develop deep learning approaches relying on many external linguistic resources, but a far better level for the two less resourced language, Hindi and Marathi. It ends even first when performances are averaged over the three tasks in these languages, outperforming many deep learning approaches. These performances suggest that it is an interesting reference level to evaluate the benefits of using more complex approaches such as deep learning or taking into account complementary resources.

en cs.CL

Detail Sumber

arXiv Open Access 2022

Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

Clarissa Forbes, Farhan Samir, Bruce Harold Oliver et al.

Recent progress in NLP is driven by pretrained models leveraging massive datasets and has predominantly benefited the world's political and economic superpowers. Technologically underserved languages are left behind because they lack such resources. Hundreds of underserved languages, nevertheless, have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts. IGT remains underutilized in NLP work, perhaps because its annotations are only semi-structured and often language-specific. With this paper, we make the case that IGT data can be leveraged successfully provided that target language expertise is available. We specifically advocate for collaboration with documentary linguists. Our paper provides a roadmap for successful projects utilizing IGT data: (1) It is essential to define which NLP tasks can be accomplished with the given IGT data and how these will benefit the speech community. (2) Great care and target language expertise is required when converting the data into structured formats commonly employed in NLP. (3) Task-specific and user-specific evaluation can help to ascertain that the tools which are created benefit the target language speech community. We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.

en cs.CL

Detail Sumber

S2 Open Access 2021

Effects of L1 morphological type on L2 morphological awareness

Zhaohong Wu, Alan Juffs

Previous studies on bilingual children have shown a significant correlation between first language (L1) and second language (L2) morphological awareness and a unique contribution of morphological awareness in one language to reading performance in the other language, suggesting cross-linguistic influence. However, few studies have compared advanced adult L2 learners from L1s of different morphological types or compared native speakers with advanced learners from a morphologically more complex L1 in their target-language morphological awareness. The current study filled this gap by comparing native English speakers (analytic) and two L2 groups from typologically different L1s: Turkish (agglutinative) and Chinese (isolating). Participants’ morphological awareness was evaluated via a series of tasks, including derivation, affix-choice word and nonword tasks, morphological relatedness, and a suffix-ordering task. Results showed a significant effect of L1 morphological type on L2 morphological awareness. After accounting for L2 proficiency, the Turkish group significantly outperformed the Chinese group in the derivation, morphological relatedness, and suffix-ordering tasks. More importantly, the Turkish group significantly outperformed the native English group in the morphological relatedness task even without accounting for English proficiency. Such results have implications for theories in second language acquisition regarding representation of the bilingual lexicon. In addition, results of the current study underscored the need to guard against the comparative fallacy and highlighted the influential effect of L1 experience on the acquisition of L2 morphological knowledge.

19 sitasi en Psychology

Detail DOI Sumber

DOAJ Open Access 2021

SELF ASSESSMENT FOR HIGHER LANGUAGE LEARNERS’ AUTONOMY

Gita Mutiara Hati, Wisma Yunita, Anggun Citra Sari Dewi

The objective of this research is to describe how the use of self-assessment could improve learners’ autonomy. This study employed an action research design involving freshmen of English Education Study Program University of Bengkulu as the subject of the research. This action research was conducted in the Intensive English Course. The cycle of research consisted of plan, action, observation, and reflection. The cycle consisted of fifteen meetings covering four themes of materials. During the meetings, students were instructed to do self-assessment guided by the instructor after they have finished learning each theme. The indicator of success for this research was at least 25% students were categorized as having Very High level of autonomy, while the rest were at High level of autonomy. To measure the level of learners’ autonomy, a questionnaire was used. The questionnaire covers nine aspects of autonomy with the total fifty items. It was given to the subject at the end of the cycle. From the analyzed data, it was found that after only conducting one cycle of the action research, the indicator of success in this research was achieved. Hence, there was no need to continue it to the second cycle. From the result, 30% of students were categorized as having Very High Learning Autonomy.

Special aspects of education, Language. Linguistic theory. Comparative grammar

Detail Sumber

DOAJ Open Access 2021

The Role of the Right Hemisphere in Processing Phonetic Variability Between Talkers

Sahil Luthra

AbstractNeurobiological models of speech perception posit that both left and right posterior temporal brain regions are involved in the early auditory analysis of speech sounds. However, frank deficits in speech perception are not readily observed in individuals with right hemisphere damage. Instead, damage to the right hemisphere is often associated with impairments in vocal identity processing. Herein lies an apparent paradox: The mapping between acoustics and speech sound categories can vary substantially across talkers, so why might right hemisphere damage selectively impair vocal identity processing without obvious effects on speech perception? In this review, I attempt to clarify the role of the right hemisphere in speech perception through a careful consideration of its role in processing vocal identity. I review evidence showing that right posterior superior temporal, right anterior superior temporal, and right inferior / middle frontal regions all play distinct roles in vocal identity processing. In considering the implications of these findings for neurobiological accounts of speech perception, I argue that the recruitment of right posterior superior temporal cortex during speech perception may specifically reflect the process of conditioning phonetic identity on talker information. I suggest that the relative lack of involvement of other right hemisphere regions in speech perception may be because speech perception does not necessarily place a high burden on talker processing systems, and I argue that the extant literature hints at potential subclinical impairments in the speech perception abilities of individuals with right hemisphere damage.

Language. Linguistic theory. Comparative grammar, Neurophysiology and neuropsychology

Detail DOI Sumber

arXiv Open Access 2021

Analyzing the Limits of Self-Supervision in Handling Bias in Language

Lisa Bauer, Karthik Gopalakrishnan, Spandana Gella et al.

Prompting inputs with natural language task descriptions has emerged as a popular mechanism to elicit reasonably accurate outputs from large-scale generative language models with little to no in-context supervision. This also helps gain insight into how well language models capture the semantics of a wide range of downstream tasks purely from self-supervised pre-training on massive corpora of unlabeled text. Such models have naturally also been exposed to a lot of undesirable content like racist and sexist language and there is limited work on awareness of models along these dimensions. In this paper, we define and comprehensively evaluate how well such language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing. We define three broad classes of task descriptions for these tasks: statement, question, and completion, with numerous lexical variants within each class. We study the efficacy of prompting for each task using these classes and the null task description across several decoding methods and few-shot examples. Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation. We believe our work is an important step towards unbiased language models by quantifying the limits of current self-supervision objectives at accomplishing such sociologically challenging tasks.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2021

Machine Translation into Low-resource Language Varieties

Sachin Kumar, Antonios Anastasopoulos, Shuly Wintner et al.

State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source--variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English--Russian MT system to generate Ukrainian and Belarusian, an English--Norwegian Bokmål system to generate Nynorsk, and an English--Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.

en cs.CL

Detail Sumber

arXiv Open Access 2020

SLURP: A Spoken Language Understanding Resource Package

Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski et al.

Spoken Language Understanding infers semantic meaning directly from audio data, and thus promises to reduce error propagation and misunderstandings in end-user applications. However, publicly available SLU resources are limited. In this paper, we release SLURP, a new SLU package containing the following: (1) A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets; (2) Competitive baselines based on state-of-the-art NLU and ASR systems; (3) A new transparent metric for entity labelling which enables a detailed error analysis for identifying potential areas of improvement. SLURP is available at https: //github.com/pswietojanski/slurp.

en cs.CL, cs.LG

Detail Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"