Hasil "Language. Linguistic theory. Comparative grammar"

arXiv Open Access 2026

Sign Language Recognition in the Age of LLMs

Vaclav Javorek, Jakub Honzik, Ivan Gruber et al.

Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

en cs.CV, cs.CL

Detail Sumber

arXiv Open Access 2025

Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park, Timothy Zhou, Loris D'Antoni

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

Dominik Meier, Jan Philip Wahle, Paul Röttger et al.

As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.

en cs.CL, cs.CR

Detail DOI Sumber

arXiv Open Access 2025

Gender Encoding Patterns in Pretrained Language Model Representations

Mahdi Zakizadeh, Mohammad Taher Pilehvar

Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Aplicação de Large Language Models na Análise e Síntese de Documentos Jurídicos: Uma Revisão de Literatura

Matheus Belarmino, Rackel Coelho, Roberto Lotudo et al.

Large Language Models (LLMs) have been increasingly used to optimize the analysis and synthesis of legal documents, enabling the automation of tasks such as summarization, classification, and retrieval of legal information. This study aims to conduct a systematic literature review to identify the state of the art in prompt engineering applied to LLMs in the legal context. The results indicate that models such as GPT-4, BERT, Llama 2, and Legal-Pegasus are widely employed in the legal field, and techniques such as Few-shot Learning, Zero-shot Learning, and Chain-of-Thought prompting have proven effective in improving the interpretation of legal texts. However, challenges such as biases in models and hallucinations still hinder their large-scale implementation. It is concluded that, despite the great potential of LLMs for the legal field, there is a need to improve prompt engineering strategies to ensure greater accuracy and reliability in the generated results.

en cs.CL

Detail Sumber

DOAJ Open Access 2024

Regulação da mídia e literacias digitais no combate a fake news: plataformização, inteligência artificial e algoritmos

Regina Rossetti, Renata Abibe Ferrarezi

Este artigo propõe uma discussão sobre desinformação, envolvendo engrenagem e métodos que colaboram para a propagação de fake news no ambiente online e de outros fenômenos que surgem na esteira do modelo de negócios das Big Techs. Considera-se que, com o avanço da Inteligência Artificial, tais formatos passam a exercer um domínio cada vez maior sobre as experiências, as narrativas e o conhecimento humano, controle que pode ser considerado uma ameaça à liberdade de expressão e ao direito à informação. Diante desse desafio, do ponto de vista regulatório, avalia-se a revisão de normas que atribuam mais limites às plataformas digitais e do educacional, destaca-se a necessidade de ampliação do letramento digital, midiático e informacional. A metodologia envolve revisão bibliográfica e documental.

Discourse analysis

Detail DOI Sumber

arXiv Open Access 2024

Facilitating large language model Russian adaptation with Learned Embedding Propagation

Mikhail Tikhomirov, Daniil Chernyshev

Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

Fangru Lin, Daniel Altshuler, Janet B. Pierrehumbert

Scalar adjectives pertain to various domain scales and vary in intensity within each scale (e.g. certain is more intense than likely on the likelihood scale). Scalar implicatures arise from the consideration of alternative statements which could have been made. They can be triggered by scalar adjectives and require listeners to reason pragmatically about them. Some scalar adjectives are more likely to trigger scalar implicatures than others. This phenomenon is referred to as scalar diversity. In this study, we probe different families of Large Language Models such as GPT-4 for their knowledge of the lexical semantics of scalar adjectives and one specific aspect of their pragmatics, namely scalar diversity. We find that they encode rich lexical-semantic information about scalar adjectives. However, the rich lexical-semantic knowledge does not entail a good understanding of scalar diversity. We also compare current models of different sizes and complexities and find that larger models are not always better. Finally, we explain our probing results by leveraging linguistic intuitions and model training objectives.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Measuring the Inconsistency of Large Language Models in Preferential Ranking

Xiutian Zhao, Ke Wang, Wei Peng

Despite large language models' (LLMs) recent advancements, their bias and hallucination issues persist, and their ability to offer consistent preferential rankings remains underexplored. This study investigates the capacity of LLMs to provide consistent ordinal preferences, a crucial aspect in scenarios with dense decision space or lacking absolute answers. We introduce a formalization of consistency based on order theory, outlining criteria such as transitivity, asymmetry, reversibility, and independence from irrelevant alternatives. Our diagnostic experiments on selected state-of-the-art LLMs reveal their inability to meet these criteria, indicating a strong positional bias and poor transitivity, with preferences easily swayed by irrelevant alternatives. These findings highlight a significant inconsistency in LLM-generated preferential rankings, underscoring the need for further research to address these limitations.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2024

TEXT2AFFORD: Probing Object Affordance Prediction abilities of Language Models solely from Text

Sayantan Adak, Daivik Agrawal, Animesh Mukherjee et al.

We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs). A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances -- Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances. Codes and data are available at https://github.com/sayantan11995/Text2Afford

en cs.CL

Detail DOI Sumber

arXiv Open Access 2024

Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models

Minhao Bai, Kaiyi Pang, Yongfeng Huang

In the rapidly evolving domain of artificial intelligence, safeguarding the intellectual property of Large Language Models (LLMs) is increasingly crucial. Current watermarking techniques against model extraction attacks, which rely on signal insertion in model logits or post-processing of generated text, remain largely heuristic. We propose a novel method for embedding learnable linguistic watermarks in LLMs, aimed at tracing and preventing model extraction attacks. Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding an statistically identifiable controllable watermark.We leverage statistical hypothesis testing and information theory, particularly focusing on Kullback-Leibler Divergence, to differentiate between original and modified distributions effectively. Our watermarking method strikes a delicate well balance between robustness and output quality, maintaining low false positive/negative rates and preserving the LLM's original performance.

en cs.CR, cs.AI

Detail Sumber

arXiv Open Access 2024

L+M-24: Building a Dataset for Language + Molecules @ ACL 2024

Carl Edwards, Qingyun Wang, Lawrence Zhao et al.

Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the $\textit{L+M-24}$ dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, $\textit{L+M-24}$ is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Language-Conditioned Change-point Detection to Identify Sub-Tasks in Robotics Domains

Divyanshu Raj, Chitta Baral, Nakul Gopalan

In this work, we present an approach to identify sub-tasks within a demonstrated robot trajectory using language instructions. We identify these sub-tasks using language provided during demonstrations as guidance to identify sub-segments of a longer robot trajectory. Given a sequence of natural language instructions and a long trajectory consisting of image frames and discrete actions, we want to map an instruction to a smaller fragment of the trajectory. Unlike previous instruction following works which directly learn the mapping from language to a policy, we propose a language-conditioned change-point detection method to identify sub-tasks in a problem. Our approach learns the relationship between constituent segments of a long language command and corresponding constituent segments of a trajectory. These constituent trajectory segments can be used to learn subtasks or sub-goals for planning or options as demonstrated by previous related work. Our insight in this work is that the language-conditioned robot change-point detection problem is similar to the existing video moment retrieval works used to identify sub-segments within online videos. Through extensive experimentation, we demonstrate a $1.78_{\pm 0.82}\%$ improvement over a baseline approach in accurately identifying sub-tasks within a trajectory using our proposed method. Moreover, we present a comprehensive study investigating sample complexity requirements on learning this mapping, between language and trajectory sub-segments, to understand if the video retrieval-based methods are realistic in real robot scenarios.

en cs.RO, cs.AI

Detail Sumber

arXiv Open Access 2023

Implicit Self-supervised Language Representation for Spoken Language Diarization

Jagabandhu Mishra, S. R. Mahadeva Prasanna

In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework.

en eess.AS, cs.CL

Detail DOI Sumber

arXiv Open Access 2023

Explaining Vision and Language through Graphs of Events in Space and Time

Mihai Masala, Nicolae Cudlenco, Traian Rebedea et al.

Artificial Intelligence makes great advances today and starts to bridge the gap between vision and language. However, we are still far from understanding, explaining and controlling explicitly the visual content from a linguistic perspective, because we still lack a common explainable representation between the two domains. In this work we come to address this limitation and propose the Graph of Events in Space and Time (GEST), by which we can represent, create and explain, both visual and linguistic stories. We provide a theoretical justification of our model and an experimental validation, which proves that GEST can bring a solid complementary value along powerful deep learning models. In particular, GEST can help improve at the content-level the generation of videos from text, by being easily incorporated into our novel video generation engine. Additionally, by using efficient graph matching techniques, the GEST graphs can also improve the comparisons between texts at the semantic level.

en cs.AI, cs.CL

Detail Sumber

DOAJ Open Access 2022

HABITUS MASYARAKAT KRAPYAK KIDUL KOTA PEKALONGAN TERKAIT TRADISI LOPIS RAKSASA

Divani Majidullah Syarief, Ufairoh Shoofii Abiyyi, Umu Hana Amini et al.

Penelitian ini dilaksanakan dengan maksud untuk mengetahui habitus-habitus masyarakat Krapyak Kidul terkait tradisi lopis raksasa. Tradisi ini rutin dihelat setiap tahun pada tanggal 8 Syawal atau tujuh hari setelah Idulfitri di Krapyak Kidul, Kota Pekalongan. Penelitian tergolong deskriptif kualitatif yang menggunakan pendekatan strukturalisme genetik Pierre Bourdieu. Pengambilan data dilakukan dengan cara wawancara terhadap narasumber yang direkam dengan gawai untuk nantinya data-data yang dibutuhkan dalam penelitian diambil dengan teknik simak dan catat. Data yang terkumpul akan dibedah berdasarkan konsep habitus dalam strukturalisme genetik Pierre Bourdieu. Hasil yang diperoleh dalam penelitian ini antara lain: (1) Habitus persaudaraan, tradisi lopis raksasa memiliki semangat persaudaraan untuk merekatkan masyarakat; (2) Habitus kompak, tradisi lopis raksasa merupakan media untuk mengompakkan masyarakat yang heterogen; (3) Habitus religius, tradisi lopis raksasa tidak dapat dipisahkan dari nilai keagamaan; (4) Habitus berbagi, tradisi lopis raksasa mengajarkan untuk berbagi pada sesama; (5) Habitus gotong royong, pembuatan lopis raksasa melalui proses yang panjang dan dilakukan bersama-sama; (6) Habitus kerja keras, eksistensi tradisi lopis raksasa tidak lepas dari usaha dan kerja keras masyarakat dalam melestarikan tradisi ini; dan (7) Habitus berdagang, tradisi lopis raksasa mampu menaikkan perekonomian masyarakat dengan adanya kesempatan untuk berdagang. Kata kunci: tradisi, lopis raksasa, syawalan

Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2022

BACK MATTER

Special aspects of education, Language. Linguistic theory. Comparative grammar

Detail Sumber

S2 Open Access 2022

Expression of voice relation in the languages of different structure: is there a category of voice in Hungarian?

A. P. Guskova

Introduction. This article is a development of an earlier study on the grammatical categories of the verb in Hungarian in comparison with Russian. The subject of study in this work is the grammatical category of voice, one of the most common verbal categories, typical for Russian verbs and absent in the Hungarian language, as well as the ways of its transmission when translated into Hungarian. The purpose of this work is to analyze the grammatical forms of the Hungarian and Russian verbs expressing voice relations and to establish the corresponding equivalents of the category of voice in Russian in the Hungarian language. Materials and Methods. The research material is verbal vocabulary selected from lexicographic works (dictionaries of the Hungarian and Russian languages), linguistic and journalistic sources, grammars of the Hungarian language. The article considers both the written form of the modern Hungarian language and examples from live colloquial speech. The methodological framework of this study is based on researches on general linguistics, the theory of morphology, contributing to the formulation of the problem, determining the degree of its development in the research literature, as well as familiarizing with the conceptual approaches to the voice as an ambiguously interpreted grammatical category of the verb. The methods used in the study are comparative-typological, synchronous-descriptive and functional-semantic methods. Results and Discussion. In the course of study it revealed how inflectional and derivational means helping to express the grammatical semantics of the category of voice in Russian can adequately convey the categorical meaning in the Hungarian language, in which the category of voice is absent. The comparative study of two languages helps to identify the features inherent in each of them, which generally contributes to solving the problems associated with the practice of their teaching. The results of the work based on the data of the ongoing research may be useful in the analysis of other grammatical categories on the material of various languages. Conclusion. Despite the fact that verbs in the Hungarian language do not have a passive voice, the means and forms of its expression are represented in it by a developed and diverse system. The study allows making a step forward in understanding the typological specifics of the voice category and make a certain contribution to the general theory of this phenomenon, since it was the first study of the verb forms of two heterogeneous, genetically unrelated languages based on a comparative analysis in grammatical, semantic and functional aspects.

en

Detail DOI Sumber

arXiv Open Access 2022

Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale

Laurent Sartran, Samuel Barrett, Adhiguna Kuncoro et al.

We introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are implemented through a special attention mask and deterministic transformation of the linearized tree. We find that TGs outperform various strong baselines on sentence-level language modeling perplexity, as well as on multiple syntax-sensitive language modeling evaluation metrics. Additionally, we find that the recursive syntactic composition bottleneck which represents each sentence as a single vector harms perplexity on document-level language modeling, providing evidence that a different kind of memory mechanism -- one that is independent of composed syntactic representations -- plays an important role in current successful models of long text.

en cs.CL

Detail DOI Sumber

arXiv Open Access 2022

Local Grammar-Based Coding Revisited

Łukasz Dębowski

In the setting of minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined via simple symbol-by-symbol encoding. This paper discusses four contributions to this field. First, we invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law and simplifies universality proofs for minimal local grammar-based codes. Second, we refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy. These bounds are relevant for linking Zipf's law with the neural scaling law for large language models. Third, we develop a framework for universal codes with fixed infinite vocabularies, recasting universal coding as matching ranked patterns that are independent of empirical data. Finally, we analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that that such codes are also universal. These results extend foundations of universal grammar-based coding and reaffirm previously stated connections to power laws for human language and language models.

en cs.IT, cs.CL

Detail Sumber

Hasil untuk "Language. Linguistic theory. Comparative grammar"