The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory
Romain Peyrichou
Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.
Educación y sensibilización: una propuesta para el desarrollo integral de mujeres y niñas indígenas
Perla Elízabet Ventura Ramos, Jesús Zaratoga Martínez, Norma Yadira Memije Alarcón
Introducción: La educación es un derecho humano esencial y una herramienta poderosa para el desarrollo personal y comunitario. Sin embargo, las mujeres y niñas indígenas enfrentan múltiples desafíos que limitan su acceso al sistema educativo y afectan su desarrollo integral.
Métodos: Este estudio propone acciones de sensibilización para contribuir al desarrollo de las mujeres y niñas indígenas en Atliaca, Guerrero, México. A través de un enfoque cualitativo y explicativo, se analizó su contexto sociocultural y educativo. Se realizaron entrevistas y encuestas a mujeres, niñas, autoridades educativas, docentes, padres y líderes comunitarios, utilizando procedimientos de estadística descriptiva para identificar patrones clave.
Resultados: Se identificaron cuatro aspectos principales: la brecha de género y etnia en el acceso, permanencia y logros educativos; los factores sociales, culturales, económicos y políticos que limitan su derecho a la educación; las iniciativas actuales para promover su educación; y los beneficios asociados con su desarrollo educativo.
Conclusiones: Se plantean acciones de sensibilización que visibilicen los beneficios de la educación, generen espacios de diálogo y colaboración, brinden orientación y asesoría, y promuevan programas de apoyo y becas. Estas acciones buscan garantizar un desarrollo integral para las mujeres y niñas indígenas.
Philology. Linguistics, Language. Linguistic theory. Comparative grammar
Paisaje y memoria cultural. El litoral histórico de Santiago de Cuba como espacio de identidad y patrimonio
Adrian Amed García Jardines
Introducción: Este artículo explora la relación entre el paisaje y la memoria cultural en el litoral del Centro Histórico Urbano de Santiago de Cuba, centrándose en la avenida Jesús Menéndez. Se analiza cómo las antiguas edificaciones industriales, vinculadas a la producción de ron y cerveza, forman parte de un patrimonio que integra lo urbano, lo natural y lo simbólico, reflejando la identidad histórica y cultural de la región.
Métodos: A partir de la metodología de Espinosa Ocallaghan y Gómez Ortega (2022), se evalúa la integración paisajística mediante un análisis de unidades paisajísticas y áreas de percepción visual. Este enfoque permite identificar la calidad visual y la visibilidad del paisaje, así como su relación con los valores patrimoniales y culturales del área.
Resultados: El estudio identifica un uso de suelo mixto, con escasez de servicios y tres unidades de paisaje de alto valor, afectadas por la falta de atención. Estas unidades permiten jerarquizar los elementos clave para futuras intervenciones que integren la conservación del patrimonio con el desarrollo comunitario.
Conclusiones: El litoral de Santiago de Cuba es un espacio cultural e histórico, por lo que se propone un modelo sostenible que involucre a la comunidad para su conservación y valorización como recurso clave de identidad local.
Philology. Linguistics, Language. Linguistic theory. Comparative grammar
IX. znanstveni skup Dani Andrije Štampara
Vlatka Dugački
Can Large Language Models Robustly Perform Natural Language Inference for Japanese Comparatives?
Yosuke Mikami, Daiki Matsuoka, Hitomi Yanaka
Large Language Models (LLMs) perform remarkably well in Natural Language Inference (NLI). However, NLI involving numerical and logical expressions remains challenging. Comparatives are a key linguistic phenomenon related to such inference, but the robustness of LLMs in handling them, especially in languages that are not dominant in the models' training data, such as Japanese, has not been sufficiently explored. To address this gap, we construct a Japanese NLI dataset that focuses on comparatives and evaluate various LLMs in zero-shot and few-shot settings. Our results show that the performance of the models is sensitive to the prompt formats in the zero-shot setting and influenced by the gold labels in the few-shot examples. The LLMs also struggle to handle linguistic phenomena unique to Japanese. Furthermore, we observe that prompts containing logical semantic representations help the models predict the correct labels for inference problems that they struggle to solve even with few-shot examples.
Can Vision-Language Models Infer Speaker's Ignorance? The Role of Visual and Linguistic Cues
Ye-eun Cho, Yunho Maeng
This study investigates whether vision-language models (VLMs) can perform pragmatic inference, focusing on ignorance implicatures, utterances that imply the speaker's lack of precise knowledge. To test this, we systematically manipulated contextual cues: the visually depicted situation (visual cue) and QUD-based linguistic prompts (linguistic cue). When only visual cues were provided, three state-of-the-art VLMs (GPT-4o, Gemini 1.5 Pro, and Claude 3.5 sonnet) produced interpretations largely based on the lexical meaning of the modified numerals. When linguistic cues were added to enhance contextual informativeness, Claude exhibited more human-like inference by integrating both types of contextual cues. In contrast, GPT and Gemini favored precise, literal interpretations. Although the influence of contextual cues increased, they treated each contextual cue independently and aligned them with semantic features rather than engaging in context-driven reasoning. These findings suggest that although the models differ in how they handle contextual cues, Claude's ability to combine multiple cues may signal emerging pragmatic competence in multimodal models.
Bridging Gaps in Natural Language Processing for Yorùbá: A Systematic Review of a Decade of Progress and Prospects
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov
Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable. Several NLP applications are ubiquitous, partly due to the myriad of datasets being churned out daily through mediums like social networking sites. However, the growing development has not been evident in most African languages due to the persisting resource limitations, among other issues. Yorùbá language, a tonal and morphologically rich African language, suffers a similar fate, resulting in limited NLP usage. To encourage further research towards improving this situation, this systematic literature review aims to comprehensively analyse studies addressing NLP development for Yorùbá, identifying challenges, resources, techniques, and applications. A well-defined search string from a structured protocol was employed to search, select, and analyse 105 primary studies between 2014 and 2024 from reputable databases. The review highlights the scarcity of annotated corpora, the limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles. It also revealed the prominent techniques, including rule-based methods, among others. The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and the desertion of language for digital usage. This review synthesises existing research, providing a foundation for advancing NLP for Yorùbá and in African languages generally. It aims to guide future research by identifying gaps and opportunities, thereby contributing to the broader inclusion of Yorùbá and other under-resourced African languages in global NLP advancements.
On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Tarek Naous, Wei Xu
Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
Assessing the alignment between infants' visual and linguistic experience using multimodal language models
Alvin Wei Ming Tan, Jane Yang, Tarun Sepuri
et al.
Figuring out which objects or concepts words refer to is a central language learning challenge for young children. Most models of this process posit that children learn early object labels from co-occurrences of words and their referents that occur when someone around them talks about an object in the immediate physical environment. But how aligned in time are children's visual and linguistic experiences during everyday learning? To date, answers to this question have been limited by the need for labor-intensive manual annotations of vision-language co-occurrences. Here, we evaluate the use of contrastive language-image pretraining (CLIP) models to automatically characterize vision-language alignment in egocentric videos taken from the infant perspective in home environments. After validating CLIP alignment scores using human alignment judgments, we apply this metric to a large corpus of infant-perspective videos. We show that idealized aligned moments for learning (e.g., "look at the ball" with a ball present in the child's view) are relatively rare in children's everyday experiences compared to modern machine learning datasets, and highlight variability in alignment both within and across children. These findings suggest that infrequent alignment is a constraint for models describing early word learning and offer a new method for investigating children's multimodal environment.
LuxBank: The First Universal Dependency Treebank for Luxembourgish
Alistair Plum, Caroline Döhmer, Emilia Milano
et al.
The Universal Dependencies (UD) project has significantly expanded linguistic coverage across 161 languages, yet Luxembourgish, a West Germanic language spoken by approximately 400,000 people, has remained absent until now. In this paper, we introduce LuxBank, the first UD Treebank for Luxembourgish, addressing the gap in syntactic annotation and analysis for this `low-research' language. We establish formal guidelines for Luxembourgish language annotation, providing the foundation for the first large-scale quantitative analysis of its syntax. LuxBank serves not only as a resource for linguists and language learners but also as a tool for developing spell checkers and grammar checkers, organising existing text archives and even training large language models. By incorporating Luxembourgish into the UD framework, we aim to enhance the understanding of syntactic variation within West Germanic languages and offer a model for documenting smaller, semi-standardised languages. This work positions Luxembourgish as a valuable resource in the broader linguistic and NLP communities, contributing to the study of languages with limited research and resources.
Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models
Linge Guo
This research critically navigates the intricate landscape of AI deception, concentrating on deceptive behaviours of Large Language Models (LLMs). My objective is to elucidate this issue, examine the discourse surrounding it, and subsequently delve into its categorization and ramifications. The essay initiates with an evaluation of the AI Safety Summit 2023 (ASS) and introduction of LLMs, emphasising multidimensional biases that underlie their deceptive behaviours.The literature review covers four types of deception categorised: Strategic deception, Imitation, Sycophancy, and Unfaithful Reasoning, along with the social implications and risks they entail. Lastly, I take an evaluative stance on various aspects related to navigating the persistent challenges of the deceptive AI. This encompasses considerations of international collaborative governance, the reconfigured engagement of individuals with AI, proposal of practical adjustments, and specific elements of digital education.
A Legal Framework for Natural Language Processing Model Training in Portugal
Rúben Almeida, Evelin Amorim
Recent advances in deep learning have promoted the advent of many computational systems capable of performing intelligent actions that, until then, were restricted to the human intellect. In the particular case of human languages, these advances allowed the introduction of applications like ChatGPT that are capable of generating coherent text without being explicitly programmed to do so. Instead, these models use large volumes of textual data to learn meaningful representations of human languages. Associated with these advances, concerns about copyright and data privacy infringements caused by these applications have emerged. Despite these concerns, the pace at which new natural language processing applications continued to be developed largely outperformed the introduction of new regulations. Today, communication barriers between legal experts and computer scientists motivate many unintentional legal infringements during the development of such applications. In this paper, a multidisciplinary team intends to bridge this communication gap and promote more compliant Portuguese NLP research by presenting a series of everyday NLP use cases, while highlighting the Portuguese legislation that may arise during its development.
Facilitating large language model Russian adaptation with Learned Embedding Propagation
Mikhail Tikhomirov, Daniil Chernyshev
Rapid advancements of large language model (LLM) technologies led to the introduction of powerful open-source instruction-tuned LLMs that have the same text generation quality as the state-of-the-art counterparts such as GPT-4. While the emergence of such models accelerates the adoption of LLM technologies in sensitive-information environments the authors of such models don not disclose the training data necessary for replication of the results thus making the achievements model-exclusive. Since those open-source models are also multilingual this in turn reduces the benefits of training a language specific LLMs as improved inference computation efficiency becomes the only guaranteed advantage of such costly procedure. More cost-efficient options such as vocabulary extension and subsequent continued pre-training are also inhibited by the lack of access to high-quality instruction-tuning data since it is the major factor behind the resulting LLM task-solving capabilities. To address the limitations and cut the costs of the language adaptation pipeline we propose Learned Embedding Propagation (LEP). Unlike existing approaches our method has lower training data size requirements due to minimal impact on existing LLM knowledge which we reinforce using novel ad-hoc embedding propagation procedure that allows to skip the instruction-tuning step and instead implant the new language knowledge directly into any existing instruct-tuned variant. We evaluated four Russian vocabulary adaptations for LLaMa-3-8B and Mistral-7B, showing that LEP is competitive with traditional instruction-tuning methods, achieving performance comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct, with further improvements via self-calibration and continued tuning enhancing task-solving capabilities.
Overcoming linguistic barriers in code assistants: creating a QLoRA adapter to improve support for Russian-language code writing instructions
C. B. Pronin, A. V. Volosova, A. V. Ostroukh
et al.
In this paper, an approach to training and evaluating an adapter model for the popular language model "zephyr-7b-beta" is described. The adapter was developed to improve the performance of the base model in tasks related to programming and understanding the Russian language. Considering the high quality of the original model in tasks in the English language, the goal of the research was to expand its linguistic and technical spectrum. The proposed adapter was trained using a large and diverse dataset, including question-answer pairs related to programming, as well code-related texts in Russian language. The applied training methodology ensures an improvement in the model's quality of answers in understanding and generating Python code based on Russian instructions. We evaluated the performance of the base model with the installed adapter using various metrics, comparing it to the base model as well as other state-of-the-art models in this field. The obtained results showed significant improvement, both in tasks related to writing Python code and in processing the Russian language, confirming the effectiveness of the proposed adapter.
Trustworthiness of Children Stories Generated by Large Language Models
Prabin Bhandari, Hannah Marie Brennan
Large Language Models (LLMs) have shown a tremendous capacity for generating literary text. However, their effectiveness in generating children's stories has yet to be thoroughly examined. In this study, we evaluate the trustworthiness of children's stories generated by LLMs using various measures, and we compare and contrast our results with both old and new children's stories to better assess their significance. Our findings suggest that LLMs still struggle to generate children's stories at the level of quality and nuance found in actual stories
"Mistakes Help Us Grow": Facilitating and Evaluating Growth Mindset Supportive Language in Classrooms
Kunal Handa, Margaret Clapper, Jessica Boyle
et al.
Teachers' growth mindset supportive language (GMSL)--rhetoric emphasizing that one's skills can be improved over time--has been shown to significantly reduce disparities in academic achievement and enhance students' learning outcomes. Although teachers espouse growth mindset principles, most find it difficult to adopt GMSL in their practice due the lack of effective coaching in this area. We explore whether large language models (LLMs) can provide automated, personalized coaching to support teachers' use of GMSL. We establish an effective coaching tool to reframe unsupportive utterances to GMSL by developing (i) a parallel dataset containing GMSL-trained teacher reframings of unsupportive statements with an accompanying annotation guide, (ii) a GMSL prompt framework to revise teachers' unsupportive language, and (iii) an evaluation framework grounded in psychological theory for evaluating GMSL with the help of students and teachers. We conduct a large-scale evaluation involving 174 teachers and 1,006 students, finding that both teachers and students perceive GMSL-trained teacher and model reframings as more effective in fostering a growth mindset and promoting challenge-seeking behavior, among other benefits. We also find that model-generated reframings outperform those from the GMSL-trained teachers. These results show promise for harnessing LLMs to provide automated GMSL feedback for teachers and, more broadly, LLMs' potentiality for supporting students' learning in the classroom. Our findings also demonstrate the benefit of large-scale human evaluations when applying LLMs in educational domains.
Quantifying the Roles of Visual, Linguistic, and Visual-Linguistic Complexity in Verb Acquisition
Yuchen Zhou, Michael J. Tarr, Daniel Yurovsky
Children typically learn the meanings of nouns earlier than the meanings of verbs. However, it is unclear whether this asymmetry is a result of complexity in the visual structure of categories in the world to which language refers, the structure of language itself, or the interplay between the two sources of information. We quantitatively test these three hypotheses regarding early verb learning by employing visual and linguistic representations of words sourced from large-scale pre-trained artificial neural networks. Examining the structure of both visual and linguistic embedding spaces, we find, first, that the representation of verbs is generally more variable and less discriminable within domain than the representation of nouns. Second, we find that if only one learning instance per category is available, visual and linguistic representations are less well aligned in the verb system than in the noun system. However, in parallel with the course of human language development, if multiple learning instances per category are available, visual and linguistic representations become almost as well aligned in the verb system as in the noun system. Third, we compare the relative contributions of factors that may predict learning difficulty for individual words. A regression analysis reveals that visual variability is the strongest factor that internally drives verb learning, followed by visual-linguistic alignment and linguistic variability. Based on these results, we conclude that verb acquisition is influenced by all three sources of complexity, but that the variability of visual structure poses the most significant challenge for verb learning.
A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao
et al.
Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
OLEKSANDR BONDAR’S SCIENTIFIC HERITAGE
N. Khrustyk
The article is dedicated to the research of the scientific heritage of Oleksandr Ivanovich Bondar – a famous Ukrainian linguist, professor, doctor of philological sciences, whose 70th birthday will be celebrated by the scientific community in 2022. The purpose of our article is to understand the place and significance of the scientific heritage of O. I. Bondar in modern linguistics. The object of observation is the scientific creativity of the professor. The subject of study is the problem of his linguistic development. The source base for our observation was the scientific heritage of the linguist. We set ourselves the following tasks: to reveal the problems of Professor Bondar’s scientific works; determine the specifics of his scientific activity; describe fragments of the ideological space of the linguist’s scientific discourse. The work uses such general scientific methods of research as analysis and synthesis, induction and deduction, observation, thematic and ideologue portraiture of the scientist’s personality. It was found that O. I. Bondar actively worked in such scientific areas as: history of the Ukrainian language, Ukrainian dialectology, word-form, syntax, lexicography, functional grammar, phonology, lexicology, terminology, linguistic ecology, Ukrainian orthography, text linguistics, sociolinguistics, general linguistics, Slavic studies, Japanese studies, comparison of Ukrainian and Japanese languages, translation studies, computer linguistics etc. Among them is linguistic ecology, a comparative analysis of the Ukrainian and Japanese languages, the study of which the scientist initiated himself. Knowledge of languages, non-standard thinking of a scientist and the desire to learn new things, not only in the field of his specialty, but also far beyond it – all this served as the basis on which the research self of Professor Bondar was formed. The article focuses attention on the innovative views of O. I. Bondar: the question of studying the category of temporality in the Ukrainian language is considered; the problem related to determining the time of emergence of the Ukrainian language; the quantitative composition of phonemes in the modern Ukrainian language and the degree of phonemicity of phonemes are substantiated; the theory of language as humanitarian capital; the issue of language as the main ethnic- forming factor for Ukrainians is highlighted. Conclusions. The ideological space of the researcher’s scientific works is based on the desire to study the language and its diverse phenomena as an integral dynamic object that is in close relationship with all other realities of human existence. Hence the multi-vector nature of the linguist’s works. Scientific work of O. I. Bondar, in particular the research perspectives outlined by him, his unusual “optics” of scientific observations, peculiar methods of studying linguistic phenomena, are the source of the further development of many areas of linguistic science.
The most stable it's ever been: the preterit/present perfect alternation in spoken Ontario English
Karlien Franco, Sali A. Tagliamonte
English tense/aspect-marking is an area where variation abounds and where many theories have been formulated. Diachronic studies of the preterit/present perfect alternation indicate that the present perfect (e.g. I have eaten already) has been losing ground to the preterit (e.g. I ate already) (e.g. Elsness 1997, but see Hundt & Smith 2009, Werner 2014). However, few studies have examined this alternation in vernacular speech. This article fills this lacuna by analyzing spoken data from Ontario, Canada, from an apparent-time perspective. Using a large archive of multiple communities and people of different generations, we focus on linguistic contexts known to be variable, viz. with adverbs of indefinite time. Results indicate that, in contrast with previous studies, the alternation is mostly stable. We find evidence of change only with the adverb ever. Where there is evidence of change, this change is different from the predictions in the literature, with the preterit increasing in frequency. We suggest that a minor constructionalization process operates in tandem with ongoing specialization of the preterit/present perfect contrast. Taken together, these results provide another example of the importance of including speech in research on language variation and change and of the unique contribution certain constructions make to more general systems of grammar.