The Indo-European Cognate Relationships dataset
Cormac Anderson, Matthew Scarborough, Lechosław Jocz
et al.
Abstract The Indo-European Cognate Relationships (IE-CoR) dataset is an open-access relational dataset showing how related, inherited words (‘cognates’) pattern across 160 languages of the Indo-European family. IE-CoR is intended as a benchmark dataset for computational research into the evolution of the Indo-European languages. It is structured around 170 reference meanings in core lexicon, and contains 25731 lexeme entries, analysed into 4981 cognate sets. Novel, dedicated structures are used to code all known cases of horizontal transfer. All 13 main documented clades of Indo-European, and their main subclades, are well represented. Time calibration data for each language are also included, as are relevant geographical and social metadata. Data collection was performed by an expert consortium of 89 linguists drawing on 355 cited sources. The dataset is extendable to further languages and meanings and follows the Cross-Linguistic Data Format (CLDF) protocols for linguistic data. It is designed to be interoperable with other cross-linguistic datasets and catalogues, and provides a reference framework for similar initiatives for other language families.
PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models
Zaur Gouliev, Jennifer Waters, Chengqian Wang
Disinformation spreads rapidly across linguistic boundaries, yet most AI models are still benchmarked only on English. We address this gap with a systematic comparison of five multilingual transformer models: mBERT, XLM, XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning classification task. While transformer-based language models have demonstrated notable success in detecting disinformation in English, their effectiveness in multilingual contexts still remains up for debate. To facilitate evaluation, we introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs (false claim vs. factual correction) spanning over twenty five languages that collectively cover five language families and a broad topical range from politics, health, climate, finance, and conspiracy, half of which are fact-checked disinformation claims verified by an augmented MindBugs Discovery dataset. Our experiments revealed performance variations. Models such as RemBERT achieved better overall accuracy, particularly excelling in low-resource languages, whereas models like mBERT and XLM exhibit considerable limitations when training data is scarce. We provide a discussion of these performance patterns and implications for real-world deployment. The dataset is publicly available on our GitHub repository to encourage further experimentation and advancement. Our findings illuminate both the potential and the current limitations of AI systems for multilingual disinformation detection.
Rez. N. Herzig: Ciceros Paradoxa Stoicorum als interdisziplinäre Schullektüre für die Fächer Latein und Philosophie (Acta Didactica Classica 4), Heidelberg: Propylaeum 2023.
Joachim Penzel
Greek language and literature. Latin language and literature, Philology. Linguistics
MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model
Jiahao Huo, Yibo Yan, Boren Hu
et al.
Projecting visual features into word embedding space has become a significant fusion strategy adopted by Multimodal Large Language Models (MLLMs). However, its internal mechanisms have yet to be explored. Inspired by multilingual research, we identify domain-specific neurons in multimodal large language models. Specifically, we investigate the distribution of domain-specific neurons and the mechanism of how MLLMs process features from diverse domains. Furthermore, we propose a three-stage mechanism for language model modules in MLLMs when handling projected image features, and verify this hypothesis using logit lens. Extensive experiments indicate that while current MLLMs exhibit Visual Question Answering (VQA) capability, they may not fully utilize domain-specific information. Manipulating domain-specific neurons properly will result in a 10% change of accuracy at most, shedding light on the development of cross-domain, all-encompassing MLLMs in the future. The source code is available at https://github.com/Z1zs/MMNeuron.
Technical, organizational and oral history regarding the soil samples measurements for Cs-137 because of the Chernobyl accident fallout
Nick P. Petropoulos
Data are given, commentary is supplied and explanations are provided with regard to the technical, the organizational and, of course, the human history connected to the time of research, which resulted to the paper entitled "Soil sampling and Cs-137 analysis of the Chernobyl fallout in Greece", written by late Professor S.E. Simopoulos. This paper has been provided in Greek translation within an issued honorary volume (ISBN 978-960-254-714-4). Reasonably, the narration starts with the review of the political, the financial and the social situation of Greece around 1986. Subsequently, an analysis is given on the then available means, the persons involved, the methods used, the lessons learned and any other connection with the oral history of the NTUA's Nuclear Engineering Laboratory and other relevant Greek Laboratories. For this history, written proof is now scarce and the persons available to pass it on are growing less and less. N.P. Petropoulos, now Laboratory member and then student of Professor S.E. Simopoulos was in charge of preparation of this text.
End-to-End Graph Flattening Method for Large Language Models
Bin Hong, Jinze Wu, Jiayu Liu
et al.
In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias
Jayanta Sadhu, Maneesha Rani Saha, Rifat Shahriyar
The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two types of social biases in LLM generated outputs for Bangla language. Our main contributions in this work are: (1) bias studies on two different social biases for Bangla, (2) a curated dataset for bias measurement benchmarking and (3) testing two different probing techniques for bias detection in the context of Bangla. This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge. All our code and resources are publicly available for the progress of bias related research in Bangla NLP.
Attacks Directed at Caesar In Cicero’s Paradoxa Stoicorum
Daniela Galli
Paradoxa Stoicorum is a short essay written by Cicero in 46 BC focused on six Stoic Paradoxa. In this essay Cicero doesn’t limit himself to quoting the Stoic sources: he also applies the Stoic paradoxa to his own Roman world. Therefore, Paradoxa Stoicorum is rich in allusions and references to Cicero’s contemporary Rome. Politics was dominated at that time by the figure of Julius Caesar: in April 46 BC Caesar was appointed dictator for 10 years and the dictatorship granted him near absolute power in both theory and in practice. In 2000 one of the most astute scholars who studied Cicero, Emanuele Narducci, noted that Paradoxa Stoicorum is full of hidden opposition against Caesar’s dominion. Nevertheless, this aspect hasn’t received further study from other scholars. The article examines the most direct attacks to Caesar in Paradoxon 5.
Philology. Linguistics, Greek language and literature. Latin language and literature
Rez. Gaius Plinius Secundus: Über Bäume. Lat./Dt., ausgew., hrsg. u. übers. v. B. Herzhoff (RUB 14252), Ditzingen: Reclam 2022, 208 Seiten
Thomas Doepner
Greek language and literature. Latin language and literature, Philology. Linguistics
Low-Resource Languages Jailbreak GPT-4
Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach
AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.
Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages
Bhavyajeet Singh, Pavan Kandru, Anubhav Sharma
et al.
Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46.
Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding
Bram M. A. van Dijk, Tom Kouwenhoven, Marco R. Spruit
et al.
Current Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text. LLMs are appearing rapidly, and debates on LLM capacities have taken off, but reflection is lagging behind. Thus, in this position paper, we first zoom in on the debate and critically assess three points recurring in critiques of LLM capacities: i) that LLMs only parrot statistical patterns in the training data; ii) that LLMs master formal but not functional language competence; and iii) that language learning in LLMs cannot inform human language learning. Drawing on empirical and theoretical arguments, we show that these points need more nuance. Second, we outline a pragmatic perspective on the issue of `real' understanding and intentionality in LLMs. Understanding and intentionality pertain to unobservable mental states we attribute to other humans because they have pragmatic value: they allow us to abstract away from complex underlying mechanics and predict behaviour effectively. We reflect on the circumstances under which it would make sense for humans to similarly attribute mental states to LLMs, thereby outlining a pragmatic philosophical context for LLMs as an increasingly prominent technology in society.
Exploring the Landscape of Natural Language Processing Research
Tim Schopf, Karim Arabi, Florian Matthes
As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics, identifies trends, and outlines areas for future research remains absent. Contributing to closing this gap, we have systematically classified and analyzed research papers in the ACL Anthology. As a result, we present a structured overview of the research landscape, provide a taxonomy of fields of study in NLP, analyze recent developments in NLP, summarize our findings, and highlight directions for future work.
Rez. R. Henneböhl (Bearb. u. Hrsg.), Catull: Gedichte, Bad Driburg 2021
André Hümbs
Greek language and literature. Latin language and literature, Philology. Linguistics
Rez. L. Rissmann, E. von Scheven (Hrsg.), adeo 500. Illustrierter Grundwortschatz nach Sachgruppen, Bamberg 2020
Kristina Leutloff-Völkel
Greek language and literature. Latin language and literature, Philology. Linguistics
RADOSLAV KATIČIĆ – A GREAT PHILOLOGIST
Mislav Ježić
Radoslav Katičić (3rd July 1930 – 10th August 2019) was a philologist, general and comparative linguist, classical philologist, Byzantologist, Paleo-balkanologist, Indo-Iranian philologist and Indologist, Balto-Slavic and Croatian philologist, as well as a historian of Indian, Greek, Slavic and, in particular, Croatian literatures. He made great contributions to the humanities and culture in general in Croatia with his scholarly, educational and public work from the middle of the 20th century until the first two decades of the 21st century. This paper presents Katičić's work in the fields of linguistics and literary history. His approach to these fields was very comprehensive and multidisciplinary, and therefore he achieved outstanding results in many areas: in the structuralist theory of comparative linguistics, in the survey and classification of the languages of Ancient Balkans, in his all-encompassing transformationalist description of Croatian syntax, in his comprehensive reconstruction of the Balto-Slavic pre-Christian sacred texts, in his research of the reports and mythic stories about the Adriatic and South-East European regions in ancient Greek authors, in his research of the literature and culture in Croatia in Late Antiquity and Early Middle Ages, or in his synthetic survey of the history of the Croatian language, where he demonstrated, in accordance with his structuralist thesis on comparative linguistics, that the structure of history of a language, in this case Croatian, determines its current structure. This is what equally determines and differentiates the Bosniak/Bosnian, the Montenegrin or the Serbian language among the genetically and typologically closest literary or standard languages. He used to call all the fields of his multidisciplinary scho-larship simply: true philology.
Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks
Edwin Zhang, Yujie Lu, Shinda Huang
et al.
Training generalist agents is difficult across several axes, requiring us to deal with high-dimensional inputs (space), long horizons (time), and generalization to novel tasks. Recent advances with architectures have allowed for improved scaling along one or two of these axes, but are still computationally prohibitive to use. In this paper, we propose to address all three axes by leveraging \textbf{L}anguage to \textbf{C}ontrol \textbf{D}iffusion models as a hierarchical planner conditioned on language (LCD). We effectively and efficiently scale diffusion models for planning in extended temporal, state, and task dimensions to tackle long horizon control problems conditioned on natural language instructions, as a step towards generalist agents. Comparing LCD with other state-of-the-art models on the CALVIN language robotics benchmark finds that LCD outperforms other SOTA methods in multi-task success rates, whilst improving inference speed over other comparable diffusion models by 3.3x~15x. We show that LCD can successfully leverage the unique strength of diffusion models to produce coherent long range plans while addressing their weakness in generating low-level details and control.
Controlling Translation Formality Using Pre-trained Multilingual Language Models
Elijah Rippeth, Sweta Agrawal, Marine Carpuat
This paper describes the University of Maryland's submission to the Special Task on Formality Control for Spoken Language Translation at \iwslt, which evaluates translation from English into 6 languages with diverse grammatical formality markers. We investigate to what extent this problem can be addressed with a \textit{single multilingual model}, simultaneously controlling its output for target language and formality. Results show that this strategy can approach the translation quality and formality control achieved by dedicated translation models. However, the nature of the underlying pre-trained language model and of the finetuning samples greatly impact results.
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
Sid Black, Stella Biderman, Eric Hallahan
et al.
We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe \model{}'s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.
Projekt Erasmus+ na podporu výuky novořečtiny
Simone Sumelidu
History of Greece, Translating and interpreting