Are formal and functional linguistic mechanisms dissociated in language models?
Michael Hanna, Yonatan Belinkov, Sandro Pezzelle
Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: they excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this paper, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the "circuits", or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness - the ability of one circuit to solve another's task - we observe a separation between formal and functional mechanisms, suggesting that shared mechanisms between formal tasks may exist.
HERITRACE in action: the ParaText project as a case study for semantic data management in Classical Philology
Francesca Filograsso, Arcangelo Massari, Camillo Neri
et al.
HERITRACE is a semantic data editor designed for cultural heritage institutions, addressing the gap between complex Semantic Web technologies and domain expert needs. ParaText Bibliographical Database, a specialized bibliographical database for ancient Greek exegesis, demonstrates HERITRACE's capabilities in Classical Philology. This paper examines how HERITRACE enables non-technical scholars to manage complex semantic data through SHACL-based form generation and validation, while ensuring comprehensive provenance tracking and change management via an OpenCitations Data Model adaptation.
Linguistic Features Extracted by GPT-4 Improve Alzheimer's Disease Detection based on Spontaneous Speech
Jonathan Heitz, Gerold Schneider, Nicolas Langer
Alzheimer's Disease (AD) is a significant and growing public health concern. Investigating alterations in speech and language patterns offers a promising path towards cost-effective and non-invasive early detection of AD on a large scale. Large language models (LLMs), such as GPT, have enabled powerful new possibilities for semantic text analysis. In this study, we leverage GPT-4 to extract five semantic features from transcripts of spontaneous patient speech. The features capture known symptoms of AD, but they are difficult to quantify effectively using traditional methods of computational linguistics. We demonstrate the clinical significance of these features and further validate one of them ("Word-Finding Difficulties") against a proxy measure and human raters. When combined with established linguistic features and a Random Forest classifier, the GPT-derived features significantly improve the detection of AD. Our approach proves effective for both manually transcribed and automatically generated transcripts, representing a novel and impactful use of recent advancements in LLMs for AD speech analysis.
Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats
Arne Rubehn, Jessica Nieder, Robert Forkel
et al.
When comparing speech sounds across languages, scholars often make use of feature representations of individual sounds in order to determine fine-grained sound similarities. Although binary feature systems for large numbers of speech sounds have been proposed, large-scale computational applications often face the challenges that the proposed feature systems -- even if they list features for several thousand sounds -- only cover a smaller part of the numerous speech sounds reflected in actual cross-linguistic data. In order to address the problem of missing data for attested speech sounds, we propose a new approach that can create binary feature vectors dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Since CLTS is actively used in large data collections, covering more than 2,000 distinct language varieties, our procedure for the generation of binary feature vectors provides immediate access to a very large collection of multilingual wordlists. Testing our feature system in different ways on different datasets proves that the system is not only useful to provide a straightforward means to compare the similarity of speech sounds, but also illustrates its potential to be used in future cross-linguistic machine learning applications.
The Curious Case of Representational Alignment: Unravelling Visio-Linguistic Tasks in Emergent Communication
Tom Kouwenhoven, Max Peeperkorn, Bram van Dijk
et al.
Natural language has the universal properties of being compositional and grounded in reality. The emergence of linguistic properties is often investigated through simulations of emergent communication in referential games. However, these experiments have yielded mixed results compared to similar experiments addressing linguistic properties of human language. Here we address representational alignment as a potential contributing factor to these results. Specifically, we assess the representational alignment between agent image representations and between agent representations and input images. Doing so, we confirm that the emergent language does not appear to encode human-like conceptual visual features, since agent image representations drift away from inputs whilst inter-agent alignment increases. We moreover identify a strong relationship between inter-agent alignment and topographic similarity, a common metric for compositionality, and address its consequences. To address these issues, we introduce an alignment penalty that prevents representational drift but interestingly does not improve performance on a compositional discrimination task. Together, our findings emphasise the key role representational alignment plays in simulations of language emergence.
İzleyicilikten Katılımcılığa: Türkiye’de Dijital Yerlilerin İnteraktif Sinemada Yeni Görme Kültürü Üzerine Bir Araştırma
Ferhat Zengin, Uğur Baloğlu, Yıldız Derya Birincioğlu
Dijital çağın getirdiği teknolojik dönüşümler, görsel kültür pratiklerini ve izleyici deneyimlerini önemliölçüde değiştirmektedir. Bu değişim özellikle dijital yerlilerin medya tüketim alışkanlıklarında belirginbir şekilde görülmektedir. Bu araştırma, etkileşimli sinemanın sunduğu görsel ve işitsel teknikler ileizleyicilerin görme kültürü ve görme politikalarını nasıl yapılandırdığı konusunu incelemektedir. Çalışma,sinemanın dijitalleşmeyle gelişen anlatım olanaklarının izleyiciye hikâyeyi yönlendirme ve yönetebilmeimkânı sunmasına, dolayısıyla izleyici pratiklerinde ortaya çıkan değişimlere odaklanmaktadır.Araştırmanın kuramsal çerçevesi, etkileşimli ve siber drama ile çizgisel olmayan anlatı bağlamlarındasinema-izleyici ilişkisinin geçirdiği dönüşümleri ele almaktadır. Nitel araştırma yöntemleri arasında yeralan fenomenolojik desen üzerine kurulu olan bu araştırmada, 10 katılımcıya Late Shift interaktif filmiizletilerek derinlemesine görüşmeler yapılmış ve elde edilen veriler betimsel ve tematik analiz teknikleriyleçözümlenmiştir. Araştırmanın bulgularına göre izleyiciler, interaktif film deneyiminde yaratma, yönetmeve temsil etme yanılsaması içerisinde yeni bir özdeşleşme, görme kültürü ve seyir deneyimi içerisinegirmektedir. Çalışma, interaktif sinemanın geleneksel sinemadaki tekil karakter özdeşleşmesini aşarakizleyiciye karakter, yönetmen ve senarist rolleri arasında dinamik geçişler sunan katılımcı bir deneyim yarattığını ve bu sayede salt teknolojik bir yenilikten öte kültürel bir paradigma değişimini temsil ettiğiniortaya koymuştur.
Communication. Mass media
DE L’INTERVENTION DES ACTEURS NON ÉTATIQUES POUR LA SCOLARISATION DES ÉLÈVES DÉPLACÉS INTERNES (EDI)
Issiaka OUEDRAOGO, Goama NAKOULMA & Sylvie KOROGO
Résumé : La crise sécuritaire à laquelle le Burkina Faso est confronté affecte son système éducatif depuis 2016. Elle est à l’origine de la fermeture de plusieurs milliers d’établissements scolaires et du déplacement des élèves de leurs lieux habituels d’habitation vers des zones relativement mieux sécurisées. Dans ces zones d’accueil, leurs besoins éducatifs ne sont pas couverts, ce qui compromet leurs possibilités de poursuite des études. Pour accompagner l’État burkinabè dans la prise en charge de ces besoins d’urgence, des acteurs non étatiques mettent en œuvre des actions en faveur de la scolarisation des EDI. Dans le cadre d’une recherche sur les mécanismes de scolarisation de ces élèves, des enquêtes ont été réalisées dans les villes de Kaya et de Fada N’Gourma. Les résultats révèlent une multiplicité et une diversité de ces acteurs. Certains ont l’expérience de l’intervention dans le secteur humanitaire ou de l’éducation, d’autres en ont dans les deux, tandis qu’il existe des acteurs qui n’ont l’expérience dans aucun de ces domaines. Leurs actions contribuent à la scolarisation de nombreux EDI. Cependant, elles comportent des limites liées aux pratiques d’intervention et à l’insuffisance des aides apportées par rapport à la demande. Leurs interventions prennent peu en compte certains besoins prioritaires pour les EDI. Il est donc indispensable d’améliorer la gouvernance dans le domaine de l’Éducation en Situation d’Urgence (ESU) en vue d’accroître l’efficacité de ces interventions.
Mots-clés : Intervention- Acteur non étatique - Scolarisation - EDI – Zone d’accueil
Arts in general, Computational linguistics. Natural language processing
Wstęp
Artur Gałkowski, Rafał Zarębski
The Role of Critical Discourse Analysis in Translation: A Case of the Political Speech
Barış Can Aydın
This study aims to provide insights into understanding the theoretical background of the application of critical discourse analysis (CDA) in the translation of political texts in the field of translation studies. The study also casts light on the investigation into the ideological and discursive issues in translation through the use of CDA as well as political discourse and translation. CDA is crucial in understanding the role and significance of discourse in the translation of a political text without disregarding the literary sense, authentic style of the speaker in the target language, and rhetorical devices. In this regard, this study considers the case of a political speech to demonstrate the role and significance of CDA in the translation of political speech. For this reason, the study has selected the case of Donald Trump’s inaugural address for translation into the target language of Turkish by the study’s author through the use of a critical lens. Following a critical approach and Norman Fairclough’s (1995) model for CDA in the interpretation and translation of political discourse, this study aims to provide explanations and solutions to the difficulties encountered in the interpretation and translation of a political speech. Therefore, the comparison of the source text with the target text offered and discussed in this study helps to underline and raise awareness about the contributions of CDA to translation studies.
Translating and interpreting
Logion: Machine Learning for Greek Philology
Charlie Cowen-Breen, Creston Brooks, Johannes Haubold
et al.
This paper presents machine-learning methods to address various problems in Greek philology. After training a BERT model on the largest premodern Greek dataset used for this purpose to date, we identify and correct previously undetected errors made by scribes in the process of textual transmission, in what is, to our knowledge, the first successful identification of such errors via machine learning. Additionally, we demonstrate the model's capacity to fill gaps caused by material deterioration of premodern manuscripts and compare the model's performance to that of a domain expert. We find that best performance is achieved when the domain expert is provided with model suggestions for inspiration. With such human-computer collaborations in mind, we explore the model's interpretability and find that certain attention heads appear to encode select grammatical features of premodern Greek.
The Emergence of Chunking Structures with Hierarchical RNN
Zijun Wu, Anup Anand Deshmukh, Yongkang Wu
et al.
In Natural Language Processing (NLP), predicting linguistic structures, such as parsing and chunking, has mostly relied on manual annotations of syntactic structures. This paper introduces an unsupervised approach to chunking, a syntactic task that involves grouping words in a non-hierarchical manner. We present a Hierarchical Recurrent Neural Network (HRNN) designed to model word-to-chunk and chunk-to-sentence compositions. Our approach involves a two-stage training process: pretraining with an unsupervised parser and finetuning on downstream NLP tasks. Experiments on multiple datasets reveal a notable improvement of unsupervised chunking performance in both pretraining and finetuning stages. Interestingly, we observe that the emergence of the chunking structure is transient during the neural model's downstream-task training. This study contributes to the advancement of unsupervised syntactic structure discovery and opens avenues for further research in linguistic theory.
A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies
A. Seza Doğruöz, Sunayana Sitaram, Barbara E. Bullock
et al.
The analysis of data in which multiple languages are represented has gained popularity among computational linguists in recent years. So far, much of this research focuses mainly on the improvement of computational methods and largely ignores linguistic and social aspects of C-S discussed across a wide range of languages within the long-established literature in linguistics. To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies. From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts as highly multilingual areas. From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data, lack of robust evaluation benchmarks for C-S (across multilingual situations and types of C-S) and lack of end-to-end systems that cover sociolinguistic aspects of C-S as well. Our survey will be a step towards an outcome of mutual benefit for computational scientists and linguists with a shared interest in multilingualism and C-S.
Exploring Large Language Models for Classical Philology
Frederick Riemenschneider, Anette Frank
Recent advances in NLP have led to the creation of powerful language models for many languages including Ancient Greek and Latin. While prior work on Classical languages unanimously uses BERT, in this work we create four language models for Ancient Greek that vary along two dimensions to study their versatility for tasks of interest for Classical languages: we explore (i) encoder-only and encoder-decoder architectures using RoBERTa and T5 as strong model types, and create for each of them (ii) a monolingual Ancient Greek and a multilingual instance that includes Latin and English. We evaluate all models on morphological and syntactic tasks, including lemmatization, which demonstrates the added value of T5's decoding abilities. We further define two probing tasks to investigate the knowledge acquired by models pre-trained on Classical texts. Our experiments provide the first benchmarking analysis of existing models of Ancient Greek. Results show that our models provide significant improvements over the SoTA. The systematic analysis of model types can inform future research in designing language models for Classical languages, including the development of novel generative tasks. We make all our models available as community resources, along with a large curated pre-training corpus for Ancient Greek, to support the creation of a larger, comparable model zoo for Classical Philology. Our models and resources are available at https://github.com/Heidelberg-NLP/ancient-language-models.
Editorial: Towards 2030: sustainable development goal 4: quality education. A communication perspective
Antonia Liguori, Philippa Rappoport, Antonella Poce
et al.
Communication. Mass media
LINGUIST: Language Model Instruction Tuning to Generate Annotated Utterances for Intent Classification and Slot Tagging
Andy Rosenbaum, Saleh Soltan, Wael Hamza
et al.
We present LINGUIST, a method for generating annotated data for Intent Classification and Slot Tagging (IC+ST), via fine-tuning AlexaTM 5B, a 5-billion-parameter multilingual sequence-to-sequence (seq2seq) model, on a flexible instruction prompt. In a 10-shot novel intent setting for the SNIPS dataset, LINGUIST surpasses state-of-the-art approaches (Back-Translation and Example Extrapolation) by a wide margin, showing absolute improvement for the target intents of +1.9 points on IC Recall and +2.5 points on ST F1 Score. In the zero-shot cross-lingual setting of the mATIS++ dataset, LINGUIST out-performs a strong baseline of Machine Translation with Slot Alignment by +4.14 points absolute on ST F1 Score across 6 languages, while matching performance on IC. Finally, we verify our results on an internal large-scale multilingual dataset for conversational agent IC+ST and show significant improvements over a baseline which uses Back-Translation, Paraphrasing and Slot Catalog Resampling. To our knowledge, we are the first to demonstrate instruction fine-tuning of a large-scale seq2seq model to control the outputs of multilingual intent- and slot-labeled data generation.
Can Linguistic Distance help Language Classification? Assessing Hawrami-Zaza and Kurmanji-Sorani
Hossein Hassani
To consider Hawrami and Zaza (Zazaki) standalone languages or dialects of a language have been discussed and debated for a while among linguists active in studying Iranian languages. The question of whether those languages/dialects belong to the Kurdish language or if they are independent descendants of Iranian languages was answered by MacKenzie (1961). However, a majority of people who speak the dialects are against that answer. Their disapproval mainly seems to be based on the sociological, cultural, and historical relationship among the speakers of the dialects. While the case of Hawrami and Zaza has remained unexplored and under-examined, an almost unanimous agreement exists about the classification of Kurmanji and Sorani as Kurdish dialects. The related studies to address the mentioned cases are primarily qualitative. However, computational linguistics could approach the question from a quantitative perspective. In this research, we look into three questions from a linguistic distance point of view. First, how similar or dissimilar Hawrami and Zaza are, considering no common geographical coexistence between the two. Second, what about Kurmanji and Sorani that have geographical overlap. Finally, what is the distance among all these dialects, pair by pair? We base our computation on phonetic presentations of these dialects (languages), and we calculate various linguistic distances among the pairs. We analyze the data and discuss the results to conclude.
Linguistic Cues of Deception in a Multilingual April Fools' Day Context
Katerina Papantoniou, Panagiotis Papadakos, Giorgos Flouris
et al.
In this work we consider the collection of deceptive April Fools' Day(AFD) news articles as a useful addition in existing datasets for deception detection tasks. Such collections have an established ground truth and are relatively easy to construct across languages. As a result, we introduce a corpus that includes diachronic AFD and normal articles from Greek newspapers and news websites. On top of that, we build a rich linguistic feature set, and analyze and compare its deception cues with the only AFD collection currently available, which is in English. Following a current research thread, we also discuss the individualism/collectivism dimension in deception with respect to these two datasets. Lastly, we build classifiers by testing various monolingual and crosslingual settings. The results showcase that AFD datasets can be helpful in deception detection studies, and are in alignment with the observations of other deception detection works.
Georgian Cultural-Intellectual and National Islands beyond Ideological Frontiers
Ketevan GIGASHVILI, Mzia GIGASHVILI
This paper examines the Georgian migrant press, which emerged as a result of the flight
of Georgian intellectuals from the Soviet Communist regime. It reviews publications
such as Chveni Drosha (Our Flag), Tavisuphali Sakartvelo (Free Georgia), Bedi
Kartlisa (Destiny of Kartli) and others, which became cultural-intellectual and
national islands beyond ideological frontiers. After the occupation of Georgia, the efforts
of emigrants focused on generating an information war against the Soviet occupation.
The emigrants living in Paris were especially active in this regard. The newspapers and
magazines were mainly issued in Georgian, but also in English, French, and German.
Special attention is paid to the historical, literary and scientific journal Bedi
Kartlisa (Revue de Kartvélologie), founded in Paris, in 1948, by Nino Salia. The
journal brought together Kartvelologists living abroad, and played an important role
in promoting Georgian history, literature, science, and culture. In this magazine,
Georgian and foreign journalists, prominent writers and scientists worked together.
The study aims to examine the pathos and the main topics of the emigrant press, based
on the analysis of a large number of magazines and personal archives of Georgian
emigrants. Their role in keeping the Georgian national soul alive abroad, disclosing the
regime, introducing the Georgian intellectual and spiritual culture to the Europeans
and transferring European ideas to Georgia are among the most relevant findings. The
role of these magazines is invaluable not only in the history of Georgian journalism,
but also in the history of the Georgian national struggle.
Philology. Linguistics, Literature (General)
LA DISTANCIA DE LA ESPERANZA: ANÁLISIS RETÓRICO DE “ASÍ ESTAMOS” DE ALEJANDRO ROMUALDO Y “AÑOS DE LOS CASTIGOS” DE MANUEL SCORZA
Andree Alejandro Villegas de la Cruz
La obra de Alejandro Romualdo ocupa un lugar paradigmático en el corpus de la poesía peruana del siglo xx. La tendencia purista ampliamente ignorada de sus primeros trabajos todavía se encuentra presente en sus poemarios de temática social, en especial en Poesía Concreta (1952). Para demostrar esto, buscamos cotejar uno de los poemas dicho libro, “Así estamos” con el poema “Años de los castigos” de Manuel Scorza, poeta considerado por la crítica como social y confrontativo. Usando la retórica comparada y la retórica de la argumentación, se puede ver que ambos poetas tienden a las consonancias metáforicas y simbólicas para demostrar la inclusión de todas las personas en un ideal común de nación, sin embargo, la presencia hermética de la esperanza en el poema de Romualdo contrasta con el final fatalista de la poesía de corte social de Scorza.
Style. Composition. Rhetoric
Linguistic Profiling of a Neural Language Model
Alessio Miaschi, Dominique Brunato, Felice Dell'Orletta
et al.
In this paper we investigate the linguistic knowledge learned by a Neural Language Model (NLM) before and after a fine-tuning process and how this knowledge affects its predictions during several classification problems. We use a wide set of probing tasks, each of which corresponds to a distinct sentence-level feature extracted from different levels of linguistic annotation. We show that BERT is able to encode a wide range of linguistic characteristics, but it tends to lose this information when trained on specific downstream tasks. We also find that BERT's capacity to encode different kind of linguistic properties has a positive influence on its predictions: the more it stores readable linguistic information of a sentence, the higher will be its capacity of predicting the expected label assigned to that sentence.