Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.
Résumé : Dans un contexte de transformation rapide de l’enseignement supérieur, les universités des pays du Sud font face à des défis majeurs : massification des effectifs, raréfaction des ressources pédagogiques, inégalités d’accès à la formation continue, et injonctions à l’internationalisation. Le café pédagogique apparaît dans ce cadre comme un dispositif informel et structurant, propice à l’échange horizontal, à la réflexivité et à la co-construction de savoirs contextualisés. Inspiré d’initiatives développées dans certaines universités occidentales, il doit néanmoins être repensé à l’aune des spécificités culturelles et institutionnelles des contextes du Sud. Mobilisant le cadre théorique de l’organisation apprenante, la pertinence du café pédagogique est interrogée comme levier de transformation des pratiques et des cultures universitaires. Les expériences analysées révèlent que, malgré les résistances institutionnelles et les logiques professionnelles figées, ce format favorise l’émergence de dynamiques collectives d’analyse de l’action. Le café pédagogique offre une médiation féconde entre conformisme routinier et engagement. Il permet de rompre avec l’immobilisme face aux pratiques qui, dans des environnements contraints, s’écartent des finalités de la vocation universitaire. Il ouvre ainsi un espace inédit de reconnaissance des pratiques enseignantes, tout en soutenant l’élaboration de projets éducatifs ancrés et portés par les communautés académiques elles-mêmes.
Mots-clés : café pédagogique, réflexivité, organisation apprenante, transformation, co-construction.
Arts in general, Computational linguistics. Natural language processing
We examine the semantic evolution of Donald Trump’s presidential speeches during his first term (2017-2021), using quantitative methods from corpus linguistics and natural language processing. Our corpus includes all transcribed official speeches, segmented into homogeneous speech units and then analyzed using word embeddings. Two scales of evolution are considered: an external dynamic, measured at the level of the presidency, and an internal dynamic, observed within each speech. The external analysis does not reveal a large semantic shift over time, suggesting relative stability in the presidential register. On the other hand, a clear internal evolution is evident: Trump’s speeches begin in an institutional and impersonal register, then evolve towards a more subjective formulation, marked by the increasing use of the first person and a decrease in lexical density. This pattern indicates a specific structure of Trump’s presidential discourse, characterized by a gradual transition from a representative role to personal expression.
Philology. Linguistics, Computational linguistics. Natural language processing
William Villegas-Ch, Rommel Gutierrez, Alexandra Maldonado Navarro
Abstract In the digital age, the proliferation and complexity of data present significant challenges for digital forensic analysis. Traditional tools often struggle to keep pace with the volume and sophistication of data, leading to delays in detecting illicit activity. This study addresses these challenges by integrating advanced artificial intelligence techniques, which overcome these limitations and enhance the effectiveness and efficiency of digital forensics. We apply specific artificial intelligence methods, such as convolutional neural networks, supervised machine learning algorithms, and natural language processing techniques, to optimize data processing and analysis in forensic investigations. Our findings demonstrate accelerated data analysis, improved precision, and greater capacity to handle large volumes of information. For instance, convolutional neural networks achieved a remarkable 92% precision in identifying image patterns, while natural language processing techniques achieved 88% precision in extracting relevant text information. This research highlights the transformative potential of artificial intelligence in digital forensics, offering faster and more accurate solutions that are crucial for cybersecurity.
Computational linguistics. Natural language processing, Electronic computers. Computer science
This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.
Kristin Gnadt, David Thulke, Simone Kopeinik
et al.
In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.
Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.
Miguel López-Otal, Jorge Gracia, Jordi Bernad
et al.
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
Abstract The use of dynamic programming (DP) algorithms to learn Bayesian network structures is limited by their high space complexity and difficulty in learning the structure of large‐scale networks. Therefore, this study proposes a DP algorithm based on node block sequence constraints. The proposed algorithm constrains the traversal process of the parent graph by using the M‐sequence matrix to considerably reduce the time consumption and space complexity by pruning the traversal process of the order graph using the node block sequence. Experimental results show that compared with existing DP algorithms, the proposed algorithm can obtain learning results more efficiently with less than 1% loss of accuracy, and can be used for learning larger‐scale networks.
Computational linguistics. Natural language processing, Computer software
Abstract :The cessation of Internet services for only a few hours creates clear confusion, and members of society become living in a state of “social alienation.” This indicates the depth of the psychological and social integration of individuals with this virtual network, which has drawn the attention of many researchers and those interested in researching the phenomenon and revealing its social, political, and media dimensions. The use of social media is no longer limited to urban communities only, but has become widespread even in remote areas with a rural Bedouin character. Our approach to this phenomenon, in its various dimensions, comes as a serious attempt to examine the functional relationship between the status of the virtual world and its outcomes in reshaping the Algerian social reality, by presenting a general problem that states:How did social media contribute to reproducing the representations of members of the Algerian tribal community towards their social reality?
Keywords : Social media, Social change, Nomadism, The tribe, Identity.
Arts in general, Computational linguistics. Natural language processing
In the realm of financial decision-making, predicting stock prices is pivotal. Artificial intelligence techniques such as long short-term memory networks (LSTMs), support-vector machines (SVMs), and natural language processing (NLP) models are commonly employed to predict said prices. This paper utilizes stock percentage change as training data, in contrast to the traditional use of raw currency values, with a focus on analyzing publicly released news articles. The choice of percentage change aims to provide models with context regarding the significance of price fluctuations and overall price change impact on a given stock. The study employs specialized BERT natural language processing models to predict stock price trends, with a particular emphasis on various data modalities. The results showcase the capabilities of such strategies with a small natural language processing model to accurately predict overall stock trends, and highlight the effectiveness of certain data features and sector-specific data.
Grammatical rules in natural languages are often characterized by exceptions. How do language learners learn these exceptions to otherwise general patterns? Here, we study this question through the case study of English passivization. While passivization is in general quite productive, there are cases where it cannot apply (cf. the following sentence is ungrammatical: *One hour was lasted by the meeting). Using neural network language models as theories of language acquisition, we explore the sources of indirect evidence that a learner can leverage to learn whether a verb can be passivized. We first characterize English speakers' judgments of exceptions to the passive, and confirm that speakers find some verbs more passivizable than others. We then show that a neural network language model's verb passivizability judgments are largely similar to those displayed by humans, suggesting that evidence for these exceptions is available in the linguistic input. Finally, we test two hypotheses as to the source of evidence that language models use to learn these restrictions: frequency (entrenchment) and semantics (affectedness). We do so by training models on versions of the corpus that have had sentences of the types implicated by each hypothesis removed, altered, or introduced. We find support for both hypotheses: entrenchment and affectedness make independent contributions to a verb's passivizability. From a methodological point of view, this study highlights the utility of altering a language model's training data for answering questions where complete control over a learner's input is vital.
The increased use of large language models (LLMs) across a variety of real-world applications calls for automatic tools to check the factual accuracy of their outputs, as LLMs often hallucinate. This is difficult as it requires assessing the factuality of free-form open-domain responses. While there has been a lot of research on this topic, different papers use different evaluation benchmarks and measures, which makes them hard to compare and hampers future progress. To mitigate these issues, we developed OpenFactCheck, a unified framework, with three modules: (i) RESPONSEEVAL, which allows users to easily customize an automatic fact-checking system and to assess the factuality of all claims in an input document using that system, (ii) LLMEVAL, which assesses the overall factuality of an LLM, and (iii) CHECKEREVAL, a module to evaluate automatic fact-checking systems. OpenFactCheck is open-sourced (https://github.com/mbzuai-nlp/openfactcheck) and publicly released as a Python library (https://pypi.org/project/openfactcheck/) and also as a web service (http://app.openfactcheck.com). A video describing the system is available at https://youtu.be/-i9VKL0HleI.
Vagrant Gautam, Arjun Subramonian, Anne Lauscher
et al.
Personal names simultaneously differentiate individuals and categorize them in ways that are important in a given society. While the natural language processing community has thus associated personal names with sociodemographic characteristics in a variety of tasks, researchers have engaged to varying degrees with the established methodological problems in doing so. To guide future work that uses names and sociodemographic characteristics, we provide an overview of relevant research: first, we present an interdisciplinary background on names and naming. We then survey the issues inherent to associating names with sociodemographic attributes, covering problems of validity (e.g., systematic error, construct validity), as well as ethical concerns (e.g., harms, differential impact, cultural insensitivity). Finally, we provide guiding questions along with normative recommendations to avoid validity and ethical pitfalls when dealing with names and sociodemographic characteristics in natural language processing.
Text-to-graphics systems encompass three types of tools: text-to-picture, text-to-scene and text-to-animation. They are an artificial intelligence application wherein users can create 2D and 3D scenes or animations and recently immersive environments from natural language. These complex tasks require the collaboration of various fields, such as natural language processing, computational linguistics and computer graphics. Text-to-animation systems have received more interest than their counterparts, and have been developed for various domains, including theatrical pre-production, education or training. In this survey we focus on text-to-animation systems, discussing their requirements, challenges and proposing solutions, and investigate the natural language understanding approaches adopted in previous research works to solve the challenge of animation generation. We review text-to-animation systems developed over the period 2001-2021, and investigate their recent trends in order to paint the current landscape of the field.
Marcel Bollmann, Nathan Schneider, Arne Köhn
et al.
The ACL Anthology is a prime resource for research papers within computational linguistics and natural language processing, while continuing to be an open-source and community-driven project. Since Gildea et al. (2018) reported on its state and planned directions, the Anthology has seen major technical changes. We discuss what led to these changes and how they impact long-term maintainability and community engagement, describe which open-source data and software tools the Anthology currently provides, and provide a survey of literature that has used the Anthology as a main data source.
Automatic Term Recognition is used to extract domain-specific terms that belong to a given domain. In order to be accurate, these corpus and language-dependent methods require large volumes of textual data that need to be processed to extract candidate terms that are afterward scored according to a given metric. To improve text preprocessing and candidate terms extraction and scoring, we propose a distributed Spark-based architecture to automatically extract domain-specific terms. The main contributions are as follows: (1) propose a novel distributed automatic domain-specific multi-word term recognition architecture built on top of the Spark ecosystem; (2) perform an in-depth analysis of our architecture in terms of accuracy and scalability; (3) design an easy-to-integrate Python implementation that enables the use of Big Data processing in fields such as Computational Linguistics and Natural Language Processing. We prove empirically the feasibility of our architecture by performing experiments on two real-world datasets.