This research aims to examine the use of gender-fair language in job advertisements within the contemporary Italian job market. The study is based on a sample of 240 job announcements collected online in 2024, with results compared to those of previous research and analyzed using corpus linguistics tools. The findings will confirm the predominance of neutralizing strategies and masculine forms, while revealing a slight increase in the use of split forms and the emergence of non-binary language, such as neomorphemes. This trend underscores the persistence of linguistic asymmetries in job announcements: problematic gender-related linguistic strategies may contribute to maintaining access barriers to specific segments of the labor market for women and to the overall persistence of social and cultural stereotypes.
Computational linguistics. Natural language processing, Language. Linguistic theory. Comparative grammar
The design of Large Language Models and generative artificial intelligence has been shown to be "unfair" to less-spoken languages and to deepen the digital language divide. Critical sociolinguistic work has also argued that these technologies are not only made possible by prior socio-historical processes of linguistic standardisation, often grounded in European nationalist and colonial projects, but also exacerbate epistemologies of language as "monolithic, monolingual, syntactically standardized systems of meaning". In our paper, we draw on earlier work on the intersections of technology and language policy and bring our respective expertise in critical sociolinguistics and computational linguistics to bear on an interrogation of these arguments. We take two different complexes of non-standard linguistic varieties in our respective repertoires--South Tyrolean dialects, which are widely used in informal communication in South Tyrol, Italy, as well as varieties of Kurdish--as starting points to an interdisciplinary exploration of the intersections between GenAI and linguistic variation and standardisation. We discuss both how LLMs can be made to deal with nonstandard language from a technical perspective, and whether, when or how this can contribute to "democratic and decolonial digital and machine learning strategies", which has direct policy implications.
Muhammad Owais Raza, Aqsa Umar, Jawad Rasheed
et al.
Abstract Due to the uncertain nature of drought, it is one of the most menacing natural disasters. Drought modeling (Prediction, Detection, Forecasting, and Stage Prediction) is very essential for efficient policy making. But one of the key problems with drought modeling is the limited availability of centralized datasets. To address this problem, we are a novel proposing federated learning based transfer learning models for the prediction of drought stages. In this study, satellite image dataset was collected from the Tharparkar district (prone to drought) of Pakistan. We trained the dataset using traditional and federated learning approaches, comparing centralized ML models, pre-trained models, and their respective federated learning models (FL-ResNet, FL-DenseNet, FL-MobileNet). The development of these models is the novel aspect of the study specifically for the use case of drought stage prediction. Based on the final evaluation, FL-MobileNet achieved 82% precision while baseline MobileNet scored 68%. The results show the effectiveness of novelty (federated learning), that our proposed framework improves the performance of the drought stage classification task.
Computational linguistics. Natural language processing, Electronic computers. Computer science
L’innovazione tecnologica ha profondamente influenzato la lessicografia, aprendo nuove prospettive per la redazione e la consultazione dei vocabolari. Questo contributo si propone di indagare il ruolo dell’intelligenza artificiale generativa, attraverso l’impiego di ChatGPT, nella creazione di un vocabolario del lessico impiegato da Giovanni Boccaccio nel Decameron. Attraverso un’indagine sperimentale condotta su un corpus di vocaboli appartenenti all’area botanica e naturalistica, il lavoro esplora le capacità e i limiti di ChatGPT nella generazione di voci lessicografiche. Vengono esaminati aspetti come la struttura delle definizioni, l’accuratezza nell’individuazione dei diversi significati e l’adeguatezza degli esempi testuali tratti dalla principale opera del Certaldese. Il contributo intende fornire una base di riflessione sulle prospettive future dell’IA generativa nel supporto alla compilazione di un vocabolario in lingua italiana, delineando opportunità e sfide per una sua integrazione efficace nella pratica lessicografica.
Computational linguistics. Natural language processing
<p class="MsoNormal" style="text-align: justify; text-justify: inter-ideograph; line-height: 115%;"><span style="font-size: 14.0pt; line-height: 115%; font-family: "Urdu Typesetting";">This research paper presents a critical analysis of <span style="text-decoration: underline;">Asghar Nadeem Syed</span>’s novel “<span style="text-decoration: underline;">Jahanabad ki Galiyan</span>”, examining it through the lens of <span style="text-decoration: underline;">historical consciousness</span>. Set against the backdrop of <span style="text-decoration: underline;">General Zia-ul-Haq</span>’s<span style="text-decoration: underline;"> Martial Law</span> in Pakistan (1977-1988), the novel explores the far-reaching consequences of this era on <span style="text-decoration: underline;">Pakistani society</span>. Through its narrative, the novel delves into the intersection of politics, history, and culture, revealing the complexities of a nation’s experience under <span style="text-decoration: underline;">authoritarian rule.</span> The novel’s canvas simultaneously encompasses various facets of the Pakistani landscape, including the political upheavals, social transformations, and cultural shifts that occurred during the Martial Law era. By employing a range of literary devices, including narrative, character development, and symbolism, the novel creates a nuanced and multifaceted portrayal of life in Pakistan during this period. This study will analyze how the novel portrays the impact of Martial Law on Pakistani politics, media, and interpersonal relationships. Furthermore, it will examine the ways in which the novel employs historical consciousness to blur the boundaries between past, present, and future, creating a nuanced understanding of <span style="text-decoration: underline;">Pakistan’s complex history</span>. By exploring the intersections between history, politics, and literature, this research aims to contribute to a deeper understanding of the Martial Law era and its ongoing impact on Pakistani society.</span></p>
Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing
Emmanuel MATANGWA JETOU, Elvis KUBANZILA MUZELA & Crédo MUSIPADE SIOKODOR
Résumé : Dans le cadre de cet article, nous avons essayé au regard des éléments produits, discuter sur la circulation de flux financiers autour de la garde à vue en République Démocratique du Congo (saisies, amendes, etc.) et les différentes pratiques de la corruption décrites par les acteurs de la justice (justiciables et agents publics). Cette étude, démontre autant une rhétorique de justification ; c’est-à-dire une sorte d’économie de la débrouille (la survie) qu’une dénonciation de ces pratiques confirmant une « économie morale » de la corruption en termes de capitalisation du pouvoir policier. L’analyse de la garde à vue en République Démocratique du Congo à travers le prisme de l'« économie morale » permet de comprendre comment la corruption s’inscrit dans des dynamiques plus complexes que la simple transgression des lois. Elle révèle que les flux financiers qui circulent autour de cette pratique sont justifiés par des raisons économiques et sociales profondes. Dans un système où l'État faillit à assurer ses fonctions, ces pratiques sont perçues comme une forme de justice alternative, bien qu'elles alimentent en même temps le cercle vicieux de la corruption. Cet article met en lumière les complexités entourant la corruption dans le cadre de la garde à vue en RDC. L'économie morale aide à comprendre comment ces pratiques, bien que condamnées, deviennent des stratégies de survie dans un environnement où les services publics et judiciaires sont déficients.
Mots clés : Flux financiers, garde à vue, rhétorique, économie morale, corruption.
Arts in general, Computational linguistics. Natural language processing
Abstract This study explores the application of an improved clustering algorithm in blended teaching with modern educational technology. It utilizes data analysis to enhance teaching processes and outcomes. As information technology rapidly advances, traditional teaching methods are increasingly unable to meet the diverse learning needs of students, prompting a shift to blended teaching. This research employs cluster analysis to categorise student data based on specific characteristics, facilitating the design of personalised teaching paths. It also discusses optimizing teaching resources and dynamic adjustment mechanisms to meet real-world teaching demands. The improved clustering algorithm demonstrates high precision and flexibility, enabling accurate resource allocation, effective teaching content, and targeted pacing adjustments based on students’ learning progress. Results indicate that cluster analysis enhances resource allocation, improves learning outcomes, and supports the personalisation and accuracy of teaching methods. With advances in algorithm optimization and computational capabilities, cluster analysis holds broad potential for educational applications. Specifically, the improved clustering algorithm increased the silhouette score from 0.51 (K-means) and 0.58 (DBSCAN) to 0.68 and reduced the clustering error from 0.27 to 0.18. This resulted in an average gain of 4.5 points in student performance and a 10% improvement in personalised learning progress.
Computational linguistics. Natural language processing, Electronic computers. Computer science
<p class="MsoNormal" style="margin-bottom: .0001pt; text-align: justify; text-justify: inter-ideograph;"><strong><span style="font-size: 12.0pt; line-height: 115%; mso-bidi-font-family: "Times New Roman"; mso-bidi-theme-font: major-bidi; color: black; mso-themecolor: text1; mso-bidi-language: ER;">Abstract</span></strong></p> <p class="MsoNormal" style="margin-bottom: .0001pt; text-align: justify; text-justify: inter-ideograph;"><span style="font-size: 12.0pt; line-height: 115%; mso-bidi-font-family: "Times New Roman"; mso-bidi-theme-font: major-bidi; color: black; mso-themecolor: text1; mso-bidi-language: ER;">This paper presents a comparative study of the major elements of Romanticism in the poetry of Allama Muhammad Iqbal and Rabindranath Tagore. Romanticism, as a literary and intellectual movement, emphasizes imagination, love of nature, spirituality, nostalgia for the past, aesthetic beauty, and individual freedom. Both Iqbal and Tagore embody these Romantic traits within their distinct cultural, philosophical, and artistic contexts, giving them new dimensions and unique expressive forms. The study highlights that while both poets share Romantic features such as imaginative elevation, vivid nature imagery, spiritual experience, and humanistic love, the foundations and purposes of their Romanticism differ significantly. Iqbal’s Romantic vision is rooted in action, selfhood (khudi), dynamism, and revolutionary consciousness, whereas Tagore’s Romanticism revolves around aesthetic harmony, divine unity, contemplative serenity, and the celebration of universal humanism. Through this comparative analysis, the paper clarifies the similarities and distinctions in the poets’ philosophical frameworks, aesthetic sensibilities, and Romantic ideals, thereby underscoring the richness and diversity of South Asian literary heritage</span><span style="font-size: 12.0pt; line-height: 115%; font-family: "Times New Roman","serif"; mso-ascii-theme-font: major-bidi; mso-hansi-theme-font: major-bidi; mso-bidi-theme-font: major-bidi; color: black; mso-themecolor: text1; mso-bidi-language: ER;">.</span></p>
Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing
Konstantinos I. Roumeliotis, Nikolaos D. Tselikas, Dimitrios K. Nasiopoulos
In the rapidly evolving e-commerce landscape, efficient and accurate product classification is essential for enhancing customer experience and streamlining operations. Traditional product classification methods, which depend heavily on labeled data and manual effort, struggle with scalability and adaptability to diverse product categories. This study explores the transformative potential of large language models (LLMs) for zero-shot product classification in e-commerce, addressing the challenge of automating product categorization without prior labeled training data. We evaluate the performance of four state-of-the-art LLMs — GPT-4o, GPT-4o mini, Claude 3.5 Sonnet, and Claude 3.5 Haiku — on a diverse dataset of 248 product categories, each containing 20 samples, structured into 8 subsets. Each model performs zero-shot classification, assigning products to predefined categories without prior exposure. Our findings reveal significant variations in classification accuracy across models, with certain LLMs demonstrating superior scalability and adaptability for real-world e-commerce applications. Based on these insights, we developed an API software to integrate the top-performing models into e-commerce systems, enhancing automation and efficiency. This study underscores the transformative role of LLMs in revolutionizing e-commerce workflows and recommends their adoption for scalable, intelligent product classification.
Computational linguistics. Natural language processing
This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.
Dominik Meier, Jan Philip Wahle, Paul Röttger
et al.
As large language models (LLMs) become integrated into sensitive workflows, concerns grow over their potential to leak confidential information. We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography, without requiring explicit control over inference inputs. We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat. To implement TrojanStego, we propose a practical encoding scheme based on vocabulary partitioning learnable by LLMs via fine-tuning. Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations. Further, they maintain high utility, can evade human detection, and preserve coherence. These results highlight a new class of LLM data exfiltration attacks that are passive, covert, practical, and dangerous.
This study examines the intricate relationship between transitivity and existential thought within the context of Ekegusii, an indigenous language spoken by the Abagusii people of Western Kenya. The research sought to uncover how Ekegusii speakers conceptualize and express the existence and occurrence of entities using the language. Grounding itself on the experiential metafunction of Halliday and Mathiessen's (2014) Functional Grammar theory, the study specifically sought to find out how existential processes are realized in the various forms of Ekegusii declarative clauses, looking at the forms of the clauses in terms of polarity and voice. Methodologically, the study used qualitative analysis. The data, Ekegusii declarative clauses with verbs carrying existential processes, was drawn from Ekegusii Bible, Ekegusii storybooks, and the researcher’s intuition as a native speaker of Ekegusii. The analysis involved an examination of the patterns of the transitivity structure in the selected clauses by looking at the various components of the clauses: the verbs realizing the existential Process, the Participants involved and the Circumstances, thereby allowing for a focused exploration of how existential thought is linguistically represented in Ekegusii. Besides implications for language teaching, translation, and cultural preservation efforts, the findings from this research will contribute to a deeper understanding of how indigenous languages and Ekegusii in particular, can serve to provide unique insights into human thought and perception.
Language. Linguistic theory. Comparative grammar, Computational linguistics. Natural language processing
Evaluating the cognitive capacities of large language models (LLMs) requires overcoming not only anthropomorphic but also anthropocentric biases. This article identifies two types of anthropocentric bias that have been neglected: overlooking how auxiliary factors can impede LLM performance despite competence ("auxiliary oversight"), and dismissing LLM mechanistic strategies that differ from those of humans as not genuinely competent ("mechanistic chauvinism"). Mitigating these biases necessitates an empirically-driven, iterative approach to mapping cognitive tasks to LLM-specific capacities and mechanisms, which can be done by supplementing carefully designed behavioral experiments with mechanistic studies.
Nikit Srivastava, Denis Kuchelev, Tatiana Moteu Ngoli
et al.
This paper presents LOLA, a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Our architectural and implementation choices address the challenge of harnessing linguistic diversity while maintaining efficiency and avoiding the common pitfalls of multilinguality. Our analysis of the evaluation results shows competitive performance in natural language generation and understanding tasks. Additionally, we demonstrate how the learned expert-routing mechanism exploits implicit phylogenetic linguistic patterns to potentially alleviate the curse of multilinguality. We provide an in-depth look at the training process, an analysis of the datasets, and a balanced exploration of the model's strengths and limitations. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research. Our findings enable the development of compute-efficient multilingual models with strong, scalable performance across languages.
Mattia Setzu, Marta Marchiori Manerba, Pasquale Minervini
et al.
Language Models (LMs) have been shown to inherit undesired biases that might hurt minorities and underrepresented groups if such systems were integrated into real-world applications without careful fairness auditing. This paper proposes FairBelief, an analytical approach to capture and assess beliefs, i.e., propositions that an LM may embed with different degrees of confidence and that covertly influence its predictions. With FairBelief, we leverage prompting to study the behavior of several state-of-the-art LMs across different previously neglected axes, such as model scale and likelihood, assessing predictions on a fairness dataset specifically designed to quantify LMs' outputs' hurtfulness. Finally, we conclude with an in-depth qualitative assessment of the beliefs emitted by the models. We apply FairBelief to English LMs, revealing that, although these architectures enable high performances on diverse natural language processing tasks, they show hurtful beliefs about specific genders. Interestingly, training procedure and dataset, model scale, and architecture induce beliefs of different degrees of hurtfulness.
Résumé : Les contes africains abordent la problématique de la faim, une notion capitale dans le quotidien des sociétés africaines. Le traitement de ce thème est un moyen efficace d’éducation à l’entraide et à la cohésion sociale. Ainsi, par le biais du décepteur, les comportements antisociaux sont reprouvés pour faire place aux valeurs sociales, gage d’équilibre de la société. Dès lors, la gourmandise et l’égoïsme sont une manifestation parfaite de l’égocentrisme quand la violence, elle, est perçue comme un frein à la sécurité sociale. Aussi, les défauts tels que le vol, la malhonnêteté, sont considérés comme une mise en abîme de la quiétude sociale et un pilier de la corruption.
Mots-clés : Contes africains, faim, éducation, entraide, cohésion sociale.
Arts in general, Computational linguistics. Natural language processing
Ixandra Achitouv, Dragos Gorduza, Antoine Jacquier
This article provides an understanding of Natural Language Processing techniques in the framework of financial regulation, more specifically in order to perform semantic matching search between rules and policy when no dataset is available for supervised learning. We outline how to outperform simple pre-trained sentences-transformer models using freely available resources and explain the mathematical concepts behind the key building blocks of Natural Language Processing.