Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev
Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.
Zygmunt Frajzyngier, Locative predications in Chadic languages: Implications for semantic analysis
Joseph Lovestrand
African languages and literature
Bialgebraic Reasoning on Stateful Languages
Sergey Goncharov, Stefan Milius, Lutz Schröder
et al.
Reasoning about program equivalence in imperative languages is notoriously challenging, as the presence of states (in the form of variable stores) fundamentally increases the observational power of program terms. The key desideratum for any notion of equivalence is compositionality, guaranteeing that subprograms can be safely replaced by equivalent subprograms regardless of the context. To facilitate compositionality proofs and avoid boilerplate work, one would hope to employ the abstract bialgebraic methods provided by Turi and Plotkin's powerful theory of mathematical operational semantics (a.k.a. abstract GSOS) or its recent extension by Goncharov et al. to higher-order languages. However, multiple attempts to apply abstract GSOS to stateful languages have thus failed. We propose a novel approach to the operational semantics of stateful languages based on the formal distinction between readers (terms that expect an initial input store before being executed), and writers (running terms that have already been provided with a store). In contrast to earlier work, this style of semantics is fully compatible with abstract GSOS, and we can thus leverage the existing theory to obtain coinductive reasoning techniques. We demonstrate that our approach generates non-trivial compositionality results for stateful languages with first-order and higher-order store and that it flexibly applies to program equivalences at different levels of granularity, such as trace, cost, and natural equivalence.
Idefix-Closed Languages and Their Application in Contextual Grammars
Marvin Ködding, Bianca Truthe
In this paper, we continue the research on the power of contextual grammars with selection languages from subfamilies of the family of regular languages. We investigate infix-, prefix-, and suffix-closed languages (referred to as idefix-closed languages) and compare such language families to some other subregular families of languages (finite, monoidal, nilpotent, combinational, (symmetric) definite, ordered, non-counting, power-separating, commutative, circular, union-free, star, and comet languages). Further, we compare the families of the hierarchies obtained for external and internal contextual grammars with the language families defined by these new types for the selection. In this way, we extend the existing hierarchies by new language families. Moreover, we solve an open problem regarding internal contextual grammars with suffix-closed selection languages.
Identity Testing for Stochastic Languages
Smayan Agarwal, Shobhit Singh, Aalok Thakkar
Determining whether an unknown distribution matches a known reference is a cornerstone problem in distributional analysis. While classical results establish a rigorous framework in the case of distributions over finite domains, real-world applications in computational linguistics, bioinformatics, and program analysis demand testing over infinite combinatorial structures, particularly strings. In this paper, we initiate the theoretical study of identity testing for stochastic languages, bridging formal language theory with modern distribution property testing. We first propose a polynomial-time algorithm to verify if a finite state machine represents a stochastic language, and then prove that rational stochastic languages can approximate an arbitrary probability distribution. Building on these representations, we develop a truncation-based identity testing algorithm that distinguishes between a known and an unknown distributions with sample complexity $\widetildeΘ\left( \frac{\sqrt{n}}{\varepsilon^2} + \frac{n}{\log n} \right)$ where $n$ is the size of the truncated support. Our approach leverages the exponential decay inherent in rational stochastic languages to bound truncation error, then applies classical finite-domain testers to the restricted problem. This work establishes the first identity testing framework for infinite discrete distributions, opening new directions in probabilistic formal methods and statistical analysis of structured data.
On the Representation and State Complexity of Block Languages
Guilherme Duarte, Nelma Moreira, Luca Prigioniero
et al.
In this paper, we consider block languages, namely sets of words having the same length, and we propose a new representation for these languages. In particular, given an alphabet of size $k$ and a length $\ell$, a block language can be represented by a bitmap of length $k^\ell$, where each bit indicates whether the corresponding word, according to the lexicographical order, belongs, or not, to the language (bit equal to 1 or 0, respectively). First, we show how to convert bitmaps into deterministic and nondeterministic finite automata, and we prove that the machines are minimal. Then, we give an analysis of the maximum number of states sufficient to accept every block language in the deterministic and nondeterministic case. Finally, we study the deterministic and nondeterministic state complexity of several operations on these languages. Being a subclass of finite languages, the upper bounds of operational state complexity known for finite languages apply for block languages as well. However, in several cases, smaller values were found.
Subsequence Matching and Analysis Problems for Formal Languages
Szilárd Zsolt Fazekas, Tore Koß, Florin Manea
et al.
In this paper, we study a series of algorithmic problems related to the subsequences occurring in the strings of a given language, under the assumption that this language is succinctly represented by a grammar generating it, or an automaton accepting it. In particular, we focus on the following problems: Given a string $w$ and a language $L$, does there exist a word of $L$ which has $w$ as subsequence? Do all words of $L$ have $w$ as a subsequence? Given an integer $k$ alongside $L$, does there exist a word of $L$ which has all strings of length $k$, over the alphabet of $L$, as subsequences? Do all words of $L$ have all strings of length $k$ as subsequences? For the last two problems, efficient algorithms were already presented in [Adamson et al., ISAAC 2023] for the case when $L$ is a regular language, and efficient solutions can be easily obtained for the first two problems. We extend that work as follows: we give sufficient conditions on the class of input-languages, under which these problems are decidable; we provide efficient algorithms for all these problems in the case when the input language is context-free; we show that all problems are undecidable for context-sensitive languages. Finally, we provide a series of initial results related to a class of languages that strictly includes the regular languages and is strictly included in the class of context-sensitive languages, but is incomparable to the of class context-free languages; these results deviate significantly from those reported for language-classes from the Chomsky hierarchy.
Culture in Design of Coding Toolkits for Young Learners in Developing Economies in Africa: A Review
Koech Cheruiyot, Koech Janeth Chepngetich
With the increasing importance and demand of coding skills in today’s digital world, developing economies in Africa are putting every effort to include coding education in their educational curriculum. However, the existing coding toolkits for young learners have different cultural orientation making it hard to learn and use thereby making coding education programs and learning outcomes ineffective. The aim of the study was to carry out systematic literature review with particular emphasis on learning environment and pedagogy, use of local language and gender related norms and explored the potential challenges with recommendations in building culturally responsive coding kits. It was found out that the design of culturally responsive coding toolkits has the potential to improve the accessibility, affordability, relevance and effectiveness of coding education for young learners in developing economies in Africa contributing to more effective coding education and therefore helping to bridge the digital divide.
Preface
Zhen Li, Allen Asiimwe, Patrick Kanampiu
et al.
African languages and literature
“It’s easy to just be normal”: Performative masculinities in Die stropers
Danel Boshoff, Annemi Conradie
In post-apartheid South Africa, the representation of queer identities in Afrikaans-language films have become increasingly prominent. However, many such films focus on past contexts or serve as comedic depictions, failing to confront the complex issues faced by individuals in contemporary Afrikaner communities. Etienne Kallos’s 2018 coming-of-age film Die stropers (The Harvesters) candidly depicts the challenges faced by adoptive brothers Janno and Pieter, who must negotiate desire and gender in a conservative, Afrikaans farming community. Drawing on Judith Butler’s notion of gender performativity and recent reformulations of Raewyn Connell’s notion of hegemonic masculinity, we examine the discursive (re)production of hegemonic Afrikaner masculinity within the boys’ adoptive family and community. We investigate how the protagonists strategically deploy performative masculinities to achieve belonging and acceptance as heirs, even if the ideals they emulate oppress and restrict them. We propose that the protagonists’ strategic deployment of performative masculinities represents their claiming of agency in contexts where gender and desire are aggressively policed. Furthermore, their navigation of hegemony in this white, Afrikaner community through resistance and consent might point to a claiming of space for queer identities and the continued transmutation of hegemonic masculinity.
African languages and literature
Bot (Carien Smith)
Carmi Britz
African languages and literature
Detecting the Presence of COVID-19 Vaccination Hesitancy from South African Twitter Data Using Machine Learning
Nicholas Perikli, Srimoy Bhattacharya, Blessing Ogbuokiri
et al.
Very few social media studies have been done on South African user-generated content during the COVID-19 pandemic and even fewer using hand-labelling over automated methods. Vaccination is a major tool in the fight against the pandemic, but vaccine hesitancy jeopardizes any public health effort. In this study, sentiment analysis on South African tweets related to vaccine hesitancy was performed, with the aim of training AI-mediated classification models and assessing their reliability in categorizing UGC. A dataset of 30000 tweets from South Africa were extracted and hand-labelled into one of three sentiment classes: positive, negative, neutral. The machine learning models used were LSTM, bi-LSTM, SVM, BERT-base-cased and the RoBERTa-base models, whereby their hyperparameters were carefully chosen and tuned using the WandB platform. We used two different approaches when we pre-processed our data for comparison: one was semantics-based, while the other was corpus-based. The pre-processing of the tweets in our dataset was performed using both methods, respectively. All models were found to have low F1-scores within a range of 45$\%$-55$\%$, except for BERT and RoBERTa which both achieved significantly better measures with overall F1-scores of 60$\%$ and 61$\%$, respectively. Topic modelling using an LDA was performed on the miss-classified tweets of the RoBERTa model to gain insight on how to further improve model accuracy.
Environmental and Health Risks of Pesticide Use in Ethiopia
B. Negatu, Sisay Dugassa, Y. Mekonnen
Background. There are frequent reports of unsafe pesticide use in many parts of Africa. Ethiopia is the second most populous nation in Africa with around 80% of the population still depending on agriculture which intensively uses pesticides. A number of studies have examined pesticide-related health and environmental risks in Ethiopia. However, most of these studies have been small in scale and it is therefore challenging to get a general overview of the extent of health risks and level of environmental contamination in the country. Objectives. The aim of the present study was to synthesize and summarize contemporary knowledge on pesticide-related risks and relevant gaps in Ethiopia. Methods. An electronic database search and gathering of grey literature were done to collect information on the risks of pesticide use in Ethiopia. The electronic search was conducted using MEDLINE (via PubMed) without any publication date or language specifications. The Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) checklist was used as guide in the creation of this review. Discussion. A synthesis of the reviewed studies showed evidence of health risks due to occupational pesticide exposure, surface water pollution with pesticides that could cause chronic health risks to the public, evidence of pesticide contamination of the environment (e.g., soil organisms, fish, bee colonies and wildlife) and local as well as international consumer risks due to pesticide residues in food items. In addition, there have been frequent reports of health and environmental hazards in association with cut-flower farms. There is also evidence of direct use of DDT (dichlorodiphenyltrichloroethane) on food crops and detection of DDT residues in surface water, soil and human breast milk. Those reported risks might be due to lack of knowledge among farm workers, negligence of farm owners, absence of post-registration monitoring systems and poor implementation of both national and international regulations in Ethiopia due to poor institutional capacity. Conclusions. The health and environmental risks of inappropriate use of pesticides requires action by all concerned bodies. Improved institutional arrangements for enforcement of regulations, awareness and further intervention studies could lessen the high risks of pesticide misuse. Competing Interests. The authors declare no competing financial interests.
Rethinking China's Soft Power: “Pragmatic Enticement” of Confucius Institutes in Ethiopia
M. Repnikova
Abstract This article examines China's most controversial soft power export – the Confucius Institute initiative – through the case study of its promotion and implementation in Ethiopia. As one of China's closest partners in Africa, Ethiopia presents a path-breaking case for examining the potential and the limitations of Confucius Institutes. In contrast to the existing literature that depicts Confucius Institutes largely as contested and limited initiatives, this article shows that Confucius Institutes and Confucius Classrooms have thus far been relatively successful in Ethiopia. Specifically, China's fusion of practical or tangible benefits with language and cultural promotion – what I describe as “pragmatic enticement” – invokes support from key participants in this project, including university administrators, students and Chinese teachers. In the long term, however, even in the highly favourable context of Ethiopia, the sustainability of Confucius Institutes is questionable, as there are apparent gaps between the rising expectations of Ethiopian administrators and students, and the limited resources on the ground.
Affordances of digital platforms in sub‐Saharan Africa: An analytical review
O. Daramola, Ernest Etim
The proliferation of digital platforms in sub‐Saharan Africa (SSA) has increased job opportunities for informal businesses. However, the technological affordance of digital platforms in SSA has received very little attention in literature review. This paper presents an analytical review of the technological affordances of digital platforms in SSA. We adopted a systematic literature review methodology that entailed gathering evidence from relevant literature and digital platforms in three countries in SSA (South Africa, Nigeria, and Kenya). Three research questions focusing on (i) critical issues of discussion on digital platforms in SSA and their distribution; (ii) understanding the technological affordances of digital platforms in SSA; and (iii) assessing the extent to which the technological affordances digital platforms can address the challenges of the informal sector in SSA drove the investigation. The study developed an affordances criteria framework, the first of its kind, to evaluate digital platforms affordances in SSA. Our findings revealed the key topics that have gained researchers' attention so far and digital platforms distribution in SSA. We also found that most digital platforms in SSA currently lack affordances that support multimodal interface and native language contents, limiting their accessibility to persons with minimal digital literacy. Most platforms also lack support for public/private partnerships. The lack of critical affordances that match the profile of the informal sector in SSA makes existing digital platforms incapable of addressing many critical challenges of the informal sector. These challenges include low literacy levels, minimal/lack of government support, lack of electricity or expensive power supply, expensive internet connectivity, and health and safety issues. However, existing digital platforms can reasonably address the absence of social protection, lack of formalization, and poor operational infrastructure. As a contribution, this paper offers a new perspective on the technological affordances of digital platforms in SSA. It also provides insight into the imperatives for designing inclusive and accessible digital platforms for the informal sector in SSA.
22 sitasi
en
Computer Science
A Comparative Study of Ideophones in Dagbani and Gurenԑ
Samuel Awinkene Atintono, Samuel Alhassan Issah
Ideophones have been analysed in the linguistic literature as a class of words that depict sensory imagery; they are established in most natural languages, but particularly in African and Asian languages. They have gained popularity for their uniqueness in linguistic forms, including their peculiar phonological, morphological, syntactic, and semantic behaviour and rich sensory meanings. The question as to whether or not ideophones constitute a separate word class has attracted the attention of many researchers in recent years. This paper examines the linguistic characteristics of ideophones in Dagbani and Gurenԑ, two Mabia (Gur) Niger-Congo languages spoken in the Northern and Upper East Regions of Ghana, respectively. We demonstrate that ideophones in these languages exhibit peculiar phonological, morphological, syntactic, and semantic characteristics. Based on these peculiar characteristics, we assume that ideophones constitute a distinct word class in these languages. Ideophones in Dagbani and Gurenԑ have not received any systematic linguistic investigation. This study is therefore significant since it sheds new light on an area that has not received linguistic attention in the two languages under study.
Language and Literature, Social Sciences
A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature
Sara Lafia, Lizhou Fan, Libby Hemphill
Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.
Formal Semantics for the Halide Language
Alex Reinking, Gilbert Louis Bernstein, Jonathan Ragan-Kelley
We present the first formalization and metatheory of language soundness for a user-schedulable language, the widely used array processing language Halide. User-schedulable languages strike a balance between abstraction and control in high-performance computing by separating the specification of what a program should compute from a schedule for how to compute it. In the process, they make a novel language soundness claim: the result of a program should always be the same, regardless of how it is scheduled. This soundness guarantee is tricky to provide in the presence of schedules that introduce redundant recomputation and computation on uninitialized data, rather than simply reordering statements. In addition, Halide ensures memory safety through a compile-time bounds inference engine that determines safe sizes for every buffer and loop in the generated code, presenting a novel challenge: formalizing and analyzing a language specification that depends on the results of unreliable program synthesis algorithms. Our formalization has revealed flaws and led to improvements in the practical Halide system, and we believe it provides a foundation for the design of new languages and tools that apply programmer-controlled scheduling to other domains.
ASFAP Working Group Summary of Societal Engagements
Mounia Laassiri, Marie Clementine Nibamureke, Bertrand Tchanche Fankam
et al.
The second African Conference of Fundamental and Applied Physics (ACP2021) took place in the week of March 7-11, 2022. During this conference, all the African Strategy for Fundamental and Applied Physics (ASFAP) working groups had been reserved specials sessions to discuss their scope, activities (past & current) and topics of common interests. The aim of this report is to summarize the discussion of the ASFAP working groups in societal engagements, namely Physics Education, Community Engagement, Young Physicists and Women in Physics. The recommendations for future activities in societal engagements are summarised in the report as well.
Examining the Implementation of Digital Health to Strengthen the COVID-19 Pandemic Response and Recovery and Scale up Equitable Vaccine Access in African Countries
Olufunto A Olusanya, Brianna White, Chad A Melton
et al.
The COVID-19 pandemic has profoundly impacted the world, having taken the lives of over 6 million individuals. Accordingly, this pandemic has caused a shift in conversations surrounding the burden of diseases worldwide, welcoming insights from multidisciplinary fields including digital health and artificial intelligence. Africa faces a heavy disease burden that exacerbates the current COVID-19 pandemic and limits the scope of public health preparedness, response, containment, and case management. Herein, we examined the potential impact of transformative digital health technologies in mitigating the global health crisis with reference to African countries. Furthermore, we proposed recommendations for scaling up digital health technologies and artificial intelligence-based platforms to tackle the transmission of the SARS-CoV-2 and enable equitable vaccine access. Challenges related to the pandemic are numerous. Rapid response and management strategies - that is, contract tracing, case surveillance, diagnostic testing intensity, and most recently vaccine distribution mapping - can overwhelm the health care delivery system that is fragile. Although challenges are vast, digital health technologies can play an essential role in achieving sustainable resilient recovery and building back better. It is plausible that African nations are better equipped to rapidly identify, diagnose, and manage infected individuals for COVID-19, other diseases, future outbreaks, and pandemics.