Chinelo Eneh
Hasil untuk "African languages and literature"
Menampilkan 20 dari ~4929 hasil · dari DOAJ, arXiv
Prosperous Nankindu, Henry Hollan Ssembatya
Through critical research methods using document analysis, this study investigated the current Language in Education Policy (LiEP) debates in Africa. There are many such debates going on across the continent, but the literature available is very thin. Three multilingual African countries, the Federal Republic of Nigeria, the Republic of Congo, and the Islamic Republic of Mauritania, were selected for the study. The study found that colonial languages are dominant and are the languages mainly used as media of instruction in schools and languages of assessment in the sampled countries. Although three countries were selected, the debates are not any different in the rest of the African countries. The findings of this research are generalisable to the situation across the entire continent, thus critical in influencing future LiEP on the content. It is imperative to note that the use of colonial languages in education should not be at the expense of African languages.
Marcelo Arenas, Pablo Barceló, Luis Cofré et al.
Kleinberg and Mullainathan showed that language generation in the limit is always possible at the level of computability: given enough positive examples, a learner can eventually generate data indistinguishable from a target language. However, such existence results do not address feasibility. We study the sample complexity of language generation in the limit for several canonical classes of formal languages. Our results show that infeasibility already appears for context-free and regular languages, and persists even for strict subclasses such as locally threshold testable languages, as well as for incomparable classes such as non-erasing pattern languages, a well-studied class in the theory of language identification. Overall, our results establish a clear gap between the theoretical possibility of language generation in the limit and its computational feasibility.
Edward Ajayi, Eudoxie Umwari, Mawuli Deku et al.
This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
Odunayo Ogundepo, Akintunde Oladipo, Kelechi Ogueji et al.
Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computational resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.
Ife Adebara, Hawau Olamide Toyin, Nahom Tesfu Ghebremichael et al.
Africa's rich linguistic heritage remains underrepresented in NLP, largely due to historical policies that favor foreign languages and create significant data inequities. In this paper, we integrate theoretical insights on Africa's language landscape with an empirical evaluation using Sahara - a comprehensive benchmark curated from large-scale, publicly accessible datasets capturing the continent's linguistic diversity. By systematically assessing the performance of leading large language models (LLMs) on Sahara, we demonstrate how policy-induced data variations directly impact model effectiveness across African languages. Our findings reveal that while a few languages perform reasonably well, many Indigenous languages remain marginalized due to sparse data. Leveraging these insights, we offer actionable recommendations for policy reforms and inclusive data practices. Overall, our work underscores the urgent need for a dual approach - combining theoretical understanding with empirical evaluation - to foster linguistic diversity in AI for African communities.
AbdelRahim Elmadany, Ife Adebara, Muhammad Abdul-Mageed
We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.
David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime et al.
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (\eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63\% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
Polo B. Moji
Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal et al.
Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).
Andani Madodonga, Vukosi Marivate, Matthew Adendorff
Local/Native South African languages are classified as low-resource languages. As such, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these native South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Bag-Of-Words, TFIDF and Word2vec. The results of this study showed that XGBoost, Logistic Regression and LSTM, trained from Word2vec performed better than the other combinations.
Johan J. Steenkamp
No abstract available.
Troy E. Spier
Linguistic treatments of Bantu languages have traditionally focused on broadly historical/ comparative studies or on prototypical characteristics of the family, such as the nominal class system, the complexity of the verbal TAM system, or the tonal system. Consequently, far less attention has been placed upon the nominal phrase as a syntactic unit. To this end, Rugemalira (2007) proposes greater emphasis on Bantu morphosyntax generally. As such, the present study – situated within a broader discussion of the Bantu NP (cf. Chitebeta 2007, Godson & Godson 2015, Lusekelo 2009, Makanjila 2019, Möller 2011, Ondondo 2015, Rugemalira 2007) – builds upon Spier (2016, 2020, 2021) and introduces the first descriptive account of the nominal phrase in Ikyaushi, an underdocumented linguistic variety spoken in the Republic of Zambia and the Democratic Republic of Congo. The data for this study, which arrive from fourteen narratives shared orally by male and female native speakers of the grandparental generation, indicate that seven distinct elements may co-occur with the nominal, but utterances with between one and three co-occurring adnominals are far more frequently attested and more straightforwardly comprehensible to speakers.
Malgorzata Drwal
In this article, I examine the Garment Workers’ Union’s theatre as a manifestation of transnational working-class culture in the 1940s. Analysing Pageant of Unity (1940), a play in which Afrikaans and English alternate to express the equality of Afrikaans- and English-speaking workers in the face of exploitation, I offer an attempt to escape the confines of a national literature as linked to a single language. I demonstrate how the political pageant—a genre typical of socialist propaganda and international trade unionism—was adapted to a South African context. This drama is, therefore, viewed as a product of cultural mobility between Europe, the United States, and South Africa. Assuming the ‘follow the actor’ approach of Bruno Latour’s Actor-Network Theory, I identify a network of interconnections between the nodes formed by human (drama practitioners and theoreticians, socialist organisers) and nonhuman actors (texts representing socialist drama conventions, in particular agitprop techniques). Tracing the inspirations and adaptations of conventions, I argue that Pageant of Unity most evidently realises the prescriptions outlined by the Russian drama theoretician Vsevolod Meyerhold whose approach influenced Guy Routh, one of the pageant’s creators. Thus, I focus on how this propaganda production utilises certain features of the Soviet avant-garde theatre, which testifies to the transnational character of South African working-class culture.
Steward van Wyk
The intellectual history of Coloured intelligentsia in the middle of the previous century is often characterised by a sharp division between moderates and radicals. The historiography of Coloured people is relatively limited but a few scholars have studied their intellectual formation against the background of historical circumstances, their specific biography and institutional operations. In this article I give a broad overview of the intellectual history and historiography of Coloured intelligentsia. I focus in more detail on the poet and educationist P. J. Philander. He is associated with the moderate grouping and characterised as a political gradualist who favoured steady and incremental changes in the political towards democracy. I argue for further engagement with the ideas and actions of the poet and propose that liberal sentiments in his life and work provide further and important perspectives. This would also explain his disenchantment with the apartheid regime and his consequent decision to emigrate to the United States of America where he pursued a long and illustrious career as an educator at a Quaker School in New York. I analyse two texts with an autobiographical purview to indicate these strands of liberal influence and thought.
Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev et al.
Existing data-to-text generation datasets are mostly limited to English. To address this lack of data, we create Table-to-Text in African languages (TaTa), the first large multilingual table-to-text dataset with a focus on African languages. We created TaTa by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional translation to make the dataset fully parallel. TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorùbá) and a zero-shot test language (Russian). We additionally release screenshots of the original figures for future research on multilingual multi-modal approaches. Through an in-depth human evaluation, we show that TaTa is challenging for current models and that less than half the outputs from an mT5-XXL-based model are understandable and attributable to the source data. We further demonstrate that existing metrics perform poorly for TaTa and introduce learned metrics that achieve a high correlation with human judgments. We release all data and annotations at https://github.com/google-research/url-nlp.
Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed et al.
Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.
Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach et al.
Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is \textit{language adaptive fine-tuning} (LAFT) -- fine-tuning a multilingual PLM on monolingual texts of a language using the pre-training objective. However, adapting to a target language individually takes a large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform \textit{multilingual adaptive fine-tuning} on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent to encourage cross-lingual transfer learning. To further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50%. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Additionally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.
Ife Adebara, Muhammad Abdul-Mageed
Aligning with ACL 2022 special Theme on "Language Diversity: from Low Resource to Endangered Languages", we discuss the major linguistic and sociopolitical challenges facing development of NLP technologies for African languages. Situating African languages in a typological framework, we discuss how the particulars of these languages can be harnessed. To facilitate future research, we also highlight current efforts, communities, venues, datasets, and tools. Our main objective is to motivate and advocate for an Afrocentric approach to technology development. With this in mind, we recommend \textit{what} technologies to build and \textit{how} to build, evaluate, and deploy them based on the needs of local African communities.
Nguh Nwei Asanga Fon, Emmanuel Achiri
The crisis in the Anglophone regions of Cameroon that began as a protest by teachers and lawyers trade unions in late 2016 is becoming an intractable conflict which if not addressed risks destabilizing the entire Central African sub-region. Using Zartman’s “ripeness” as a theoretical premise, this paper analyses the evolution of the conflict and proposes dual track diplomacy as a possible solution to break the present deadlock. Given the difficulties for both sides to escalate their way to victory and the growing, unsustainable cost of a prolonged confrontation, the present situation shows significant traces of a mutually hurting stalemate that we propose can be exploited by actors interested in resolution of the conflict. The need for and possible policy implications of pursuing a dual track diplomatic approach is explored here. It is obvious that dual track diplomacy can contribute greatly to bringing a lasting solution to the Anglophone crisis.
Halaman 3 dari 247