Hasil untuk "African languages and literature"

Menampilkan 20 dari ~2184641 hasil · dari arXiv, DOAJ, CrossRef

JSON API
arXiv Open Access 2025
Dealing with the Hard Facts of Low-Resource African NLP

Yacouba Diarra, Nouhoum Souleymane Coulibaly, Panga Azazia Kamaté et al.

Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.

en cs.CL
arXiv Open Access 2025
Automatic Speech Recognition of African American English: Lexical and Contextual Effects

Hamid Mojarad, Kevin Tang

Automatic Speech Recognition (ASR) models often struggle with the phonetic, phonological, and morphosyntactic features found in African American English (AAE). This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. It examines whether the presence of CCR and ING-reduction increases ASR misrecognition. Subsequently, it investigates whether end-to-end ASR systems without an external Language Model (LM) are more influenced by lexical neighborhood effect and less by contextual predictability compared to systems with an LM. The Corpus of Regional African American Language (CORAAL) was transcribed using wav2vec 2.0 with and without an LM. CCR and ING-reduction were detected using the Montreal Forced Aligner (MFA) with pronunciation expansion. The analysis reveals a small but significant effect of CCR and ING on Word Error Rate (WER) and indicates a stronger presence of lexical neighborhood effect in ASR systems without LMs.

en cs.CL, cs.SD
arXiv Open Access 2025
Metrics, KPIs, and Taxonomy for Data Valuation and Monetisation -- A Systematic Literature Review

Eduardo Vyhmeister, Bastien Pietropaoli, Alejando Martinez Molina et al.

Data valuation and data monetisation are complex subjects but essential to most organisations today. Unfortunately, they still lack standard procedures and frameworks for organisations to follow. In this survey, we introduce the reader to the concepts by providing the definitions and the background required to better understand data, monetisation strategies, and finally metrics and KPIs used in these strategies. We have conducted a systematic literature review on metrics and KPIs used in data valuation and monetisation, in every aspect of an organisation's business, and by a variety of stakeholders. We provide an expansive list of such metrics and KPIs with 162 references. We then categorise all the metrics and KPIs found into a large taxonomy, following the Balanced Scorecard (BSC) approach with further subclustering to cover every aspect of an organisation's business. This taxonomy will help every level of data management understand the complex landscape of the domain. We also discuss the difficulty in creating a standard framework for data valuation and data monetisation and the major challenges the domain is currently facing.

en cs.DB
arXiv Open Access 2025
African Data Ethics: A Discursive Framework for Black Decolonial Data Science

Teanna Barrett, Chinasa T. Okolo, B. Biira et al.

The shift towards pluralism in global data ethics acknowledges the importance of including perspectives from the Global Majority to develop responsible data science practices that mitigate systemic harms in the current data science ecosystem. Sub-Saharan African (SSA) practitioners, in particular, are disseminating progressive data ethics principles and best practices for identifying and navigating anti-blackness and data colonialism. To center SSA voices in the global data ethics discourse, we present a framework for African data ethics informed by the thematic analysis of an interdisciplinary corpus of 50 documents. Our framework features six major principles: 1) Challenge Power Asymmetries, 2) Assert Data Self-Determination, 3) Invest in Local Data Institutions & Infrastructures, 4) Utilize Communalist Practices, 5) Center Communities on the Margins, and 6) Uphold Common Good. We compare our framework to seven particularist data ethics frameworks to find similar conceptual coverage but diverging interpretations of shared values. Finally, we discuss how African data ethics demonstrates the operational value of data ethics frameworks. Our framework highlights Sub-Saharan Africa as a pivotal site of responsible data science by promoting the practice of communalism, self-determination, and cultural preservation.

en cs.CY
arXiv Open Access 2025
African Gender Classification Using Clothing Identification Via Deep Learning

Samuel Ozechi

Human attribute identification and classification are crucial in computer vision, driving the development of innovative recognition systems. Traditional gender classification methods primarily rely on facial recognition, which, while effective, struggles under non-ideal conditions such as blurriness, side views, or partial occlusions. This study explores an alternative approach by leveraging clothing identification, specifically focusing on African traditional attire, which carries culturally significant and gender-specific features. We use the AFRIFASHION1600 dataset, a curated collection of 1,600 images of African traditional clothing labeled into two gender classes: male and female. A deep learning model, based on a modified VGG16 architecture and trained using transfer learning, was developed for classification. Data augmentation was applied to address the challenges posed by the relatively small dataset and to mitigate overfitting. The model achieved an accuracy of 87% on the test set, demonstrating strong predictive capability despite dataset imbalances favoring female samples. These findings highlight the potential of clothing-based identification as a complementary technique to facial recognition for gender classification in African contexts. Future research should focus on expanding and balancing datasets to enhance classification robustness and improve the applicability of clothing-based gender recognition systems.

en cs.CV, cs.AI
arXiv Open Access 2024
Evaluation of South African Candidate Sites for an Expanded Event Horizon Telescope

Senkhosi Simelane, Roger Deane, Athol Kemball et al.

Global expansion of the Event Horizon Telescope (EHT) will see the strategic addition of antennas at new geographical locations, transforming the sensitivity and imaging fidelity of the $λ\sim 1$\,mm EHT array. A possible South African EHT station would leverage a strong geographical advantage, local infrastructure, and radio astronomy expertise, and have strong synergies with the Africa Millimetre Telescope in Namibia. We assessed three South African candidate millimetre sites using climatological simulations and antenna sensitivity estimates, and found at least two promising sites. These sites are comparable to some existing EHT stations during the typical April EHT observing window and outperform them during most of the year, especially the southern hemisphere winter. The results suggest that a strategically placed South African EHT station will have a sizable, positive impact on next-generation EHT objectives and the resulting black hole imaging science.

en astro-ph.IM
arXiv Open Access 2024
Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Shantipriya Parida, Shakshi Panwar, Kusum Lata et al.

Large language models (LLMs) demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

en cs.CL, cs.AI
arXiv Open Access 2023
Ontologies in Digital Twins: A Systematic Literature Review

Erkan Karabulut, Salvatore F. Pileggi, Paul Groth et al.

Digital Twins (DT) facilitate monitoring and reasoning processes in cyber-physical systems. They have progressively gained popularity over the past years because of intense research activity and industrial advancements. Cognitive Twins is a novel concept, recently coined to refer to the involvement of Semantic Web technology in DTs. Recent studies address the relevance of ontologies and knowledge graphs in the context of DTs, in terms of knowledge representation, interoperability and automatic reasoning. However, there is no comprehensive analysis of how semantic technologies, and specifically ontologies, are utilized within DTs. This Systematic Literature Review (SLR) is based on the analysis of 82 research articles, that either propose or benefit from ontologies with respect to DT. The paper uses different analysis perspectives, including a structural analysis based on a reference DT architecture, and an application-specific analysis to specifically address the different domains, such as Manufacturing and Infrastructure. The review also identifies open issues and possible research directions on the usage of ontologies and knowledge graphs in DTs.

arXiv Open Access 2023
AfriNames: Most ASR models "butcher" African Names

Tobi Olatunji, Tejumade Afonja, Bonaventure F. P. Dossou et al.

Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a laboratory result for a patient. However, where named entities such as ``Ukachukwu`` (Igbo), ``Lakicia`` (Swahili), or ``Ingabire`` (Rwandan) are spoken, automatic speech recognition (ASR) models' performance degrades significantly, propagating errors to downstream systems. We model this problem as a distribution shift and demonstrate that such model bias can be mitigated through multilingual pre-training, intelligent data augmentation strategies to increase the representation of African-named entities, and fine-tuning multilingual ASR models on multiple African accents. The resulting fine-tuned models show an 81.5\% relative WER improvement compared with the baseline on samples with African-named entities.

en cs.CL, cs.CY
arXiv Open Access 2023
Train Global, Tailor Local: Minimalist Multilingual Translation into Endangered Languages

Zhong Zhou, Jan Niehues, Alex Waibel

In many humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, endangered languages may be possible and reduce human translation effort. We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality for a well known text, which is available in multiple languages, in a new, severely low resource language. We examine two approaches: 1. best selection of seed sentences to jump start translations in a new language in view of best generalization to the remainder of a larger targeted text(s), and 2. we adapt large general multilingual translation engines from many other languages to focus on a specific text in a new, unknown language. We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best. If we also select a best set of seed sentences, we can improve average chrF performance on new test languages from a baseline of 21.9 to 50.7, while reducing the number of seed sentences to only around 1,000 in the new, unknown language.

en cs.CL
arXiv Open Access 2023
HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource TweetData for Sentiment Analysis

Saheed Abdullahi Salahudeen, Falalu Ibrahim Lawan, Ahmad Mustapha Wali et al.

We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset. The task featured three subtasks; subtask A is monolingual sentiment classification with 12 tracks which are all monolingual languages, subtask B is multilingual sentiment classification using the tracks in subtask A and subtask C is a zero-shot sentiment classification. We present the results and findings of subtask A, subtask B and subtask C. We also release the code on github. Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages. The datasets for these subtasks consists of a gold standard multi-class labeled Twitter datasets from these languages. Our results demonstrate that Afro-xlmr-large model performed better compared to the other models in most of the languages datasets. Similarly, Nigerian languages: Hausa, Igbo, and Yoruba achieved better performance compared to other languages and this can be attributed to the higher volume of data present in the languages.

en cs.CL
arXiv Open Access 2023
CASIA-Iris-Africa: A Large-scale African Iris Image Database

Jawad Muhammad, Yunlong Wang, Junxing Hu et al.

Iris biometrics is a phenotypic biometric trait that has proven to be agnostic to human natural physiological changes. Research on iris biometrics has progressed tremendously, partly due to publicly available iris databases. Various databases have been available to researchers that address pressing iris biometric challenges such as constraint, mobile, multispectral, synthetics, long-distance, contact lenses, liveness detection, etc. However, these databases mostly contain subjects of Caucasian and Asian docents with very few Africans. Despite many investigative studies on racial bias in face biometrics, very few studies on iris biometrics have been published, mainly due to the lack of racially diverse large-scale databases containing sufficient iris samples of Africans in the public domain. Furthermore, most of these databases contain a relatively small number of subjects and labelled images. This paper proposes a large-scale African database named CASIA-Iris-Africa that can be used as a complementary database for the iris recognition community to mediate the effect of racial biases on Africans. The database contains 28,717 images of 1023 African subjects (2046 iris classes) with age, gender, and ethnicity attributes that can be useful in demographically sensitive studies of Africans. Sets of specific application protocols are incorporated with the database to ensure the database's variability and scalability. Performance results of some open-source SOTA algorithms on the database are presented, which will serve as baseline performances. The relatively poor performances of the baseline algorithms on the proposed database despite better performance on other databases prove that racial biases exist in these iris recognition algorithms. The database will be made available on our website: http://www.idealtest.org.

en cs.CV
arXiv Open Access 2023
Relating Edge Computing and Microservices by means of Architecture Approaches and Features, Orchestration, Choreography, and Offloading: A Systematic Literature Review

Lucas Fernando Souza de Castro, Sandro Rigo

Context: Microservices running and being powered by Edge Computing have been gaining much attention in the industry and academia. Since 2014, when Martin Fowler popularized the Microservice term, many studies have been published relating these subjects to explore how the Edge's low-latency feature could be combined with the high throughput of the distributed paradigm from Microservices. Objective: Identifying how Microservices work together with Edge Computing whereas they take advantage when running on Edge. Method: In order to better understand this relationship, we first identified its key concepts, which are: architecture approaches and features, microservice composition (orchestration/choreography), and offloading. Afterward, we conducted a Systematic Literature Review (SLR) as the survey method. Results: We reviewed 111 selected studies and built a taxonomy of Microservices on Edge Computing demonstrating their current architecture approaches and features, composition, and offloading modes. Moreover, we identify the research gaps and trends. Conclusion: This paper is a step forward to help researchers and professionals get a general overview of how Microservices and Edge have been related in the last years. It also discusses gaps and research trends. This SLR will also be a good introduction for new researchers in Edge and Microservices.

en cs.DC
DOAJ Open Access 2023
Bantu negative verbs: a typological-comparative investigation of form, function and distribution

Rasmus Bernander, Maud Devos, Hannah Gibson

This paper investigates the use of verbs in the expression of negation across Bantu. The development of erstwhile lexical verbs into markers of negation is an understudied yet widespread phenomenon across the language family. In this sample of 100 languages, we identify a range of pathways as well as specific functions performed by such verbs. Specifically, we identify the use of both intrinsic negative verbs — i.e. those with inherent negative meanings — and extrinsic negative verbs — i.e. those which do not have an inherently negative meaning, but which have assumed dedicated negative functions through their use in specific constructions. This latter category can further be divided into two subtypes: those with negative inflectional material and those without. Across our sample, we explore the use of negative verbs in main clauses, non-declarative clauses such as prohibitives and interrogatives, non-verbal predicative clauses, as well as in non-main clause contexts such as in complement, relative and adverbial clauses. We further explore negative verbs for negation of infinitives and in non-clausal negation. In this endeavor, we offer a refined understanding of the development and distribution of verbs as negators in Bantu languages, while also broadening our understanding of negation in general.

African languages and literature
arXiv Open Access 2022
Systematic Literature Review of Gender and Software Engineering in Asia

Hironori Washizaki

It is essential to discuss the role, difficulties, and opportunities concerning people of different gender in the field of software engineering research, education, and industry. Although some literature reviews address software engineering and gender, it is still unclear how research and practices in Asia exist for handling gender aspects in software development and engineering. We conducted a systematic literature review to grasp the comprehensive view of gender research and practices in Asia. We analyzed the 32 identified papers concerning countries and publication years among 463 publications. Researchers and practitioners from various organizations actively work on gender research and practices in some countries, including China, India, and Turkey. We identified topics and classified them into seven categories varying from personal mental health and team building to organization. Future research directions include investigating the synergy between (regional) gender aspects and cultural concerns and considering possible contributions and dependency among different topics to have a solid foundation for accelerating further research and getting actionable practices.

en cs.SE, cs.GL
arXiv Open Access 2022
Journal of Economic Literature codes classification system (JEL)

Jussi T. S. Heikkila

The Journal of Economic Literature codes classification system (JEL) published by the American Economic Association (AEA) is the de facto standard classification system for research literature in economics. The JEL classification system is used to classify articles, dissertations, books, book reviews, and working papers in EconLit, a database maintained by the AEA. Over time, it has evolved and extended to a system with over 850 subclasses. This paper reviews the history and development of the JEL classification system, describes the current version, and provides a selective overview of its uses and applications in research. The JEL codes classification system has been adopted by several publishers, and their instructions are reviewed. There are interesting avenues for future research as the JEL classification system has been surprisingly little used in existing bibliometric and scientometric research as well as in library classification systems.

en econ.GN
arXiv Open Access 2021
Screening of the Characteristics of Hate Crimes against Asian American and Comparison to African Americans in Bay Area

Redacted by arXiv

COVID-19 has aided the spread of racism, as well as national insecurity, distrust of immigrants, and general xenophobia, both of which may be linked to the rise in anti-Asian hate crimes during the pandemic. Coronavirus Disease 2019(COVID19) is thought to have originated in late December 2019 in Wuhan, China, and quickly spread across the world during the spring months of 2020. Asian Americans recorded in increase in racially based hate crimes including physical abuse and intimidation as COVID-19 spread throughout the United States. This research study was conducted by high school students in the Bay Area to compare the intention and characteristics of hate crimes against Asian Americans to hate crimes against African Americans. According to studies of both victim-related and most offender-related variables, hate crimes against Asian Americans have been rapidly growing in the United States and vary from those against African Americans. This leads to an investigation into the racial disparity between Asian American offenders and those of other races. The nature and characteristics of hate crimes against Asian Americans are compared to those of hate crimes against African Americans in our research. According to studies of all victim-related factors, hate crimes against Asian Americans are similar to those against African Americans. Hate crimes against Asian Americans, on the other hand, vary greatly from hate crimes against African Americans in terms of the offender's ethnicity and all incident-related variables.

DOAJ Open Access 2021
Eastern African women writers’ ‘national epics’: A new force in creative fiction?

Annie Gagiano

In this article, I bring five recent, substantial novels by Eastern African women writers together for the first time in a study regarding the texts as modern ‘national epics’, analysing some of their shared characteristics in foregrounding local participation in the making of East African ethno-national histories. I trace the novelists’ implicit, open-eyed moral evaluation of their leaders and peoples, neither sentimentalising nor deriding the often terrible struggles of their peoples against both inside and outside powers that seek to keep them in subjugation. The texts eschew traditional heroic portrayal of single, male leaders in national epics and allow us to grasp diverse, communal contributions to the growth of nationhood, while giving larger, often central roles to women. The texts earn the epithet ‘epic’ by authoritatively demonstrating that their embodied, localised histories matter, testifying to the wide human spectrum of the peoples they portray; as novelistic acts they are impressive and moving bids for recognition. As post-colonial endeavours, the texts effectively decentre colonial interventions. While the chosen novels are shown to be relatable, their individual power of portrayal and aesthetic achievements are scrupulously differentiated.

African languages and literature

Halaman 16 dari 109233