Hasil "Language acquisition"

DOAJ Open Access 2026

Fostering the learning of L3 French and Spanish phonology in heritage bilinguals

Christoph Gabriel, Jonas Grünke

Even though there were already calls around the turn of the millennium for the heritage languages of learners to be included in foreign-language teaching (e.g., Gogolin 1994, Hu 2003), the acquisition of phonology in the context of migration-induced multilingualism is still an emerging area of research in both linguistics and language pedagogy. Based on recent studies on the acquisition of L3 French and Spanish phonology by heritage bilinguals (e.g., Gabriel et al. 2014/2018/2022, Grünke & Gabriel 2022), it is posited that the acquisition of a foreign language’s sound system is enhanced by metalinguistic and phonological awareness. Furthermore, such awareness can be fostered by imparting declarative knowledge regarding the similarities and differences between the languages of the learners’ linguistic background and the foreign languages they learn. Whether interventions can foster facilitative transfer from a heritage language onto the target language seems to be contingent upon three factors: (i) the maintenance of the relevant feature in the learner’s heritage language, (ii) the degree of activation of the learner’s heritage language during the learning process, and (iii) the perceptual salience of the feature at issue. In the light of insights from phonetic learning, we argue that intensive pronunciation training over a longer period is required to support and consolidate the positive, sometimes short-term effects found in the studies.

Romanic languages, Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2025

Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English, Sepedi and Setswana

Koena Ronny Mabokela, Tim Schlippe, Mpho Raborife et al.

Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.

en cs.CL, cs.AI

Detail DOI Sumber

DOAJ Open Access 2025

Reseña de Mackey, Alison; Gass, Susan M. (2023). Current Approaches in Second Language Acquisition Research. Wiley Blackwell.

Lucía Gil de Montes Garín

Computational linguistics. Natural language processing, Technology

Detail DOI Sumber

DOAJ Open Access 2025

How does cultural capital influence academic achievement in the Chinese context? A systematic literature review

Fenlian Xie, Fonny Dameaty Hutagalung

This article reviewed how cultural capital theory is applied in China, focusing on the definition of cultural capital, the aspects of cultural capital most relevant to academic achievement, and the association between cultural capital and academic achievement. The review encompassed empirical studies published between 2001 and 2024, sourced from both English and Chinese databases, including Web of Science, Scopus, ERIC, and ProQuest, CNKI, Weipu , and Wanfang . A total of 20 quantitative and qualitative studies were included. Based on a convergent segregated approach, the article suggests that cultural capital in the Chinese context is defined as a cultural resource consisting of skills, knowledge, and dispositions that are acknowledged and rewarded in schools. In contrast to the focus on highbrow culture in Western research, parent-child reading, participation in shadow education, and English proficiency are emphasized as core aspects of cultural capital in the Chinese context. Additionally, the article finds a weaker association between cultural capital and academic achievement in China. These findings provide new insights into social reproduction in China, extending Bourdieu’s cultural capital theory, which suggests that unequal competition between classes exists within the family, to competition outside the family, such as shadow education and the acquisition of English language skills. Finally, this article discussed the applicability of lower-class cultural capital and provided directions for future research.

History of scholarship and learning. The humanities, Social Sciences

Detail DOI Sumber

arXiv Open Access 2024

XferBench: a Data-Driven Benchmark for Emergent Language

Brendon Boldt, David Mortensen

In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the "quality" of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language -- the better the downstream performance, the better the emergent language. We implement this benchmark as an easy-to-use Python package that only requires a text file of utterances from the emergent language to be evaluated. Finally, we empirically test the benchmark's validity using human, synthetic, and emergent language baselines.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition

Candida M. Greco, Lucio La Cava, Andrea Tagarelli

Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models' performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2024

INNOVATIVE APPROACHES TO ARABIC VOCABULARY LEARNING FOR ISLAMIC SENIOR HIGH SCHOOL STUDENTS WITH LECTORA INSPIRE

Umi Hijriyah, Relit Nur Edi, Erlina Erlina et al.

Mastering vocabulary is essential for language acquisition, particularly in Arabic. This study aims to develop and evaluate the feasibility and effectiveness of AVLI (Arabic Vocabulary on Lectora Inspire) for grade X students at Baitul Jannah Integrated Islamic High School. Employing a research and development approach, this study utilized the ADDIE development model, which includes analysis, design, development, implementation, and evaluation phases. Data were collected through documentation, observation, interviews, questionnaires, and tests. The findings revealed that the Lectora Inspire-based mufradat learning media is deemed feasible, achieving a feasibility test score of 75% from material expert validators and 90% from media validators. User feedback was positive, with 86% of 106 respondents finding the learning media engaging. Moreover, AVLI proved effective in enhancing mufradat learning, evidenced by an increase in average student scores from 65.5 on the pretest to 85.5 on the posttest, a rise of 30 points. Consequently, this study concludes that Lectora Inspire-based learning media is both feasible and effective for teaching mufradat to grade X students, highlighting its potential implications for Arabic language education.

Education (General), Islam

Detail DOI Sumber

DOAJ Open Access 2024

Instructed second language acquisition research methods

Caihong Liu, Fei Li

Philology. Linguistics

Detail DOI Sumber

arXiv Open Access 2023

NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models

Yongchao Chen, Rujul Gandhi, Yang Zhang et al.

Temporal Logic (TL) can be used to rigorously specify complex high-level specification for systems in many engineering applications. The translation between natural language (NL) and TL has been under-explored due to the lack of dataset and generalizable model across different application domains. In this paper, we propose an accurate and generalizable transformation framework of English instructions from NL to TL, exploring the use of Large Language Models (LLMs) at multiple stages. Our contributions are twofold. First, we develop a framework to create a dataset of NL-TL pairs combining LLMs and human annotation. We publish a dataset with 28K NL-TL pairs. Then, we finetune T5 models on the lifted versions (i.e., the specific Atomic Propositions (AP) are hidden) of the NL and TL. The enhanced generalizability originates from two aspects: 1) Usage of lifted NL-TL characterizes common logical structures, without constraints of specific domains. 2) Application of LLMs in dataset creation largely enhances corpus richness. We test the generalization of trained models on five varied domains. To achieve full NL-TL transformation, we either combine the lifted model with AP recognition task or do the further finetuning on each specific domain. During the further finetuning, our model achieves higher accuracy (>95%) using only <10% training data, compared with the baseline sequence to sequence (Seq2Seq) model.

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Teaching and learning the four English skills before and during the COVID-19 era: perceptions of EFL faculty and students in Saudi higher education

Siham Mousa Alhaider

Abstract At the peak of the COVID-19 pandemic in 2020, synchronous online learning replaced physical classes worldwide. However, sustaining the quality of education during this transition was a challenge. Specifically, teaching and learning the four English skills (listening, speaking, reading and writing) among English-as-a-foreign-language (EFL) learners in contexts such as Saudi Arabia required greater adjustments to meet learners’ demands during the pandemic. Within this context, this study explored how online teaching and learning helped develop the four English skills among Saudi EFL students. A survey was conducted among EFL faculty and learners in a Saudi Arabian higher-learning institution regarding the diverse aspects of language skills development before and during the pandemic. Data were analyzed using descriptive and inferential statistics. The results showed that e-learning was most beneficial for writing and least beneficial for speaking and that writing was most nurtured and adopted in almost all schoolwork, while speaking was neglected during online courses. Indeed, instructors must be prepared for an effective online learning experience in the future. This study aims to strengthen the field of EFL teaching and learning in Saudi Arabia and related contexts by describing learners’ perceptions of their experiences before and during the pandemic.

Special aspects of education, Language acquisition

Detail DOI Sumber

DOAJ Open Access 2023

Does instruction through mother tongue impede students’ English proficiency? public skepticism on the instructional language policy in the state of Amhara

Melaku Gebremeske

This study examined whether learning in the mother tongue affects students’ English proficiency in the State of Amhara. It also explores experiences across the world to look for possible means of equipping students with the required competences of English. The study depended on a review of literature and documentary examination. It has employed the content analysis method of the qualitative approach. Findings revealed that instruction through mother tongue per se does not impede the acquisition of English proficiency nor does the use of English as medium of instruction necessarily improve pupils’ English proficiency. Rather, lack of carrying out valid instructional language planning, government reluctance to consult necessary stakeholders during policy formulation and poor implementation procedures have resulted in lack of realizing the intended goals of the instructional language policy. Apart from lack of resource supply the defects in its formulation procedure have lingering effects on the realization of instructional language policy. In general, although many scholars of the field have labeled the current policy as one of the best instructional language policies formulated in Ethiopia so far, the irregularities and inconsistencies in its practice hindered to meet the intentions advocated from the outset. Consequently, the implementation of instruction through the mother tongue, like similar interventions by previous governments, has been a matter of political expediency rather than a well thought out and planned reform. Hence, since the future is today, employing a contextually sound instructional language policy alleviates the challenges pupils face in their English proficiency and ultimately helps them compete in the prevailing globalizing socio-economic world.

Education

Detail Sumber

arXiv Open Access 2022

Self-Generated In-Context Learning: Leveraging Auto-regressive Language Models as a Demonstration Generator

Hyuhng Joon Kim, Hyunsoo Cho, Junyeob Kim et al.

Large-scale pre-trained language models (PLMs) are well-known for being capable of solving a task simply by conditioning a few input-label pairs dubbed demonstrations on a prompt without being explicitly tuned for the desired downstream task. Such a process (i.e., in-context learning), however, naturally leads to high reliance on the demonstrations which are usually selected from external datasets. In this paper, we propose self-generated in-context learning (SG-ICL), which generates demonstrations for in-context learning from PLM itself to minimize the reliance on the external demonstration. We conduct experiments on four different text classification tasks and show SG-ICL significantly outperforms zero-shot learning and is generally worth approximately 0.6 gold training samples. Moreover, our generated demonstrations show more consistent performance with low variance compared to randomly selected demonstrations from the training dataset.

en cs.CL

Detail Sumber

arXiv Open Access 2022

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown et al.

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

en cs.RO, cs.CL

Detail Sumber

arXiv Open Access 2022

Large Language Models are Few-Shot Clinical Information Extractors

Monica Agrawal, Stefan Hegselmann, Hunter Lang et al.

A long-running goal of the clinical NLP community is the extraction of important variables trapped in clinical notes. However, roadblocks have included dataset shift from the general domain and a lack of public clinical corpora and annotations. In this work, we show that large language models, such as InstructGPT, perform well at zero- and few-shot information extraction from clinical text despite not being trained specifically for the clinical domain. Whereas text classification and generation performance have already been studied extensively in such models, here we additionally demonstrate how to leverage them to tackle a diverse set of NLP tasks which require more structured outputs, including span identification, token-level sequence classification, and relation extraction. Further, due to the dearth of available data to evaluate these systems, we introduce new datasets for benchmarking few-shot clinical information extraction based on a manual re-annotation of the CASI dataset for new tasks. On the clinical extraction tasks we studied, the GPT-3 systems significantly outperform existing zero- and few-shot baselines.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2022

Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT

Bhavya Bhavya, Jinjun Xiong, Chengxiang Zhai

We propose a novel application of prompting Pre-trained Language Models (PLMs) to generate analogies and study how to design effective prompts for two task settings: generating a source concept analogous to a given target concept (aka Analogous Concept Generation or ACG), and generating an explanation of the similarity between a given pair of target concept and source concept (aka Analogous Explanation Generation or AEG). We found that it is feasible to prompt InstructGPT to generate meaningful analogies and the best prompts tend to be precise imperative statements especially with a low temperature setting. We also systematically analyzed the sensitivity of the InstructGPT model to prompt design, temperature, and injected spelling errors, and found that the model is particularly sensitive to certain variations (e.g., questions vs. imperative statements). Further, we conducted human evaluation on 1.4k of the generated analogies and found that the quality of generations varies substantially by model size. The largest InstructGPT model can achieve human-level performance at generating meaningful analogies for a given target while there is still room for improvement on the AEG task.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2022

Exploring the Value of Pre-trained Language Models for Clinical Named Entity Recognition

Samuel Belkadi, Lifeng Han, Yuping Wu et al.

The practice of fine-tuning Pre-trained Language Models (PLMs) from general or domain-specific data to a specific task with limited resources, has gained popularity within the field of natural language processing (NLP). In this work, we re-visit this assumption and carry out an investigation in clinical NLP, specifically Named Entity Recognition on drugs and their related attributes. We compare Transformer models that are trained from scratch to fine-tuned BERT-based LLMs namely BERT, BioBERT, and ClinicalBERT. Furthermore, we examine the impact of an additional CRF layer on such models to encourage contextual learning. We use n2c2-2018 shared task data for model development and evaluations. The experimental outcomes show that 1) CRF layers improved all language models; 2) referring to BIO-strict span level evaluation using macro-average F1 score, although the fine-tuned LLMs achieved 0.83+ scores, the TransformerCRF model trained from scratch achieved 0.78+, demonstrating comparable performances with much lower cost - e.g. with 39.80\% less training parameters; 3) referring to BIO-strict span-level evaluation using weighted-average F1 score, ClinicalBERT-CRF, BERT-CRF, and TransformerCRF exhibited lower score differences, with 97.59\%/97.44\%/96.84\% respectively. 4) applying efficient training by down-sampling for better data distribution further reduced the training cost and need for data, while maintaining similar scores - i.e. around 0.02 points lower compared to using the full dataset. Our models will be hosted at \url{https://github.com/HECTA-UoM/TransformerCRF}

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2021

Abusive and Threatening Language Detection in Urdu using Boosting based and BERT based models: A Comparative Approach

Mithun Das, Somnath Banerjee, Punyajoy Saha

Online hatred is a growing concern on many social media platforms. To address this issue, different social media platforms have introduced moderation policies for such content. They also employ moderators who can check the posts violating moderation policies and take appropriate action. Academicians in the abusive language research domain also perform various studies to detect such content better. Although there is extensive research in abusive language detection in English, there is a lacuna in abusive language detection in low resource languages like Hindi, Urdu etc. In this FIRE 2021 shared task - "HASOC- Abusive and Threatening language detection in Urdu" the organizers propose an abusive language detection dataset in Urdu along with threatening language detection. In this paper, we explored several machine learning models such as XGboost, LGBM, m-BERT based models for abusive and threatening content detection in Urdu based on the shared task. We observed the Transformer model specifically trained on abusive language dataset in Arabic helps in getting the best performance. Our model came First for both abusive and threatening content detection with an F1scoreof 0.88 and 0.54, respectively.

en cs.CL

Detail Sumber

arXiv Open Access 2021

Deep Transfer Learning & Beyond: Transformer Language Models in Information Systems Research

Ross Gruetzemacher, David Paradice

AI is widely thought to be poised to transform business, yet current perceptions of the scope of this transformation may be myopic. Recent progress in natural language processing involving transformer language models (TLMs) offers a potential avenue for AI-driven business and societal transformation that is beyond the scope of what most currently foresee. We review this recent progress as well as recent literature utilizing text mining in top IS journals to develop an outline for how future IS research can benefit from these new techniques. Our review of existing IS literature reveals that suboptimal text mining techniques are prevalent and that the more advanced TLMs could be applied to enhance and increase IS research involving text data, and to enable new IS research topics, thus creating more value for the research community. This is possible because these techniques make it easier to develop very powerful custom systems and their performance is superior to existing methods for a wide range of tasks and applications. Further, multilingual language models make possible higher quality text analytics for research in multiple languages. We also identify new avenues for IS research, like language user interfaces, that may offer even greater potential for future IS research.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2021

Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards

Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez et al.

Developing documentation guidelines and easy-to-use templates for datasets and models is a challenging task, especially given the variety of backgrounds, skills, and incentives of the people involved in the building of natural language processing (NLP) tools. Nevertheless, the adoption of standard documentation practices across the field of NLP promotes more accessible and detailed descriptions of NLP datasets and models, while supporting researchers and developers in reflecting on their work. To help with the standardization of documentation, we present two case studies of efforts that aim to develop reusable documentation templates -- the HuggingFace data card, a general purpose card for datasets in NLP, and the GEM benchmark data and model cards with a focus on natural language generation. We describe our process for developing these templates, including the identification of relevant stakeholder groups, the definition of a set of guiding principles, the use of existing templates as our foundation, and iterative revisions based on feedback.

en cs.DB, cs.CL

Detail DOI Sumber

DOAJ Open Access 2021

The problem of developing textbook requirements for primary inclusive education

Olga Gribova, Julia Shulekina

This article discusses problems with the functioning of textbooks in modern inclusive education. A textbook is understood as a learning tool, the most important component of learning of a modern schoolchild. Irrespective of the type of educational organization and the form of education, the textbook should solve educational tasks connected with knowledge acquisition and the formation of basic educational competencies. Nowadays, the question of the requirement for textbooks to be accessible to different categories of students is raised. It is emphasised that the development of the 'right' textbook is one of the factors that determines the success and effectiveness of students throughout the learning process. The authors relate textbook requirements for inclusive education with the basic learning process, which is text comprehension. Based on the description of the specifics of students with speech impairment, we have modeled ways to optimize an existing Russian language textbook and adapt it to the comprehension abilities of this category of students. The research is presented as a review of fundamental resources that address various aspects of the development of textbook requirements. It is based on a psychological and pedagogical approach, which focuses on the special educational needs of students with disabilities as a reference point for modernization of existing textbooks, because the textbooks which were developed for normally developing students are ineffective in teaching children with disabilities. The article presents the difficulties which students with speech impairments may encounter at the initial stages of work with a textbook and proposes ways of solving potential problems. The results of the study are practically significant for updating the methodological support of primary education in the Russian Federation.

Science

Detail DOI Sumber

Hasil untuk "Language acquisition"