Hasil "English language"

S2 Open Access 2021

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major et al.

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

6803 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2020

BERTweet: A pre-trained language model for English Tweets

Dat Quoc Nguyen, Thanh Vu, A. Nguyen

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks: Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at https://github.com/VinAIResearch/BERTweet

1057 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2020

Teaching English As A Second Foreign Language

Nargiza Satullaeva, Gulperi Kurbanbaeva

The paper deals with the practical issues of multilingual approach to teaching foreign languages. The author investigates of some questions concerning the process of teaching English as the second foreign language based on the first language – the German language. The process under study is rather complicated. It requires taking into consideration the linguistic experience of students. In addition, this process should be based on the results of the comparative-contrastive analysis of the contacting languages: English, German and Russian (the Mother Tongue of the students). This phenomenon may be called an educational forming multilinguism of which interference and transference are characteristic features. In order to overcome the latter and to improve the educational process, comparative study of language phenomena is necessary. Having found out similar and dissimilar features in both languages the teacher will be able to optimize the learning process of the second language.

1129 sitasi en Sociology

Detail DOI Sumber

S2 Open Access 2018

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Taku Kudo, John Richardson

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

4026 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2005

The Language of Evaluation: Appraisal in English

Jeannett Martin, P. White

3005 sitasi en Psychology

Detail Sumber

S2 Open Access 2020

HateBERT: Retraining BERT for Abusive Language Detection in English

Tommaso Caselli, Valerio Basile, Jelena Mitrović et al.

We introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we have curated and made available to the public. We present the results of a detailed comparison between a general pre-trained language model and the retrained version on three English datasets for offensive, abusive language and hate speech detection tasks. In all datasets, HateBERT outperforms the corresponding general BERT model. We also discuss a battery of experiments comparing the portability of the fine-tuned models across the datasets, suggesting that portability is affected by compatibility of the annotated phenomena.

467 sitasi en Computer Science

Detail DOI Sumber

S2 Open Access 2021

THE ROLE OF ENGLISH AS GLOBAL LANGUAGE

Nurlaily Sofyan

English is now used as a lingua franca, most of the native and non-native speakers of English are using English as their mode of communication in their business matters or business organizations. English language is spread all over the world into almost all the fields such as science, engineering and technology, medicine, trade and commerce, scientific research, education, tourism, internet, banking, business, advertising, film industry, transportation, pharmacy and to name a few. The international expansion of English has been predicted and promoted regularly by leaders on both sides of the Atlantic over the past 200 years. In October 1934 the Carnegie Foundation sponsored a conference in New York on ‘The use of English as a world language’. ELF is used in a number of fields in which English is used as the mutual language of choice, such as science, politics, business, and education. The materials of higher education in English is to make it available for the learners and the teachers wherever they are located on this earth, most of the students, teachers and research scholars make use of this opportunity to enhance their qualification as well as knowledge. The widespread of English as an international language and with the advent of the internet as a mode of fast communication channel which has no boundaries, both are mutually enforcing novel trends and bringing tremendous changes in this age of globalization. English is the only language that is internationally spoken and accepted. Furthermore, the impact of the English language on the internet has been disclosed above and it also has been proved that there are many websites using English as their medium of communication. Even though there are certain fields that are not mentioned, it seems that English plays a dominant in those areas also.

336 sitasi en Political Science

Detail DOI Sumber

S2 Open Access 2018

Politics and the English Language

G. Orwell

Most people who bother with the matter at all would admit that the English language is in a bad way, but it is generally assumed that we cannot by conscious action do anything about it. Our civilization is decadent and our language – so the argument runs – must inevitably share in the general collapse. It follows that any struggle against the abuse of language is a sentimental archaism, like preferring candles to electric light or hansom cabs to aeroplanes. Underneath this lies the half-conscious belief that language is a natural growth and not an instrument which we shape for our own purposes. Now, it is clear that the decline of a language must ultimately have political and economic causes: it is not due simply to the bad influence of this or that individual writer. But an effect can become a cause, reinforcing the original cause and producing the same effect in an intensified form, and so on indefinitely. A man may take to drink because he feels himself to be a failure, and then fail all the more completely because he drinks. It is rather the same thing that is happening to the English language. It becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts. The point is that the process is reversible. Modern English, especially written English, is full of bad habits which spread by imitation and which can be avoided if one is willing to take the necessary trouble. If one gets rid of these habits one can think more clearly, and to think clearly is a necessary first step toward political regeneration: so that the fight against bad English is not frivolous and is not the exclusive concern of professional writers. I will come back to this presently, and I hope that by that time the meaning of what I have said here will have become clearer. Meanwhile, here are five specimens of the English language as it is now habitually written. These five passages have not been picked out because they are especially bad – I could have quoted far worse if I had chosen – but because they illustrate various of the mental vices from which we now suffer. They are a little below the average, but are fairly representative examples. I number them so that I can refer back to them when necessary:

327 sitasi en

Detail DOI Sumber

DOAJ Open Access 2025

Visualizing and analyzing global knowledge maps and emerging research trends in tumor-derived exosomes using CiteSpace

Ruijun Guo, Jiajun Xu, Chunxia Li et al.

Abstract Objective Tumor-derived exosomes testing can be effective in diagnosing disease and assisting in the treatment of disease. Our study utilizes bibliometric analysis to identify research hotspots related to tumor-derived exosomes, predict emerging research frontiers and development trends, and offer diverse perspectives for advancing research in this field. Methods Search the Web of Science Core Collection for English-language literature published on the field of tumor-derived exosomes from 2015 to 2024. CiteSpace(6.2.R3) software was utilized to visualize the distribution of countries/regions, institutions, authors, co-cited authors, and co-cited journals within the relevant literature. Additionally, co-occurrence, clustering, and emergence analyses were conducted on the co-cited references. Results An analysis of 2523 articles meeting the inclusion criteria revealed a steady increase in the number of publications in this field over the past decade. In terms of countries/regions, institutions, authors, and journals that published articles, the most productive were China, Ministry of Education-China, Theresa L. Whiteside, and CANCER RES, respectively. The most influential were The United States, Harvard University, Theresa L. Whiteside, and CANCER RES, respectively. The mechanisms underlying exosomal PD-L1 and engineered exosomes are currently prominent research foci, warranting meticulous examination by the academic community. Conclusion Within the field of research on tumor-derived exosomes, current investigations appear largely concentrated on the exosome PD-L1 mechanism and engineered exosomes. Possible future research hotspots will focus on the use of engineered exosomes to target tumor cells and as a drug delivery platform for more precise therapeutic targeting.

Neoplasms. Tumors. Oncology. Including cancer and carcinogens

Detail DOI Sumber

DOAJ Open Access 2025

Addressing food insecurity among U.S. refugees, considering the temporal patterns of food insecurity after resettlement: Qualitative insights from Utah.

Nasser Sharareh, Rachel Dalrymple, Konstantinos N Kambouris et al.

<h4>Background</h4>Refugees experience high rates of food insecurity (FI) and its associated health outcomes, such as depression and hypertension. Prior research has identified barriers in accessing food among U.S. refugees. What remains unknown is when accessing food becomes a problem for U.S. refugees and what their preferred strategies are to address FI. Therefore, the objectives were to explore FI experiences among refugees to identify time points at which accessing food becomes a problem and to identify refugees' preferred strategies to address FI.<h4>Methods</h4>In collaboration with one of the U.S. resettlement agencies in Utah, refugees were recruited for semi-structured interviews using convenience and snowball sampling. Thirty-six interviews were conducted between July and September 2024, in four different languages: English (4 interviews), Dari (6), Arabic (12), and Kinyarwanda (14). Interview transcripts were analyzed using thematic analysis.<h4>Results</h4>FI was at its peak among refugees at four time points. First, when they found their first job in the U.S. Second, after six months in the U.S., when they had to renew their Supplemental Nutrition Assistance Program (SNAP) application. Third, when they were no longer receiving caseworkers' support from resettlement agencies. Fourth, when they faced fluctuations in employment or household expenditures. Refugees' preferred strategies to address FI were addressing language barriers, providing a champion to check on them frequently and help when needed, providing information on addressing unmet needs, extending and expanding SNAP benefits, and providing gardens to grow food.<h4>Conclusion</h4>Four time points when refugees are at higher risk of FI were identified. Community organizations, policymakers, and resettlement agencies should therefore develop interventions to address FI among refugees, specifically around these four time points and informed by refugees' preferred strategies.

Medicine, Science

Detail DOI Sumber

DOAJ Open Access 2025

A pragmatic analysis of deictic expressions used in the IELTS speaking test

Fadi Al-Khasawneh

This study investigates the role of deictic expressions in the IELTS speaking test, addressing a gap in research on how test-takers across proficiency levels use deixis in spoken language assessment. While previous studies have examined general discourse features in language testing, little attention has been given to the frequency, functions, and distribution of deixis in assessing spoken proficiency. The study analysed a corpus of 30 IELTS speaking test transcripts, covering proficiency levels from low-intermediate to advanced. Using Levinson’s classification of deixis, the study employed quantitative frequency analysis and qualitative discourse analysis to examine variations in the use of personal, temporal, and spatial deixis. The findings revealed that personal deixis was the most frequently used, followed by temporal and spatial deixis. However, the results of One-Way ANOVA test showed no significant differences in deixis usage across proficiency levels. These findings contribute to English language teaching and assessment by highlighting how deixis functions in test-taker discourse, offering insights for IELTS preparation and speaking proficiency evaluation. The study indicates the need for further exploration of discourse features in language assessment.

Education, Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2025

Formation Resources of the English Terminology of Inclusive Education

Alina Dushkevych

The article is devoted to a comprehensive analysis of the resources of forming the English terminological system of inclusive education in the modern educational environment. The role of terminology as a tool for standardizing knowledge, communication and scientific understanding of inclusion problems is considered. It is shown that the development of inclusive education requires a clear delineation of the terminological apparatus, since it is the terms that ensure accuracy in defining concepts, unambiguousness in use and unity in the interpretation of international and national educational documents. The formation of the English-language terminological system is based on international regulatory acts, such as the "Convention on the Rights of Persons with Disabilities", "Salamanca Statement and Framework for Action on Special Needs Education", as well as numerous legislative acts of the USA (in particular the "Individuals with Disabilities Education Act" - IDEA). An important role in this process is played by glossaries, encyclopedias and textbooks on pedagogy, psychology and special education, which systematize, unify and disseminate professional vocabulary. Particular attention is paid to the analysis of key concepts of English-language inclusive education: "inclusive education", "special educational needs", "learning disabilities", "barrier-free environment", "universal design for learning", "accessibility" and their Ukrainian counterparts. It is emphasized that when translating and adapting terms, it is necessary to take into account not only the lexical-semantic aspect, but also the cultural-pedagogical context in order to avoid shifting meanings. The terminological base of inclusive education performs a number of functions: cognitive (ensuring the scientific validity of concepts), communicative (unification of interdisciplinary and intercultural communication), normative (consolidating standards in legislation and educational policy) and practical (ensuring the effective work of teachers, psychologists, social workers). It is noted that the terms must meet the criteria of accuracy, conciseness, unambiguousness and international comprehensibility.

Discourse analysis, Computational linguistics. Natural language processing

Detail DOI Sumber

DOAJ Open Access 2025

PRESENTATIONAL DISCOURSE: PROTOTYPICAL CHARACTERISTICS OF TEXTS IN THE LINGUACULTURAL DIMENSION

Vera Mityagina, Elina Novikova

The article examines presentational discourse as a “superstructural” type of discourse that is influenced by all current trends in the development of modern communicative space. The study explores two widespread yet pragmatically opposing approaches to self-presentation in web space: 1) the presentational content of linguistic universities in Russia, Belarus, and Germany, shaped by similar discourse agents sharing common objectives; 2) the presentational content of two global websites (in English and German) and a local one (in Russian) of the multinational company Bayer AG. The aim of the research is to identify prototypical characteristics of presentational texts through a comparative analysis of linguacultural topics and issues. The study of the linguosemiotic space of linguistic universities revealed that the dominant thematic focus of their presentational content is multilingualism as a defining feature of their specialized educational programs, along with references to technical resources, prestige, membership in organizations and research activities. The primary linguistic means in university presentational content include intertextuality, simple syntax, epithets, and first- and second-person pronouns. In contrast, the multilingual self-presentation of transnational corporations is characterized by the obligatory inclusion of thematic dominants across all versions of their official website depending on the linguacultural and sociopragmatic features of each language version. Multilingual presentational discourse demonstrates the signs of deglobalization, where, alongside content equivalence in global versions, transcreation techniques are extensively employed in localized versions to align with the values of the target linguacultural continuum.

Language and Literature

Detail DOI Sumber

arXiv Open Access 2025

From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data

Chun-Yi Kuan, Hung-yi Lee

Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.

en eess.AS, cs.AI

Detail DOI Sumber

arXiv Open Access 2025

Retrospex: Language Agent Meets Offline Reinforcement Learning Critic

Yufei Xiang, Yiqun Shen, Yeqin Zhang et al.

Large Language Models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous approaches, Retrospex does not directly integrate experiences into the LLM's context. Instead, it combines the LLM's action likelihood with action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline ''retrospection'' process. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments, demonstrating its advantages over strong, contemporary baselines.

en cs.CL, cs.AI

Detail DOI Sumber

DOAJ Open Access 2024

‘Those who can’t, teach’: Pregnancy, Professors, and the Anxiety of Transmission in Waterland and Common Ground

Maxence Gouleau

Pregnancy remains a rare diegetic occurrence in literature. When it appears, it has been remarked, most recently by writer Jessie Greengrass, that pregnancy struggles to ‘stand both for itself and for something other’ (Greengrass 2018). Pregnancy and maternity are at best metaphors or allegories for something else, a tendency that is not exclusively found but most often observed and most problematic in male-authored texts (Friedman 1987; Hanson 2015). While this is true of texts in which pregnancy and maternity are ‘despatched elsewhere while in the centre of things a man paces a carpet’ (Greengrass), it hardly applies to two contemporary male-authored texts in which pregnancy and maternity are central to the diegesis: Graham Swift’s Waterland (1983) and Andrew Cowan’s Common Ground (1996). This article shows that these texts’ anxious male perspectives offer ‘a new figure of the father’, one whose language makes manifest that which ‘does not signify’ and restores balance between tenor and vehicles in metaphors of pregnancy and maternity (Miller 2005).

Arts in general, English language

Detail DOI Sumber

arXiv Open Access 2024

The Roles of English in Evaluating Multilingual Language Models

Wessel Poelman, Miryam de Lhoneux

Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from this imprecise method and instead focus on furthering language understanding.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

PersianMind: A Cross-Lingual Persian-English Large Language Model

Pedram Rostami, Ali Salemi, Mohammad Javad Dousti

Large language models demonstrate remarkable proficiency in various linguistic tasks and have extensive knowledge across various domains. Although they perform best in English, their ability in other languages is notable too. In contrast, open-source models, such as LLaMa, are primarily trained on English datasets, resulting in poor performance in non-English languages. In this paper, we introduce PersianMind, an open-source bilingual large language model which demonstrates comparable performance to closed-source GPT-3.5-turbo in the Persian language. By expanding LLaMa2's vocabulary with 10,000 Persian tokens and training it on a dataset comprising nearly 2 billion Persian tokens, we show that our approach preserves the model's English knowledge and employs transfer learning to excel at transferring task knowledge from one language to another.

en cs.CL, cs.AI

Detail Sumber

S2 Open Access 2017

English language teaching

Constant Leung, J. Lewkowicz

The researchers would like to research an Acehnese saga or hikayat, titled Pocut Muhammad saga. The representation of power and social context in the text can be seen by applying Norman Fairclough's Critical Discourse Analysis (CDA) method to this story analysis. By using critical discourse analysis developed by Norman Fairlough, the analysis seeks to uncover hidden meanings throughout the story's sentences structure. Since critical discourse analysis by Norman A. Fairclough offers insight into how a text may represent social practice that takes place in society, this theory is decided to be the one that the researchers have chosen. The research method employed is descriptive qualitative, including text analysis technique. Based on the linguistic elements described in this research, the researchers concluded that the structure of the Acehnese language differs significantly from real Indonesian. The second result shows as a leader needs to be someone who can be trusted, who is fair, who values unity and his people above all else, and who never gives up. That is the hidden meaning of this story.

229 sitasi en Sociology

Detail DOI Sumber

S2 Open Access 2018

The effect of out-of-class exposure to English language media on learners’ vocabulary knowledge

E. Peters

The aim of this study is to investigate (1) how often Flemish English-as-a-foreign language learners are exposed to English language media outside of the classroom, (2) whether current out-of-class exposure to English language media is related to learners’ vocabulary knowledge, and (3) whether length of instruction (three or six years), gender, and out-of-class exposure to English language media affect Flemish learners’ vocabulary. Data were collected with 79 English-as-a-foreign language learners. They were administered a frequency-based vocabulary test and a questionnaire. The findings of this study show that Flemish English-as-a-foreign language learners are frequently exposed to English language media. The results also indicate a positive relationship between learners’ vocabulary knowledge and exposure to non-subtitled TV programs and movies, the Internet, and written print (books, magazines). Finally, both length of instruction and current out-of-class exposure to English language media had an effect on learners’ vocabulary knowledge, but out-of-class exposure had a larger effect than length of instruction.

185 sitasi en Psychology

Detail DOI Sumber

Hasil untuk "English language"