The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black
et al.
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile : an 825 GiB English text corpus tar-geted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. 1
2707 sitasi
en
Computer Science
Cross-lingual Language Model Pretraining
Guillaume Lample, Alexis Conneau
Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We propose two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingual language model objective. We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT’16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT’16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models will be made publicly available.
2957 sitasi
en
Computer Science
Flipping the classroom for English language learners to foster active learning
Hsiu-Ting Hung
785 sitasi
en
Computer Science
Practical English language teaching
D. Nunan
1916 sitasi
en
Computer Science
Critical period effects in second language learning: the influence of maturational state on the acquisition of English as a second language.
Jacqueline S. Johnson, E. Newport
2773 sitasi
en
Psychology, Medicine
WordNet: A Lexical Database for English
G. Miller
Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet1 provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].
17398 sitasi
en
Computer Science
The American Heritage dictionary of the English language
William Morris
1997 sitasi
en
Psychology, Sociology
The american heritage dictionary of the english language
Anne H. Soukhanov, Kaethe Ellis, Marion Severynse
What drives success in English medium taught courses? The interplay between language proficiency, academic skills, and motivation
H. Rose, Samantha Curle, Ikuya Aizawa
et al.
ABSTRACT This article explores the relationship between course performance, English language proficiency, motivation, and academic language skills in an English medium instruction (EMI) university context. It analyses test and questionnaire data from 146 students from an EMI business program at a Japanese university, and follow-up interviews with seven students. Proficiency test and subject exam scores revealed that knowledge of English language and academic English skill were statistically significant predictors of success in EMI, suggesting that lower proficiency students require more targeted language support in order to increase their likelihood of success. A motivation measure did not correlate with higher grades, contradicting research in language learning contexts. Interview data uncovered the multi-faceted nature of ‘success in EMI’ suggesting that students see success as a combination of final grades, lecture comprehension, English language proficiency gains, and long-term career advancement.
Factors affecting the attitudes of students towards learning English as a foreign language
A. Getie
Abstract This study aimed at investigating on factors affecting the attitudes of grade 10 students towards learning EFL in Debremarkos Comprehensive Secondary School in Debre Markos town, Ethiopia. The researcher randomly selected 103 sample students (10%) out of the total population (1030) for the study. In order to gather data, a questionnaire was carefully and systematically adapted and designed. Nine sample students were also selected purposely for focus group discussion, and Grade 10 English teachers were selected for the interview. Then, the data were analyzed quantitatively and qualitatively. The findings of the study mainly showed that the attitudes of grade 10 students towards learning EFL is positive. There are social factors (e.g., English native speakers, peer groups and learners’ parents) affecting students’ attitudes positively. On the other hand, educational context factors like English language teachers, the English language learning situations (e.g., the classrooms, arrangements of seats and the physical learning environment) had negative impacts on students’ attitude. However, the findings showed that target language learners have positive attitudes towards the other educational context factor that is the English textbook of grade 10 which means English as a foreign language teaching materials in the study’s context affect students’ attitudes positively. By lowering the psychological variables (i.e. affective filters) for the target language learners, it is possible to aid the language learning process. Thus, as the implication of this study considers, the physical learning environment should be improved, and to achieve this, the government should work in conjunction with the school principals, teachers and societies.
Learners' Listening Comprehension Difficulties in English Language Learning: A Literature Review.
Abbas Pourhosein Gilakjani, Narjes Banou Sabouri
Listening is one of the most important skills in English language learning. When students listen to English language, they face a lot of listening difficulties. Students have critical difficulties in listening comprehension because universities and schools pay more attention to writing, reading, and vocabulary. Listening is not an important part of many course books and most teachers do not pay attention to this important skill in their classes. In this paper, the researchers reviewed the terms listening, listening comprehension, listening comprehension strategies, and listening difficulties. The review of literature indicated that when teachers are aware of students’ learning difficulties they can help them develop effective listening strategies and finally solve their difficulties in listening and improve their listening comprehension abilities.
The effectiveness of the ECRIF framework in developing EFL students' engagement and learning outcomes: A case study of Moroccan middle school students
Ayad Chraa, Hanane Aqadoh
Recent shifts in education, particularly in English language teaching (ELT), have emphasized learner-centered approaches over teacher-centered ones. The ECRIF framework (Encounter, Clarify, Remember, Internalize, Fluently Use) is widely recognized for providing a practical roadmap for facilitating active, meaningful, and student-centered learning experiences. This study investigates its effectiveness in developing learning outcomes and engagement among Moroccan EFL middle school students. Data were collected from a sample of 67 ninth-grade students, using a pre-test, post-test, and feedback questionnaire. The results revealed that the ECRIF framework significantly enhanced students’ performance in the grammar test compared to traditional instruction. While both instructional methods resulted in learning gains, the improvement observed in the control group was modest compared to the substantial progress made by the experimental group. The structured stages and interactive activities incorporated throughout the instructional session contributed to the positive impact of the ECRIF framework on the experimental group’s achievement. The students expressed high levels of satisfaction and positive perceptions toward the ECRIF approach, as reflected in their responses to the feedback questionnaire. Overall, the present study demonstrated that the ECRIF activities used during the lesson not only improved students’ learning outcomes but also maximized their engagement and participation. These findings suggest that the framework can effectively foster both student learning and engagement in the English classroom. To this end, the study calls for the integration of ECRIF framework to enhance learner-centered practices and improve educational outcomes.
Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models
Terra Blevins, Luke Zettlemoyer
English pretrained language models, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only on English text but have been found to transfer surprisingly well to other languages. We investigate this phenomenon and find that common English pretraining corpora actually contain significant amounts of non-English text: even when less than 1% of data is not English (well within the error rate of strong language classifiers), this leads to hundreds of millions of foreign language tokens in large-scale datasets. We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them, with target language performance strongly correlated to the amount of in-language data seen during pretraining. In light of these findings, we argue that no model is truly monolingual when pretrained at scale, which should be considered when evaluating cross-lingual transfer.
108 sitasi
en
Computer Science
Conceptualisation and design of a protocol for a novel social drivers and primary care screening tool in paediatric orthopaedics: a quality improvement initiative
Devika A Shenoy, Kathryn Radulovacki, Christian Zirbes
et al.
Introduction Paediatric patients with elevated healthcare needs have limited access to care, contributing to insufficient preventive services. Social drivers of health (SDOH), such as inadequate nutrition and poor community support, contribute substantially to paediatric health. In order to promote value-based healthcare, specialty providers like orthopaedic surgeons can serve as an entry point to connect patients to SDOH and primary care resources. The purpose of this short report is to provide a framework for other institutions for an ongoing screening programme.Overview of programme Informed on prior initiatives, we designed a systematic screening programme for SDOH, obesity (via body mass index (BMI)) and primary care access in a paediatric orthopaedics clinic, for English-speaking and Spanish-speaking patients <18 years old. We record age, sex, race/ethnicity, language, orthopaedic condition, Area Deprivation Index and Childhood Opportunity Index from the electronic record previsit. BMI ≥95th percentile triggers referral to a paediatric obesity management programme. An institutional SDOH screen detects financial, food, transportation, housing and utility needs. A positive screen triggers a referral to institution and state-specific resources. A five-question Primary Care Access Screen identifies gaps in primary care access. A positive screen triggers an internal referral to paediatric primary care. Patients are flagged before visits, and screening results are reviewed. Students place referrals for positive screens, finalised by providers. Follow-up occurs at 6 weeks.Data collection We assess screen positivity, referral uptake and completed appointments for all patients at 6 weeks.
Enhancing image caption generation through context-aware attention mechanism
Ahatesham Bhuiyan, Eftekhar Hossain, Mohammed Moshiul Hoque
et al.
Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While much prior research has focused on caption generation for English, addressing low-resource languages like Bengali presents challenges, particularly in producing coherent captions linking visual objects with corresponding words. This paper proposes a context-aware attention mechanism over semantic attention to accurately diagnose objects for image captioning in Bengali. The proposed architecture consists of an encoder and a decoder block. We chose ResNet-50 over the other pre-trained models for encoding the image features due to its ability to solve the vanishing gradient problem and recognize complex object features. For decoding generated captions, a bidirectional Gated Recurrent Unit (GRU) architecture combined with an attention mechanism captures contextual dependencies in both directions, resulting in more accurate captions. The paper also highlights the challenge of transferring knowledge between domains, especially with culturally specific images. Evaluation of three Bengali benchmark datasets, namely BAN-Cap, BanglaLekhaImageCaption, and Bornon, demonstrates significant performance improvement in METEOR score over existing methods by approximately 30%, 18%, and 45%, respectively. The proposed context-aware, attention-based image captioning system significantly outperforms current state-of-the-art models in Bengali caption generation despite limitations in reference captions on certain datasets.
Science (General), Social sciences (General)
Rhetorical Strategies Used by Information Technology Students in In-Class Presentations IAFOR Publications on May 31, 2023
Eva Ellederová
Rhetoric plays an important role in helping information technology (IT) professionals communicate their ideas clearly and effectively. By employing rhetorical devices when speaking about technology topics, IT professionals can present logical and convincing arguments, and demonstrate their knowledge and expertise while engaging the audience and making complex technical concepts more accessible for non-experts. This study attempts to understand how IT students construct and develop persuasive arguments by analysing their use of rhetorical strategies in a sample of persuasive presentations delivered in the course “English for IT”. Both corpus analysis and manual analysis were used to identify different types of rhetorical strategies students employed to influence their audiences’ attitudes. The results show that IT students not only created a logical appeal which might be more natural for them but also employed a wide range of rhetorical strategies and devices to establish disciplinary credibility and create a more personal connection with their audience, thus maintaining an appropriate balance of logos, ethos and pathos. The study further recommends systematic and careful rhetorical analysis of ESP (English for specific purposes) students’ spoken language across disciplines and the consequent adaption of learning materials and teaching methods to improve ESP students’ rhetoric skills.
Theory and practice of education
Cultivating Intercultural Attitudes among EIL learners: A Case Study of Undergraduates
Muhammad Waqar Ali, Muhammad Iqbal, Sareer Badshah
Teaching of English as an international language (EIL) necessitates promoting intercultural communicative competence (ICC) to enable the individual to survive and progress in multicultural societies. It thereby requires them to develop intercultural attitudes—one of the essential components of ICC. The current paper strives to fathom their intercultural attitudes before and after the intervention, and to measure the impact of the treatment on their intercultural attitudes. Based on the material designed for promoting intercultural competence, a semester-long course was carried out as the intervention. Unlike the control group, the treatment group was exposed to the treatment. A self-report survey was administered to both groups before and after the treatment. Though both groups reported improvement in their attitudes, the mean differences and effect size values for the cumulative scale and subscales indicate a substantial difference between the attitudes of the two groups; it thereby may be inferred that the treatment proved to be effective in cultivating intercultural attitudes. Furthermore, female participants slightly surpassed the male participants in certain attitudes, but the difference infinitesimal.
English literature, Language. Linguistic theory. Comparative grammar
A longitudinal study on students’ self-regulated listening during transition to an English-medium transnational university in China
Sihan Zhou, Gene Thompson
Listening to academic content in English medium instruction (EMI) classrooms at university can be a demanding task for students who transition from first language (L1) instructed secondary schools. This longitudinal mixed methods study analyzes data from 316 students collected at the beginning, midterm, and the end of their first semester after entering an EMI transnational university in southeast China. The analysis of questionnaire responses revealed significant variations in students’ listening strategies over time, with a significant decrease in deep processing cognitive strategies at the midterm when content difficulty increased. Conversely, two types of metacognitive strategies (problem solving, plan evaluation) increased significantly during the second half of the semester. Informed by Zimmerman’s (2000) social cognitive self-regulated learning (SRL) model, thematic analysis of semi-structured interviews with 34 participants revealed that students generally developed a more top-down listening approach focusing on content learning over the semester and became more selective in their strategic and self-regulatory processes after the midterm “watershed” moment. Results highlight the importance of structured topic knowledge in EMI curriculum design and the necessity of strategy training in language support programs.
Defining Issues In English Language Teaching
A. Ritter
224 sitasi
en
Computer Science
Self-Repair in Elicited Narrative Production in Speakers of Russian as the First (L1), Second (L2), and Heritage (HL) Language
Natalia Bogdanova-Beglarian, Kristina Zaides, Tatiana Verkhovtceva
et al.
The current study investigates self-repairs in the speech of three groups of Russian speakers: monolingual controls (<i>N</i> = 12) residing in the Russian Federation, for whom Russian is their first dominant language (L1); bilingual Russian–Hebrew speaking participants (<i>N</i> = 12), who acquired Russian as their Heritage Language (HL) in contact with the dominant Societal Hebrew in Israel; and bilingual Russian–Chinese speakers (<i>N</i> = 12) residing in the Russian Federation at the time of testing, for whom Russian is their second language (L2). Picture-elicited narratives were coded for instances of self-repairs, split into Conceptualizer Repairs (C-repairs)—which imply pragmatic, semantic, or lexical changes—and Formulator Repairs (F-repairs), correcting different types of errors. In addition, self-repair initiators—such as cut-offs, hesitation pauses, and discourse markers—were annotated before each instance of self-repair. The results indicate that L2 speakers, in general, use self-repairs more frequently than L1 and HL speakers. L1 speakers hardly produced F-repairs, while HL and L2 speakers resorted to both C- and F-repairs. L1 speakers mainly used C-repairs for appropriacy, whereas HL and L2 speakers used C-repairs for rephrasing and lexical item change. As for F-repairs, HL speakers tended to change pronunciation and morphology, while L2 speakers implemented more morphological repairs. Lexical initiators of self-repairs were more common in L1 speech; however, in the L2 group we saw much more frequent cut-offs of repaired speech fragments. As such, varying self-repair strategies were employed by different speaker groups, shedding light on the underlying processes of language production. There was also evidence of cross-linguistic transfer of non-lexical self-repair initiators: HL speakers resorted to prolongations as initiators in HL-Russian (a strategy that is common in their dominant language, Hebrew), whereas L1 speakers used vocalized and silent pauses more frequently.