Hasil untuk "English language"

Menampilkan 20 dari ~6566826 hasil · dari arXiv, DOAJ, Semantic Scholar, CrossRef

JSON API
S2 Open Access 2013
Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou, Bryan Perozzi, S. Skiena

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

491 sitasi en Computer Science
arXiv Open Access 2025
mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Carl Edwards, Chi Han, Gawon Lee et al.

Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. Experiments on FDA-approved drugs showed that mCLM is capable of significantly improving chemical functions. mCLM, with only 3B parameters, also achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials ("fallen angels").

en cs.AI, cs.CL
arXiv Open Access 2025
Artificial Intelligence Bias on English Language Learners in Automatic Scoring

Shuchen Guo, Yun Wang, Jichao Yu et al.

This study investigated potential scoring biases and disparities toward English Language Learners (ELLs) when using automatic scoring systems for middle school students' written responses to science assessments. We specifically focus on examining how unbalanced training data with ELLs contributes to scoring bias and disparities. We fine-tuned BERT with four datasets: responses from (1) ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs (unbalanced), and (4) a balanced mixed dataset with equal representation of both groups. The study analyzed 21 assessment items: 10 items with about 30,000 ELL responses, five items with about 1,000 ELL responses, and six items with about 200 ELL responses. Scoring accuracy (Acc) was calculated and compared to identify bias using Friedman tests. We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities. We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large enough (ELL = 30,000 and ELL = 1,000), but concerns could exist if the sample size is limited (ELL = 200).

en cs.CL, cs.AI
arXiv Open Access 2025
Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications

Guangxin Su, Hanchen Wang, Jianwei Wang et al.

Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM--TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.

en cs.CL, cs.AI
arXiv Open Access 2025
Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models

Hwiyeong Lee, Uiji Hwang, Hyelim Lim et al.

Large language models often retain unintended content, prompting growing interest in knowledge unlearning. Recent approaches emphasize localized unlearning, restricting parameter updates to specific regions in an effort to remove target knowledge while preserving unrelated general knowledge. However, their effectiveness remains uncertain due to the lack of robust and thorough evaluation of the trade-off between the competing goals of unlearning. In this paper, we begin by revisiting existing localized unlearning approaches. We then conduct controlled experiments to rigorously evaluate whether local parameter updates causally contribute to unlearning. Our findings reveal that the set of parameters that must be modified for effective unlearning is not strictly determined, challenging the core assumption of localized unlearning that parameter locality is inherently indicative of effective knowledge removal.

en cs.CL
DOAJ Open Access 2025
An Unrealistic ‘Mission Impossible’? Surveying Principals’ Interpretations and Local Solutions for L2-mediated EFL in Finnish Adult Basic Education

Antti Suomela

Finnish Adult Basic Education (ABE) comprises primary and lower secondary education designed to improve adult immigrants’ participation. Completing the entire syllabus and graduating with a Finnish basic education certificate is helpful for many adult immigrants without prior basic education in finding work, continuing studies at the upper secondary level, and applying for Finnish citizenship. English has potential utility for integration in the Nordic countries, yet adult immigrants’ foreign language studies, and Finnish ABE, have been little studied. Recent government reports have indicated that ABE students seem to face exceptional challenges in their language studies, particularly their compulsory advanced syllabus English studies. The present survey study (n=35) targeting principals of ABE institutions examines institution-level foreign language education practices and principals’ views on language education. The findings indicate that ABE English education is rapid-paced, relies on a local teaching language that students have only learned for a few years, and may lack suitable instructional materials and tests. Most principals are not convinced that their students graduate with English skills sufficient for language needs in working life or upper secondary education. Varied local interpretations of national guidelines lead to inequity in language learning opportunities. The findings highlight the significance of aligning materials and local and national policies with student diversity and the evolving role of English in the Nordics.

English literature
DOAJ Open Access 2025
Developing the Language of EFL Students’ Paragraph Writing through Enhanced Readers’ Awareness of Summary and Reading Comprehension Skills

Ali Saleh Mohammad Al-Attas

 The study aimed to examine the role of reading comprehension strategies to enhance the language of university students in the writing skill derived from the summary of comprehension reading. Teaching skills of skimming and scanning to search vital vocabulary in a sentence of paragraph structure facilitate the complexity of text and improve the students’ ability to write completed paragraph. The research focused on equipping students with skills to extract information derived from text reading to writing purpose. Due to its linguistic and educational needs, Education in Socotra was the ideal setting to evaluate the effectiveness of targeted reading-to-writing intervention. A pre-experimental design was adopted with a single group of 16 students who participated in a three-month intervention. A mixed-methods approach gathered data from a single group through a closed-ended comprehension questionnaire and an opened-ended summary question on the paragraph writing before and after the treatment. The students’ performances analyzed at comprehension text and pattern of summarization. Results showed that teaching techniques positively impacted students’ comprehension and summarization skills, with an 82% success rate in paragraph writing as a reflection of writing skills. Post-test scores improved significantly (74.3%) compared to pre-test scores (45.7%). Statistics revealed a strong correlation (79%) between CT and SPW. That strategic vocabulary-focused reading applied demonstrated the language abilities in the quality writing. It concluded that targeted reading to writing approach in EFL education supports sustainable development of language ability in non-native speakers setting.

History of scholarship and learning. The humanities
arXiv Open Access 2024
RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Harshvivek Kashid, Pushpak Bhattacharyya

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

en cs.CL, cs.CV
arXiv Open Access 2024
Türkçe Dil Modellerinin Performans Karşılaştırması Performance Comparison of Turkish Language Models

Eren Dogan, M. Egemen Uzun, Atahan Uz et al.

The developments that language models have provided in fulfilling almost all kinds of tasks have attracted the attention of not only researchers but also the society and have enabled them to become products. There are commercially successful language models available. However, users may prefer open-source language models due to cost, data privacy, or regulations. Yet, despite the increasing number of these models, there is no comprehensive comparison of their performance for Turkish. This study aims to fill this gap in the literature. A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. Turkish datasets for contextual learning and question-answering were prepared, and both automatic and human evaluations were conducted. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish and that in-context learning performances do not much related to question-answering performances.

en cs.CL, cs.AI
arXiv Open Access 2024
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Yu Zhang, Xiusi Chen, Bowen Jin et al.

In many scientific fields, large language models (LLMs) have revolutionized the way text and other modalities of data (e.g., molecules and proteins) are handled, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one or two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 260 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.

en cs.CL
DOAJ Open Access 2024
On the Effect of Diagnostic Self-, and Peer-assessment on Reading Comprehension: Examining EFL Learners’ Diagnostic Rating Accuracy across Various Genres

Masoome Azmoode Sis Abad, Gholam-Reza Kiany, Gholam-Reza Abassian

The present research responds to the call for implementing diagnostic assessment through learner involvement. This study followed two aims: first, examining the distinctive effects of diagnostic self- and peer - assessments on EFL learners’ reading comprehension; second, exploring students’ diagnostic rating accuracy in various reading genres. To achieve this, a number of diagnostic reading comprehension tests and a checklist were developed, the construction of which were based on the learners’ challenging reading subskills. The participants include 60 English translation university students. During 12 weeks of instruction, the rating accuracy of the subjects’ diagnostic self- and peer- assessments were investigated while the instructor ratings were taken as the yardstick. Data analysis, using T-test and MANOVA confirmed that the two groups had improvement in reading comprehension but there was no significant difference between the two groups’ gain. In addition, no statistically significant difference was detected among the accuracy of diagnostic self-, peer- and instructor- ratings except for the assessment of main idea and supporting details subskill in descriptive genre and cause and effect subskill in narrative genre. This study can offer evidence for conducting diagnostic assessment through learner engagement to provide students with appropriate feedback and remedial instruction.

English language

Halaman 36 dari 328342