Hasil "Language and Literature"

DOAJ Open Access 2025

Visualizing and analyzing global knowledge maps and emerging research trends in tumor-derived exosomes using CiteSpace

Ruijun Guo, Jiajun Xu, Chunxia Li et al.

Abstract Objective Tumor-derived exosomes testing can be effective in diagnosing disease and assisting in the treatment of disease. Our study utilizes bibliometric analysis to identify research hotspots related to tumor-derived exosomes, predict emerging research frontiers and development trends, and offer diverse perspectives for advancing research in this field. Methods Search the Web of Science Core Collection for English-language literature published on the field of tumor-derived exosomes from 2015 to 2024. CiteSpace(6.2.R3) software was utilized to visualize the distribution of countries/regions, institutions, authors, co-cited authors, and co-cited journals within the relevant literature. Additionally, co-occurrence, clustering, and emergence analyses were conducted on the co-cited references. Results An analysis of 2523 articles meeting the inclusion criteria revealed a steady increase in the number of publications in this field over the past decade. In terms of countries/regions, institutions, authors, and journals that published articles, the most productive were China, Ministry of Education-China, Theresa L. Whiteside, and CANCER RES, respectively. The most influential were The United States, Harvard University, Theresa L. Whiteside, and CANCER RES, respectively. The mechanisms underlying exosomal PD-L1 and engineered exosomes are currently prominent research foci, warranting meticulous examination by the academic community. Conclusion Within the field of research on tumor-derived exosomes, current investigations appear largely concentrated on the exosome PD-L1 mechanism and engineered exosomes. Possible future research hotspots will focus on the use of engineered exosomes to target tumor cells and as a drug delivery platform for more precise therapeutic targeting.

Neoplasms. Tumors. Oncology. Including cancer and carcinogens

Detail DOI Sumber

DOAJ Open Access 2025

Le formule magiche medio inglesi del XV secolo tra convenzionalità e innovazione

Donata Bulotta

The precariousness of the health situation in England from the 14th century onwards led to the use of any curative means, whether scientific, religious or ritual-magical. In this context, healing charms were seen as accessible and practicable methods. They were often added to medical prescriptions and herbal remedies in medical or pseudo-pharmacological compilations, as they were considered an alternative form of therapy equally valid in the treatment of ailments. Many charms created during this period were a mixture of magic, religion and folklore, but some received new cultural stimulus, by incorporating original elements and symbolism from Arabic, Greek and Hebrew magical texts introduced to the island. This work will focus on a selection of 15th century healing charms. The analysis aims to demonstrate that the principles of the new occult and esoteric doctrines, circulating in the intellectual and cultural centers of the island, influenced the magical healing ritual. The study of pseudo-Solomonic texts, although strongly censored by the Church, however contributed to the creation of new textual amulets, which were used in addition to the pre-existing charms so becoming a further alternative medium in the therapeutic procedure.

German literature, Philology. Linguistics

Detail Sumber

DOAJ Open Access 2025

A pragmatic analysis of deictic expressions used in the IELTS speaking test

Fadi Al-Khasawneh

This study investigates the role of deictic expressions in the IELTS speaking test, addressing a gap in research on how test-takers across proficiency levels use deixis in spoken language assessment. While previous studies have examined general discourse features in language testing, little attention has been given to the frequency, functions, and distribution of deixis in assessing spoken proficiency. The study analysed a corpus of 30 IELTS speaking test transcripts, covering proficiency levels from low-intermediate to advanced. Using Levinson’s classification of deixis, the study employed quantitative frequency analysis and qualitative discourse analysis to examine variations in the use of personal, temporal, and spatial deixis. The findings revealed that personal deixis was the most frequently used, followed by temporal and spatial deixis. However, the results of One-Way ANOVA test showed no significant differences in deixis usage across proficiency levels. These findings contribute to English language teaching and assessment by highlighting how deixis functions in test-taker discourse, offering insights for IELTS preparation and speaking proficiency evaluation. The study indicates the need for further exploration of discourse features in language assessment.

Education, Philology. Linguistics

Detail DOI Sumber

DOAJ Open Access 2025

Formation Resources of the English Terminology of Inclusive Education

Alina Dushkevych

The article is devoted to a comprehensive analysis of the resources of forming the English terminological system of inclusive education in the modern educational environment. The role of terminology as a tool for standardizing knowledge, communication and scientific understanding of inclusion problems is considered. It is shown that the development of inclusive education requires a clear delineation of the terminological apparatus, since it is the terms that ensure accuracy in defining concepts, unambiguousness in use and unity in the interpretation of international and national educational documents. The formation of the English-language terminological system is based on international regulatory acts, such as the "Convention on the Rights of Persons with Disabilities", "Salamanca Statement and Framework for Action on Special Needs Education", as well as numerous legislative acts of the USA (in particular the "Individuals with Disabilities Education Act" - IDEA). An important role in this process is played by glossaries, encyclopedias and textbooks on pedagogy, psychology and special education, which systematize, unify and disseminate professional vocabulary. Particular attention is paid to the analysis of key concepts of English-language inclusive education: "inclusive education", "special educational needs", "learning disabilities", "barrier-free environment", "universal design for learning", "accessibility" and their Ukrainian counterparts. It is emphasized that when translating and adapting terms, it is necessary to take into account not only the lexical-semantic aspect, but also the cultural-pedagogical context in order to avoid shifting meanings. The terminological base of inclusive education performs a number of functions: cognitive (ensuring the scientific validity of concepts), communicative (unification of interdisciplinary and intercultural communication), normative (consolidating standards in legislation and educational policy) and practical (ensuring the effective work of teachers, psychologists, social workers). It is noted that the terms must meet the criteria of accuracy, conciseness, unambiguousness and international comprehensibility.

Discourse analysis, Computational linguistics. Natural language processing

Detail DOI Sumber

arXiv Open Access 2025

Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

Meng Xiao, Xunxin Cai, Qingqing Long et al.

Corpus distillation for biomedical large language models (LLMs) seeks to address the pressing challenge of insufficient quantity and quality in open-source annotated scientific corpora, which remains a bottleneck for effective LLM training in biomedical research. This paper proposes a knowledge-driven, agentic framework for scientific corpus distillation, tailored explicitly for LLM training in the biomedical domain, addressing the challenge posed by the complex hierarchy of biomedical knowledge. Central to our approach is a collaborative multi-agent architecture, where specialized agents, each guided by the Medical Subject Headings (MeSH) hierarchy, work in concert to autonomously extract, synthesize, and self-evaluate high-quality textual data from vast scientific literature. This agentic framework collectively generates and refines domain-specific question-answer pairs, ensuring comprehensive coverage and consistency with biomedical ontologies while minimizing manual involvement. Extensive experimental results show that language models trained on our multi-agent distilled datasets achieve notable improvements in biomedical question-answering tasks, outperforming both strong life sciences LLM baselines and advanced proprietary models. Notably, our AI-Ready dataset enables Llama3-70B to surpass GPT-4 with MedPrompt and Med-PaLM-2, despite their larger scale. Detailed ablation studies and case analyses further validate the effectiveness and synergy of each agent within the framework, highlighting the potential of multi-agent collaboration in biomedical LLM training.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Masato Kikuchi, Masatsugu Ono, Toshioki Soga et al.

Although WordNet is a valuable resource because of its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this issue, we developed a version of WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our approach, we constructed a large-scale corpus containing both sense and CEFR-level information from the annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on this corpus perform comparably to those fine-tuned on gold-standard annotations. Furthermore, by combining this corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81. This result provides indirect evidence that the transferred labels are largely consistent with the gold-standard levels. The annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.

en cs.CL

Detail Sumber

arXiv Open Access 2025

The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Dinh Nam Pham, Eleftherios Avramidis

Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.

en cs.CV, cs.CL

Detail DOI Sumber

arXiv Open Access 2025

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Carl Edwards, Chi Han, Gawon Lee et al.

Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as effective building blocks for real-world automated laboratory synthesis. This motivates us to propose mCLM, a modular Chemical-Language Model that comprises a bilingual language model that understands both natural language descriptions of functions and molecular blocks. mCLM front-loads synthesizability considerations while improving the predicted functions of molecules in a principled manner. Experiments on FDA-approved drugs showed that mCLM is capable of significantly improving chemical functions. mCLM, with only 3B parameters, also achieves improvements in synthetic accessibility relative to 7 other leading generative AI methods including GPT-5. When tested on 122 out-of-distribution medicines using only building blocks/tokens that are compatible with automated modular synthesis, mCLM outperforms all baselines in property scores and synthetic accessibility. mCLM can also reason on multiple functions and iteratively self-improve to rescue drug candidates that failed late in clinical trials ("fallen angels").

en cs.AI, cs.CL

Detail Sumber

arXiv Open Access 2025

Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications

Guangxin Su, Hanchen Wang, Jianwei Wang et al.

Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM--TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2024

Examining the escalation of hostility in social media: a comparative analysis of online incivility in China and the United States regarding the Russia–Ukraine war

Li Yanbo, Su Chris Chao

This study examines and compares online incivility on China’s Weibo and the U.S.’s X (Twitter) amid the Russia-Ukraine conflict, aiming to unravel how different cultural and geopolitical contexts influence online incivility and identify factors that may influence the occurrence of online incivility in different national contexts.

Communication. Mass media

Detail DOI Sumber

DOAJ Open Access 2024

Developing Artificial Intelligence-Powered Monetary Policy Communication Indicators for Macroeconomic Inquiries in Ghana

Francis Mawuli Abude, Jones Odei-Mensah, Eric Schaling

Central bank communication is a valuable source of information designed to shape the expectations of economic agents within and outside an economy. In particular, the content of Monetary Policy Committees’ press releases and statements reflect the central banks’ view of current and future macroeconomic developments, making them useful for creating high-frequency indicators as alternatives to traditional but slower-to-publish macroeconomic indicators. In this study, Artificial Intelligence (AI)-powered text-mining techniques were employed to create monetary policy communication-based indicators, namely the Monetary Policy Readability Index (MPRI), the Monetary Policy Sentiment Index (MPSI), and the Monetary Policy Uncertainty Index (MPUI), using press releases from the Bank of Ghana's monetary policy committee spanning January 2003 to December 2022. The findings suggest that while readability and sentiments generally declined over the sample period, uncertainty increased, indicating persistent macroeconomic imbalances and vulnerabilities in the domestic economy. The newly developed time series-based indicators demonstrate Granger causal relationships with key macroeconomic variables, affirming their relevance to the central bank, the Ministry of Finance, researchers, investors, and development partners. Notably, the indicators can serve as an early warning system for monitoring and predicting the country's macroeconomic risks, forecasting lagging indicators, assessing the effectiveness of the Bank’s monetary policy communication, and addressing monetary policy inquiries.

Communication. Mass media

Detail DOI Sumber

arXiv Open Access 2024

An Empirical Study of Gendered Stereotypes in Emotional Attributes for Bangla in Multilingual Large Language Models

Jayanta Sadhu, Maneesha Rani Saha, Rifat Shahriyar

The influence of Large Language Models (LLMs) is rapidly growing, automating more jobs over time. Assessing the fairness of LLMs is crucial due to their expanding impact. Studies reveal the reflection of societal norms and biases in LLMs, which creates a risk of propagating societal stereotypes in downstream tasks. Many studies on bias in LLMs focus on gender bias in various NLP applications. However, there's a gap in research on bias in emotional attributes, despite the close societal link between emotion and gender. This gap is even larger for low-resource languages like Bangla. Historically, women are associated with emotions like empathy, fear, and guilt, while men are linked to anger, bravado, and authority. This pattern reflects societal norms in Bangla-speaking regions. We offer the first thorough investigation of gendered emotion attribution in Bangla for both closed and open source LLMs in this work. Our aim is to elucidate the intricate societal relationship between gender and emotion specifically within the context of Bangla. We have been successful in showing the existence of gender bias in the context of emotions in Bangla through analytical methods and also show how emotion attribution changes on the basis of gendered role selection in LLMs. All of our resources including code and data are made publicly available to support future research on Bangla NLP. Warning: This paper contains explicit stereotypical statements that many may find offensive.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Improving Text Embeddings for Smaller Language Models Using Contrastive Fine-tuning

Trapoom Ukarapol, Zhicheng Lee, Amy Xin

While Large Language Models show remarkable performance in natural language understanding, their resource-intensive nature makes them less accessible. In contrast, smaller language models such as MiniCPM offer more sustainable scalability, but often underperform without specialized optimization. In this paper, we explore the enhancement of smaller language models through the improvement of their text embeddings. We select three language models, MiniCPM, Phi-2, and Gemma, to conduct contrastive fine-tuning on the NLI dataset. Our results demonstrate that this fine-tuning method enhances the quality of text embeddings for all three models across various benchmarks, with MiniCPM showing the most significant improvements of an average 56.33% performance gain. The contrastive fine-tuning code is publicly available at https://github.com/trapoom555/Language-Model-STS-CFT.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Instruct Large Language Models to Generate Scientific Literature Survey Step by Step

Yuxuan Lai, Yupeng Wu, Yidan Wang et al.

Abstract. Automatically generating scientific literature surveys is a valuable task that can significantly enhance research efficiency. However, the diverse and complex nature of information within a literature survey poses substantial challenges for generative models. In this paper, we design a series of prompts to systematically leverage large language models (LLMs), enabling the creation of comprehensive literature surveys through a step-by-step approach. Specifically, we design prompts to guide LLMs to sequentially generate the title, abstract, hierarchical headings, and the main content of the literature survey. We argue that this design enables the generation of the headings from a high-level perspective. During the content generation process, this design effectively harnesses relevant information while minimizing costs by restricting the length of both input and output content in LLM queries. Our implementation with Qwen-long achieved third place in the NLPCC 2024 Scientific Literature Survey Generation evaluation task, with an overall score only 0.03% lower than the second-place team. Additionally, our soft heading recall is 95.84%, the second best among the submissions. Thanks to the efficient prompt design and the low cost of the Qwen-long API, our method reduces the expense for generating each literature survey to 0.1 RMB, enhancing the practical value of our method.

en cs.CL

Detail Sumber

arXiv Open Access 2024

RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages

Harshvivek Kashid, Pushpak Bhattacharyya

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

en cs.CL, cs.CV

Detail Sumber

arXiv Open Access 2024

A Federated Learning Approach to Privacy Preserving Offensive Language Identification

Marcos Zampieri, Damith Premasiri, Tharindu Ranasinghe

The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users' privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2024

A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery

Yu Zhang, Xiusi Chen, Bowen Jin et al.

In many scientific fields, large language models (LLMs) have revolutionized the way text and other modalities of data (e.g., molecules and proteins) are handled, achieving superior performance in various applications and augmenting the scientific discovery process. Nevertheless, previous surveys on scientific LLMs often concentrate on one or two fields or a single modality. In this paper, we aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs regarding their architectures and pre-training techniques. To this end, we comprehensively survey over 260 scientific LLMs, discuss their commonalities and differences, as well as summarize pre-training datasets and evaluation tasks for each field and modality. Moreover, we investigate how LLMs have been deployed to benefit scientific discovery. Resources related to this survey are available at https://github.com/yuzhimanhua/Awesome-Scientific-Language-Models.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs

Koshiro Saito, Sakae Mizuki, Masanari Ohi et al.

Why do we build local large language models (LLMs)? What should a local LLM learn from the target language? Which abilities can be transferred from other languages? Do language-specific scaling laws exist? To explore these research questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and English, taking Japanese as a local language. Adopting an observational approach, we analyzed correlations of benchmark scores, and conducted principal component analysis (PCA) on the scores to derive \textit{ability factors} of local LLMs. We found that training on English text can improve the scores of academic subjects in Japanese (JMMLU). In addition, it is unnecessary to specifically train on Japanese text to enhance abilities for solving Japanese code generation, arithmetic reasoning, commonsense, and reading comprehension tasks. In contrast, training on Japanese text could improve question-answering tasks about Japanese knowledge and English-Japanese translation, which indicates that abilities for solving these two tasks can be regarded as \textit{Japanese abilities} for LLMs. Furthermore, we confirmed that the Japanese abilities scale with the computational budget for Japanese text.

en cs.CL

Detail Sumber

arXiv Open Access 2024

Why do objects have many names? A study on word informativeness in language use and lexical systems

Eleonora Gualdoni, Gemma Boleda

Human lexicons contain many different words that speakers can use to refer to the same object, e.g., "purple" or "magenta" for the same shade of color. On the one hand, studies on language use have explored how speakers adapt their referring expressions to successfully communicate in context, without focusing on properties of the lexical system. On the other hand, studies in language evolution have discussed how competing pressures for informativeness and simplicity shape lexical systems, without tackling in-context communication. We aim at bridging the gap between these traditions, and explore why a soft mapping between referents and words is a good solution for communication, by taking into account both in-context communication and the structure of the lexicon. We propose a simple measure of informativeness for words and lexical systems, grounded in a visual space, and analyze color naming data for English and Mandarin Chinese. We conclude that optimal lexical systems are those where multiple words can apply to the same referent, conveying different amounts of information. Such systems allow speakers to maximize communication accuracy and minimize the amount of information they convey when communicating about referents in contexts.

en cs.CL

Detail Sumber

DOAJ Open Access 2023

Empatia e giustizia nella ricezione artistica dei casi e del processo intorno al terrorismo nero in Germania

Yvonne Huetter-Almerigi

The article analyses the various levels on which justice figures as a motif in Fatih Akin’s film In the fade (2017). According to my reading, ‘justice’ informs on the level of aesthetics of reception as well as at the level of production: first, the film is the first attempt to render justice to the families of the victims of the Nsu-terror in Germany by acknowledging their sufferance and integrating them into the circle of people for whom we feel empathy and solidarity. Second, the film thematizes the conflicts between justice and individual agency in a post-metaphysical setting. More generally, by drawing on the theoretical instruments of Law and Literature studies, the film also can be understood as an attempt to influence the normative web with which to evaluate the real-world court case against members of the Nsu, which was still ongoing when Akin’s film came out.

Literature (General)

Detail DOI Sumber

Hasil untuk "Language and Literature"