Hasil "Chinese language and literature"

DOAJ Open Access 2025

Intelligentization of warfare: Сhina's application of artificial intelligence in the military sphere and its geopolitical implications

Stanislav Shumlianskyi, Igor Svietlichnyi

The article studies how China is using and developing artificial intelligence (AI) for its army. This is a big part of China’s plan to modernize and make its military smarter. The research looks at the main ideas behind China’s AI strategy, how it affects the world, and the moral problems that come with China’s goal to become the top country in military AI. It shows that China uses a centralized way to develop AI, with a lot of help from the government and easy access to data. The article also explains how China uses AI to run worldwide campaigns to influence what people think. China has some problems, like depending on foreign technology for making chips, and issues with bureaucracy and corruption. But it also has strong points, such as a huge number of workers and the ability to quickly use resources. The article gives a clear picture of how AI is growing in China and what it means for global security. It shows that using AI in information and mind-influencing wars is a serious threat. China tries to change what people think and what enemies decide by affecting their minds directly. The war in Ukraine proved that such AI-based tactics are dangerous but cannot win wars alone. Real fighting on the battlefield is still the key. Because of this, other countries – especially the United States – need to improve their own AI systems and create strong plans that combine physical, cyber, and mind-based operations to deal with China effectively. It is concluded that China’s use of artificial intelligence is one of the key factors strengthening its military power and serves as a catalyst for reshaping the modern geopolitical map of the world.

Chinese language and literature

Detail DOI Sumber

DOAJ Open Access 2025

Otthonok határok nélkül: a család ezer arca Ázsiában

Balázs Szigethy

Szigethy Balázs recenziója A család szerepe a távol-keleti és belső-ázsiai régióban. Tanulmányok a család mint közösség társadalmi, vallási és rituális aspektusairól című kötetről (szerk.: Birtalan Ágnes – Teleki Krisztina).

Chinese language and literature

Detail DOI Sumber

DOAJ Open Access 2025

Progress or plateau? A 20-year systematic review of self-efficacy among TCFL teachers

Qian Shi, Jing Zhang, Wei Zhao

This systematic review examined research on the self-efficacy of teachers of Chinese as a foreign language (TCFL) from 2004 to 2024. Guided by social cognitive theory and Bandura’s concept of self-efficacy, 15 empirical studies were synthesized following PRISMA guidelines, employing both qualitative and quantitative analyses. The review identified key factors influencing TCFL teacher self-efficacy, including personal, student, and environmental factors. It further showed that teacher self-efficacy predicts important outcomes such as technology use and integration, career development and retention, and emotional and psychological resources. The findings underscore the need for targeted professional development, supportive institutional policies, and cross-cultural adaptation resources, and they point to future research directions on emerging technologies and diverse teaching contexts.

Psychology

Detail DOI Sumber

DOAJ Open Access 2025

Szempontok a Selyemút vizsgálatához

Szandra Ésik

Ésik Szandra recenziója a Gazdaság, történelem és politika a Selyemút mentén című kötetről (szerk.: Kósa Gábor).

Chinese language and literature

Detail DOI Sumber

CrossRef Open Access 2025

The Oikos of the Mother Tongue: Ecology and Maternality of Language in Two Yi Poets

Cosima Bruno

en

Detail DOI Sumber

arXiv Open Access 2025

Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs

Sifan Li, Yujun Cai, Bryan Hooi et al.

Traditional Chinese Medicine (TCM) has seen increasing adoption in healthcare, with specialized Large Language Models (LLMs) emerging to support clinical applications. A fundamental requirement for these models is accurate identification of TCM drug ingredients. In this paper, we evaluate how general and TCM-specialized LLMs perform when identifying ingredients of Chinese drugs. Our systematic analysis reveals consistent failure patterns: models often interpret drug names literally, overuse common herbs regardless of relevance, and exhibit erratic behaviors when faced with unfamiliar formulations. LLMs also fail to understand the verification task. These findings demonstrate that current LLMs rely primarily on drug names rather than possessing systematic pharmacological knowledge. To address these limitations, we propose a Retrieval Augmented Generation (RAG) approach focused on ingredient names. Experiments across 220 TCM formulations show our method significantly improves accuracy from approximately 50% to 82% in ingredient verification tasks. Our work highlights critical weaknesses in current TCM-specific LLMs and offers a practical solution for enhancing their clinical reliability.

en cs.CL

Detail Sumber

arXiv Open Access 2025

PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models

Zaur Gouliev, Jennifer Waters, Chengqian Wang

Disinformation spreads rapidly across linguistic boundaries, yet most AI models are still benchmarked only on English. We address this gap with a systematic comparison of five multilingual transformer models: mBERT, XLM, XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning classification task. While transformer-based language models have demonstrated notable success in detecting disinformation in English, their effectiveness in multilingual contexts still remains up for debate. To facilitate evaluation, we introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs (false claim vs. factual correction) spanning over twenty five languages that collectively cover five language families and a broad topical range from politics, health, climate, finance, and conspiracy, half of which are fact-checked disinformation claims verified by an augmented MindBugs Discovery dataset. Our experiments revealed performance variations. Models such as RemBERT achieved better overall accuracy, particularly excelling in low-resource languages, whereas models like mBERT and XLM exhibit considerable limitations when training data is scarce. We provide a discussion of these performance patterns and implications for real-world deployment. The dataset is publicly available on our GitHub repository to encourage further experimentation and advancement. Our findings illuminate both the potential and the current limitations of AI systems for multilingual disinformation detection.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2025

BenCao: An Instruction-Tuned Large Language Model for Traditional Chinese Medicine

Jiacheng Xie, Yang Yu, Yibo Chen et al.

Traditional Chinese Medicine (TCM), with a history spanning over two millennia, plays a role in global healthcare. However, applying large language models (LLMs) to TCM remains challenging due to its reliance on holistic reasoning, implicit logic, and multimodal diagnostic cues. Existing TCM-domain LLMs have made progress in text-based understanding but lack multimodal integration, interpretability, and clinical applicability. To address these limitations, we developed BenCao, a ChatGPT-based multimodal assistant for TCM, integrating structured knowledge bases, diagnostic data, and expert feedback refinement. BenCao was trained through natural language instruction tuning rather than parameter retraining, aligning with expert-level reasoning and ethical norms specific to TCM. The system incorporates a comprehensive knowledge base of over 1,000 classical and modern texts, a scenario-based instruction framework for diverse interactions, a chain-of-thought simulation mechanism for interpretable reasoning, and a feedback refinement process involving licensed TCM practitioners. BenCao connects to external APIs for tongue-image classification and multimodal database retrieval, enabling dynamic access to diagnostic resources. In evaluations across single-choice question benchmarks and multimodal classification tasks, BenCao achieved superior accuracy to general-domain and TCM-domain models, particularly in diagnostics, herb recognition, and constitution classification. The model was deployed as an interactive application on the OpenAI GPTs Store, accessed by nearly 1,000 users globally as of October 2025. This study demonstrates the feasibility of developing a TCM-domain LLM through natural language-based instruction tuning and multimodal integration, offering a practical framework for aligning generative AI with traditional medical reasoning and a scalable pathway for real-world deployment.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2024

Analysis of the Meaning of Idioms in Selected Poems of Li Bai (李白)

Nuray Pamuk Öztürk

Idioms serve as crucial components that illustrate the expressive capabilities of a society’s language and enhance stylistic richness. Their presence significantly enriches the narrative quality of literary works. Throughout Chinese history, literary creations have emerged in various periods, with distinct genres gaining prominence. During the Tang Dynasty (618-907), poetry notably ascended as a leading literary form. This study investigates the idioms found in selected poems by Li Bai, an immortal poet emblematic of Tang poetry. Li Bai’s body of work includes over a thousand poems. The analysis focuses on his poems featured in Three Hundred Tang Poems (唐诗三百首), compiled by Gu Qing in 2009 and published by Zhonghua Shuju (中华书局), which is a vital resource for scholars of Tang poetry and is written in Chinese. Employing a qualitative research design, the study utilizes literature screening for data collection, followed by content analysis. The volume encompasses twenty-six poems attributed to Li Bai, eight of which contain a total of fifteen distinct idioms. The research includes translations of these poems into Turkish, provides a thematic overview, analyzes and categorizes the idioms by meaning, and offers suggestions for closely related Turkish idioms.

Oriental languages and literatures

Detail DOI Sumber

arXiv Open Access 2024

Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Hongbo Wang, Junyu Lu, Yan Han et al.

Patronizing and Condescending Language (PCL) is a form of discriminatory toxic speech targeting vulnerable groups, threatening both online and offline safety. While toxic speech research has mainly focused on overt toxicity, such as hate speech, microaggressions in the form of PCL remain underexplored. Additionally, dominant groups' discriminatory facial expressions and attitudes toward vulnerable communities can be more impactful than verbal cues, yet these frame features are often overlooked. In this paper, we introduce the PCLMM dataset, the first Chinese multimodal dataset for PCL, consisting of 715 annotated videos from Bilibili, with high-quality PCL facial frame spans. We also propose the MultiPCL detector, featuring a facial expression detection module for PCL recognition, demonstrating the effectiveness of modality complementarity in this challenging task. Our work makes an important contribution to advancing microaggression detection within the domain of toxic speech.

en cs.CV, cs.CL

Detail Sumber

arXiv Open Access 2024

End-to-End Graph Flattening Method for Large Language Models

Bin Hong, Jinze Wu, Jiayu Liu et al.

In recent years, the breakthrough of Large Language Models (LLMs) offers new ideas for achieving universal methods on graph data. The common practice of converting graphs into natural language for LLMs, which refers to graph flattening, exhibits good generalizability and interpretability. However, the poor organization of the textual format results in poor performance in long-distance scenario understanding. Inspired by human cognitive reasoning habits, we propose a novel method for graph flattening to fit LLMs, termed as End-to-End DAG-Path prompting (EEDP). Experiments on real-world datasets show that EEDP enhances the reasoning performance of LLMs in long-distance scenarios while maintaining excellent performance in short-distance scenarios, demonstrating good robustness in the face of distance variations.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2024

BianCang: A Traditional Chinese Medicine Large Language Model

Sibo Wei, Xueping Peng, Yi-Fei Wang et al.

The surge of large language models (LLMs) has driven significant progress in medical applications, including traditional Chinese medicine (TCM). However, current medical LLMs struggle with TCM diagnosis and syndrome differentiation due to substantial differences between TCM and modern medical theory, and the scarcity of specialized, high-quality corpora. To this end, in this paper we propose BianCang, a TCM-specific LLM, using a two-stage training process that first injects domain-specific knowledge and then aligns it through targeted stimulation to enhance diagnostic and differentiation capabilities. Specifically, we constructed pre-training corpora, instruction-aligned datasets based on real hospital records, and the ChP-TCM dataset derived from the Pharmacopoeia of the People's Republic of China. We compiled extensive TCM and medical corpora for continual pre-training and supervised fine-tuning, building a comprehensive dataset to refine the model's understanding of TCM. Evaluations across 11 test sets involving 31 models and 4 tasks demonstrate the effectiveness of BianCang, offering valuable insights for future research. Code, datasets, and models are available on https://github.com/QLU-NLP/BianCang.

en cs.CL, cs.AI

Detail DOI Sumber

arXiv Open Access 2023

CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks for Chinese Large Language Models

Dan Shi, Chaobin You, Jiantao Huang et al.

As an indispensable ingredient of intelligence, commonsense reasoning is crucial for large language models (LLMs) in real-world scenarios. In this paper, we propose CORECODE, a dataset that contains abundant commonsense knowledge manually annotated on dyadic dialogues, to evaluate the commonsense reasoning and commonsense conflict detection capabilities of Chinese LLMs. We categorize commonsense knowledge in everyday conversations into three dimensions: entity, event, and social interaction. For easy and consistent annotation, we standardize the form of commonsense knowledge annotation in open-domain dialogues as "domain: slot = value". A total of 9 domains and 37 slots are defined to capture diverse commonsense knowledge. With these pre-defined domains and slots, we collect 76,787 commonsense knowledge annotations from 19,700 dialogues through crowdsourcing. To evaluate and enhance the commonsense reasoning capability for LLMs on the curated dataset, we establish a series of dialogue-level reasoning and detection tasks, including commonsense knowledge filling, commonsense knowledge generation, commonsense conflict phrase detection, domain identification, slot identification, and event causal inference. A wide variety of existing open-source Chinese LLMs are evaluated with these tasks on our dataset. Experimental results demonstrate that these models are not competent to predict CORECODE's plentiful reasoning content, and even ChatGPT could only achieve 0.275 and 0.084 accuracy on the domain identification and slot identification tasks under the zero-shot setting. We release the data and codes of CORECODE at https://github.com/danshi777/CORECODE to promote commonsense reasoning evaluation and study of LLMs in the context of daily conversations.

en cs.CL

Detail Sumber

arXiv Open Access 2023

Low-Resource Languages Jailbreak GPT-4

Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Massively Multilingual Language Models for Cross Lingual Fact Extraction from Low Resource Indian Languages

Bhavyajeet Singh, Pavan Kandru, Anubhav Sharma et al.

Massive knowledge graphs like Wikidata attempt to capture world knowledge about multiple entities. Recent approaches concentrate on automatically enriching these KGs from text. However a lot of information present in the form of natural text in low resource languages is often missed out. Cross Lingual Information Extraction aims at extracting factual information in the form of English triples from low resource Indian Language text. Despite its massive potential, progress made on this task is lagging when compared to Monolingual Information Extraction. In this paper, we propose the task of Cross Lingual Fact Extraction(CLFE) from text and devise an end-to-end generative approach for the same which achieves an overall F1 score of 77.46.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

Bram M. A. van Dijk, Tom Kouwenhoven, Marco R. Spruit et al.

Current Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text. LLMs are appearing rapidly, and debates on LLM capacities have taken off, but reflection is lagging behind. Thus, in this position paper, we first zoom in on the debate and critically assess three points recurring in critiques of LLM capacities: i) that LLMs only parrot statistical patterns in the training data; ii) that LLMs master formal but not functional language competence; and iii) that language learning in LLMs cannot inform human language learning. Drawing on empirical and theoretical arguments, we show that these points need more nuance. Second, we outline a pragmatic perspective on the issue of `real' understanding and intentionality in LLMs. Understanding and intentionality pertain to unobservable mental states we attribute to other humans because they have pragmatic value: they allow us to abstract away from complex underlying mechanics and predict behaviour effectively. We reflect on the circumstances under which it would make sense for humans to similarly attribute mental states to LLMs, thereby outlining a pragmatic philosophical context for LLMs as an increasingly prominent technology in society.

en cs.CL, cs.AI

Detail Sumber

DOAJ Open Access 2022

Efficacy and Safety of Oral Chinese Herbal Medicine for Migraine: A Systematic Review and Meta-Analyses Using Robust Variance Estimation Model

Shaohua Lyu, Shaohua Lyu, Claire Shuiqing Zhang et al.

BackgroundMigraine is a prevalent headache disorder with significant impacts on patients' quality of life and economic burden. Chinese herbal medicine (CHM) is commonly prescribed for migraine in China. This review aimed to provide a rigorous evaluation of evidence on the efficacy of oral CHM for migraine and explore the correlation between its effect size and treatment duration.MethodsWe searched nine digital databases (PubMed, EMBASE, CINAHL, Cochrane Central Register of Controlled Trials, AMED, BioMedical Literature, CNKI, CQVIP, and Wanfang Data) from their inceptions to May 2021, with the language being restricted to Chinese and English. Randomized, placebo-controlled trials using oral CHM to treat adult migraine were included. Data screening and extraction were conducted by two independent reviewers. The methodological quality of randomized controlled trials (RCTs) was assessed using the Cochrane Risk of Bias tool. Meta-analyses were conducted to estimate the effect size using a random effect model, and a robust variance estimation (RVE) model was constructed to explore the correlation between treatment effects and treatment duration. The certainty of the evidence was assessed with the Grading of Recommendations Assessment, Development, and Evaluation. Publication bias was tested using a funnel plot and Egger's test.ResultsA total of 18 RCTs involving 3,015 participants were included. Results of the meta-analyses showed that, at the end of the treatment phase, CHM was more efficacious than placebo in reducing migraine frequency, migraine days, and pain severity, and increasing response rate. Additionally, CHM showed superior effects to placebo in lowering migraine frequency and pain severity at the end of the 4-week follow-up. The RVE model suggested that the benefits of CHM for migraine frequency and pain intensity increased as treatment duration extended. The number of adverse events reported by the CHM and placebo groups was comparable. The certainty of the evidence was graded as “moderate.” No publication bias was detected.ConclusionOral CHM appeared to be more efficacious than placebo for reducing migraine frequency and pain severity. Greater treatment effects were associated with longer treatment duration. The oral CHM was well tolerated.Systematic Review Registrationhttps://www.crd.york.ac.uk/prospero/#recordDetails, identifier: CRD42021270719.

Neurology. Diseases of the nervous system

Detail DOI Sumber

arXiv Open Access 2022

Language Control Diffusion: Efficiently Scaling through Space, Time, and Tasks

Edwin Zhang, Yujie Lu, Shinda Huang et al.

Training generalist agents is difficult across several axes, requiring us to deal with high-dimensional inputs (space), long horizons (time), and generalization to novel tasks. Recent advances with architectures have allowed for improved scaling along one or two of these axes, but are still computationally prohibitive to use. In this paper, we propose to address all three axes by leveraging \textbf{L}anguage to \textbf{C}ontrol \textbf{D}iffusion models as a hierarchical planner conditioned on language (LCD). We effectively and efficiently scale diffusion models for planning in extended temporal, state, and task dimensions to tackle long horizon control problems conditioned on natural language instructions, as a step towards generalist agents. Comparing LCD with other state-of-the-art models on the CALVIN language robotics benchmark finds that LCD outperforms other SOTA methods in multi-task success rates, whilst improving inference speed over other comparable diffusion models by 3.3x~15x. We show that LCD can successfully leverage the unique strength of diffusion models to produce coherent long range plans while addressing their weakness in generating low-level details and control.

en cs.LG, cs.AI

Detail Sumber

arXiv Open Access 2022

Controlling Translation Formality Using Pre-trained Multilingual Language Models

Elijah Rippeth, Sweta Agrawal, Marine Carpuat

This paper describes the University of Maryland's submission to the Special Task on Formality Control for Spoken Language Translation at \iwslt, which evaluates translation from English into 6 languages with diverse grammatical formality markers. We investigate to what extent this problem can be addressed with a \textit{single multilingual model}, simultaneously controlling its output for target language and formality. Results show that this strategy can approach the translation quality and formality control achieved by dedicated translation models. However, the nature of the underlying pre-trained language model and of the finetuning samples greatly impact results.

en cs.CL

Detail Sumber

arXiv Open Access 2022

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan et al.

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe \model{}'s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.

en cs.CL

Detail Sumber

Hasil untuk "Chinese language and literature"