The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models
Yitong Zhang, Yuhan Xiang, Mingxuan Liu
From a pragmatic perspective, this study systematically evaluates the differences in performance among representative large language models (LLMs) in recognizing politeness, impoliteness, and mock politeness phenomena in Chinese. Addressing the existing gaps in pragmatic comprehension, the research adopts the frameworks of Rapport Management Theory and the Model of Mock Politeness to construct a three-category dataset combining authentic and simulated Chinese discourse. Six representative models, including GPT-5.1 and DeepSeek, were selected as test subjects and evaluated under four prompting conditions: zero-shot, few-shot, knowledge-enhanced, and hybrid strategies. This study serves as a meaningful attempt within the paradigm of ``Great Linguistics,'' offering a novel approach to applying pragmatic theory in the age of technological transformation. It also responds to the contemporary question of how technology and the humanities may coexist, representing an interdisciplinary endeavor that bridges linguistic technology and humanistic reflection.
Innovative model of Сhina’s economic leadership in the post-globalization world
Qi Yang, Tetiana Chernytska
The article examines the innovative model of economic leadership in the People’s Republic of China in the context of post-globalization. It is substantiated that the modern Chinese economy functions as an institutionally integrated system of innovation-driven development, in which the state, business, and science form a synergistic interaction. Based on the theoretical frameworks of J. Schumpeter, B. Lundvall, R. Nelson, as well as the institutional approach and the concept of structural power by S. Strange, the study demonstrates that China’s model combines state strategic planning with market incentives, forming a new type of “state-coordinated innovative capitalism.” Using data from international organizations (World Bank, OECD, IMF, WIPO), the paper proves that the growing share of hightech sectors in GDP, the expansion of digital services, and the development of knowledgeintensive industries have ensured the resilience of China’s economic growth amid global turbulence. Special attention is paid to the geo-economic dimension of China’s innovation leadership, which is implemented through the Belt and Road Initiative and the Digital Silk Road, as well as the introduction of the digital yuan (e-CNY). The study reveals that technology has become a tool of China’s structural power, shaping an alternative architecture of globalization based on knowledge, data, and technological sovereignty. It is concluded that China is constructing a new paradigm of economic leadership in which innovation emerges as the key driver of global competitiveness and a crucial mechanism in forming a polycentric world order.
Chinese language and literature
Progress and stagnation: A comparative study of gender representation in Chinese language textbooks for junior high school between 2001 edition and 2023 edition.
Xia Yang, Yu Sun, Hu He
Gender bias in textbooks is a global problem. Gender development stands as a pivotal concern among junior high school students, underscoring the paramount importance of scrutinizing gender portrayals within textbooks. Chinese language textbooks, utilized by around 200 million students, have garnered scant attention in English literature concerning its embedded gender ideology. This essay delves into a comparative analysis of gender representations in Chinese Language textbooks published by the People's Education Press in 2001 and 2023, revealing little advancements. The evolution of these textbooks has witnessed a surge in the number of female authors and an expansion of gender-centric themes, fostering a more diverse array of female roles. However, a discernible disparity persists in the frequency of male and female portrayals, with males continuing to dominate the narrative, even amidst a slight decline in the female proportion. Furthermore, the spectrum of gender roles and professions remains limited, perpetuating the age-old gender binary of "men as providers, women as homemakers." Notably, societal stereotypes of both genders linger, impeding the depiction of nuanced and multifaceted identities. In summary, this study found that both sets of Chinese language textbooks implicitly contain gender bias, with little progress made towards gender equality, and even has stagnated or declined.
Chinese Tea Pest and Disease Named Entity Recognition Method Based on Improved Boundary Offset Prediction Network
XIE Yuxin, WEI Jiangshu, ZHANG Yao, LI Fang
[Objective] Named entity recognition (NER) is vital for many natural language processing (NLP) applications, including information retrieval and knowledge graph construction. While Chinese NER has advanced with datasets like ResumeNER, WeiboNER, and CLUENER (Chinese language understanding evaluation NER), most focus on general domains such as news or social media. However, there is a notable lack of annotated data in specialized fields, particularly agriculture. In the context of tea pest and disease, this shortage hampers progress in intelligent agricultural information extraction. These domain-specific texts pose unique challenges for NER due to frequent nested and long-span entities, which traditional sequence labeling models struggle to handle. Issues such as boundary ambiguity further complicate accurate entity recognition, leading to poor segmentation and labeling performance. Addressing these challenges requires targeted datasets and improved NER techniques tailored to the agricultural domain. [Methods] The proposed model comprises two core modules specifically designed to enhance performance in BOPN (Boundary-Oriented and Path-aware Named Entity Recognition) tasks, particularly within domains characterized by complex and fine-grained entity structures, such as tea pest and disease recognition. The boundary prediction module was responsible for identifying entity spans within input text sequences. It employed an attention-based mechanism to dynamically estimate the probability that consecutive tokens belong to the same entity, thereby addressing the challenge of boundary ambiguity. This mechanism facilitated more accurate detection of entity boundaries, which was particularly critical in scenarios involving nested or overlapping entities. The label enhancement module further refines entity recognition by employing a biaffine classifier that jointly models entity spans and their corresponding category labels. This joint modeling approach enabled the capture of intricate interactions between span representations and semantic label information, improving the identification of long or syntactically complex entities. The output of this module was integrated with conditionally normalized hidden representations, enhancing the model's capacity to assign context-aware and semantically precise labels. In order to reduce computational complexity while preserving model effectiveness, the architecture incorporated low-rank linear layers. These were constructed by integrating the adaptive channel weighting mechanism of Squeeze-and-Excitation Networks with low-rank decomposition techniques. The modified layers replace traditional linear transformations, yielding improvements in both efficiency and representational capacity. In addition to model development, a domain-specific NER corpus was constructed through the systematic collection and annotation of entity information related to tea pest and disease from scientific literature, agricultural technical reports, and online texts. The annotated entities in the corpus were categorized into ten classes, including tea plant diseases, tea pests, disease symptoms, and pest symptoms. Based on this labeled corpus, a Chinese NER dataset focused on tea pest and disease was developed, referred to as the Chinese tea pest and disease dataset. [Results and Discussions] Extensive experiments were conducted on the constructed dataset, comparing the proposed method with several mainstream NER approaches, including traditional sequence labeling models (e.g., BiLSTM-CRF), lexicon-enhanced models (e.g., SoftLexicon), and boundary smoothing strategies (e.g., Boundary Smooth). These comparisons aimed to rigorously assess the effectiveness of the proposed architecture in handling domain-specific and structurally complex entity types. Additionally, to evaluate the model's generalization capability beyond the tea disease and pest domain, the study performed comprehensive evaluations on four publicly available Chinese NER benchmark datasets: ResumeNER, WeiboNER, CLUENER, and Taobao. Results showed that the proposed model consistently achieved higher F1-Scores improved across all used datasets: 0.68% on the self-built dataset, 0.29% on ResumeNER, 0.96% on WeiboNER, 0.7% on CLUENER, and 0.5% on Taobao. With particularly notable improvements in the recognition of complex, nested, and long-span entities. These outcomes demonstrate the model's superior capacity for capturing intricate entity boundaries and semantics, and confirm its robustness and adaptability when compared to current state-of-the-art methods. [Conclusions] The study presents a high-performance NER approach tailored to the characteristics of Chinese texts on tea pest and disease. By simultaneously optimizing entity boundary detection and label classification, the proposed method significantly enhanced recognition accuracy in specialized domains. Experimental results demonstrated strong adaptability and robustness of the model across both newly constructed and publicly available datasets, indicating its broad applicability and promising prospects.
Agriculture (General), Technology (General)
Summary of the Best Evidence for the Prevention and Management of Medical Adhesive-Related Skin Injuries in Patients with Tracheal Intubation
Hu M, Hu J, Sun P
et al.
Minhua Hu, Jiajia Hu, Peng Sun, Hongyan Zheng, Mingyang Zhang The First Department of Surgical Anesthesiology, Zhongshan People’s Hospital, Zhongshan, Guangdong, 528403, People’s Republic of ChinaCorrespondence: Minhua Hu, The First Department of Surgical Anesthesiology, Zhongshan People’s Hospital, 2 Sun Wen East Road, Zhongshan, Guangdong, People’s Republic of China, Email huminhuamail@163.com Mingyang Zhang, The First Department of Surgical Anesthesiology, Zhongshan People’s Hospital, 2 Sun Wen East Road, Zhongshan, Guangdong, People’s Republic of China, Email jdzmy2010@163.comObjective: The aim of this study was to summarize the best evidence for the prevention and management of medical adhesive-related skin injury and to provide a scientific basis for the facial skin care of healthcare professionals and patients undergoing tracheal intubation.Methods: Following the “ 6S” pyramid model, the literature on the prevention and management of adhesive-associated skin lesions was searched from top to bottom of the guideline website, relevant professional association websites, and Chinese and English databases. The time frame of the search was from the creation of the repository to August 10, 2024. Evidence consolidation was completed in September 2024. The quality of the included studies was assessed, and key evidence was extracted and summarized.Results: One guideline, five expert consensus reports, three evidence summaries, one cohort study, and two randomized controlled trials were included. Finally, 30 key evidence items were extracted from 7 areas: risk factor identification, skin assessment, selection and use of medical adhesives, skin care, use of skin protection products and removers, pain management, and education and training. The evidence base is characterized by a high proportion of expert consensus (Level 5, 20%) and case-series evidence (Level 4, 60%), reflecting the current lack of robust RCTs in MARSI prevention for intubated patients. However, 73.3% of recommendations carry Grade A strength, supported by either consistent expert consensus or at least one high-quality RCT, providing a solid foundation for clinical implementation. Future research should prioritize large-scale RCTs to validate interventions in diverse populations.Conclusion: The best evidence for the prevention and management of medical adhesive-associated skin breakdown summarized in this study can be used as a scientific reference for practical nursing care and provide direction for healthcare teams as well as for the care of the facial skin of patients undergoing endotracheal intubation.Plain Language Summary: risk factor identification, skin assessment, selection and use of medical adhesives, skin care, use of skin protection products and removers, pain management, and education and training. These 7 areas summarize the best evidence to guide healthcare professionals and facial skin care for tracheally intubated patients. Keywords: medical adhesives, skin injuries, skin tears, medical adhesive, MARSI, prevention, evidence-based care, evidence summary
EEG2TEXT-CN: An Exploratory Study of Open-Vocabulary Chinese Text-EEG Alignment via Large Language Model and Contrastive Learning on ChineseEEG
Jacky Tai-Yu Lu, Jung Chiang, Chi-Sheng Chen
et al.
We propose EEG2TEXT-CN, which, to the best of our knowledge, represents one of the earliest open-vocabulary EEG-to-text generation frameworks tailored for Chinese. Built on a biologically grounded EEG encoder (NICE-EEG) and a compact pretrained language model (MiniLM), our architecture aligns multichannel brain signals with natural language representations via masked pretraining and contrastive learning. Using a subset of the ChineseEEG dataset, where each sentence contains approximately ten Chinese characters aligned with 128-channel EEG recorded at 256 Hz, we segment EEG into per-character embeddings and predict full sentences in a zero-shot setting. The decoder is trained with teacher forcing and padding masks to accommodate variable-length sequences. Evaluation on over 1,500 training-validation sentences and 300 held-out test samples shows promising lexical alignment, with a best BLEU-1 score of 6.38\%. While syntactic fluency remains a challenge, our findings demonstrate the feasibility of non-phonetic, cross-modal language decoding from EEG. This work opens a new direction in multilingual brain-to-text research and lays the foundation for future cognitive-language interfaces in Chinese.
Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
Ece Takmaz, Lisa Bylinina, Jakub Dotlacil
State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation
Shangqing Zhao, Yuhao Zhou, Yupei Ren
et al.
Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce Fùxì, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.
EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing
Fan Gao, Dongyuan Li, Ding Xia
et al.
Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose \benchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the \textit{Open-Ended} and \textit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With \benchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.
McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
Tian Lan, Xiangdong Su, Xu Liu
et al.
As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.
Missing the human touch? A computational stylometry analysis of GPT-4 translations of online Chinese literature
Xiaofang Yao, Yong-Bin Kang, Anthony McCosker
Existing research indicates that machine translations (MTs) of literary texts are often unsatisfactory. MTs are typically evaluated using automated metrics and subjective human ratings, with limited focus on stylistic features. Evidence is also limited on whether state-of-the-art large language models (LLMs) will reshape literary translation. This study examines the stylistic features of LLM translations, comparing GPT-4's performance to human translations in a Chinese online literature task. Computational stylometry analysis shows that GPT-4 translations closely align with human translations in lexical, syntactic, and content features, suggesting that LLMs might replicate the 'human touch' in literary translation style. These findings offer insights into AI's impact on literary translation from a posthuman perspective, where distinctions between machine and human translations become increasingly blurry.
Exploring the Influence of Personality Traits on Casual Attributions in Language Learning
Süleyman Kasap, Fırat Ünsal
In the contemporary era of globalization and economic shifts, acquiring foreign language skills has become almost indispensable. The ability to communicate in a language other than one's native tongue offers individuals a strategic advantage, facilitating informed decisions and empowering them in various life pursuits. Consequently, there is a growing emphasis on learning languages such as English, German, French, and Chinese. However, this increased focus on language acquisition has brought about significant changes in the landscape of foreign language education. Despite the considerable attention given to language education, particularly in English, which stands as the international lingua franca, learners often encounter challenges in attaining proficiency. This study aimed to explore the influence of personality traits on the success and failure of high school students in learning a foreign language. Conducted using a mixed research method and sequential design, the study involved 984 high school students from the 11th and 12th grades in the Edremit, İpekyolu, and Tuşba districts of Van province. To gather comprehensive data, a combination of quantitative tools such as the "Personal Information Form," "Five Factor Personality Inventory," and "Scale of Attribution of Success and Failure," along with qualitative insights from a "semi-structured interview form," was utilized. The study commenced by establishing a robust theoretical framework, elucidating pivotal concepts such as personality, its basic characteristics, dimensions, attribution theory, and its relationship to foreign language learning. Furthermore, the literature review encapsulated insights from prior studies within attribution theory and personality traits. Subsequently, the mixed-method design, participant information, data collection tools, and procedures were elaborated upon in detail. Employing Weiner's attribution theory, the obtained data were thoroughly deliberated upon, leading to the formulation of recommendations for prospective research. Upon analyzing both quantitative and qualitative data, it became evident that students attributed their success and failure in learning English as a Foreign Language to multifarious factors. Successful students predominantly ascribed their achievements to internal, controllable, and variable factors. Conversely, less successful students attributed their shortcomings to external, uncontrollable, and permanent reasons. Further analyses highlighted the significant influence of gender on attribution types and item-level attributions. Moreover, personality traits (extraversion, agreeableness, conscientiousness, neuroticism, and openness to experience), age, high school type, department level, grade, and academic performance exhibited notable disparities in attributions.
Congenital isolated adrenocorticotropic hormone deficiency in a newborn caused by TBX19 mutation: a case report and literature review
Yinxia Dang, Juanli Zhang, Fan Wang
BackgroundTo investigate the clinical phenotype, genetic characteristics, and prognosis of isolated adrenocorticotropic hormone deficiency in a newborn (IAD, OMIM 201400) caused by mutation of the TBX19 gene.Case presentationThe clinical features, diagnosis, treatment, and prognosis of a newborn with IAD admitted to our hospital were retrospectively analyzed. The patient and his parents were also examined by whole exome sequencing. We used the terms “newborn”, “child”, “congenital isolated adrenocorticotropic hormone deficiency”, and “TBX19” to retrieve relevant studies published up to December 2023 from the following databases: China National Knowledge Infrastructure (CNKI), Wanfang Database, Chinese Medical Journal Full-text Database, VIP database, Sinomed, PubMed, Embase, and Web of Science. The clinical and genetic characteristics of children from these other publications were summarized. The newborn boy with IAD was admitted to our hospital with poor mental response, feeding difficulties, hypoglycemia, and jaundice. The brain and adrenal MRI results were normal. Clinical whole exome sequencing showed that the boy carried compound heterozygous variants in the TBX19 gene. Specifically, the first exon had a novel frameshift mutation, c.240-246del(p.leu81Profs*54, NM_005149.3), and a missense mutation, c.377C>T(p.Pro126leu, NM_005149.3). The literature search found 34 additional cases from 4 Chinese-language articles and 12 English-language articles. The main clinical manifestations were hypoglycemia, jaundice, convulsions, feeding difficulties, poor mental response, hypotonia, and growth retardation. There were 24 cases with TBX19 mutations, and 19 different mutation sites. Among the 15 patients with different degrees of nervous system developmental delays, 13 initiated treatment when more than 1-year-old.ConclusionIAD from TBX19 mutation causes nonspecific symptoms. Genetic testing is the key to diagnosis. Early diagnosis and treatment can help to improve the prognosis and prevent neurological complications.
Clinical Trial Registrationidentifier (2024A-796).
Chinese Scientific Literature Annotation Method Based on Large Language Model
YANG Dongju, HUANG Juntao
High-quality annotated data are crucial for Natural Language Processing(NLP) tasks in the field of Chinese scientific literature. A method of annotation based on a Large Language Model(LLM) was proposed to address the lack of high-quality annotated corpora and the issues of inconsistent and inefficient manual annotation in Chinese scientific literature. First, a fine-grained annotation specification suitable for multi-domain Chinese scientific literature was established to clarify entity types and annotation granularity. Second, a structured text annotation prompt template and a generation parser were designed. The annotation task of Chinese scientific literature was set up as a single-stage, single-round question-and-answer process in which the annotation specifications and text to be annotated were filled into the corresponding slots of the prompt template to construct the task prompt. This prompt was then injected into the LLM to generate output text containing annotation information. Finally, the structured annotation data were obtained by the parser. Subsequently, using prompt learning based on LLM, the Annotated Chinese Scientific Literature(ACSL) entity dataset was generated, which contains 10 000 annotated documents and 72 536 annotated entities distributed across 48 disciplines. For ACSL, three baseline models based on RoBERTa-wwm-ext, a configuration of the Robustly optimized Bidirectional Encoder Representations from Transformers(RoBERT) approach, were proposed. The experimental results demonstrate that the BERT+Span model performs best on long-span entity recognition in Chinese scientific literature, achieving an F1 value of 0.335. These results serve as benchmarks for future research.
Computer engineering. Computer hardware, Computer software
Morphological evaluation of subwords vocabulary used by BETO language model
Óscar García-Sierra, Ana Fernández-Pampillón Cesteros, Miguel Ortega-Martín
Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always align with real morphemes, potentially impacting the models' performance, though it remains uncertain when this might occur. In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language. Our evaluation method was built on three quality measures, relevance, cohesion, and morphological accuracy, and a procedure for their assessment. By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality. In this article, we apply this evaluation to the tokenizer of BETO, a BERT language model trained on large Spanish corpora. This evaluation, along with our previous results, helped us conclude that its vocabulary has a low morphological quality, and we also found that training the tokenizer in a larger corpus does not improve the morphological quality of the generated vocabulary. Additionally, this evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims and the model's configuration.
LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model
Zhi Zhou, Jiang-Xin Shi, Peng-Xiao Song
et al.
Large language models (LLMs), including both proprietary and open-source models, have showcased remarkable capabilities in addressing a wide range of downstream tasks. Nonetheless, when it comes to practical Chinese legal tasks, these models fail to meet the actual requirements. Proprietary models do not ensure data privacy for sensitive legal cases, while open-source models demonstrate unsatisfactory performance due to their lack of legal knowledge. To address this problem, we introduce LawGPT, the first open-source model specifically designed for Chinese legal applications. LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning. Specifically, we employ large-scale Chinese legal documents for legal-oriented pre-training to incorporate legal domain knowledge. To further improve the model's performance on downstream legal tasks, we create a knowledge-driven instruction dataset for legal supervised fine-tuning. Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model. Our code and resources are publicly available at https://github.com/pengxiao-song/LaWGPT and have received 5.7K stars on GitHub.
C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models
Jiahuan Cao, Yongxin Shi, Dezhi Peng
et al.
Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}.
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
Yuxia Wang, Zenan Zhai, Haonan Li
et al.
Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks when LLMs are deployed. Previous studies have proposed comprehensive taxonomies of the risks posed by LLMs, as well as corresponding prompts that can be used to examine the safety mechanisms of LLMs. However, the focus has been almost exclusively on English, and little has been explored for other languages. Here we aim to bridge this gap. We first introduce a dataset for the safety evaluation of Chinese LLMs, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments on five LLMs show that region-specific risks are the prevalent type of risk, presenting the major issue with all Chinese LLMs we experimented with. Our data is available at https://github.com/Libr-AI/do-not-answer. Warning: this paper contains example data that may be offensive, harmful, or biased.
Manipulating language models' training data to study syntactic constraint learning: the case of English passivization
Cara Su-Yi Leong, Tal Linzen
Grammatical rules in natural languages are often characterized by exceptions. How do language learners learn these exceptions to otherwise general patterns? Here, we study this question through the case study of English passivization. While passivization is in general quite productive, there are cases where it cannot apply (cf. the following sentence is ungrammatical: *One hour was lasted by the meeting). Using neural network language models as theories of language acquisition, we explore the sources of indirect evidence that a learner can leverage to learn whether a verb can be passivized. We first characterize English speakers' judgments of exceptions to the passive, and confirm that speakers find some verbs more passivizable than others. We then show that a neural network language model's verb passivizability judgments are largely similar to those displayed by humans, suggesting that evidence for these exceptions is available in the linguistic input. Finally, we test two hypotheses as to the source of evidence that language models use to learn these restrictions: frequency (entrenchment) and semantics (affectedness). We do so by training models on versions of the corpus that have had sentences of the types implicated by each hypothesis removed, altered, or introduced. We find support for both hypotheses: entrenchment and affectedness make independent contributions to a verb's passivizability. From a methodological point of view, this study highlights the utility of altering a language model's training data for answering questions where complete control over a learner's input is vital.
A Survey of Large Language Models for Arabic Language and its Dialects
Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa
This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.