CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity
Xuefeng Wei, Zhixuan Wang, Xuan Zhou
et al.
We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.
Assessment of the methodological quality of studies on core outcome sets for respiratory diseases: A systematic review and meta-research study.
Mengjuan Liu, Jiajia Wang, Lu Wang
et al.
<h4>Background</h4>With increasing attention to core outcome sets (COS), the number of studies on COS for respiratory diseases (COS-RD) is on the rise. However, the methodological quality is still unclear. Therefore, we conducted a study to assess the methodological quality of studies on COS-RD.<h4>Methods</h4>PubMed, Embase, Cochrane Library, and Web of Science were searched for study protocols or original studies on COS-RD about adults, from their inception to February 23, 2024. The COMET database and Chinese databases (including China National Knowledge Infrastructure, Wanfang Data, Chongqing VIP database, and China Biology Medicine) were also searched as a supplement. Two researchers independently screened the literature, extracted the data, and assessed the methodological quality of included studies using the Core Outcome Set-STAndardised Protocol (COS-STAP) statement, the Core Outcome Set-STAndards for Development (COS-STAD) recommendations, and the Core Outcome Set-STAndards for Reporting (COS-STAR) statement.<h4>Results</h4>A total of 27 articles (five study protocols and 22 original studies, 26 studies) were included in this study. For the assessment of study protocols using the COS-STAP statement, the item with the lowest complete reporting rate was "missing data" (Item 9, 40.0%), while "description how outcomes may be dropped/combined, with reasons" (Item 5b, 60.0%) and "dissemination" (Item 11, 60.0%) had relatively low complete reporting rates. For the assessment of original studies using the COS-STAD recommendations, the item with the highest non-reporting rate was "care was taken to avoid ambiguity of language used in the list of outcomes" (Item 11, 45.5%), while "the population(s) covered by COS" (Item 3, 31.8%) and "the intervention(s) covered by COS" (Item 4, 31.8%) had relatively high non-reporting rate. When using the COS-STAR statement to assess the original studies, the item with the lowest complete reporting rate was "protocol deviations" (Item 11, 13.6%), while "describe how outcomes were dropped/combined, with reasons (if applicable)" (Item 6b, 36.4%), "participants" (Item 5, 40.9%), "ethics and consent" (Item 10, 54.5%), "protocol/registry entry" (Item 14, 63.6%), and "outcome scoring" (Item 8, 63.6%) had relatively low complete reporting rates.<h4>Conclusion</h4>The methodological quality of studies on COS-RD needs to be further improved. The appropriate use of aforementioned international reporting standards can advance the methodological quality and reporting transparency of studies on COS-RD.
Writing Virtuality: Mu Xin’s Sinographic Designs Across Media and Nations
Muyun Zhou
This article reinterprets the concept of "virtuality" within contemporary Sinophone literature through a study of contemporary Sinophone author Mu Xin (1927-2011)’s writings and media presence. Challenging the common conflation of virtuality with digital or online environments, I argue that virtuality should be understood as an aesthetic and embodied design principle that traverses the boundaries between physical and imagined spaces, national identity and diaspora, and print and digital media. Focusing on Mu Xin’s short story “The Reflected Shadow of Columbia” (Gelunbiya de daoying哥倫比亞的倒影) and the manuscript Prison Notes (Yuzhong shougao 獄中手稿) , the article explores the mutual constitution of textual bodies and national imaginaries in a globalized cultural context, intertwining the virtual aesthetics of placelessness and rootlessness with concrete legal and financial boundaries. Through the lens of science and technology studies, critical design, and Sinophone studies, this study calls for the concept of virtuality should be brought to bear on new concerns for Sinophone literature beyond online literature and science fiction.
Philology. Linguistics, Chinese language and literature
Modeling the relationships between growth language mindset, L2 grit, and foreign language learning burnout
Xiaojing Hu, Lu Song, Yuan He
Abstract While the existing literature acknowledges the existence of foreign language learning burnout (FLLB) and its deleterious effect on English as a foreign language (EFL) learners, the protective factors that may alleviate FLLB in EFL contexts remain largely unexplored. The current study addresses this gap by examining the impact of learners’ growth language mindset (GLM) on FLLB by employing the structural equation modeling (SEM). Furthermore, the mediating role of L2 grit in the relationship between GLM and FLLB was scrutinized. To achieve these objectives, 840 Chinese EFL learners were recruited to complete questionnaires encompassing scales assessing GLM, L2 grit, and FLLB. The results unveiled that learners’ GLM and L2 grit were significantly and negatively associated with FLLB. The SEM analysis indicated that GLM exerted a negative predictive effect on EFL learners’ FLLB. Additionally, two components of L2 grit, namely perseverance of effort (PE) and consistency of interest (CI), fully mediated the relationship between GLM and FLLB, with the influence of CI being particularly robust. The research findings offer significant theoretical implications and practical guidance for the intervention programs targeting FLLB in language classrooms.
Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
Yicheng Fu, Zhemin Huang, Liuxin Yang
et al.
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.
Trust and Trustworthiness from Human-Centered Perspective in HRI -- A Systematic Literature Review
Debora Firmino de Souza, Sonia Sousa, Kadri Kristjuhan-Ling
et al.
The Industry 5.0 transition highlights EU efforts to design intelligent devices that can work alongside humans to enhance human capabilities, and such vision aligns with user preferences and needs to feel safe while collaborating with such systems take priority. This demands a human-centric research vision and requires a societal and educational shift in how we perceive technological advancements. To better understand this perspective, we conducted a systematic literature review focusing on understanding how trust and trustworthiness can be key aspects of supporting this move towards Industry 5.0. This review aims to overview the most common methodologies and measurements and collect insights about barriers and facilitators for fostering trustworthy HRI. After a rigorous quality assessment following the Systematic Reviews and Meta-Analyses guidelines, using rigorous inclusion criteria and screening by at least two reviewers, 34 articles were included in the review. The findings underscores the significance of trust and safety as foundational elements for promoting secure and trustworthy human-machine cooperation. Confirm that almost 30% of the revised articles do not present a definition of trust, which can be problematic as this lack of conceptual clarity can undermine research efforts in addressing this problem from a central perspective. It highlights that the choice of domain and area of application should influence the choice of methods and approaches to fostering trust in HRI, as those choices can significantly affect user preferences and their perceptions and assessment of robot capabilities. Additionally, this lack of conceptual clarity can be a potential barrier to fostering trust in HRI and explains the sometimes contradictory findings or choice of methods and instruments used to investigate trust in robots and other autonomous systems in the literature.
On Italian spatial prepositions and measure phrases: reconciling the data with theoretical accounts
Ursini Francesco-Alessio, Wu Tong
The goal of this article is to offer new empirical evidence regarding the grammatical and semantic properties of Italian spatial prepositions, and to provide a theoretical account based on this evidence. We show that Italian has four grammatical types of prepositions (simple, complex, contracted and uncontracted), and three semantic types (geometric, projective and region prepositions). By studying the syntactic distribution of prepositions and the phrases they form with measure phrases (e.g., dieci metri ‘ten meters’) we argue that a non-isomorphic (i.e., not one-to-one) relation between grammatical and semantic type emerges. Region and geometric prepositions form phrases that block the presence of measure phrases (e.g., #dieci metri a fianco del muro ‘ten meters beside the wall’), whereas projective prepositions license them (e.g., dieci metri dietro al muro ‘ten meters behind the wall’). We show that previous accounts postulate a type of symmetry that leads to problematic predictions regarding these patterns. We then propose an alternative account based on the Lexical Syntax framework that models the data via a feature-matching mechanism.
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios
Samuel Ackerman, Ella Rabinovich, Eitan Farchi
et al.
We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the model's answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models on the created datasets.
TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine
Wenjing Yue, Xiaoling Wang, Wei Zhu
et al.
Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.
A Topic-aware Comparable Corpus of Chinese Variations
Da-Chen Lian, Shu-Kai Hsieh
This study aims to fill the gap by constructing a topic-aware comparable corpus of Mainland Chinese Mandarin and Taiwanese Mandarin from the social media in Mainland China and Taiwan, respectively. Using Dcard for Taiwanese Mandarin and Sina Weibo for Mainland Chinese, we create a comparable corpus that updates regularly and reflects modern language use on social media.
A Comparative Study of Indo-Chinese Public Diplomacy in the Arab Countries of Persian Gulf (2001-2021) and Consequences for Islamic Republic of Iran
Majid Reza Momeni, Narges Khangol zadeh Sangroodi
As two emerging Asian powers, India and China, have tried to expand their relations with Arab Countries of the Persian Gulf through public diplomacy which can be a platform for the realization of the foreign policy goals of these two countries. this study tries to address this important question that what are the differences and similarities of public diplomacy between India and China in the Arab countries of the Persian Gulf between 2001and2021? The hypothesis in response to this question is that the importance of the concept and position of public diplomacy in order to acquire foreign policy goals and more influence in the region, as well as the use of language, literature and culture are similarities between the two countries; Methods of conducting public diplomacy including the film industry, the Confucius Institute, diaspora activity and the “Belt Road initiative” are the differences of Indo-Chinese public diplomacy. In this study, a comparative method is used to analyze the differences and similarities of public diplomacy between India and China in the Persian Gulf Arab countries. The finding of the research indicates that public diplomacy as the tool of national interests has been in the focus of India and China and this point is important for Iran based on the principle of neighborhood policy.
Political science, Political science (General)
Serving the Dead as Serving the Living: Examining the Concept of Burial and Life Consciousness in Medieval China
Wei Wang
In the minds of ancient people, tombs and burials were where the lives of this world ended and another type of life began. By incorporating the concepts of life found in Confucianism, Taoism, Buddhism, and the widespread belief in ghosts and immortals, burial ceremonies evolved during the Wei and Jin 魏晋 dynasties (220–420) into an integrated and unified notion of burial. The funeral ritual’s imaginative and fanciful depictions of the hereafter express sentimental devotion to life and contemplation of death. The burial ceremony and tomb architecture change in accordance with how the concepts of sacrifice and ghosts develop. The features of people’s belief in ghosts and immortality are reflected in particular burial practices. The popularity of necromancy burials and ghost marriages during the Middle Ages (third to sixth centuries) bring to light the binary antagonism between the soul and the body in burial, as well as the emphasis on spiritual freedom and physical immortality in the life philosophy.
Exploring the Usage of Chinese Pinyin in Pretraining
Baojun Wang, Kun Xu, Lifeng Shang
Unlike alphabetic languages, Chinese spelling and pronunciation are different. Both characters and pinyin take an important role in Chinese language understanding. In Chinese NLP tasks, we almost adopt characters or words as model input, and few works study how to use pinyin. However, pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors. Most of these errors are caused by the same or similar pronunciation words, and we refer to this type of error as SSP(the same or similar pronunciation) errors for short. In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT. Our method uses characters and pinyin in parallel for pretraining. Through delicate pretraining tasks, the characters and pinyin representation are fused, which can enhance the error tolerance for SSP errors. We do comprehensive experiments and ablation tests to explore what makes a robust phonetic enhanced Chinese language model. The experimental results on both the constructed noise-added dataset and the public error-correction dataset demonstrate that our model is more robust compared to SOTA models.
CMB: A Comprehensive Medical Benchmark in Chinese
Xidong Wang, Guiming Hardy Chen, Dingjie Song
et al.
Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.
Unveiling Safety Vulnerabilities of Large Language Models
George Kour, Marcel Zalmanovici, Naama Zwerdling
et al.
As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.
A karate megismerésének tíz cikkelye
Gábor Richárd Gottner
A karate az egyik legnépszerűbb harcművészet a világon, ám ennek ellenére nincs évszázadokra visszatekintő, részletesen dokumentált történelmi múltja. Az addig főleg titokban gyakorolt okinawai karate publikus oktatását a XX. század első felében kezdték meg a kor elismert mesterei, ennek megfelelően pedig az első írott források, szakirodalmi munkák megjelenése szintén erre az időszakra tehető. Ennek egyik legkorábbi forrásdokumentuma Itosu Ankō levele, amelyet az Okinawa Prefektúrai Tanulmányi Hivatal részére nyújtott be, a karate iskolai tantárggyá tételének céljából. Ez az egyik, ha nem a legelső fennmaradt dokumentum a karate történetéről, céljáról és edzésmódszertanáról. Jelen tanulmány ennek a levélnek a magyar filológiai fordítását és elemzését tűzte ki célul, vizsgálva a klasszikus irodalomnak a szerzőre gyakorolt hatását, valamint a szövegben leírtak kontextusba helyezését történelmi, filozófiai, sportpedagógiai, egészségtani és pszichológiai vonatkozásban.
Chinese language and literature
Difficulties in teaching the second foreign (English) language to students studying oriental language as their major at NEFU, Yakutia, Russia
Ivanova Aleksandra
The Far Eastern Federal District of Russia, including the Republic of Sakha (Yakutia), is located in close proximity to the Asia-Pacific region, which explains the demand in specialists who know oriental languages. One of the oriental languages (Japanese, Chinese or Korean) is studied at the head university of the republic, and English is the language of business communication. In secondary educational institutions of Yakutia, English is the first foreign language, and students study it as a second foreign language in the university. The goal of this study is to identify the main difficulties of learning English as a second foreign language by bilingual students studying the oriental language (Japanese, Chinese or Korean) as their major. The study analyzed domestic and foreign literature, professional educational programs of the North-Eastern Federal University in Yakutsk, characteristics of students from the indigenous population of Yakutia. It was substantiated that the more difficulties in mastering the subject, the stricter the requirements for mastering the educational material.
Krónika
A szerkesztőség
Az ELTE BTK Távol-keleti Intézetben 2022-ben tartott fontos rendezvények és elért eredmények krónikája.
Chinese language and literature
Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook
Baihan Lin
In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics, and last but not least, the speech and natural language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving the training of deep neural networks with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures and generalizability. In this survey, we present an overview of recent advancements of reinforcement learning and bandits, and discuss how they can be effectively employed to solve speech and natural language processing problems with models that are adaptive, interactive and scalable.
CCPM: A Chinese Classical Poetry Matching Dataset
Wenhao Li, Fanchao Qi, Maosong Sun
et al.
Poetry is one of the most important art forms of human languages. Recently many studies have focused on incorporating some linguistic features of poetry, such as style and sentiment, into its understanding or generation system. However, there is no focus on understanding or evaluating the semantics of poetry. Therefore, we propose a novel task to assess a model's semantic understanding of poetry by poem matching. Specifically, this task requires the model to select one line of Chinese classical poetry among four candidates according to the modern Chinese translation of a line of poetry. To construct this dataset, we first obtain a set of parallel data of Chinese classical poetry and modern Chinese translation. Then we retrieve similar lines of poetry with the lines in a poetry corpus as negative choices. We name the dataset Chinese Classical Poetry Matching Dataset (CCPM) and release it at https://github.com/THUNLP-AIPoet/CCPM. We hope this dataset can further enhance the study on incorporating deep semantics into the understanding and generation system of Chinese classical poetry. We also preliminarily run two variants of BERT on this dataset as the baselines for this dataset.