arXiv Open Access 2024

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

Jinchang Hou Chang Ao Haihong Wu Xiangtao Kong Zhigang Zheng +6 lainnya

Lihat Sumber

Abstrak

With the accelerating development of Large Language Models (LLMs), many LLMs are beginning to be used in the Chinese K-12 education domain. The integration of LLMs and education is getting closer and closer, however, there is currently no benchmark for evaluating LLMs that focuses on the Chinese K-12 education domain. Therefore, there is an urgent need for a comprehensive natural language processing benchmark to accurately assess the capabilities of various LLMs in the Chinese K-12 education domain. To address this, we introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field. The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography. We conducted a comprehensive evaluation of E-EVAL on advanced LLMs, including both English-dominant and Chinese-dominant models. Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics. We also found that most Chinese-dominant LLMs did not achieve higher scores at the primary school level compared to the middle school level. We observe that the mastery of higher-order knowledge by the model does not necessarily imply the mastery of lower-order knowledge as well. Additionally, the experimental results indicate that the Chain of Thought (CoT) technique is effective only for the challenging science subjects, while Few-shot prompting is more beneficial for liberal arts subjects. With E-EVAL, we aim to analyze the strengths and limitations of LLMs in educational applications, and to contribute to the progress and development of Chinese K-12 education and LLMs.

Topik & Kata Kunci

cs.CL

Penulis (11)

Jinchang Hou

Chang Ao

Haihong Wu

Xiangtao Kong

Zhigang Zheng

Daijia Tang

Chengming Li

Xiping Hu

Ruifeng Xu

Shiwen Ni

Min Yang

Format Sitasi

APA MLA BibTeX

Hou, J., Ao, C., Wu, H., Kong, X., Zheng, Z., Tang, D. et al. (2024). E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models. https://arxiv.org/abs/2401.15927

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2024
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓