arXiv Open Access 2025

The Ever-Evolving Science Exam

Junying Wang Zicheng Zhang Yijin Guo Farong Wen Ye Shen +7 lainnya

Lihat Sumber

Abstrak

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (12)

Junying Wang

Zicheng Zhang

Yijin Guo

Farong Wen

Ye Shen

Yingji Liang

Yalun Wu

Wenzhe Li

Chunyi Li

Zijian Chen

Qi Jia

Guangtao Zhai

Format Sitasi

APA MLA BibTeX

Wang, J., Zhang, Z., Guo, Y., Wen, F., Shen, Y., Liang, Y. et al. (2025). The Ever-Evolving Science Exam. https://arxiv.org/abs/2507.16514

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓