arXiv Open Access 2025

The Ever-Evolving Science Exam

Junying Wang Zicheng Zhang Yijin Guo Farong Wen Ye Shen +7 lainnya
Lihat Sumber

Abstrak

As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: https://github.com/aiben-ch/EESE.

Topik & Kata Kunci

Penulis (12)

J

Junying Wang

Z

Zicheng Zhang

Y

Yijin Guo

F

Farong Wen

Y

Ye Shen

Y

Yingji Liang

Y

Yalun Wu

W

Wenzhe Li

C

Chunyi Li

Z

Zijian Chen

Q

Qi Jia

G

Guangtao Zhai

Format Sitasi

Wang, J., Zhang, Z., Guo, Y., Wen, F., Shen, Y., Liang, Y. et al. (2025). The Ever-Evolving Science Exam. https://arxiv.org/abs/2507.16514

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓