arXiv Open Access 2025

ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding

Xinbang Dai Huikang Hu Yongrui Chen Jiaqi Li Rihui Jin +4 lainnya
Lihat Sumber

Abstrak

While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.

Topik & Kata Kunci

Penulis (9)

X

Xinbang Dai

H

Huikang Hu

Y

Yongrui Chen

J

Jiaqi Li

R

Rihui Jin

Y

Yuyang Zhang

X

Xiaoguang Li

L

Lifeng Shang

G

Guilin Qi

Format Sitasi

Dai, X., Hu, H., Chen, Y., Li, J., Jin, R., Zhang, Y. et al. (2025). ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding. https://arxiv.org/abs/2510.10549

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓