ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding
Abstrak
While large language models (LLMs) excel at many domain-specific tasks, their ability to deeply comprehend and reason about full-length academic papers remains underexplored. Existing benchmarks often fall short of capturing such depth, either due to surface-level question design or unreliable evaluation metrics. To address this gap, we introduce ELAIPBench, a benchmark curated by domain experts to evaluate LLMs' comprehension of artificial intelligence (AI) research papers. Developed through an incentive-driven, adversarial annotation process, ELAIPBench features 403 multiple-choice questions from 137 papers. It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval. Our experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance. Moreover, we observe that frontier LLMs equipped with a thinking mode or a retrieval-augmented generation (RAG) system fail to improve final results-even harming accuracy due to overthinking or noisy retrieval. These findings underscore the significant gap between current LLM capabilities and genuine comprehension of academic papers.
Topik & Kata Kunci
Penulis (9)
Xinbang Dai
Huikang Hu
Yongrui Chen
Jiaqi Li
Rihui Jin
Yuyang Zhang
Xiaoguang Li
Lifeng Shang
Guilin Qi
Akses Cepat
- Tahun Terbit
- 2025
- Bahasa
- en
- Sumber Database
- arXiv
- Akses
- Open Access ✓