arXiv Open Access 2025

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James V. Roggeveen Erik Y. Wang Will Flintoft Peter Donets Lucy S. Nathwani +40 lainnya

Lihat Sumber

Abstrak

Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

Topik & Kata Kunci

cs.LG cs.AI

Penulis (45)

James V. Roggeveen

Erik Y. Wang

Will Flintoft

Peter Donets

Lucy S. Nathwani

Nickholas Gutierrez

David Ettel

Anton Marius Graf

Siddharth Dandavate

Arjun Nageswaran

Raglan Ward

Ava Williamson

Anne Mykland

Kacper K. Migacz

Yijun Wang

Egemen Bostan

Duy Thuc Nguyen

Zhe He

Marc L. Descoteaux

Felix Yeung

Shida Liu

Jorge García Ponce

Luke Zhu

Yuyang Chen

Ekaterina S. Ivshina

Miguel Fernandez

Minjae Kim

Kennan Gumbs

Matthew Scott Tan

Russell Yang

Mai Hoang

David Brown

Isabella A. Silveira

Lavon Sykes

Ahmed Roman

William Fredenberg

Yiming Chen

Lucas Martin

Yixing Tang

Kelly Werker Smith

Hongyu Liao

Logan G. Wilson

Alexander Dazhen Cai

Andrea Elizabeth Biju

Michael P. Brenner

Format Sitasi

APA MLA BibTeX

Roggeveen, J.V., Wang, E.Y., Flintoft, W., Donets, P., Nathwani, L.S., Gutierrez, N. et al. (2025). HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class. https://arxiv.org/abs/2505.11774

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓