arXiv Open Access 2025

HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James V. Roggeveen Erik Y. Wang Will Flintoft Peter Donets Lucy S. Nathwani +40 lainnya
Lihat Sumber

Abstrak

Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

Topik & Kata Kunci

Penulis (45)

J

James V. Roggeveen

E

Erik Y. Wang

W

Will Flintoft

P

Peter Donets

L

Lucy S. Nathwani

N

Nickholas Gutierrez

D

David Ettel

A

Anton Marius Graf

S

Siddharth Dandavate

A

Arjun Nageswaran

R

Raglan Ward

A

Ava Williamson

A

Anne Mykland

K

Kacper K. Migacz

Y

Yijun Wang

E

Egemen Bostan

D

Duy Thuc Nguyen

Z

Zhe He

M

Marc L. Descoteaux

F

Felix Yeung

S

Shida Liu

J

Jorge García Ponce

L

Luke Zhu

Y

Yuyang Chen

E

Ekaterina S. Ivshina

M

Miguel Fernandez

M

Minjae Kim

K

Kennan Gumbs

M

Matthew Scott Tan

R

Russell Yang

M

Mai Hoang

D

David Brown

I

Isabella A. Silveira

L

Lavon Sykes

A

Ahmed Roman

W

William Fredenberg

Y

Yiming Chen

L

Lucas Martin

Y

Yixing Tang

K

Kelly Werker Smith

H

Hongyu Liao

L

Logan G. Wilson

A

Alexander Dazhen Cai

A

Andrea Elizabeth Biju

M

Michael P. Brenner

Format Sitasi

Roggeveen, J.V., Wang, E.Y., Flintoft, W., Donets, P., Nathwani, L.S., Gutierrez, N. et al. (2025). HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class. https://arxiv.org/abs/2505.11774

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓