arXiv Open Access 2026

Agent Benchmarks Fail Public Sector Requirements

Jonathan Rystrøm Chris Schmitz Karolina Korgul Jan Batzner Chris Russell
Lihat Sumber

Abstrak

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

Topik & Kata Kunci

Penulis (5)

J

Jonathan Rystrøm

C

Chris Schmitz

K

Karolina Korgul

J

Jan Batzner

C

Chris Russell

Format Sitasi

Rystrøm, J., Schmitz, C., Korgul, K., Batzner, J., Russell, C. (2026). Agent Benchmarks Fail Public Sector Requirements. https://arxiv.org/abs/2601.20617

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓