arXiv Open Access 2026

Agent Benchmarks Fail Public Sector Requirements

Jonathan Rystrøm Chris Schmitz Karolina Korgul Jan Batzner Chris Russell

Lihat Sumber

Abstrak

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

Topik & Kata Kunci

cs.CY cs.AI

Penulis (5)

Jonathan Rystrøm

Chris Schmitz

Karolina Korgul

Jan Batzner

Chris Russell

Format Sitasi

APA MLA BibTeX

Rystrøm, J., Schmitz, C., Korgul, K., Batzner, J., Russell, C. (2026). Agent Benchmarks Fail Public Sector Requirements. https://arxiv.org/abs/2601.20617

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓