arXiv Open Access 2026

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

Simon Rosen Siddarth Singh Ebenezer Gelo Helen Sarah Robertson Ibrahim Suder +4 lainnya
Lihat Sumber

Abstrak

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

Topik & Kata Kunci

Penulis (9)

S

Simon Rosen

S

Siddarth Singh

E

Ebenezer Gelo

H

Helen Sarah Robertson

I

Ibrahim Suder

V

Victoria Williams

B

Benjamin Rosman

G

Geraud Nangue Tasse

S

Steven James

Format Sitasi

Rosen, S., Singh, S., Gelo, E., Robertson, H.S., Suder, I., Williams, V. et al. (2026). MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents. https://arxiv.org/abs/2602.13372

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2026
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓