arXiv Open Access 2026

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

Simon Rosen Siddarth Singh Ebenezer Gelo Helen Sarah Robertson Ibrahim Suder +4 lainnya

Lihat Sumber

Abstrak

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

Topik & Kata Kunci

cs.AI cs.LG

Penulis (9)

Simon Rosen

Siddarth Singh

Ebenezer Gelo

Helen Sarah Robertson

Ibrahim Suder

Victoria Williams

Benjamin Rosman

Geraud Nangue Tasse

Steven James

Format Sitasi

APA MLA BibTeX

Rosen, S., Singh, S., Gelo, E., Robertson, H.S., Suder, I., Williams, V. et al. (2026). MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents. https://arxiv.org/abs/2602.13372

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2026
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓