arXiv Open Access 2025

Superalignment with Dynamic Human Values

Florian Mai David Kaczér Nicholas Kluge Corrêa Lucie Flek

Lihat Sumber

Abstrak

Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.

Topik & Kata Kunci

cs.AI

Penulis (4)

Florian Mai

David Kaczér

Nicholas Kluge Corrêa

Lucie Flek

Format Sitasi

APA MLA BibTeX

Mai, F., Kaczér, D., Corrêa, N.K., Flek, L. (2025). Superalignment with Dynamic Human Values. https://arxiv.org/abs/2503.13621

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓