Aligning AI With Shared Human Values
Abstrak
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Topik & Kata Kunci
Penulis (7)
Dan Hendrycks
Collin Burns
Steven Basart
Andrew Critch
J. Li
D. Song
J. Steinhardt
Akses Cepat
PDF tidak tersedia langsung
Cek di sumber asli →- Tahun Terbit
- 2020
- Bahasa
- en
- Total Sitasi
- 858×
- Sumber Database
- Semantic Scholar
- Akses
- Open Access ✓