Semantic Scholar Open Access 2020 858 sitasi

Aligning AI With Shared Human Values

Dan Hendrycks Collin Burns Steven Basart Andrew Critch J. Li +2 lainnya

Abstrak

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete understanding of basic ethical knowledge. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Topik & Kata Kunci

Penulis (7)

D

Dan Hendrycks

C

Collin Burns

S

Steven Basart

A

Andrew Critch

J

J. Li

D

D. Song

J

J. Steinhardt

Format Sitasi

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D. et al. (2020). Aligning AI With Shared Human Values. https://www.semanticscholar.org/paper/65906e6027246ae9e4ecd18d6e019a24505c842e

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2020
Bahasa
en
Total Sitasi
858×
Sumber Database
Semantic Scholar
Akses
Open Access ✓