arXiv Open Access 2023

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou Long Phan Sarah Chen James Campbell Phillip Guo +16 lainnya

Lihat Sumber

Abstrak

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Topik & Kata Kunci

cs.LG cs.AI cs.CL cs.CV cs.CY

Penulis (21)

Andy Zou

Long Phan

Sarah Chen

James Campbell

Phillip Guo

Richard Ren

Alexander Pan

Xuwang Yin

Mantas Mazeika

Ann-Kathrin Dombrowski

Shashwat Goel

Nathaniel Li

Michael J. Byun

Zifan Wang

Alex Mallen

Steven Basart

Sanmi Koyejo

Dawn Song

Matt Fredrikson

J. Zico Kolter

Dan Hendrycks

Format Sitasi

APA MLA BibTeX

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R. et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. https://arxiv.org/abs/2310.01405

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓