arXiv Open Access 2023

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Kent K. Chang Mackenzie Cramer Sandeep Soni David Bamman
Lihat Sumber

Abstrak

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

Topik & Kata Kunci

Penulis (4)

K

Kent K. Chang

M

Mackenzie Cramer

S

Sandeep Soni

D

David Bamman

Format Sitasi

Chang, K.K., Cramer, M., Soni, S., Bamman, D. (2023). Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4. https://arxiv.org/abs/2305.00118

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓