Semantic Scholar Open Access 2020 2707 sitasi

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao Stella Biderman Sid Black Laurence Golding Travis Hoppe +7 lainnya

Abstrak

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile : an 825 GiB English text corpus tar-geted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. 1

Topik & Kata Kunci

Penulis (12)

L

Leo Gao

S

Stella Biderman

S

Sid Black

L

Laurence Golding

T

Travis Hoppe

C

Charles Foster

J

Jason Phang

H

Horace He

A

Anish Thite

N

Noa Nabeshima

S

Shawn Presser

C

Connor Leahy

Format Sitasi

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C. et al. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. https://www.semanticscholar.org/paper/db1afe3b3cd4cd90e41fbba65d3075dd5aebb61e

Akses Cepat

PDF tidak tersedia langsung

Cek di sumber asli →
Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2020
Bahasa
en
Total Sitasi
2707×
Sumber Database
Semantic Scholar
Akses
Open Access ✓