arXiv Open Access 2024

Data Engineering for Scaling Language Models to 128K Context

Yao Fu Rameswar Panda Xinyao Niu Xiang Yue Hannaneh Hajishirzi +2 lainnya
Lihat Sumber

Abstrak

We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.

Topik & Kata Kunci

Penulis (7)

Y

Yao Fu

R

Rameswar Panda

X

Xinyao Niu

X

Xiang Yue

H

Hannaneh Hajishirzi

Y

Yoon Kim

H

Hao Peng

Format Sitasi

Fu, Y., Panda, R., Niu, X., Yue, X., Hajishirzi, H., Kim, Y. et al. (2024). Data Engineering for Scaling Language Models to 128K Context. https://arxiv.org/abs/2402.10171

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2024
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓