arXiv Open Access 2023

Data Management For Training Large Language Models: A Survey

Zige Wang Wanjun Zhong Yufei Wang Qi Zhu Fei Mi +4 lainnya

Lihat Sumber

Abstrak

Data plays a fundamental role in training Large Language Models (LLMs). Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through efficient data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.

Topik & Kata Kunci

cs.CL cs.AI

Penulis (9)

Zige Wang

Wanjun Zhong

Yufei Wang

Qi Zhu

Fei Mi

Baojun Wang

Lifeng Shang

Xin Jiang

Qun Liu

Format Sitasi

APA MLA BibTeX

Wang, Z., Zhong, W., Wang, Y., Zhu, Q., Mi, F., Wang, B. et al. (2023). Data Management For Training Large Language Models: A Survey. https://arxiv.org/abs/2312.01700

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2023
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓