Semantic Scholar Open Access 2022 490 sitasi

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Yuhang Lai Chengxi Li Yiming Wang Tianyi Zhang Ruiqi Zhong +5 lainnya

Lihat Sumber DOI

Abstrak

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

Topik & Kata Kunci

Computer Science

Penulis (10)

Yuhang Lai

Chengxi Li

Yiming Wang

Tianyi Zhang

Ruiqi Zhong

Luke Zettlemoyer

S. Yih

Daniel Fried

Si-yi Wang

Tao Yu

Format Sitasi

APA MLA BibTeX

Lai, Y., Li, C., Wang, Y., Zhang, T., Zhong, R., Zettlemoyer, L. et al. (2022). DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. https://doi.org/10.48550/arXiv.2211.11501

Akses Cepat

Lihat di Sumber doi.org/10.48550/arXiv.2211.11501

Informasi Jurnal

Tahun Terbit: 2022
Bahasa: en
Total Sitasi: 490×
Sumber Database: Semantic Scholar
DOI: 10.48550/arXiv.2211.11501
Akses: Open Access ✓