arXiv Open Access 2025

Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

Kelin Fu Tianyu Liu Zeyu Shang Yingwei Ma Jian Yang +2 lainnya
Lihat Sumber

Abstrak

Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.

Topik & Kata Kunci

Penulis (7)

K

Kelin Fu

T

Tianyu Liu

Z

Zeyu Shang

Y

Yingwei Ma

J

Jian Yang

J

Jiaheng Liu

K

Kaigui Bian

Format Sitasi

Fu, K., Liu, T., Shang, Z., Ma, Y., Yang, J., Liu, J. et al. (2025). Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering. https://arxiv.org/abs/2512.06915

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2025
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓