arXiv Open Access 2025

Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

Kelin Fu Tianyu Liu Zeyu Shang Yingwei Ma Jian Yang +2 lainnya

Lihat Sumber

Abstrak

Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide actionable guidelines for building scalable, fully automated SWE pipelines.

Topik & Kata Kunci

cs.SE

Penulis (7)

Kelin Fu

Tianyu Liu

Zeyu Shang

Yingwei Ma

Jian Yang

Jiaheng Liu

Kaigui Bian

Format Sitasi

APA MLA BibTeX

Fu, K., Liu, T., Shang, Z., Ma, Y., Yang, J., Liu, J. et al. (2025). Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering. https://arxiv.org/abs/2512.06915

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2025
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓