jimenez2024-swebench
arXiv: 2310.06770
TLDR(中文)
用 12 个真实 Python 仓库 2294 个 issue 评测代码模型"端到端解决 bug"的能力。一夜成为 coding agent 行业标准评测,几乎每篇 coding agent 论文都报 SWE-bench 分数。
TLDR (English)
Uses 12 real Python repos with 2294 issues to evaluate code models' "end-to-end bug solving" capability. Overnight became coding agent industry standard benchmark; almost every coding agent paper reports SWE-bench scores.