Skip to content

jimenez2024-swebench

arXiv: 2310.06770

TLDR (English)

Uses 12 real Python repos with 2294 issues to evaluate code models' "end-to-end bug solving" capability. Overnight became coding agent industry standard benchmark; almost every coding agent paper reports SWE-bench scores.

TLDR(中文)

用 12 个真实 Python 仓库 2294 个 issue 评测代码模型"端到端解决 bug"的能力。一夜成为 coding agent 行业标准评测,几乎每篇 coding agent 论文都报 SWE-bench 分数。