SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Authors: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan (2024)

arXiv: 2310.06770

Domains

EvaluationApplications

TLDR (English)

Uses 12 real Python repos with 2294 issues to evaluate code models' "end-to-end bug solving" capability. Overnight became coding agent industry standard benchmark; almost every coding agent paper reports SWE-bench scores.

TLDR（中文）

用 12 个真实 Python 仓库 2294 个 issue 评测代码模型"端到端解决 bug"的能力。一夜成为 coding agent 行业标准评测，几乎每篇 coding agent 论文都报 SWE-bench 分数。

Appears in These Articles

代码生成：模型如何写程序
Code Generation: How Models Write Programs

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain