# SWE-Bench SWE-Bench is the standard benchmark for AI coding agents. It hands a model a real bug report from an open-source GitHub repository and asks for a patch that makes the project's hidden test suite pass. No multiple choice, no toy problems. The task is close to what a developer actually does on a normal day, which is why every serious coding model now reports a SWE-Bench number. ## The variants worth knowing - **SWE-bench Verified** is the one most people quote. It is a 500-task subset that human annotators reviewed by hand, throwing out instances with broken tests or descriptions too vague to be solvable. OpenAI ran that cleanup. When a note says "SWE-bench" with no qualifier, it usually means Verified. - **SWE-bench Multilingual** drops the Python-only assumption. It spans Java, TypeScript, JavaScript, Go, Rust, C, and C++, so a score here says something about a model outside its strongest language. - The original full SWE-bench is larger and noisier. Cleaning up its flaws is the whole reason Verified exists. ## How to read a score A score is just the percentage of issues resolved. By mid-2026 the frontier had pushed near the top of the chart: Anthropic's Claude Mythos Preview leads Verified at roughly 0.94 and Multilingual at roughly 0.87. [[Composer 2.5]] reports 79.8% on Multilingual, and that figure is how Cursor backs its claim of matching [[Claude Opus 4.7]] and [[GPT-5.5]]. ## Why a number is not the whole story Most scores are self-reported by the lab. The harness, the retry budget, and the scaffolding around the model all move the result, so the same model can post different numbers depending on who ran it. Treat any SWE-Bench figure as directional. A model can also be tuned toward the benchmark itself, which means a high score is necessary evidence that it codes well, but not sufficient. ## References - Leaderboards: https://www.swebench.com - SWE-bench Verified: https://www.swebench.com/verified.html ## Related - [[Large Language Models (LLMs)]] - [[AI Evaluation]] - [[AI Agents]] - [[Composer 2.5]] - [[Composer 2]] - [[Claude Opus 4.7]] - [[GPT-5.5]] - [[AI Frontier Model]] - [[AI Coding Maturity Levels]]