May 7, 2026

SWE-Bench Pro: Why Coding Agents Collapse From 80% to 23%

Frontier models score ~80% on SWE-Bench Verified and crater to 23% on SWE-Bench Pro. Here's what multi-file PRs actually expose—and how to pick a coding agent that survives them.

Sebastian Mondragon

8 min read

SWE-Bench Pro: Why Coding Agents Collapse From 80% to 23%

TL;DR

SWE-Bench Verified tasks touch ~1.2 files on average; SWE-Bench Pro averages 4.1 files per task and uses contractor-written, repo-realistic PRs. The same models that score 79–81% on Verified land at 23–35% on Pro, with Claude-class agents recovering to ~57–59% only when paired with custom scaffolds. The collapse isn't a model-quality problem—it's a planning, retrieval, and context-management problem. If you're picking a coding agent for production refactors, weight Pro scores 3–5x more than Verified, and budget for harness work, not model swaps.

Claude Opus 4.6 scores 80.8% on SWE-Bench Verified. Six frontier models sit within 1.3 points of it. By the time leadership reviews the leaderboard and asks why one model costs 60% more than another scoring within a point of it, the leaderboard has already lost its discriminative power.

Both numbers are misleading. SWE-Bench Pro shows why.

The same Claude Opus that hits 80.8% on Verified scores around 55% on Pro. Gemini 3.1 Pro drops harder. Several open-weight models that brag about Verified parity with frontier closed models fall into the 23–35% band on Pro. The benchmark didn't get harder by changing the model's job—it got harder by making the job look like real engineering work. And almost every "best AI for coding" article published in the last six months is about to be obsolete because of it.

This post is the breakdown—numbers, failure modes, and an agent-selection framework that holds up when teams ask which coding model to standardize on.

The Headline Trap: SWE-Bench Verified Is a Single-File Test

SWE-Bench Verified became the default coding benchmark because it filtered out broken or ambiguous tasks from the original SWE-Bench. The cleanup worked—but it also concentrated the dataset on tasks that an agent can solve by reading roughly one file and editing it.

The numbers tell the story. Verified tasks average around 1.2 modified files. The median is one file. A long tail goes higher, but the bulk of the benchmark is "find the bug in this module, change a few lines, pass the existing test." That is a real coding skill. It is not the coding skill most teams are paying frontier-model prices for.

This is why six frontier models now cluster within 1.3 points of each other on Verified:

When the spread between top and bottom is 1.3 points, the benchmark is no longer telling you which model is better—it's telling you which models have saturated the test. We covered this convergence in our Claude Opus 4.6 vs GPT-5.3 vs Gemini 3.1 head-to-head, but the deeper issue is that everyone was reading the leaderboard like it still discriminated.

It doesn't. SWE-Bench Pro does.

Model	SWE-Bench Verified
Claude Opus 4.6	80.8%
Gemini 3.1 Pro	80.6%
Claude Opus 4.5	80.9%
MiniMax M2.5	80.2%
GPT-5.4	~80.0%
Sonnet 4.6	79.6%

SWE-Bench Pro: 4.1 Files, Real PRs, No Mercy

Scale AI built SWE-Bench Pro to fix the saturation problem. Three structural changes matter:

Tasks are contractor-curated from real GitHub PRs, not filtered from auto-generated issues. The PR diff is the ground truth, including the test changes.

Average 4.1 modified files per task, with several tasks crossing 8–12 files. The dataset deliberately overweights cross-file work: type updates with downstream call-site fixes, schema changes that ripple into migrations, refactors that touch tests + runtime + docs.

Test suites are the original PR's tests—not a curated subset. If the agent's patch breaks an unrelated test that the original author also fixed, the task fails.

The benchmark size is similar to Verified (a few hundred tasks), but the difficulty distribution shifts hard. Where Verified tasks resemble bug fixes, Pro tasks resemble pull requests.

Here is the practical impact of that shift on the same models:

The headline drop—80% to 23%—is real for the weakest models on stock scaffolds. It is also the wrong number to anchor on, because the more interesting variance is inside a single model, not between models. Claude Opus 4.5 swings from 45.9% to 55.4% on SWE-Bench Pro depending only on which agent harness wraps it. That 9.5-point spread on a fixed model is larger than the spread between any two frontier models on Verified.

Model / Agent	Verified	SWE-Bench Pro
Claude Opus 4.6 (Claude Code)	80.8%	~55–59%
Claude Opus 4.5 (Cursor)	80.9%	~50.2%
Claude Opus 4.5 (Auggie)	80.9%	~51.8%
Claude Opus 4.5 (SEAL)	80.9%	~45.9%
Gemini 3.1 Pro (stock)	80.6%	~35–40%
GPT-5.4 (stock)	~80%	~38–45%
Mid-tier open-weight models	70–78%	~23–32%

Why Cross-File Dependencies Break the Agent Loop

The collapse from Verified to Pro is not random. It maps cleanly to four failure modes that show up the moment a task touches more than one file.

1. The Agent Stops Reading Once the Test Compiles

Single-file tasks reward fast iteration: edit, run tests, done. Pro tasks punish it. The most common Pro failure we see in trace logs is the agent finding a fix that passes one test, declaring victory, and ignoring the call site in the next file that now type-errors at runtime. The agent never opened the consumer because nothing in its loop forced it to. This is a planning problem disguised as a code problem. The fix isn't a smarter model—it's a planner that maps consumers before it touches definitions. We covered the broader pattern in agent scaffolding beats model upgrades, and Pro is where it bites hardest.

2. Context Drifts Faster Than the Agent Can Compact It

A Verified task fits comfortably in 30K–50K tokens of context, including tool output. Pro tasks routinely push past 200K. By the time the agent has read four files, run grep five times, and inspected two test failures, the early-loaded interface definitions have rotated out of useful attention. The agent literally forgets what the function it's calling expects. We've documented this on the model side—long contexts degrade faster than people assume, see Chroma's context-rot research—but on Pro the degradation is operationally fatal. Without aggressive context compaction, the agent regresses on knowledge it had ten tool calls ago.

3. Retrieval Is Sequential When It Needs to Be Parallel

Most stock agents do code search one tool call at a time: grep, read, grep, read. On a 4-file task that's fine. On a 12-file Pro task, the agent burns its step budget on navigation and runs out of attempts before it has a working plan. Search subagents that issue 6–8 parallel grep/read operations per turn cut navigation time in half and add 2–4 points to Pro scores across every model tested—same pattern we documented for parallel coding agents and the worktree workflow.

4. Errors Compound Instead of Recovering

On a single-file task, a bad edit is undone by re-editing the same file. On a Pro task, a bad edit propagates: the agent fixes the type error in file A, which surfaces a new test failure in file B, which the agent "fixes" by mutating an unrelated assertion. Three turns later the patch is structurally wrong and the agent can't reason about why. Stock scaffolds rarely roll back. The high-scoring Pro agents do—they snapshot state before each multi-file edit and revert if downstream tests break. That single change is worth 5–15 points on Pro-difficulty tasks.

The 9.5-Point Scaffold Gap Is the Real Story

If you only remember one number from SWE-Bench Pro, make it this: the same model swings 9–10 points based on the harness around it. That's the data point most articles are missing.

The model is identical. The benchmark is identical. The 9.5-point gap is entirely explained by tool orchestration, planning structure, error recovery, and context management. That's a bigger delta than you'll get from any frontier model upgrade in 2026, and it shows up consistently across model families—the harness improvements transfer.

This is also why it's worth flipping how you read the leaderboard. It's not a model leaderboard. It's an agent leaderboard. Pick the agent first, then pick the model that fits inside it. Our breakdown of Cursor 3 vs Claude Code vs Codex CLI is the same point applied to commercial tools.

Agent Framework	Model	SWE-Bench Pro
SEAL (standardized scaffold)	Claude Opus 4.5	45.9%
Cursor	Claude Opus 4.5	50.2%
Auggie (Augment)	Claude Opus 4.5	51.8%
Claude Code	Claude Opus 4.5	55.4%

What Actually Works: Patterns That Recover the Lost Points

Across the agent traces we've reviewed and the production deployments we've shipped, four scaffolding patterns recover most of the Verified-to-Pro gap. None require a stronger model.

Dependency-Aware Planning

Before any edit, the agent runs a structured pass: locate the symbol, list every consumer, list every test that exercises the consumer, and write a one-paragraph plan that names every file it will touch. This sounds bureaucratic. It also adds 4–8 points on Pro because it short-circuits the "edit and pray" pattern that dominates stock agent failures.

Judge-Then-Execute Split

One model proposes the patch. A second model—often a smaller, cheaper one—judges it against the plan and the tests before it ships. The judge doesn't need to be smarter; it needs to be skeptical. This dual-agent pattern is responsible for most of the gap between Claude Code's 55.4% and Cursor's 50.2% on the same underlying model.

Aggressive Context Compaction

Every 8–10 tool calls, summarize older tool output into a compressed form, keep the last 2–3 raw, and re-inject the original task description. Without this, the agent's effective working memory degrades to the most recent 30K tokens regardless of how much context you've technically loaded.

Snapshot + Rollback on Test Failure

Save state before any multi-file edit. If downstream tests break, roll back to the snapshot and try a different decomposition. The pattern adds 5–15 points on Pro because it converts compounding errors back into recoverable ones. If your team is building or evaluating coding agents, this is the order of investment. Each one is cheaper than a model upgrade and the gains stack.

How to Read the Leaderboard Now

For production decisions in 2026, our framework is:

Treat SWE-Bench Verified as a floor, not a discriminator. Below ~75% on Verified, the model isn't ready for serious agent work. Above 75%, Verified scores are noise.

Weight SWE-Bench Pro 3–5x more than Verified. Pro is the closest public proxy for multi-file PR work, and the score variance still discriminates between agents.

Compare scaffolds, not models. Ask vendors for their Pro number and the Verified number. If the gap is small (< 15 points), the harness is doing real work. If the gap is large, the agent is a Verified specialist.

Run a 20-task internal eval on your repo. Public benchmarks under-represent legacy code, weird build systems, and brownfield tests. A small custom eval catches more agent regressions than any leaderboard.

For teams seeing this play out in production, the productivity paradox we documented across coding tools is the same data from a different angle: tools that demo well on isolated tasks don't always survive real codebases.

What This Means for Picking a Coding Agent

Three takeaways for the engineering leaders we're advising right now:

Stop optimizing the model column. Six frontier models are within 1.3 points on Verified and within ~5 points on Pro at the top end. The model is no longer the lever. The harness around it is.

Budget for scaffolding work. Whether you build internally or buy, plan for 2–4 weeks of harness tuning—dependency-aware planning, context compaction, retrieval parallelism, rollback. This is the work that recovers the Verified-to-Pro gap, and it transfers when the next model drops.

Pick the agent that wins on Pro, not Verified. Right now that's a small set: Claude Code and Auggie at the top, Cursor close behind, with mid-tier wrappers visibly lagging. The gap will move. The framing won't.

The Verified era told us frontier models had converged. The Pro era is telling us something more useful: the coding-agent gap that matters in 2026 sits in the harness, not the model. If you're still buying based on Verified scores, you're choosing a single-file specialist for a multi-file job—and paying frontier prices for it.

Frequently Asked Questions

Quick answers to common questions about this topic

SWE-Bench Pro is Scale AI's harder successor to SWE-Bench Verified. It uses contractor-curated GitHub PRs that average 4.1 modified files per task with repo-scale context, versus Verified's roughly 1.2-file, mostly self-contained patches. Pro tasks include cross-file dependency changes, schema migrations, and refactors that span tests, types, and runtime code. The result is a benchmark that breaks the single-file hill-climbing pattern that inflated Verified scores into the high 70s and 80s.