March 25, 2026

Agent Scaffolding Beats Model Upgrades: 42% to 78% on SWE-Bench

Same LLM scored 42% and 78% on coding benchmarks by changing only the scaffolding. SWE-bench data proves harness engineering outperforms model upgrades, here are the patterns that matter.

Sebastian Mondragon

8 min read

Agent Scaffolding Beats Model Upgrades: 42% to 78% on SWE-Bench

TL;DR

Six frontier models now score within 0.8 points of each other on SWE-bench Verified, the model is no longer the differentiator. Scaffolding changes alone produce 22+ point swings on SWE-bench Pro with the same model. Focus on tool orchestration, error recovery, context management, and retry logic instead of chasing the next model release.

Upgrading from Sonnet to Opus rarely fixes a broken coding agent. Scaffolding does. SWE-bench data now confirms what production deployments have been saying for the better part of a year: the same LLM can score anywhere from 42% to 78% on coding benchmarks depending entirely on the harness around it, while swapping between the six best frontier models moves your score by less than a single percentage point. If you're still chasing model upgrades to improve agent performance, you're optimizing the wrong variable.

The Data: Same Model, Wildly Different Scores

SWE-bench has become the standard benchmark for evaluating coding agents against real GitHub issues. The latest results reveal something that should change how every team thinks about agent development.

The difference between the best and worst model on this list is 1.3 points. That's noise, not signal. If your strategy for improving agent performance is "wait for the next model release," you're chasing diminishing returns that are already nearly zero at the frontier.

Same model. A 9.5-point spread. Auggie solved 17 more problems than SEAL out of 731 total, not because it had a better LLM, but because it had better tool orchestration, context management, and error recovery. We unpack why the Pro benchmark exposes this gap so cleanly in why top coding agents collapse from 80% to 23% on real codebases.

The full picture is even more dramatic. Nate B. Jones documented cases where harness changes alone swing the same model from 42% to 78% on coding benchmarks. On SWE-bench Pro, the scaffold accounts for a 22+ point swing while model swaps account for roughly 1 point at the frontier. The next step beyond that, models post-trained directly against the harness, is already here: see our breakdown of MiniMax M2.7's self-evolving training on the OpenClaw scaffold. For a deeper dive into how agent frameworks compare, see our LangGraph vs CrewAI vs OpenAI Agents SDK comparison.

Frontier Models Have Converged

On SWE-bench Verified, six frontier models now score within 0.8 points of each other:

Scaffolding Creates the Real Performance Gap

Now compare what happens when you run the same model through different agent frameworks on SWE-bench Pro:

A Weaker Model With Better Scaffolding Wins

The Confucius Code Agent from Meta and Harvard ran Claude Sonnet 4.5, not Opus, with their custom scaffold and scored 52.7% on SWE-bench Pro. That beat Claude Opus 4.5 running on Anthropic's own scaffold at 52.0%. A weaker, cheaper model outperformed the flagship because the scaffolding was better. This is the clearest evidence that scaffolding has become the primary lever for agent performance. The model is a commodity. The harness is the differentiator.

Model	SWE-bench Verified Score
Claude Opus 4.5	80.9%
Claude Opus 4.6	80.8%
Gemini 3.1 Pro	80.6%
MiniMax M2.5	80.2%
GPT-5.4	~80.0%
Sonnet 4.6	79.6%

Agent Framework	Model	SWE-bench Pro Score
SEAL (standardized scaffold)	Claude Opus 4.5	45.9%
Cursor	Claude Opus 4.5	50.2%
Auggie (Augment)	Claude Opus 4.5	51.8%
Claude Code	Claude Opus 4.5	55.4%

What Scaffolding Actually Means

"Scaffolding" and "harness" get thrown around loosely. Here's what they specifically include, and where the performance gains come from.

Scaffolding vs. Harness

Scaffolding constructs the agent before it processes the first user request: defining available tools, writing system instructions, setting up the architecture. The harness orchestrates everything after each prompt: dispatching tool calls, compacting context, enforcing safety, persisting state between turns. Both are non-model infrastructure. Both have outsized impact on results. earezki's research quantifies this: the harness provides roughly 2x impact on output quality compared to the raw model alone.

The Six Components That Drive Performance

Based on SWE-bench data and production deployments, these are the scaffolding components with the highest measurable impact: 1. Tool Orchestration How your agent discovers, selects, and invokes tools matters enormously. WarpGrep, a YC-launched search subagent, added +2.1 to +3.7 points to every model tested on SWE-bench Pro simply by providing faster, parallel code search. It executes up to 36 grep and read operations in under 5 seconds using 8 parallel calls per turn, something a naive sequential agent can't match. 2. Context Management Long coding sessions degrade performance because older context pushes relevant information out of the window. Adaptive context compaction, progressively summarizing older tool outputs while keeping recent results intact, prevents this degradation. The OPENDEV framework uses this pattern to maintain performance across sessions that would otherwise hit token limits. For more on context as a discipline, see our guide on context engineering in 2026. 3. Error Recovery and Rollback Basic agents fail and stop. Good scaffolding detects failures, rolls back to a known-good state, and retries with a different approach. This is the difference between an agent that solves 42% of problems and one that solves 78%, the higher-performing agent doesn't make fewer mistakes, it recovers from them. The same gap shows up at the organization level in the DORA 2025 data on AI throughput and incident rates, where AI tripled incidents per PR precisely because the surrounding recovery and review systems weren't redesigned to match the speed. Read more about preventing common agent mistakes. 4. Planning-Execution Separation Dual-agent architectures that separate planning from execution consistently outperform single-loop agents. One agent decides what to do, another does it. This mirrors how experienced developers work, you think through the approach before writing code, not during. 5. Memory and Persistent State Agents that accumulate project-specific knowledge across sessions outperform agents that start fresh every time. Persistent note-taking, structured memory of file locations and code patterns, and cross-session learning all contribute to higher completion rates. We've written extensively about agent memory and context management. 6. Retry Logic with Decreasing Budgets Pass@3 retry strategies, where the agent gets three attempts with decreasing step budgets for each retry, provide a practical balance between thoroughness and cost. The first attempt gets the full budget, the second gets 60%, the third gets 30%. This prevents runaway costs while giving the agent a fair chance to recover from initial failures. For optimal step counting strategies, see our guide on AI agent reasoning loops optimization.

Real-World Scaffolding Wins

These aren't theoretical gains. Here are documented cases where scaffolding changes produced dramatic improvements without touching the model.

Grok Code Fast: 6.7% to 68.3%

Grok Code Fast went from 6.7% to 68.3% on coding benchmarks by modifying only its edit tool format. Not the model, not the prompts, not the training data, just how the tool communicated edits back to the LLM. A 10x improvement from a tool format change.

LangChain's Coding Agent: 52.8% to 66.5%

LangChain's coding agent jumped from 52.8% to 66.5% on Terminal Bench 2.0, moving from Top 30 to Top 5, by changing nothing about the underlying model. The improvements came from better task decomposition, improved context handling, and smarter tool use patterns.

WarpGrep: +3.7 Points and 28% Faster

Adding WarpGrep as a search subagent to existing agents added +2.1 to +3.7 points on SWE-bench Pro across every model tested. It also cut costs by 15.6% and completion time by 28% with Claude Opus 4.6, input tokens dropped 17% and output tokens dropped 13% because the agent found what it needed faster.

OpenAI's Codex Team: 1,500 PRs With 3 Engineers

Three engineers on OpenAI's Codex team produced roughly 1,500 merged PRs over 5 months, about 1 million lines of code, with no manually written source code. At that volume, scaffolding constraints are also what keep quality from collapsing, and you can see how the same incentive shows up as code churn across whole codebases when those constraints are missing. The key wasn't the model. It was architectural constraints that enforced a controlled sequence: Types, Config, Repo, Service, Runtime, UI. Agents were restricted to specific layers. As Martin Fowler noted, "Harness Engineering is a valuable framing of a key part of AI-enabled software development."

Practical Patterns to Improve Your Agent Today

You don't need to rebuild your entire agent to capture scaffolding gains. Start with the highest-impact changes.

Start With Search

The single highest-ROI scaffolding improvement for coding agents is better code search. If your agent does sequential file reads, switch to parallel grep operations with context-limited output. WarpGrep's approach, 8 parallel calls per turn, up to 4 turns, is a pattern you can replicate with any tool-use framework.

Add Structured Error Recovery

Implement a simple pattern: after every tool call, check if the result indicates an error. If it does, save the current state, try an alternative approach, and only roll back if the alternative also fails. Most agents treat errors as terminal. Making them recoverable is a 5-15 point improvement on benchmarks.

Compact Context Aggressively

After every 5-10 tool calls, summarize older results into a compressed format. Keep the last 2-3 tool outputs at full fidelity. This prevents the context degradation that causes agents to "forget" what they've already learned about a codebase partway through a session.

Separate Planning From Doing

Even a simple two-phase approach helps: first, have the agent read relevant files and output a plan. Then, in a second phase, execute that plan. This prevents the "code first, think later" failure mode that tanks completion rates on complex tasks. For more on building robust multi-step agents, see our guide to building complex AI agents.

Profile Your Agent's Tool Usage

Before optimizing, measure. Log every tool call, its latency, and whether it contributed to task completion. Most agents waste 30-40% of their tool calls on redundant searches or unnecessary file reads. Eliminating waste is free performance.

The Scaffolding Investment Framework

These improvements compound. An agent with parallel search, error recovery, and context compaction will outperform a model upgrade every time, and the improvements are permanent. You keep them when the next model drops. For teams evaluating the broader AI agents pillar, scaffolding engineering should be the first investment, not the last.

Improvement	Expected Impact	Effort
Parallel search operations	+2-4 points	Low, swap sequential for parallel tool calls
Structured error recovery	+5-15 points	Medium, implement rollback and retry logic
Context compaction	+3-8 points	Medium, add summarization between tool turns
Planning-execution split	+5-10 points	Medium, add a planning phase before execution
Persistent memory	+2-5 points	High, build cross-session knowledge store
Custom tool formats	Up to 10x (Grok case)	Low, but requires profiling to identify bottleneck

Stop Waiting for the Next Model

The SWE-bench data is unambiguous: scaffolding determines agent performance more than model selection does. Six frontier models score within 0.8 points of each other, but scaffolding changes produce 22+ point swings with the same model.

Every week your team spends evaluating whether to switch from Claude to GPT or Gemini is a week you could have spent building better tool orchestration, error recovery, and context management, changes that would produce 10-20x the performance improvement of any model swap.

The model is a commodity. The harness is your moat. Build accordingly.

Frequently Asked Questions

Quick answers to common questions about this topic

Agent scaffolding is the infrastructure surrounding an LLM that determines how it receives context, calls tools, handles errors, and manages state across multiple steps. It includes tool orchestration, retry logic, context compaction, memory systems, and error recovery patterns. Unlike prompt engineering which optimizes the input to a model, scaffolding optimizes everything before and after each LLM call, how tasks are decomposed, which tools are available, how failures are detected, and how results are validated.