Six frontier models now score within 0.8 points of each other on SWE-bench Verified—the model is no longer the differentiator. Scaffolding changes alone produce 22+ point swings on SWE-bench Pro with the same model. Focus on tool orchestration, error recovery, context management, and retry logic instead of chasing the next model release.
A client recently asked us to evaluate whether upgrading from Claude Sonnet to Opus would fix their coding agent's 34% task completion rate. We ran the numbers and told them to keep Sonnet. Instead, we rebuilt the scaffolding around it—better tool orchestration, smarter error recovery, context compaction at every step. Completion rate jumped to 71%. The model never changed.
This isn't an isolated story. SWE-bench data now proves what we've seen in production: the same LLM can score anywhere from 42% to 78% on coding benchmarks depending entirely on the scaffolding around it. Meanwhile, swapping between the six best frontier models moves your score by less than a single percentage point. If you're still chasing model upgrades to improve agent performance, you're optimizing the wrong variable.
The Data: Same Model, Wildly Different Scores
SWE-bench has become the standard benchmark for evaluating coding agents against real GitHub issues. The latest results reveal something that should change how every team thinks about agent development.
The difference between the best and worst model on this list is 1.3 points. That's noise, not signal. If your strategy for improving agent performance is "wait for the next model release," you're chasing diminishing returns that are already nearly zero at the frontier.
Same model. A 9.5-point spread. Auggie solved 17 more problems than SEAL out of 731 total—not because it had a better LLM, but because it had better tool orchestration, context management, and error recovery.
The full picture is even more dramatic. Nate B. Jones documented cases where harness changes alone swing the same model from 42% to 78% on coding benchmarks. On SWE-bench Pro, the scaffold accounts for a 22+ point swing while model swaps account for roughly 1 point at the frontier. For a deeper dive into how agent frameworks compare, see our LangGraph vs CrewAI vs OpenAI Agents SDK comparison.
Frontier Models Have Converged
On SWE-bench Verified, six frontier models now score within 0.8 points of each other:
Scaffolding Creates the Real Performance Gap
Now compare what happens when you run the same model through different agent frameworks on SWE-bench Pro:
A Weaker Model With Better Scaffolding Wins
The Confucius Code Agent from Meta and Harvard ran Claude Sonnet 4.5—not Opus—with their custom scaffold and scored 52.7% on SWE-bench Pro. That beat Claude Opus 4.5 running on Anthropic's own scaffold at 52.0%. A weaker, cheaper model outperformed the flagship because the scaffolding was better. This is the clearest evidence that scaffolding has become the primary lever for agent performance. The model is a commodity. The harness is the differentiator.
| Model | SWE-bench Verified Score |
|---|---|
| Claude Opus 4.5 | 80.9% |
| Claude Opus 4.6 | 80.8% |
| Gemini 3.1 Pro | 80.6% |
| MiniMax M2.5 | 80.2% |
| GPT-5.4 | ~80.0% |
| Sonnet 4.6 | 79.6% |
| Agent Framework | Model | SWE-bench Pro Score |
|---|---|---|
| SEAL (standardized scaffold) | Claude Opus 4.5 | 45.9% |
| Cursor | Claude Opus 4.5 | 50.2% |
| Auggie (Augment) | Claude Opus 4.5 | 51.8% |
| Claude Code | Claude Opus 4.5 | 55.4% |
What Scaffolding Actually Means
"Scaffolding" and "harness" get thrown around loosely. Here's what they specifically include—and where the performance gains come from.
Scaffolding vs. Harness
Scaffolding constructs the agent before it processes the first user request: defining available tools, writing system instructions, setting up the architecture. The harness orchestrates everything after each prompt: dispatching tool calls, compacting context, enforcing safety, persisting state between turns. Both are non-model infrastructure. Both have outsized impact on results. earezki's research quantifies this: the harness provides roughly 2x impact on output quality compared to the raw model alone.
The Six Components That Drive Performance
Based on SWE-bench data and production deployments, these are the scaffolding components with the highest measurable impact: 1. Tool Orchestration How your agent discovers, selects, and invokes tools matters enormously. WarpGrep, a YC-launched search subagent, added +2.1 to +3.7 points to every model tested on SWE-bench Pro simply by providing faster, parallel code search. It executes up to 36 grep and read operations in under 5 seconds using 8 parallel calls per turn—something a naive sequential agent can't match. 2. Context Management Long coding sessions degrade performance because older context pushes relevant information out of the window. Adaptive context compaction—progressively summarizing older tool outputs while keeping recent results intact—prevents this degradation. The OPENDEV framework uses this pattern to maintain performance across sessions that would otherwise hit token limits. For more on context as a discipline, see our guide on context engineering in 2026. 3. Error Recovery and Rollback Basic agents fail and stop. Good scaffolding detects failures, rolls back to a known-good state, and retries with a different approach. This is the difference between an agent that solves 42% of problems and one that solves 78%—the higher-performing agent doesn't make fewer mistakes, it recovers from them. Read more about preventing common agent mistakes. 4. Planning-Execution Separation Dual-agent architectures that separate planning from execution consistently outperform single-loop agents. One agent decides what to do, another does it. This mirrors how experienced developers work—you think through the approach before writing code, not during. 5. Memory and Persistent State Agents that accumulate project-specific knowledge across sessions outperform agents that start fresh every time. Persistent note-taking, structured memory of file locations and code patterns, and cross-session learning all contribute to higher completion rates. We've written extensively about agent memory and context management. 6. Retry Logic with Decreasing Budgets Pass@3 retry strategies—where the agent gets three attempts with decreasing step budgets for each retry—provide a practical balance between thoroughness and cost. The first attempt gets the full budget, the second gets 60%, the third gets 30%. This prevents runaway costs while giving the agent a fair chance to recover from initial failures. For optimal step counting strategies, see our guide on AI agent reasoning loops optimization.
Real-World Scaffolding Wins
These aren't theoretical gains. Here are documented cases where scaffolding changes produced dramatic improvements without touching the model.
Grok Code Fast: 6.7% to 68.3%
Grok Code Fast went from 6.7% to 68.3% on coding benchmarks by modifying only its edit tool format. Not the model, not the prompts, not the training data—just how the tool communicated edits back to the LLM. A 10x improvement from a tool format change.
LangChain's Coding Agent: 52.8% to 66.5%
LangChain's coding agent jumped from 52.8% to 66.5% on Terminal Bench 2.0, moving from Top 30 to Top 5, by changing nothing about the underlying model. The improvements came from better task decomposition, improved context handling, and smarter tool use patterns.
WarpGrep: +3.7 Points and 28% Faster
Adding WarpGrep as a search subagent to existing agents added +2.1 to +3.7 points on SWE-bench Pro across every model tested. It also cut costs by 15.6% and completion time by 28% with Claude Opus 4.6—input tokens dropped 17% and output tokens dropped 13% because the agent found what it needed faster.
OpenAI's Codex Team: 1,500 PRs With 3 Engineers
Three engineers on OpenAI's Codex team produced roughly 1,500 merged PRs over 5 months—about 1 million lines of code—with no manually written source code. The key wasn't the model. It was architectural constraints that enforced a controlled sequence: Types, Config, Repo, Service, Runtime, UI. Agents were restricted to specific layers. As Martin Fowler noted, "Harness Engineering is a valuable framing of a key part of AI-enabled software development."
Practical Patterns to Improve Your Agent Today
You don't need to rebuild your entire agent to capture scaffolding gains. Start with the highest-impact changes.
Start With Search
The single highest-ROI scaffolding improvement for coding agents is better code search. If your agent does sequential file reads, switch to parallel grep operations with context-limited output. WarpGrep's approach—8 parallel calls per turn, up to 4 turns—is a pattern you can replicate with any tool-use framework.
Add Structured Error Recovery
Implement a simple pattern: after every tool call, check if the result indicates an error. If it does, save the current state, try an alternative approach, and only roll back if the alternative also fails. Most agents treat errors as terminal. Making them recoverable is a 5-15 point improvement on benchmarks.
Compact Context Aggressively
After every 5-10 tool calls, summarize older results into a compressed format. Keep the last 2-3 tool outputs at full fidelity. This prevents the context degradation that causes agents to "forget" what they've already learned about a codebase partway through a session.
Separate Planning From Doing
Even a simple two-phase approach helps: first, have the agent read relevant files and output a plan. Then, in a second phase, execute that plan. This prevents the "code first, think later" failure mode that tanks completion rates on complex tasks. For more on building robust multi-step agents, see our guide to building complex AI agents.
Profile Your Agent's Tool Usage
Before optimizing, measure. Log every tool call, its latency, and whether it contributed to task completion. Most agents waste 30-40% of their tool calls on redundant searches or unnecessary file reads. Eliminating waste is free performance.
The Scaffolding Investment Framework
These improvements compound. An agent with parallel search, error recovery, and context compaction will outperform a model upgrade every time—and the improvements are permanent. You keep them when the next model drops. For teams evaluating the broader AI agents pillar, scaffolding engineering should be the first investment, not the last.
| Improvement | Expected Impact | Effort |
|---|---|---|
| Parallel search operations | +2-4 points | Low—swap sequential for parallel tool calls |
| Structured error recovery | +5-15 points | Medium—implement rollback and retry logic |
| Context compaction | +3-8 points | Medium—add summarization between tool turns |
| Planning-execution split | +5-10 points | Medium—add a planning phase before execution |
| Persistent memory | +2-5 points | High—build cross-session knowledge store |
| Custom tool formats | Up to 10x (Grok case) | Low—but requires profiling to identify bottleneck |
Stop Waiting for the Next Model
The SWE-bench data is unambiguous: scaffolding determines agent performance more than model selection does. Six frontier models score within 0.8 points of each other, but scaffolding changes produce 22+ point swings with the same model.
Every week your team spends evaluating whether to switch from Claude to GPT or Gemini is a week you could have spent building better tool orchestration, error recovery, and context management—changes that would produce 10-20x the performance improvement of any model swap.
The model is a commodity. The harness is your moat. Build accordingly.
Frequently Asked Questions
Quick answers to common questions about this topic
Agent scaffolding is the infrastructure surrounding an LLM that determines how it receives context, calls tools, handles errors, and manages state across multiple steps. It includes tool orchestration, retry logic, context compaction, memory systems, and error recovery patterns. Unlike prompt engineering which optimizes the input to a model, scaffolding optimizes everything before and after each LLM call—how tasks are decomposed, which tools are available, how failures are detected, and how results are validated.



