Instead of loading every MCP tool definition upfront and piping each result back through context, code execution exposes your tools as a code API the model calls programmatically. Anthropic measured a Google Drive to Salesforce task drop from 150,000 to 2,000 tokens (98.7%). Independent tests reproduce the direction: AIMultiple saw a 78.5% input-token cut with GPT-4.1, Bifrost scaled from 58% at 96 tools to 92.8% at 508 tools, and Cloudflare's Code Mode hit 99.9% on a 2,500-endpoint API. The catch: per-call latency rose about 7% in controlled tests, and you now operate a code sandbox.
Load five MCP servers into a coding agent and you can burn 50,000 tokens of context before the model reads a single word of the user's request. The tool definitions alone, the names, descriptions, and JSON schemas for every available action, get pasted into the system prompt on every turn. Measurements published in April 2026 found that a single heavy MCP configuration consumed roughly 66,000 tokens at startup, about a third of a 200K context window, spent entirely on schemas the model may never use. That is the token tax, and code execution with MCP is the pattern teams are adopting to stop paying it.
The headline number is striking. In its November 2025 engineering post, Anthropic rebuilt a Google Drive to Salesforce workflow so the agent wrote code against its tools instead of calling them one at a time, and watched token usage fall from 150,000 to 2,000, a 98.7% reduction. That single figure is what made "code execution with MCP" trend through every agent-engineering channel for the following six months, and it is also why the "is MCP dead" debate took off: the protocol is fine, but the default way of wiring tools into a model turned out to be wasteful at scale.
This post explains what the pattern actually is, walks through the numbers Anthropic and independent testers measured (including where the savings are smaller than the headline suggests), and lays out when to reach for it versus the lighter-weight alternatives. The short version: code execution trades a chunk of operational complexity, a sandbox you now have to secure, for a token reduction that grows with the size of your tool surface. For a five-tool agent it is overkill. For an enterprise stack wiring dozens of servers into one model, it is close to mandatory.
The Token Tax: Why Loading Every Tool Upfront Breaks at Scale
Standard Model Context Protocol wiring does two expensive things on every agent turn. First, it front-loads tool definitions: every connected server announces all of its tools, and those schemas live in the context window whether or not the task needs them. Second, it round-trips intermediate results: when a tool returns a 10,000-row spreadsheet or a long document, that payload flows back into context so the model can decide what to do next, then often flows out again as an argument to the next tool.
Both costs scale the wrong way. Add a server and every turn gets more expensive, forever, even on tasks that never touch the new tools. Chain three data-heavy tools and the intermediate payloads can dwarf the actual reasoning. Across the agent stacks we have audited, tool-definition overhead is the single most common source of silent context bloat, and it is invisible on a dashboard that only tracks completion tokens. It shows up as a steadily rising prompt-token bill and a context window that fills before the work starts, the same failure mode we covered in agent tool selection at scale, where accuracy collapses once a model is drowning in tool schemas.
The insight behind code execution is that models are already excellent at one thing that makes both costs disappear: writing code. If the tools are presented as a code API rather than as a list of callable schemas, the model can import only what it needs, loop and filter in the execution environment, and return just the final answer to context.
How Code Execution With MCP Works
Instead of registering each MCP tool as a model-callable function, you expose the MCP servers as code, typically a generated directory of files where each tool becomes a function in a small TypeScript or Python module. The agent does not call salesforce_create_record through the model's tool-use interface. It writes a short script that imports the Salesforce module, calls the function, processes the result, and prints only what matters.
The mechanics that produce the savings are straightforward once you see them:
Anthropic's framing is that tools become "a filesystem of code APIs." Cloudflare shipped essentially the same idea under the name Code Mode, generating a typed API from MCP tool schemas and letting the model write JavaScript against it inside a V8 isolate. The names differ; the architecture is the same move from "call tools" to "write code that calls tools."
The Headline Numbers, and What Was Actually Measured
The 98.7% figure is real but specific. It is Anthropic's measurement of one workflow (Google Drive to Salesforce) where the dominant cost was loading large tool definitions and round-tripping document data. When both of those costs are high, eliminating them produces a near-total collapse in token usage. Treat 98.7% as a best case for definition-heavy, data-heavy workflows, not as a number you will hit on every agent.
It also helps to keep Anthropic's two November 2025 releases straight, because they are frequently conflated:
These are complementary, not competing. The Tool Search Tool fixes the upfront-definition cost without a sandbox. Code execution fixes both the definition cost and the data round-tripping, at the cost of running model-generated code. The 98.7% headline belongs to the November 4 code-execution post; the 85% figure belongs to the November 24 advanced tool use post. Citing one date with the other number is the most common error in coverage of this topic.
| Release | Date | Mechanism | Reported reduction |
|---|---|---|---|
| Code execution with MCP | Nov 4, 2025 | Tools as a code API in a sandbox | 150K to 2K tokens (98.7%) on a Drive-to-Salesforce task |
| Advanced tool use (Tool Search Tool) | Nov 24, 2025 | On-demand loading of tool definitions | Tool definitions ~77K to ~8.7K (85%) |
| Advanced tool use (Programmatic Tool Calling) | Nov 24, 2025 | Model orchestrates calls in code | Avg usage 43,588 to 27,297 tokens (37%) on complex research |
Reproductions in the Wild
The pattern would be a curiosity if only Anthropic could hit the numbers. It is not. Independent testers reproduced the direction and magnitude across different models and stacks:
The Bifrost result is the most useful one for planning, because it isolates the variable that matters: tool count. At 96 tools the savings were 58%; at 508 tools they were 92.8%. That is the whole economic argument in one data series. The more tool definitions you would otherwise load, the more code execution saves, which is exactly why a small agent sees little benefit and a large enterprise stack sees the headline figures.
The AIMultiple test is the most honest about cost, because it ran a controlled comparison and reported the latency penalty alongside the token win. That penalty is the part most write-ups skip.
| Source | Setup | Input-token reduction | Notes |
|---|---|---|---|
| AIMultiple | GPT-4.1 vs Bright Data MCP, 50 runs | 78.5% (165K vs 771K) | 100% task success; per-call latency up ~7% |
| Bifrost | Scaling tool count | 58% (96 tools) to 92.8% (508 tools) | 100% pass rate held across rounds |
| Cloudflare Code Mode | 2,500-endpoint API | 99.9% (1.17M to ~1K) | Runs generated JS in a V8 isolate |
| GitHub Discussion #629 | 112 GitHub tools | 99.2% (150K to ~1.2K) | Community reproduction, Nov 2025 |
Where the Savings Come From, and the Hidden Costs
The savings are not magic. They come from three places: not loading unused tool definitions, not moving intermediate data through context, and collapsing multiple model turns into one code execution. All three are real, and all three have a bill attached.
The first hidden cost is latency. AIMultiple measured a roughly 7% per-call increase (10.37s versus 9.66s), because the model now generates more output tokens of code and waits on a sandbox to run it. The widely repeated claim of a flat "30 to 40% latency improvement" does not hold up against primary measurement; what is true is that collapsing eight sequential tool calls into one script removes seven model round trips, so end-to-end task latency often drops even though each individual call is slightly slower. Per-call latency and total-task latency are different metrics, and you should track them separately.
The second hidden cost is output tokens. You are trading input tokens (cheap, cacheable) for output tokens (more expensive, the model writing code). On most pricing this is still a large net win because input definitions dominated the bill, but it is not free, and it is why you should measure total cost, not just the input-token reduction that makes the best headline. The same discipline we recommend in reducing LLM costs through token optimization applies: optimize against the full bill.
The third hidden cost is the sandbox itself, which is significant enough to deserve its own section.
Sandboxing and Security: The Real Price of Admission
Code execution means running model-generated code, and model-generated code can be steered by prompt injection in any document or tool result the agent touches. This is the same threat surface as ordinary tool use, but amplified, because the model can now compose arbitrary operations rather than picking from a fixed menu. The mitigation is not optional and it is not novel: it is the discipline you already apply to any service that executes untrusted code.
A production deployment needs, at minimum, a properly isolated runtime (Cloudflare uses a V8 isolate with no filesystem; others use microVM or container sandboxes of the kind we compared in Modal vs E2B vs Daytona vs Vercel Sandbox), network egress controls so a script cannot exfiltrate data to an arbitrary host, per-tool permission scoping so the Salesforce module cannot reach the filesystem, resource and timeout limits to contain runaway loops, no ambient credentials in the execution environment, and audit logging of every snippet the model runs. Get those right and the blast radius stays contained. Skip them and you have handed a prompt-injectable model a code interpreter with your credentials, which is a materially worse position than plain tool calling.
This is the honest reason code execution is not a default. The token savings are large and well-replicated, but they are paid for with operational surface area that a small team may not want to own.
When to Use Code Execution vs the Alternatives
Pick the lightest tool that solves your actual problem:
A simple decision rule: if tool definitions plus intermediate data are not a large share of your token bill, do not add a sandbox. If they are, and your toolset is big enough that Bifrost-style scaling applies, code execution is likely the highest-leverage change you can make to an agent's cost profile.
Implementation Checklist for Production
If you decide the pattern fits, treat the rollout as an infrastructure project, not a prompt change:
Code execution with MCP is not a silver bullet, and the 98.7% figure is the best case rather than the expectation. But the underlying economics are sound and unusually well-replicated for a pattern this young: the more tools you wire into a model, the more it costs to do it the default way, and writing code against those tools is how teams running large agent stacks are reclaiming their context budget. If your agents are getting expensive as you add capabilities, this is the architecture worth evaluating, and it sits squarely in the AI development tooling decisions that increasingly separate cheap agents from runaway ones.
Frequently Asked Questions
Quick answers to common questions about this topic
It depends on how many tools you load and how you measure. Anthropic reported the most dramatic result: 150,000 tokens down to 2,000 (98.7%) on a Google Drive to Salesforce workflow. Independent tests land lower but still large: AIMultiple measured a 78.5% input-token reduction with GPT-4.1 against Bright Data's MCP server, and Bifrost saw savings scale with tool count, 58% at 96 tools, 84.5% at 251, and 92.8% at 508. Cloudflare's Code Mode hit 99.9% on a 2,500-endpoint API. The pattern's value grows with the number of tool definitions you would otherwise load upfront, so small toolsets see modest gains while large enterprise stacks see the headline numbers.



