How much can code execution with MCP actually reduce token usage?

It depends on how many tools you load and how you measure. Anthropic reported the most dramatic result: 150,000 tokens down to 2,000 (98.7%) on a Google Drive to Salesforce workflow. Independent tests land lower but still large: AIMultiple measured a 78.5% input-token reduction with GPT-4.1 against Bright Data's MCP server, and Bifrost saw savings scale with tool count, 58% at 96 tools, 84.5% at 251, and 92.8% at 508. Cloudflare's Code Mode hit 99.9% on a 2,500-endpoint API. The pattern's value grows with the number of tool definitions you would otherwise load upfront, so small toolsets see modest gains while large enterprise stacks see the headline numbers.

Does code execution with MCP make agents slower or faster?

It is a tradeoff, not a free speedup. AIMultiple's controlled test measured a roughly 7% latency increase per call (10.37s vs 9.66s) because the agent generates and runs more output tokens of code. The latency benefit is structural, not per-call: a multi-step task that needs eight sequential tool calls can collapse into one code execution round trip, cutting model invocations significantly. So end-to-end wall-clock time often drops for multi-tool workflows even when a single call gets slightly slower. Treat per-call latency and total task latency as separate metrics, and benchmark your own workload rather than assuming a fixed gain.

What is the difference between code execution with MCP and the Tool Search Tool?

They solve the same problem differently. Anthropic's Tool Search Tool, from the November 24 2025 advanced tool use release, keeps normal tool calling but loads definitions on demand, cutting tool-definition tokens about 85% (from roughly 77K to 8.7K). Code execution with MCP, from November 4 2025, goes further: it exposes tools as a filesystem of code APIs the model calls programmatically, so it can also chain calls, filter results, and avoid piping intermediate data through context. Tool Search is lower risk and needs no sandbox; code execution delivers bigger savings (up to 98.7%) plus orchestration power but requires a secure execution environment. Many teams start with Tool Search and adopt code execution as toolsets grow.

Is code execution with MCP a security risk?

It introduces real attack surface you must contain. Running model-generated code means a sandbox handling potentially untrusted instructions, so isolation is non-negotiable. Cloudflare runs Code Mode inside a V8 isolate with no filesystem; Anthropic's reference pattern uses a sandboxed runtime. You should add network egress controls, per-tool permission scoping, resource and timeout limits, and audit logging of every executed snippet. The same prompt-injection risks that affect tool calling apply here, amplified because the model can now compose operations. Treat the sandbox like any code-execution service: least privilege, no ambient credentials, and review the logs. Done correctly, the blast radius stays contained while you capture the 78 to 99 percent token savings.

When is code execution with MCP not worth it?

Skip it when your agent loads only a handful of tools. Bifrost's data shows savings scale with tool count: 58% at 96 tools but far less at small counts, so a five-tool agent rarely justifies the added sandbox and operational complexity. It is also a poor fit for ultra-low-latency single-call workflows, since per-call latency rose about 7% in AIMultiple's test, and for environments where you cannot safely run model-generated code. In those cases, the Tool Search Tool (85% definition savings, no sandbox) or simple tool pruning is a better return. Reserve code execution for large toolsets, multi-step orchestration, or workflows where intermediate results would otherwise flood the context window.

Has anyone reproduced Anthropic's results outside of Anthropic?

Yes, multiple independent reproductions exist. GitHub Discussion #629 (November 19 2025) scaled the pattern to 112 GitHub tools and cut roughly 150,000 tokens to about 1,200, a 99.2% reduction on tool definitions. AIMultiple's test with GPT-4.1 and Bright Data hit 78.5% input-token reduction at 100% success across 50 runs. Bifrost benchmarked 58 to 92.8 percent across 96 to 508 tools with a 100% pass rate. Cloudflare shipped Code Mode (February 2026) at 99.9% on a 2,500-endpoint API. The spread, 78% to 99%, reflects toolset size and measurement method, but the direction and magnitude are consistent across vendors and independent testers.

BLOG/AI DEVELOPMENT TOOLS

Code Execution With MCP: Cut Tool Tokens up to 98%

Code execution with MCP cut Anthropic's tool tokens from 150K to 2K (98.7%); third-party tests show 78-99% input cuts. How it works and when to use it.

Sebastian MondragonJUNE 16, 2026 · 10 MIN READ

Code Execution With MCP: Cut Tool Tokens up to 98%

Load five MCP servers into a coding agent and you can burn 50,000 tokens of context before the model reads a single word of the user's request. The tool definitions alone, the names, descriptions, and JSON schemas for every available action, get pasted into the system prompt on every turn. Measurements published in April 2026 found that a single heavy MCP configuration consumed roughly 66,000 tokens at startup, about a third of a 200K context window, spent entirely on schemas the model may never use. That is the token tax, and code execution with MCP is the pattern teams are adopting to stop paying it.

The headline number is striking. In its November 2025 engineering post, Anthropic rebuilt a Google Drive to Salesforce workflow so the agent wrote code against its tools instead of calling them one at a time, and watched token usage fall from 150,000 to 2,000, a 98.7% reduction. That single figure is what made "code execution with MCP" trend through every agent-engineering channel for the following six months, and it is also why the "is MCP dead" debate took off: the protocol is fine, but the default way of wiring tools into a model turned out to be wasteful at scale.

This post explains what the pattern actually is, walks through the numbers Anthropic and independent testers measured (including where the savings are smaller than the headline suggests), and lays out when to reach for it versus the lighter-weight alternatives. The short version: code execution trades a chunk of operational complexity, a sandbox you now have to secure, for a token reduction that grows with the size of your tool surface. For a five-tool agent it is overkill. For an enterprise stack wiring dozens of servers into one model, it is close to mandatory.

The Token Tax: Why Loading Every Tool Upfront Breaks at Scale

Standard Model Context Protocol wiring does two expensive things on every agent turn. First, it front-loads tool definitions: every connected server announces all of its tools, and those schemas live in the context window whether or not the task needs them. Second, it round-trips intermediate results: when a tool returns a 10,000-row spreadsheet or a long document, that payload flows back into context so the model can decide what to do next, then often flows out again as an argument to the next tool.

Both costs scale the wrong way. Add a server and every turn gets more expensive, forever, even on tasks that never touch the new tools. Chain three data-heavy tools and the intermediate payloads can dwarf the actual reasoning. Across the agent stacks we have audited, tool-definition overhead is the single most common source of silent context bloat, and it is invisible on a dashboard that only tracks completion tokens. It shows up as a steadily rising prompt-token bill and a context window that fills before the work starts, the same failure mode we covered in agent tool selection at scale, where accuracy collapses once a model is drowning in tool schemas.

The insight behind code execution is that models are already excellent at one thing that makes both costs disappear: writing code. If the tools are presented as a code API rather than as a list of callable schemas, the model can import only what it needs, loop and filter in the execution environment, and return just the final answer to context.

How Code Execution With MCP Works

Instead of registering each MCP tool as a model-callable function, you expose the MCP servers as code, typically a generated directory of files where each tool becomes a function in a small TypeScript or Python module. The agent does not call salesforce_create_record through the model's tool-use interface. It writes a short script that imports the Salesforce module, calls the function, processes the result, and prints only what matters.

The mechanics that produce the savings are straightforward once you see them:

Progressive disclosure. The model lists the available tool files and reads only the definitions for the handful it decides to use, rather than ingesting all of them upfront. A 200-tool surface costs almost nothing until a specific tool is opened.

In-environment data handling. Filtering 10,000 rows down to the 3 that matter happens in the sandbox, in code, so 9,997 rows never enter the context window. Intermediate results stay where they are produced. It is the same economy a coding agent gets from semantic code search instead of grep-and-read, where returning the few relevant functions keeps whole files out of context.

Composition. The model can chain several operations in one script, with loops and conditionals, instead of a separate model turn per tool call. Eight sequential calls become one execution.

Anthropic's framing is that tools become "a filesystem of code APIs." Cloudflare shipped essentially the same idea under the name Code Mode, generating a typed API from MCP tool schemas and letting the model write JavaScript against it inside a V8 isolate. The names differ; the architecture is the same move from "call tools" to "write code that calls tools."

The Headline Numbers, and What Was Actually Measured

The 98.7% figure is real but specific. It is Anthropic's measurement of one workflow (Google Drive to Salesforce) where the dominant cost was loading large tool definitions and round-tripping document data. When both of those costs are high, eliminating them produces a near-total collapse in token usage. Treat 98.7% as a best case for definition-heavy, data-heavy workflows, not as a number you will hit on every agent.

It also helps to keep Anthropic's two November 2025 releases straight, because they are frequently conflated:

These are complementary, not competing. The Tool Search Tool fixes the upfront-definition cost without a sandbox. Code execution fixes both the definition cost and the data round-tripping, at the cost of running model-generated code. The 98.7% headline belongs to the November 4 code-execution post; the 85% figure belongs to the November 24 advanced tool use post. Citing one date with the other number is the most common error in coverage of this topic.

Release	Date	Mechanism	Reported reduction
Code execution with MCP	Nov 4, 2025	Tools as a code API in a sandbox	150K to 2K tokens (98.7%) on a Drive-to-Salesforce task
Advanced tool use (Tool Search Tool)	Nov 24, 2025	On-demand loading of tool definitions	Tool definitions ~77K to ~8.7K (85%)
Advanced tool use (Programmatic Tool Calling)	Nov 24, 2025	Model orchestrates calls in code	Avg usage 43,588 to 27,297 tokens (37%) on complex research

Reproductions in the Wild

The pattern would be a curiosity if only Anthropic could hit the numbers. It is not. Independent testers reproduced the direction and magnitude across different models and stacks:

The Bifrost result is the most useful one for planning, because it isolates the variable that matters: tool count. At 96 tools the savings were 58%; at 508 tools they were 92.8%. That is the whole economic argument in one data series. The more tool definitions you would otherwise load, the more code execution saves, which is exactly why a small agent sees little benefit and a large enterprise stack sees the headline figures.

The AIMultiple test is the most honest about cost, because it ran a controlled comparison and reported the latency penalty alongside the token win. That penalty is the part most write-ups skip.

Source	Setup	Input-token reduction	Notes
AIMultiple	GPT-4.1 vs Bright Data MCP, 50 runs	78.5% (165K vs 771K)	100% task success; per-call latency up ~7%
Bifrost	Scaling tool count	58% (96 tools) to 92.8% (508 tools)	100% pass rate held across rounds
Cloudflare Code Mode	2,500-endpoint API	99.9% (1.17M to ~1K)	Runs generated JS in a V8 isolate
GitHub Discussion #629	112 GitHub tools	99.2% (150K to ~1.2K)	Community reproduction, Nov 2025

Where the Savings Come From, and the Hidden Costs

The savings are not magic. They come from three places: not loading unused tool definitions, not moving intermediate data through context, and collapsing multiple model turns into one code execution. All three are real, and all three have a bill attached.

The first hidden cost is latency. AIMultiple measured a roughly 7% per-call increase (10.37s versus 9.66s), because the model now generates more output tokens of code and waits on a sandbox to run it. The widely repeated claim of a flat "30 to 40% latency improvement" does not hold up against primary measurement; what is true is that collapsing eight sequential tool calls into one script removes seven model round trips, so end-to-end task latency often drops even though each individual call is slightly slower. Per-call latency and total-task latency are different metrics, and you should track them separately.

The second hidden cost is output tokens. You are trading input tokens (cheap, cacheable) for output tokens (more expensive, the model writing code). On most pricing this is still a large net win because input definitions dominated the bill, but it is not free, and it is why you should measure total cost, not just the input-token reduction that makes the best headline. The same discipline we recommend in reducing LLM costs through token optimization applies: optimize against the full bill.

The third hidden cost is the sandbox itself, which is significant enough to deserve its own section.

Sandboxing and Security: The Real Price of Admission

Code execution means running model-generated code, and model-generated code can be steered by prompt injection in any document or tool result the agent touches. This is the same threat surface as ordinary tool use, but amplified, because the model can now compose arbitrary operations rather than picking from a fixed menu. The mitigation is not optional and it is not novel: it is the discipline you already apply to any service that executes untrusted code.

A production deployment needs, at minimum, a properly isolated runtime (Cloudflare uses a V8 isolate with no filesystem; others use microVM or container sandboxes of the kind we compared in Modal vs E2B vs Daytona vs Vercel Sandbox), network egress controls so a script cannot exfiltrate data to an arbitrary host, per-tool permission scoping so the Salesforce module cannot reach the filesystem, resource and timeout limits to contain runaway loops, no ambient credentials in the execution environment, and audit logging of every snippet the model runs. Get those right and the blast radius stays contained. Skip them and you have handed a prompt-injectable model a code interpreter with your credentials, which is a materially worse position than plain tool calling.

This is the honest reason code execution is not a default. The token savings are large and well-replicated, but they are paid for with operational surface area that a small team may not want to own.

When to Use Code Execution vs the Alternatives

Pick the lightest tool that solves your actual problem:

Plain tool calling. Fine for a handful of tools and simple, single-call tasks. If your tool definitions are a small fraction of your context, there is nothing to fix.

Tool Search Tool. The right first move when the problem is purely too many tool definitions. It delivers about 85% definition savings with no sandbox and minimal architectural change. Most teams should reach for this before code execution.

Code execution with MCP. The right move when you have a large tool surface (dozens to hundreds of tools), multi-step orchestration, or data-heavy intermediate results that flood context, and you can safely run model-generated code. This is where the 78 to 99 percent numbers live.

A simple decision rule: if tool definitions plus intermediate data are not a large share of your token bill, do not add a sandbox. If they are, and your toolset is big enough that Bifrost-style scaling applies, code execution is likely the highest-leverage change you can make to an agent's cost profile.

Implementation Checklist for Production

If you decide the pattern fits, treat the rollout as an infrastructure project, not a prompt change:

Measure first. Instrument prompt-token usage and confirm tool definitions plus intermediate payloads actually dominate. If they do not, stop here.

Generate the code API. Convert MCP tool schemas into typed modules (one file per server) so the model can read definitions on demand rather than upfront.

Stand up an isolated sandbox. Choose a runtime with no ambient credentials, enforced egress rules, and hard CPU, memory, and time limits.

Scope permissions per tool. Each module gets only the access it needs; a read tool cannot write, and a data tool cannot reach the network.

Log and review every execution. Keep an audit trail of generated code, treat it as a security-relevant artifact, and alert on anomalous patterns.

Benchmark total cost and both latency metrics. Confirm the input-token win survives the output-token and latency tradeoff on your real workload before declaring victory.

Code execution with MCP is not a silver bullet, and the 98.7% figure is the best case rather than the expectation. But the underlying economics are sound and unusually well-replicated for a pattern this young: the more tools you wire into a model, the more it costs to do it the default way, and writing code against those tools is how teams running large agent stacks are reclaiming their context budget. If your agents are getting expensive as you add capabilities, this is the architecture worth evaluating, and it sits squarely in the AI development tooling decisions that increasingly separate cheap agents from runaway ones. The same overhead shows up at the harness level too: our measurement of Claude Code's 27-tool schema against OpenCode's leaner 10-tool preamble breaks down what a coding agent pays before it reads your first message, separate from whatever MCP tools you add on top.

FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI DEVELOPMENT TOOLS

Code Execution With MCP: Cut Tool Tokens up to 98%

Code execution with MCP cut Anthropic's tool tokens from 150K to 2K (98.7%); third-party tests show 78-99% input cuts. How it works and when to use it.

Sebastian MondragonJUNE 16, 2026 · 10 MIN READ

The Token Tax: Why Loading Every Tool Upfront Breaks at Scale

How Code Execution With MCP Works

The mechanics that produce the savings are straightforward once you see them:

Composition. The model can chain several operations in one script, with loops and conditionals, instead of a separate model turn per tool call. Eight sequential calls become one execution.

The Headline Numbers, and What Was Actually Measured

It also helps to keep Anthropic's two November 2025 releases straight, because they are frequently conflated:

Release	Date	Mechanism	Reported reduction
Code execution with MCP	Nov 4, 2025	Tools as a code API in a sandbox	150K to 2K tokens (98.7%) on a Drive-to-Salesforce task
Advanced tool use (Tool Search Tool)	Nov 24, 2025	On-demand loading of tool definitions	Tool definitions ~77K to ~8.7K (85%)
Advanced tool use (Programmatic Tool Calling)	Nov 24, 2025	Model orchestrates calls in code	Avg usage 43,588 to 27,297 tokens (37%) on complex research

Reproductions in the Wild

The pattern would be a curiosity if only Anthropic could hit the numbers. It is not. Independent testers reproduced the direction and magnitude across different models and stacks:

The AIMultiple test is the most honest about cost, because it ran a controlled comparison and reported the latency penalty alongside the token win. That penalty is the part most write-ups skip.

Source	Setup	Input-token reduction	Notes
AIMultiple	GPT-4.1 vs Bright Data MCP, 50 runs	78.5% (165K vs 771K)	100% task success; per-call latency up ~7%
Bifrost	Scaling tool count	58% (96 tools) to 92.8% (508 tools)	100% pass rate held across rounds
Cloudflare Code Mode	2,500-endpoint API	99.9% (1.17M to ~1K)	Runs generated JS in a V8 isolate
GitHub Discussion #629	112 GitHub tools	99.2% (150K to ~1.2K)	Community reproduction, Nov 2025

Where the Savings Come From, and the Hidden Costs

The third hidden cost is the sandbox itself, which is significant enough to deserve its own section.

Sandboxing and Security: The Real Price of Admission

This is the honest reason code execution is not a default. The token savings are large and well-replicated, but they are paid for with operational surface area that a small team may not want to own.

When to Use Code Execution vs the Alternatives

Pick the lightest tool that solves your actual problem:

Plain tool calling. Fine for a handful of tools and simple, single-call tasks. If your tool definitions are a small fraction of your context, there is nothing to fix.

Implementation Checklist for Production

If you decide the pattern fits, treat the rollout as an infrastructure project, not a prompt change:

Measure first. Instrument prompt-token usage and confirm tool definitions plus intermediate payloads actually dominate. If they do not, stop here.

Generate the code API. Convert MCP tool schemas into typed modules (one file per server) so the model can read definitions on demand rather than upfront.

Stand up an isolated sandbox. Choose a runtime with no ambient credentials, enforced egress rules, and hard CPU, memory, and time limits.

Scope permissions per tool. Each module gets only the access it needs; a read tool cannot write, and a data tool cannot reach the network.

Log and review every execution. Keep an audit trail of generated code, treat it as a security-relevant artifact, and alert on anomalous patterns.

Benchmark total cost and both latency metrics. Confirm the input-token win survives the output-token and latency tradeoff on your real workload before declaring victory.

FAQ

Quick answers to the questions this post tends to raise.