Why does my agent pick the wrong tool when it has too many tools?

Your agent picks the wrong tool because every tool schema sits in the prompt at once, and accuracy collapses as the count grows. Baseline tool-call accuracy drops to roughly 13-15%, near random, once you cross 100 tools, and one model fell from 64% to 20% going from 207 to 417 tools. The model cannot reliably disambiguate between dozens of similar-sounding functions when all their descriptions compete for attention in a single context window. The fix is not a better prompt, it is architecture: retrieve a small candidate set of relevant tools before the call instead of presenting the entire catalog every turn.

What is RAG-MCP and how does it fix tool overload?

RAG-MCP applies retrieval-augmented generation to tool selection: you embed every tool description into a vector index, then at query time retrieve the top-k most relevant tools and load only those into the model's context. The technique comes from arXiv 2505.03275 and restores tool-call accuracy to near-original levels even with hundreds of registered tools. Instead of the model scanning 200 schemas, it sees the 3-10 that semantically match the user's request. This cuts the prompt-bloat penalty (tool schemas otherwise eat 5-7% of context at 50+ tools) and removes the disambiguation problem that drives wrong-tool calls at scale.

How many tools can an AI agent handle before accuracy drops?

Accuracy starts degrading well before 100 tools and approaches random selection past it. Measured baselines put tool-call accuracy around 13-15% at 100+ tools, and the falloff is steep: one model dropped from 64% to 20% accuracy when its toolset grew from 207 to 417 tools. As a practical rule, keep the number of tools actually presented to the model per turn in the single digits to low teens. You can register hundreds of tools across MCP servers, but you should retrieve and load only a handful into context for any given request.

What is two-phase tool selection for agents?

Two-phase tool selection separates discovery from execution: phase one searches a tool registry to find candidates, phase two loads only those candidate schemas and lets the model call them. This search-then-load pattern replaces the default of putting every tool in context at once. Anthropic's tool-search feature and LangGraph tool nodes both implement variants of it. The win is twofold: the model never sees the full catalog, so disambiguation is easier, and you stop paying the token cost of hundreds of unused schemas on every turn. It is the single highest-leverage fix for agents with large tool catalogs.

How do I prevent tool name collisions across multiple MCP servers?

Namespace your tools by server and rewrite descriptions for disambiguation. When you connect several MCP servers plus internal APIs, you routinely get duplicate or near-duplicate names (two search tools, three create tools), and the model cannot tell them apart. Prefix tool names with the server or domain (github_create_issue versus linear_create_issue), and rewrite each description to state exactly when to use this tool and not the similar one. Then measure selection precision against a labeled set of queries. Semantic tool routing with namespacing reaches around 86.4% selection accuracy versus under 50% when everything is dumped into context unnamespaced.

Should I use a tool router or split into sub-agents?

Use a tool router (retrieval or semantic routing) when one agent owns a large but coherent catalog, and split into sub-agents when the catalog spans genuinely different domains or trust boundaries. Routing keeps everything in one loop and is simpler to operate; it handles 50-200 tools well once you add search-then-load. Sub-agents make sense when tool groups have distinct permissions, when context for one domain would pollute another, or when you want independent failure isolation. Many production systems combine both: a top-level router dispatches to domain sub-agents, each of which retrieves from its own smaller tool set.

How much context do tool schemas consume in an agent prompt?

Tool schemas consume roughly 5-7% of the context window before the user's message even arrives once you register 50 or more tools, and the share keeps climbing with the catalog. Each tool carries a name, description, and full JSON Schema for its parameters, and all of it is serialized into the system prompt every turn. At hundreds of tools this is both a cost problem (you pay for those tokens on every call) and an accuracy problem (the model wades through irrelevant schemas to find the right one). Search-then-load eliminates most of this overhead by loading only the retrieved candidates.

BLOG/AI AGENTS

Agent Tool Selection at Scale: Fix Picking the Wrong Tool

Tool-call accuracy collapses to ~13% near random at 100+ tools. Fix agent tool selection with two-phase search-then-load and RAG-MCP retrieval (~86% accuracy).

Sebastian MondragonMAY 19, 2026 · 15 MIN READ

Agent Tool Selection at Scale: Fix Picking the Wrong Tool

A coding agent with 200 tools registered will call the wrong one about as often as a coin flip. That is not hyperbole. Measured tool-call accuracy drops to roughly 13-15%, near random, once an agent's catalog crosses 100 tools, and the slide is steep on the way there. One model fell from 64% to 20% accuracy when its toolset grew from 207 to 417 tools, the same model, the same tasks, just more options to get lost in. Agent tool selection at scale is the failure mode nobody warns you about when you wire up your third MCP server.

The cause is structural, not a prompting mistake. The default integration pattern, the one every tutorial and most frameworks ship by default, puts every tool's full JSON Schema into the system prompt on every turn. With one server and a dozen tools, that is fine. Connect three or four MCP servers, layer in your internal REST APIs, and you are suddenly asking the model to pick the right function out of a flat list of 50 to 200 near-identical descriptions, while those schemas quietly eat 5-7% of your context window before the user has said a word. The model does not have a reasoning problem. It has a haystack problem.

This is a how-to for fixing it. I will walk through why accuracy collapses at catalog scale, then the patterns that solve it: two-phase hierarchical selection (search-then-load), RAG-MCP retrieval, semantic routing and namespacing, description rewrites for disambiguation, and when to split into sub-agents. If your agent is calling the wrong tool, none of the fixes are "use a smarter model." They are all about what you put in front of it.

01 · The Catalog-Scale Problem: Why More Tools Means Worse Agents

Start with the mechanics. A tool, in the function-calling sense, is a name, a natural-language description, and a JSON Schema describing its parameters. When you register tools with a model, all three parts of every tool get serialized into the prompt. The model reads the user request, scans the available tools, and emits a structured call naming one of them with arguments.

That loop works beautifully at small scale and degrades predictably as the catalog grows. There are two distinct penalties, and they compound.

The disambiguation penalty. When ten tools sit in context, their descriptions are usually distinct enough that the right one is obvious. When a hundred sit in context, you almost certainly have clusters of near-duplicates: a search_documents and a search_knowledge_base and a query_index; a create_ticket and create_issue and open_case. The model has to disambiguate within these clusters using only the descriptions you wrote, and most descriptions were written to describe what a tool does in isolation, not when to choose it over its three siblings.

The attention-dilution penalty. Even with perfectly distinct tools, a flat list of 200 schemas is a lot of text to hold in working attention. The relevant tool is one needle; the other 199 are hay. This is the same long-context degradation that shows up everywhere in LLM systems, and it is why your million-token context window is quietly lying about how much it can actually use. More tokens in context does not mean more usable signal.

The 207-to-417 data point is the one I find most useful, because it isolates the variable. Same model, same benchmark tasks, the only change is the size of the tool list, and accuracy collapses. You cannot prompt your way out of that. The information the model needs is present, but it is buried, and burying it is the bug. A more aggressive fix attacks the token cost at its root by exposing tools as a code API the agent calls programmatically, so the model loads only the definitions it actually opens instead of paying for the whole catalog every turn.

There is a cost dimension too. At 50-plus tools, schemas occupy 5-7% of the context before the user message, and you pay for those tokens on every turn, whether or not any of those tools get used. For a high-volume agent, that is real money spent shipping schemas the model mostly ignores.

What the Numbers Actually Show

The degradation is not subtle. Across published evaluations of tool-augmented models, the pattern is consistent:

Tool count in context	Approximate tool-call accuracy	What it means
~10-20 tools	80-95%	Healthy; default integration is fine
~50 tools	meaningful drop, schemas now 5-7% of context	Warning zone; cost and accuracy both bite
100+ tools	~13-15% (near random)	Broken; the model is effectively guessing
207 → 417 tools (one model)	64% → 20%	Doubling the catalog roughly cut accuracy by two-thirds

02 · Two-Phase Hierarchical Selection: Search, Then Load

The fix that addresses both penalties at once is to stop presenting the full catalog. Instead of "here are 200 tools, pick one," you split tool use into two phases:

Search phase. Given the user request (and conversation state), find the small set of tools that could plausibly be relevant. This is a retrieval or routing step, not a generation step.

Load phase. Load only those candidate tool schemas into context and let the model call one. The model now sees five to fifteen tools instead of two hundred.

This is the search-then-load pattern, and it is the single highest-leverage change you can make to a large-catalog agent. Anthropic ships a tool-search capability built on exactly this idea, and LangGraph tool nodes let you implement a discovery step that filters the active tool set before the model call. The mechanics differ but the principle is identical: discovery is separate from execution, and execution only ever sees a handful of tools.

It attacks both penalties at once. Disambiguation gets easier because the near-duplicate clusters are mostly filtered out before the model sees them, attention dilution disappears because there are only a few schemas in context, and token cost drops because you are no longer serializing 200 schemas every turn.

Implementing the Search Phase

The search phase can be as simple or as sophisticated as your catalog warrants. Three implementations, in increasing order of robustness: Keyword or rule-based filtering. If your tools cluster cleanly by domain (all the GitHub tools, all the database tools), a router that matches the request to a domain and exposes only that domain's tools is trivial to build and surprisingly effective. The weakness is requests that span domains or use vocabulary your rules do not anticipate. Embedding retrieval over tool descriptions. Embed each tool's name and description into a vector index. At query time, embed the user request and retrieve the top-k nearest tools. This is the RAG-MCP approach, covered in the next section, and it is the workhorse for catalogs in the 50-500 range. LLM-based routing. Use a small, fast model whose only job is to read the request and output the names of relevant tools (it sees a compressed catalog, names and one-line descriptions only, not full schemas). This handles nuance that embeddings miss but adds a model call to the critical path. Most production systems we have shipped land on embedding retrieval as the default, with a thin rule-based layer in front for the obvious cases. The key decision is what k to retrieve: too low and you risk excluding the right tool, too high and you reintroduce dilution. Start with k around 5-10 and tune it against a labeled query set.

03 · RAG-MCP: Retrieval-Augmented Tool Selection

RAG-MCP, introduced in arXiv 2505.03275, is the cleanest formalization of embedding-based tool retrieval, and it is worth understanding precisely because the name tells you the whole design. You apply the retrieval-augmented generation pattern, the same one that powers document RAG, to tools instead of documents.

The setup:

Index your tools. For every registered tool, build a text representation (name plus description, optionally plus a few example invocations) and embed it. Store the embeddings in a vector index alongside the tool's full schema and a stable tool ID.

Retrieve at query time. When a user request arrives, embed it and run a similarity search against the tool index. Take the top-k tools.

Inject and call. Load the full schemas for only those top-k tools into the model's context and let it emit a tool call.

The headline result from the paper is that this restores tool-call accuracy to near-original levels even as the registered catalog grows into the hundreds. The reason is intuitive once you have seen the catalog-scale numbers: the model's accuracy was never a function of how many tools exist, it was a function of how many it has to disambiguate in context. RAG-MCP keeps that number small and constant regardless of catalog size. You can register 500 tools and the model still only ever reasons over 8.

Where RAG-MCP Earns Its Keep, and Where It Bites

This is not free, and the failure modes are the ones RAG practitioners already know. If your tool descriptions are vague or written in language that does not match how users phrase requests, retrieval will miss the right tool, and a missed retrieval is invisible: the model never sees the tool it needed and either picks a wrong one or gives up. That makes description quality (covered below) the load-bearing input. The retrieval quality question is the same one that determines whether any RAG system works at all. If you have wrestled with getting retrieval to surface the right chunks rather than the merely similar ones, you understand the core risk: semantic similarity between a request and a tool description is a proxy for "is this the right tool," not the thing itself. Treat tool retrieval with the same rigor you would a production document-RAG pipeline, including an offline eval of retrieval recall. A practical safeguard: always retrieve a slightly larger candidate set than you load, and include a small set of always-on "core" tools (a generic search, a help tool) that are present regardless of retrieval, so the agent has a fallback when retrieval comes up empty.

04 · Semantic Tool Routing and Namespacing Across MCP Servers

Retrieval handles the "which tools are relevant" question. Namespacing handles a different, sneakier problem that shows up the moment you connect more than one MCP server: name collisions.

Connect a GitHub MCP server, a Linear MCP server, and your internal API, and you will get multiple tools called some variant of create, search, update, and get. Even if the names differ slightly, the descriptions often overlap enough that the model cannot reliably tell which server's search it should use. This is a routing problem layered on top of the selection problem, and it is why simply aggregating MCP servers into one flat tool list degrades fast.

The fixes, in order:

Namespace tool names by source. Prefix every tool with its server or domain: github_create_issue, linear_create_issue, internal_create_ticket. This does real work both for the model (the prefix is a strong disambiguation signal) and for you (logs and evals can attribute calls to the right source). Never expose two tools with the same bare name to the same model.

Route semantically before loading. Semantic tool routing combines retrieval with namespacing: the router uses the request's meaning to pick not just relevant tools but the right server's version of an ambiguous tool. Reported selection accuracy for semantic routing lands around 86.4%, versus under 50% when the full, unnamespaced catalog sits in context. That gap is the difference between an agent that works and one that quietly does the wrong thing half the time.

Mind the security boundary. Namespacing is also where tool selection meets access control. Different MCP servers carry different trust levels, and a confused agent calling the wrong server's write tool is a security incident, not just an accuracy miss. If you are aggregating third-party MCP servers, treat each as an untrusted boundary and read the MCP server security hardening checklist for production before wiring them into a write-capable agent. Routing that lands on the wrong server is bad; routing that lands on a malicious one is worse.

Open-source tooling exists here too. The vLLM Semantic Router and similar projects implement embedding-based routing as infrastructure, so you are not hand-rolling the retrieval and dispatch layer from scratch.

05 · Rewriting Tool Descriptions for Disambiguation

Everything above depends on one input you control completely: the tool description. Both retrieval and the model's final selection read these descriptions, and most are written badly for the job. The default description tells the model what a tool does. A good description tells the model when to choose it, and when not to.

The shift is from isolated definition to comparative definition. Compare:

Weak: "Searches the knowledge base for relevant documents."

Strong: "Searches the internal product documentation and help articles for answers to how-to and troubleshooting questions. Use this for product usage questions. Do NOT use this for code search (use code_search) or for searching customer records (use crm_search)."

The strong version does three things the weak one does not: it scopes the domain precisely, states the intended query type, and explicitly names the sibling tools it is most likely confused with and tells the model to avoid them. That last part is what disambiguation actually requires. Wrong-tool calls cluster around near-duplicates, so the highest-value edits draw bright lines between tools that sound alike.

A few rules I apply when rewriting a catalog:

Lead with intent, not implementation. Start with the kind of request this tool answers, not the API it wraps.

Name the confusable siblings. If two tools could match the same request, each should mention the other and the distinguishing condition.

Match user vocabulary. Descriptions are retrieval targets. If users say "find" and your description says "query," embedding retrieval may miss it.

Keep parameters honest. A required parameter the model cannot fill from context is a tool it will fail to call even when it selects the right one. This is the same discipline as making any agent use its tools correctly through clear contracts, at description level.

This work is unglamorous and it is the highest-ROI thing you can do after adding search-then-load. You are writing the labels that both your retriever and your model read, so write them for the reader.

Measuring Selection Precision

You cannot improve what you do not measure. Build a labeled set: a few hundred representative requests, each annotated with the correct tool (or correct tool set). Then measure, per change, how often the system selects the right tool. Track two numbers: Treat this exactly like evals-driven development for any other AI feature: your labeled set is the regression suite, and every description rewrite, k change, or routing tweak gets validated against it before it ships. Without this, you are tuning blind, and tool-selection regressions are notoriously easy to introduce and hard to notice in production because a wrong-tool call often still produces plausible-looking output.

Retrieval recall: did the search phase include the correct tool in its candidate set? A miss here is a hard ceiling on everything downstream.
Selection precision: given the candidates, did the model call the right one?

06 · Dynamic Tool Filtering by Task Phase

So far we have treated tool selection as a per-request retrieval problem. There is a second axis: most agents move through phases, and the tools relevant in one phase are noise in another. A coding agent exploring a codebase needs read and search tools; making a change it needs edit and write tools; verifying it needs test-runner tools. Which search tool you hand the exploration phase matters on its own, since semantic code search over an indexed repo returns ranked chunks instead of the whole-file reads grep forces into context. Exposing all of them in every phase reintroduces the dilution you just removed.

Dynamic filtering by phase means the active tool set changes as the agent's state machine advances. You can drive this off the agent's declared intent, an explicit plan step, or the structure of an orchestration graph. The pattern pairs naturally with the ReAct and function-calling agent patterns, where the loop already knows where it is in a task and can gate tools accordingly.

The payoff is twofold. Fewer tools per phase means higher selection accuracy within each phase, on top of whatever retrieval bought you. And it gives you a clean place to enforce safety: write tools simply do not exist in read-only phases, a stronger guarantee than hoping the model chooses not to call them.

07 · When to Stop Routing and Split Into Sub-Agents

Routing has a ceiling. If your catalog spans genuinely different domains, different data, different permissions, different failure semantics, then cramming it behind one router is fighting the architecture. At that point, split into sub-agents.

The decision comes down to a few signals:

A sub-agent owns a smaller, focused tool set, so each one gets the catalog-scale problem in a tractable size, and you can still apply retrieval within it. A top-level orchestrator dispatches to the right sub-agent based on the request, which is the same routing decision applied one layer up. The patterns that make multi-agent orchestration actually ship are the patterns you want here. The failure mode to avoid is splitting prematurely: sub-agents add real operational complexity, so reach for them when routing genuinely caps out, not before.

In practice the strongest large-catalog architectures combine all three layers. A top-level router dispatches to domain sub-agents; each sub-agent uses RAG-MCP retrieval over its own smaller tool set; and within each agent, dynamic filtering narrows the active set further by task phase. Each layer keeps the number of tools the model actually reasons over small, which is the entire game. For the protocol-level plumbing that ties multiple MCP servers together, the MCP developer guide to building servers and connecting tools covers the wiring.

Signal	Lean toward routing	Lean toward sub-agents
Catalog size	50-200 tools, coherent domain	Hundreds across distinct domains
Permission boundaries	Uniform trust level	Different access scopes per group
Context bleed	Tool groups share context cleanly	One domain's context pollutes another
Failure isolation	A failure in one area is tolerable	You need independent containment

08 · Putting It Together: A Migration Path

If you have an agent that is calling the wrong tool at scale, here is the order I would fix it in, highest leverage first.

Measure first. Build the labeled query set and get a baseline selection-precision number. You need this to know whether anything you do helps.

Add search-then-load. Implement two-phase selection so the model only ever sees a handful of tools per turn. This alone recovers most of the lost accuracy. Use embedding retrieval (RAG-MCP) as the default search phase.

Namespace everything. Prefix tool names by source and ensure no two tools share a bare name. This kills the cross-server collision class.

Rewrite descriptions for disambiguation. Focus on the near-duplicate clusters your eval shows the model confusing. Name the siblings, scope the intent, match user vocabulary.

Add phase-based filtering if your agent has distinct task phases, gating write tools out of read phases for both accuracy and safety.

Split into sub-agents only when the catalog genuinely spans domains that routing cannot keep coherent.

The throughline is the same at every step: tool-selection accuracy depends almost entirely on how many tools the model has to disambiguate in context, not how many exist in your system. Keep that number small and constant, and a 500-tool agent behaves like a 10-tool one. This is the core of the agent reliability work we do at Particula Tech, because the gap between an agent that demos well and one that runs in production is usually exactly this: the demo had eight tools and production has two hundred.

For the broader architecture context, see our pillar guide to building production AI agents, which covers how tool selection fits alongside memory, orchestration, and reliability. Tool selection at scale is not an exotic problem reserved for huge systems. It arrives the moment you connect your third integration, and the fix is the same whether you have 50 tools or 500: never make the model find a needle in your haystack when you can hand it the needle.

09 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI AGENTS

Agent Tool Selection at Scale: Fix Picking the Wrong Tool

Tool-call accuracy collapses to ~13% near random at 100+ tools. Fix agent tool selection with two-phase search-then-load and RAG-MCP retrieval (~86% accuracy).

Sebastian MondragonMAY 19, 2026 · 15 MIN READ

01 · The Catalog-Scale Problem: Why More Tools Means Worse Agents

That loop works beautifully at small scale and degrades predictably as the catalog grows. There are two distinct penalties, and they compound.

What the Numbers Actually Show

The degradation is not subtle. Across published evaluations of tool-augmented models, the pattern is consistent:

Tool count in context	Approximate tool-call accuracy	What it means
~10-20 tools	80-95%	Healthy; default integration is fine
~50 tools	meaningful drop, schemas now 5-7% of context	Warning zone; cost and accuracy both bite
100+ tools	~13-15% (near random)	Broken; the model is effectively guessing
207 → 417 tools (one model)	64% → 20%	Doubling the catalog roughly cut accuracy by two-thirds

02 · Two-Phase Hierarchical Selection: Search, Then Load

The fix that addresses both penalties at once is to stop presenting the full catalog. Instead of "here are 200 tools, pick one," you split tool use into two phases:

Search phase. Given the user request (and conversation state), find the small set of tools that could plausibly be relevant. This is a retrieval or routing step, not a generation step.

Load phase. Load only those candidate tool schemas into context and let the model call one. The model now sees five to fifteen tools instead of two hundred.

Implementing the Search Phase

03 · RAG-MCP: Retrieval-Augmented Tool Selection

The setup:

Retrieve at query time. When a user request arrives, embed it and run a similarity search against the tool index. Take the top-k tools.

Inject and call. Load the full schemas for only those top-k tools into the model's context and let it emit a tool call.

Where RAG-MCP Earns Its Keep, and Where It Bites

04 · Semantic Tool Routing and Namespacing Across MCP Servers

Retrieval handles the "which tools are relevant" question. Namespacing handles a different, sneakier problem that shows up the moment you connect more than one MCP server: name collisions.

The fixes, in order:

05 · Rewriting Tool Descriptions for Disambiguation

The shift is from isolated definition to comparative definition. Compare:

Weak: "Searches the knowledge base for relevant documents."

Strong: "Searches the internal product documentation and help articles for answers to how-to and troubleshooting questions. Use this for product usage questions. Do NOT use this for code search (use code_search) or for searching customer records (use crm_search)."

A few rules I apply when rewriting a catalog:

Lead with intent, not implementation. Start with the kind of request this tool answers, not the API it wraps.

Name the confusable siblings. If two tools could match the same request, each should mention the other and the distinguishing condition.

Match user vocabulary. Descriptions are retrieval targets. If users say "find" and your description says "query," embedding retrieval may miss it.

Measuring Selection Precision

Retrieval recall: did the search phase include the correct tool in its candidate set? A miss here is a hard ceiling on everything downstream.
Selection precision: given the candidates, did the model call the right one?

06 · Dynamic Tool Filtering by Task Phase

07 · When to Stop Routing and Split Into Sub-Agents

The decision comes down to a few signals:

Signal	Lean toward routing	Lean toward sub-agents
Catalog size	50-200 tools, coherent domain	Hundreds across distinct domains
Permission boundaries	Uniform trust level	Different access scopes per group
Context bleed	Tool groups share context cleanly	One domain's context pollutes another
Failure isolation	A failure in one area is tolerable	You need independent containment

08 · Putting It Together: A Migration Path

If you have an agent that is calling the wrong tool at scale, here is the order I would fix it in, highest leverage first.

Measure first. Build the labeled query set and get a baseline selection-precision number. You need this to know whether anything you do helps.

Namespace everything. Prefix tool names by source and ensure no two tools share a bare name. This kills the cross-server collision class.

Rewrite descriptions for disambiguation. Focus on the near-duplicate clusters your eval shows the model confusing. Name the siblings, scope the intent, match user vocabulary.

Add phase-based filtering if your agent has distinct task phases, gating write tools out of read phases for both accuracy and safety.

Split into sub-agents only when the catalog genuinely spans domains that routing cannot keep coherent.

09 · FAQ

Quick answers to the questions this post tends to raise.