At 100+ tools, baseline tool-call accuracy drops to roughly 13-15%, near random, and one model fell from 64% to 20% going from 207 to 417 tools. Tool schemas eat 5-7% of context before the user message at 50+ tools. Fix it with two-phase hierarchical selection (search-then-load) and RAG-MCP retrieval, which restores accuracy to near-original. Semantic tool routing hits ~86.4% selection accuracy versus under 50% when everything sits in context.
A coding agent with 200 tools registered will call the wrong one about as often as a coin flip. That is not hyperbole. Measured tool-call accuracy drops to roughly 13-15%, near random, once an agent's catalog crosses 100 tools, and the slide is steep on the way there. One model fell from 64% to 20% accuracy when its toolset grew from 207 to 417 tools, the same model, the same tasks, just more options to get lost in. Agent tool selection at scale is the failure mode nobody warns you about when you wire up your third MCP server.
The cause is structural, not a prompting mistake. The default integration pattern, the one every tutorial and most frameworks ship by default, puts every tool's full JSON Schema into the system prompt on every turn. With one server and a dozen tools, that is fine. Connect three or four MCP servers, layer in your internal REST APIs, and you are suddenly asking the model to pick the right function out of a flat list of 50 to 200 near-identical descriptions, while those schemas quietly eat 5-7% of your context window before the user has said a word. The model does not have a reasoning problem. It has a haystack problem.
This is a how-to for fixing it. I will walk through why accuracy collapses at catalog scale, then the patterns that solve it: two-phase hierarchical selection (search-then-load), RAG-MCP retrieval, semantic routing and namespacing, description rewrites for disambiguation, and when to split into sub-agents. If your agent is calling the wrong tool, none of the fixes are "use a smarter model." They are all about what you put in front of it.
The Catalog-Scale Problem: Why More Tools Means Worse Agents
Start with the mechanics. A tool, in the function-calling sense, is a name, a natural-language description, and a JSON Schema describing its parameters. When you register tools with a model, all three parts of every tool get serialized into the prompt. The model reads the user request, scans the available tools, and emits a structured call naming one of them with arguments.
That loop works beautifully at small scale and degrades predictably as the catalog grows. There are two distinct penalties, and they compound.
The disambiguation penalty. When ten tools sit in context, their descriptions are usually distinct enough that the right one is obvious. When a hundred sit in context, you almost certainly have clusters of near-duplicates: a search_documents and a search_knowledge_base and a query_index; a create_ticket and create_issue and open_case. The model has to disambiguate within these clusters using only the descriptions you wrote, and most descriptions were written to describe what a tool does in isolation, not when to choose it over its three siblings.
The attention-dilution penalty. Even with perfectly distinct tools, a flat list of 200 schemas is a lot of text to hold in working attention. The relevant tool is one needle; the other 199 are hay. This is the same long-context degradation that shows up everywhere in LLM systems, and it is why your million-token context window is quietly lying about how much it can actually use. More tokens in context does not mean more usable signal.
The 207-to-417 data point is the one I find most useful, because it isolates the variable. Same model, same benchmark tasks, the only change is the size of the tool list, and accuracy collapses. You cannot prompt your way out of that. The information the model needs is present, but it is buried, and burying it is the bug.
There is a cost dimension too. At 50-plus tools, schemas occupy 5-7% of the context before the user message, and you pay for those tokens on every turn, whether or not any of those tools get used. For a high-volume agent, that is real money spent shipping schemas the model mostly ignores.
What the Numbers Actually Show
The degradation is not subtle. Across published evaluations of tool-augmented models, the pattern is consistent:
| Tool count in context | Approximate tool-call accuracy | What it means |
|---|---|---|
| ~10-20 tools | 80-95% | Healthy; default integration is fine |
| ~50 tools | meaningful drop, schemas now 5-7% of context | Warning zone; cost and accuracy both bite |
| 100+ tools | ~13-15% (near random) | Broken; the model is effectively guessing |
| 207 → 417 tools (one model) | 64% → 20% | Doubling the catalog roughly cut accuracy by two-thirds |
Two-Phase Hierarchical Selection: Search, Then Load
The fix that addresses both penalties at once is to stop presenting the full catalog. Instead of "here are 200 tools, pick one," you split tool use into two phases:
This is the search-then-load pattern, and it is the single highest-leverage change you can make to a large-catalog agent. Anthropic ships a tool-search capability built on exactly this idea, and LangGraph tool nodes let you implement a discovery step that filters the active tool set before the model call. The mechanics differ but the principle is identical: discovery is separate from execution, and execution only ever sees a handful of tools.
It attacks both penalties at once. Disambiguation gets easier because the near-duplicate clusters are mostly filtered out before the model sees them, attention dilution disappears because there are only a few schemas in context, and token cost drops because you are no longer serializing 200 schemas every turn.
Implementing the Search Phase
The search phase can be as simple or as sophisticated as your catalog warrants. Three implementations, in increasing order of robustness: Keyword or rule-based filtering. If your tools cluster cleanly by domain (all the GitHub tools, all the database tools), a router that matches the request to a domain and exposes only that domain's tools is trivial to build and surprisingly effective. The weakness is requests that span domains or use vocabulary your rules do not anticipate. Embedding retrieval over tool descriptions. Embed each tool's name and description into a vector index. At query time, embed the user request and retrieve the top-k nearest tools. This is the RAG-MCP approach, covered in the next section, and it is the workhorse for catalogs in the 50-500 range. LLM-based routing. Use a small, fast model whose only job is to read the request and output the names of relevant tools (it sees a compressed catalog, names and one-line descriptions only, not full schemas). This handles nuance that embeddings miss but adds a model call to the critical path. Most production systems we have shipped land on embedding retrieval as the default, with a thin rule-based layer in front for the obvious cases. The key decision is what k to retrieve: too low and you risk excluding the right tool, too high and you reintroduce dilution. Start with k around 5-10 and tune it against a labeled query set.
RAG-MCP: Retrieval-Augmented Tool Selection
RAG-MCP, introduced in arXiv 2505.03275, is the cleanest formalization of embedding-based tool retrieval, and it is worth understanding precisely because the name tells you the whole design. You apply the retrieval-augmented generation pattern, the same one that powers document RAG, to tools instead of documents.
The setup:
The headline result from the paper is that this restores tool-call accuracy to near-original levels even as the registered catalog grows into the hundreds. The reason is intuitive once you have seen the catalog-scale numbers: the model's accuracy was never a function of how many tools exist, it was a function of how many it has to disambiguate in context. RAG-MCP keeps that number small and constant regardless of catalog size. You can register 500 tools and the model still only ever reasons over 8.
Where RAG-MCP Earns Its Keep, and Where It Bites
This is not free, and the failure modes are the ones RAG practitioners already know. If your tool descriptions are vague or written in language that does not match how users phrase requests, retrieval will miss the right tool, and a missed retrieval is invisible: the model never sees the tool it needed and either picks a wrong one or gives up. That makes description quality (covered below) the load-bearing input. The retrieval quality question is the same one that determines whether any RAG system works at all. If you have wrestled with getting retrieval to surface the right chunks rather than the merely similar ones, you understand the core risk: semantic similarity between a request and a tool description is a proxy for "is this the right tool," not the thing itself. Treat tool retrieval with the same rigor you would a production document-RAG pipeline, including an offline eval of retrieval recall. A practical safeguard: always retrieve a slightly larger candidate set than you load, and include a small set of always-on "core" tools (a generic search, a help tool) that are present regardless of retrieval, so the agent has a fallback when retrieval comes up empty.
Semantic Tool Routing and Namespacing Across MCP Servers
Retrieval handles the "which tools are relevant" question. Namespacing handles a different, sneakier problem that shows up the moment you connect more than one MCP server: name collisions.
Connect a GitHub MCP server, a Linear MCP server, and your internal API, and you will get multiple tools called some variant of create, search, update, and get. Even if the names differ slightly, the descriptions often overlap enough that the model cannot reliably tell which server's search it should use. This is a routing problem layered on top of the selection problem, and it is why simply aggregating MCP servers into one flat tool list degrades fast.
The fixes, in order:
Namespace tool names by source. Prefix every tool with its server or domain: github_create_issue, linear_create_issue, internal_create_ticket. This does real work both for the model (the prefix is a strong disambiguation signal) and for you (logs and evals can attribute calls to the right source). Never expose two tools with the same bare name to the same model.
Route semantically before loading. Semantic tool routing combines retrieval with namespacing: the router uses the request's meaning to pick not just relevant tools but the right server's version of an ambiguous tool. Reported selection accuracy for semantic routing lands around 86.4%, versus under 50% when the full, unnamespaced catalog sits in context. That gap is the difference between an agent that works and one that quietly does the wrong thing half the time.
Mind the security boundary. Namespacing is also where tool selection meets access control. Different MCP servers carry different trust levels, and a confused agent calling the wrong server's write tool is a security incident, not just an accuracy miss. If you are aggregating third-party MCP servers, treat each as an untrusted boundary and read the MCP server security hardening checklist for production before wiring them into a write-capable agent. Routing that lands on the wrong server is bad; routing that lands on a malicious one is worse.
Open-source tooling exists here too. The vLLM Semantic Router and similar projects implement embedding-based routing as infrastructure, so you are not hand-rolling the retrieval and dispatch layer from scratch.
Rewriting Tool Descriptions for Disambiguation
Everything above depends on one input you control completely: the tool description. Both retrieval and the model's final selection read these descriptions, and most are written badly for the job. The default description tells the model what a tool does. A good description tells the model when to choose it, and when not to.
The shift is from isolated definition to comparative definition. Compare:
Weak: "Searches the knowledge base for relevant documents."
Strong: "Searches the internal product documentation and help articles for answers to how-to and troubleshooting questions. Use this for product usage questions. Do NOT use this for code search (use code_search) or for searching customer records (use crm_search)."
The strong version does three things the weak one does not: it scopes the domain precisely, states the intended query type, and explicitly names the sibling tools it is most likely confused with and tells the model to avoid them. That last part is what disambiguation actually requires. Wrong-tool calls cluster around near-duplicates, so the highest-value edits draw bright lines between tools that sound alike.
A few rules I apply when rewriting a catalog:
This work is unglamorous and it is the highest-ROI thing you can do after adding search-then-load. You are writing the labels that both your retriever and your model read, so write them for the reader.
Measuring Selection Precision
You cannot improve what you do not measure. Build a labeled set: a few hundred representative requests, each annotated with the correct tool (or correct tool set). Then measure, per change, how often the system selects the right tool. Track two numbers: Treat this exactly like evals-driven development for any other AI feature: your labeled set is the regression suite, and every description rewrite, k change, or routing tweak gets validated against it before it ships. Without this, you are tuning blind, and tool-selection regressions are notoriously easy to introduce and hard to notice in production because a wrong-tool call often still produces plausible-looking output.
- Retrieval recall: did the search phase include the correct tool in its candidate set? A miss here is a hard ceiling on everything downstream.
- Selection precision: given the candidates, did the model call the right one?
Dynamic Tool Filtering by Task Phase
So far we have treated tool selection as a per-request retrieval problem. There is a second axis: most agents move through phases, and the tools relevant in one phase are noise in another. A coding agent exploring a codebase needs read and search tools; making a change it needs edit and write tools; verifying it needs test-runner tools. Exposing all of them in every phase reintroduces the dilution you just removed.
Dynamic filtering by phase means the active tool set changes as the agent's state machine advances. You can drive this off the agent's declared intent, an explicit plan step, or the structure of an orchestration graph. The pattern pairs naturally with the ReAct and function-calling agent patterns, where the loop already knows where it is in a task and can gate tools accordingly.
The payoff is twofold. Fewer tools per phase means higher selection accuracy within each phase, on top of whatever retrieval bought you. And it gives you a clean place to enforce safety: write tools simply do not exist in read-only phases, a stronger guarantee than hoping the model chooses not to call them.
When to Stop Routing and Split Into Sub-Agents
Routing has a ceiling. If your catalog spans genuinely different domains, different data, different permissions, different failure semantics, then cramming it behind one router is fighting the architecture. At that point, split into sub-agents.
The decision comes down to a few signals:
A sub-agent owns a smaller, focused tool set, so each one gets the catalog-scale problem in a tractable size, and you can still apply retrieval within it. A top-level orchestrator dispatches to the right sub-agent based on the request, which is the same routing decision applied one layer up. The patterns that make multi-agent orchestration actually ship are the patterns you want here. The failure mode to avoid is splitting prematurely: sub-agents add real operational complexity, so reach for them when routing genuinely caps out, not before.
In practice the strongest large-catalog architectures combine all three layers. A top-level router dispatches to domain sub-agents; each sub-agent uses RAG-MCP retrieval over its own smaller tool set; and within each agent, dynamic filtering narrows the active set further by task phase. Each layer keeps the number of tools the model actually reasons over small, which is the entire game. For the protocol-level plumbing that ties multiple MCP servers together, the MCP developer guide to building servers and connecting tools covers the wiring.
| Signal | Lean toward routing | Lean toward sub-agents |
|---|---|---|
| Catalog size | 50-200 tools, coherent domain | Hundreds across distinct domains |
| Permission boundaries | Uniform trust level | Different access scopes per group |
| Context bleed | Tool groups share context cleanly | One domain's context pollutes another |
| Failure isolation | A failure in one area is tolerable | You need independent containment |
Putting It Together: A Migration Path
If you have an agent that is calling the wrong tool at scale, here is the order I would fix it in, highest leverage first.
The throughline is the same at every step: tool-selection accuracy depends almost entirely on how many tools the model has to disambiguate in context, not how many exist in your system. Keep that number small and constant, and a 500-tool agent behaves like a 10-tool one. This is the core of the agent reliability work we do at Particula Tech, because the gap between an agent that demos well and one that runs in production is usually exactly this: the demo had eight tools and production has two hundred.
For the broader architecture context, see our pillar guide to building production AI agents, which covers how tool selection fits alongside memory, orchestration, and reliability. Tool selection at scale is not an exotic problem reserved for huge systems. It arrives the moment you connect your third integration, and the fix is the same whether you have 50 tools or 500: never make the model find a needle in your haystack when you can hand it the needle.
Frequently Asked Questions
Quick answers to common questions about this topic
Your agent picks the wrong tool because every tool schema sits in the prompt at once, and accuracy collapses as the count grows. Baseline tool-call accuracy drops to roughly 13-15%, near random, once you cross 100 tools, and one model fell from 64% to 20% going from 207 to 417 tools. The model cannot reliably disambiguate between dozens of similar-sounding functions when all their descriptions compete for attention in a single context window. The fix is not a better prompt, it is architecture: retrieve a small candidate set of relevant tools before the call instead of presenting the entire catalog every turn.



