Traditional retrieve-then-generate pipelines break on multi-hop and ambiguous queries—34% accuracy on complex benchmarks. Agentic RAG wraps retrieval in an agent loop with five components: Router, Retriever, Grader, Generator, and Hallucination Checker. This architecture hit 78% accuracy on the same benchmarks and 94.5% on HotpotQA. Implement it with LangGraph's graph-based state machine or a custom orchestrator, and use adaptive routing to skip retrieval entirely when the LLM's parametric knowledge is sufficient.
Last month, a healthcare client asked us to build a RAG system that answers clinical protocol questions. The straightforward version—embed the query, retrieve the top five chunks, generate an answer—worked fine for simple lookups like "what's the dosing schedule for metformin." Then a physician asked: "For a patient with stage 3 CKD and uncontrolled diabetes who failed metformin, what are the recommended second-line agents and their renal dosing adjustments?" The system retrieved chunks about CKD staging and chunks about diabetes medications—but never connected them. The answer was plausible-sounding nonsense.
That failure isn't a retrieval quality problem. It's an architecture problem. The fixed retrieve-then-generate pipeline has no mechanism to evaluate whether the retrieved documents actually address the query, no ability to decompose a complex question into sub-queries, and no way to verify that the generated answer is grounded in the evidence. A March 2026 systematization-of-knowledge paper formalized what practitioners have been discovering empirically: traditional RAG hits roughly 34% accuracy on complex, multi-hop queries. Agentic RAG—where an agent controls the retrieval loop with routing, grading, and self-correction—pushes that to 78%.
This isn't a marginal improvement. It's the difference between a demo and a production system.
Why Fixed Pipelines Break on Complex Queries
Traditional RAG follows a linear pipeline: query → embed → retrieve top-k → generate. Every query gets the same treatment regardless of complexity. A factual lookup and a multi-hop reasoning question both get five chunks and one generation pass.
This architecture has three failure modes that compound on complex queries:
Single-shot retrieval misses context. When a query requires information from multiple documents or multiple sections of the same document, a single embedding similarity search rarely surfaces all the relevant pieces. The query "compare the renal safety profiles of SGLT2 inhibitors versus GLP-1 agonists in CKD patients" requires at least three distinct knowledge areas: SGLT2 inhibitor pharmacology, GLP-1 agonist pharmacology, and renal function considerations. A single vector search biases toward whichever topic the query embedding is closest to.
No relevance feedback loop. Traditional RAG has no mechanism to evaluate retrieved documents before passing them to the generator. If the retriever returns tangentially related content—which happens frequently with ambiguous or multi-faceted queries—the generator works with whatever it gets. This is the single biggest driver of confident-sounding hallucination: the LLM generates a coherent response from irrelevant context.
No output verification. The pipeline ends at generation. There's no check for whether the response is actually grounded in the retrieved documents, whether it addresses all parts of the query, or whether it contains fabricated claims. The user gets the first-draft answer every time, regardless of quality.
These limitations are well understood. What changed in 2026 is that we now have a formal framework for the alternative—and benchmarks proving it works.
The Agentic RAG Architecture
The March 2026 SoK paper on agentic RAG (arXiv 2603.07379) formalized what production teams have been building iteratively: a five-component architecture where an LLM agent orchestrates the retrieval-generation loop instead of following a fixed pipeline.
The key insight is that these components form a loop, not a pipeline. If the Grader determines that retrieved documents are insufficient, the system rewrites the query and triggers another retrieval pass. If the Hallucination Checker finds unsupported claims, it can send the response back to the Generator with specific feedback about which claims need grounding. The agent decides when the answer is good enough to return.
This is what the SoK paper formalizes as a finite-horizon partially observable Markov decision process—the agent makes sequential decisions about when to retrieve, when to rewrite, and when to generate, based on the state accumulated through prior steps.
How It Differs from Corrective RAG and Self-RAG
If you've been following the RAG literature, you'll recognize pieces of this architecture in earlier work. It's worth distinguishing the three approaches: Corrective RAG (CRAG) adds a relevance grading step after retrieval. If documents score poorly, CRAG rewrites the query or falls back to web search. This is the Grader component in isolation—valuable, but it doesn't address query decomposition or output verification. Self-RAG trains the LLM itself to decide when retrieval is needed and to critique its own outputs with special reflection tokens. This bakes the routing and checking behavior into the model weights rather than the orchestration layer. The downside: you need to fine-tune or use a model that supports self-RAG natively. Agentic RAG is the superset. It orchestrates all five components through an external agent loop, meaning you can use any LLM without fine-tuning, swap retrieval strategies dynamically, and add or remove components as needed. The agent layer provides the planning and decision-making that CRAG and Self-RAG handle through narrower mechanisms. Think of it this way: CRAG fixes the retrieval step. Self-RAG teaches the model to self-correct. Agentic RAG builds a control system around the entire pipeline.
| Component | Role | Fires When |
|---|---|---|
| Router | Classifies query complexity and selects retrieval strategy | Every query |
| Retriever | Executes the selected retrieval strategy (vector, keyword, graph) | When Router determines retrieval is needed |
| Grader | Evaluates retrieved documents for relevance to the query | After each retrieval pass |
| Generator | Produces the response from graded, relevant documents | When sufficient relevant documents exist |
| Hallucination Checker | Verifies that generated claims are grounded in retrieved evidence | After generation, before returning response |
The Benchmark Data: 34% to 78%
The numbers from recent papers paint a clear picture of where fixed pipelines fail and agent-controlled retrieval succeeds.
The pattern is consistent: the more complex the query, the larger the gap. On simple single-hop factual questions, traditional RAG and agentic RAG perform similarly—the agent loop adds latency without meaningful accuracy gains. But as soon as queries require reasoning across multiple documents, decomposing questions, or resolving ambiguity, the fixed pipeline's limitations compound.
The A-RAG framework (arXiv 2602.03442) demonstrated another dimension of improvement: by exposing hierarchical retrieval interfaces—keyword search, semantic search, and chunk-level reads—directly to the agent, the system adaptively searches across multiple granularities. Instead of committing to one retrieval strategy, the agent learns when to use broad semantic search versus targeted keyword lookup versus reading specific document sections. This flexibility alone accounts for a significant portion of the accuracy gains on multi-hop benchmarks.
A separate finding worth noting: agentic RAG systems with intelligent memory management use up to 90% fewer tokens than traditional RAG approaches that retrieve fixed context windows. The agent retrieves only what it needs, when it needs it—rather than padding every prompt with the maximum context budget.
| Benchmark | Traditional RAG | Agentic RAG | Improvement |
|---|---|---|---|
| Complex multi-hop | 34% | 78% | +129% |
| HotpotQA | ~72% | 94.5% | +31% |
| 2WikiMultiHop | ~68% | 89.7% | +32% |
Three Agentic RAG Patterns
The SoK paper identifies three architectural patterns for agentic RAG systems, each suited to different complexity levels. Understanding which pattern fits your use case prevents both over-engineering simple problems and under-engineering complex ones.
# Simplified single-agent agentic RAG loop
def agentic_rag(query: str, max_cycles: int = 3) -> str:
route = router.classify(query)
if route == "parametric":
return llm.generate(query) # Skip retrieval entirely
for cycle in range(max_cycles):
documents = retriever.search(query, strategy=route)
relevant_docs = grader.filter(query, documents)
if len(relevant_docs) < MIN_RELEVANT:
query = rewriter.rewrite(query, feedback=grader.feedback)
continue # Re-retrieve with rewritten query
response = generator.generate(query, relevant_docs)
if hallucination_checker.verify(response, relevant_docs):
return response
query = rewriter.refine(query, ungrounded_claims=
hallucination_checker.flagged_claims)
return generator.generate(query, relevant_docs) # Best effortSingle-Agent Agentic RAG
One LLM agent controls the entire retrieval-generation loop. It routes queries, executes retrieval, grades documents, generates responses, and checks for hallucinations—all within a single agent's decision space. This pattern works for most production use cases. It's the right starting point unless you have a specific reason to add complexity. We've shipped single-agent agentic RAG systems for clients in healthcare, legal, and financial services at Particula Tech—the single-agent loop handles the vast majority of query patterns.
Multi-Agent Agentic RAG
Multiple specialized agents collaborate on the retrieval-generation pipeline. A typical setup: one agent decomposes complex queries into sub-queries, separate retrieval agents handle different knowledge domains, a synthesis agent combines results, and a verification agent checks the final output. This pattern shines when your knowledge base spans multiple domains with different retrieval strategies. A financial research system might need one agent querying structured market data (SQL), another searching unstructured analyst reports (vector search), and a third pulling regulatory filings (keyword search with date filtering). Each agent optimizes for its domain, and an orchestrator merges their results. The trade-off is coordination overhead. Multi-agent systems are harder to debug, harder to test, and introduce failure modes at agent boundaries. If you can get single-agent agentic RAG to work for your use case, do that first. For complex multi-domain problems, see our guide on multi-agent orchestration patterns that actually ship.
Hierarchical Agentic RAG
A meta-agent plans the overall retrieval strategy, delegates to sub-agents, monitors progress, and adjusts the plan based on intermediate results. This is the pattern the A-RAG paper demonstrates—the hierarchical agent has access to retrieval tools at different granularities and learns to compose them into effective search strategies. Hierarchical agentic RAG is appropriate for research-grade systems where queries are genuinely open-ended: "Summarize the last five years of research on CRISPR delivery mechanisms, focusing on in-vivo applications." The meta-agent decomposes this into sub-topics, delegates retrieval for each, evaluates intermediate results, and iterates until coverage is sufficient. For most production systems, this is overkill. Reserve hierarchical patterns for use cases where query decomposition is the primary challenge and you need the agent to plan multi-step retrieval strategies dynamically.
Building Agentic RAG with LangGraph
LangGraph is the most natural fit for agentic RAG because its graph-based state machine maps directly to the component architecture. Each component becomes a node, conditional edges encode the decision logic, and the state object accumulates context through the loop. For background on why LangGraph's architecture suits this kind of problem, see our agent framework comparison for 2026.
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
class RAGState(TypedDict):
query: str
original_query: str
documents: List[dict]
relevant_documents: List[dict]
response: str
cycle_count: int
route: str
graph = StateGraph(RAGState)
# Add nodes for each component
graph.add_node("router", route_query)
graph.add_node("retriever", retrieve_documents)
graph.add_node("grader", grade_relevance)
graph.add_node("rewriter", rewrite_query)
graph.add_node("generator", generate_response)
graph.add_node("hallucination_check", check_hallucination)
# Define the flow
graph.set_entry_point("router")
graph.add_conditional_edges("router", decide_route, {
"retrieve": "retriever",
"parametric": "generator", # Skip retrieval
})
graph.add_edge("retriever", "grader")
graph.add_conditional_edges("grader", check_relevance, {
"sufficient": "generator",
"insufficient": "rewriter",
})
graph.add_edge("rewriter", "retriever") # Re-retrieve loop
graph.add_edge("generator", "hallucination_check")
graph.add_conditional_edges("hallucination_check", verify_grounding, {
"grounded": END,
"ungrounded": "rewriter", # Try again
})
app = graph.compile()This graph definition makes every decision point explicit. You can look at the code and trace every possible path a query can take through the system. When a clinician asks "why did the system give this answer?"—you can point to the exact sequence of route → retrieve → grade → generate → verify that produced it.
def route_query(state: RAGState) -> RAGState:
classification = llm.classify(
f"""Classify this query into one of three categories:
- 'simple': Single fact lookup, answerable from one document
- 'complex': Multi-hop reasoning, requires multiple documents
- 'parametric': General knowledge, no retrieval needed
Query: {state['query']}"""
)
strategies = {
"simple": "vector_search",
"complex": "hybrid_search", # Vector + keyword + reranking
"parametric": "none",
}
return {**state, "route": strategies.get(classification, "vector_search")}def grade_relevance(state: RAGState) -> RAGState:
relevant = []
for doc in state["documents"]:
score = llm.judge(
f"""Is this document relevant to answering the query?
Query: {state['query']}
Document: {doc['content'][:500]}
Answer: relevant or irrelevant"""
)
if score == "relevant":
relevant.append(doc)
return {**state, "relevant_documents": relevant}
def check_relevance(state: RAGState) -> str:
if len(state["relevant_documents"]) >= 2:
return "sufficient"
if state["cycle_count"] >= 3:
return "sufficient" # Best effort after max cycles
return "insufficient"The Router Node
The router is the most impactful component to get right. A well-tuned router saves latency on simple queries (by skipping the full agent loop) and improves accuracy on complex queries (by selecting the right retrieval strategy). In production, we've found that adaptive routing alone—without any other agentic components—improves response quality by 15-20% and reduces average latency by routing simple queries to the fast path. If you're looking for the highest-ROI improvement to an existing RAG system, start here.
The Grader Node
The grader decides whether to proceed to generation or loop back for re-retrieval. This is where agentic RAG prevents the "generate from irrelevant context" failure mode. A practical optimization: use a lightweight model (Claude Haiku, GPT-4o-mini, or a fine-tuned classifier) for grading. The grader runs on every retrieved document, so cost and latency add up fast with a frontier model. In our deployments, a fine-tuned 7B classifier handles relevance grading at 100-200ms per document with accuracy comparable to frontier models on this narrow task. For more on when smaller models outperform flagships, see our analysis of specialized models versus flagship models for production workloads.
When to Use Agentic RAG vs Traditional RAG vs GraphRAG
Not every RAG system needs an agent loop. The added complexity and latency of agentic RAG is justified only when the query patterns demand it. Here's the decision framework we use with clients:
The key insight: agentic RAG and GraphRAG are complementary, not competing. You can build an agentic RAG system that uses GraphRAG as one of its retrieval tools. The agent decides when entity-relationship reasoning (graph) versus semantic similarity (vector) versus exact matching (keyword) is the right approach for each query. For more on GraphRAG specifically, see our deep-dive on LazyGraphRAG and cost-effective knowledge graph approaches.
A single-cycle agentic RAG query with 5 retrieved documents costs roughly $0.02-0.07 with frontier models—2-3x more than traditional RAG. With a mix of specialized small models for routing and grading and a frontier model for generation, you can bring that down to $0.01-0.02 per query while maintaining accuracy.
The 90% token reduction that agentic systems achieve through selective retrieval partially offsets this cost. Traditional RAG pads every prompt to the maximum context window; agentic RAG retrieves only what the grader confirms is relevant.
The Cost-Latency Trade-Off
Agentic RAG isn't free. Each component in the loop—routing, grading, hallucination checking—requires LLM calls that add both latency and cost.
| Query Pattern | Best Architecture | Why |
|---|---|---|
| Simple factual lookups | Traditional RAG | Single retrieval pass is sufficient; agent loop adds unnecessary latency |
| Multi-hop reasoning | Agentic RAG | Requires query decomposition and iterative retrieval |
| Ambiguous queries | Agentic RAG | Router classifies intent; rewriter clarifies query |
| Entity relationship questions | GraphRAG | Graph structure captures connections that vector search misses |
| Complex queries over connected data | Agentic RAG + GraphRAG | Agent uses graph retrieval as one tool alongside vector search |
| Time-sensitive queries | Agentic RAG | Router can direct to web search or real-time data sources |
| High-volume, low-complexity | Traditional RAG | Cost and latency don't justify the agent overhead |
| Component | Typical Latency | Cost per Query (Frontier Model) | Cost per Query (7B Model) |
|---|---|---|---|
| Router | 200-500ms | $0.001-0.003 | $0.0001 |
| Retrieval | 100-300ms | Infrastructure cost | Infrastructure cost |
| Grader (per doc) | 200-400ms | $0.001-0.002 | $0.0001 |
| Generator | 1-3s | $0.01-0.05 | $0.001-0.005 |
| Hallucination Check | 500ms-1s | $0.002-0.005 | $0.0002 |
Production Lessons from Shipping Agentic RAG
After deploying agentic RAG systems across healthcare, legal, and financial services clients, here are the patterns that consistently matter:
Set hard cycle limits. Without a max iteration cap, the rewrite-retrieve loop can cycle indefinitely on genuinely unanswerable queries. We cap at 3 cycles—this resolves 95% of queries that benefit from re-retrieval while preventing runaway costs on the remaining 5%.
Cache the router. Query classification is highly cacheable. Similar queries almost always route the same way. A semantic cache on the router decision alone eliminates 30-40% of routing LLM calls in production workloads. For more on when caching helps and when it hurts, see our guide on caching LLM responses in production.
Grade with cheap models, generate with smart models. The grading step is a binary classification task. A fine-tuned 7B model or even a well-prompted Claude Haiku handles it at a fraction of the cost and latency of a frontier model. Reserve your compute budget for the generation step where reasoning quality matters most.
Monitor the rewrite trigger rate. If more than 30% of queries trigger re-retrieval, your base retrieval pipeline needs work—better chunking, better embeddings, or hybrid search. Agentic RAG should be a safety net for complex queries, not a band-aid for poor retrieval quality. For retrieval evaluation specifically, see our guide on how to tell if your RAG system actually works.
The hallucination checker is non-negotiable in regulated industries. In healthcare and legal, we've seen the hallucination checker catch fabricated citations, incorrect dosing information, and misattributed legal precedents that the generator produced confidently from partially relevant context. This single component justifies the entire agentic architecture for high-stakes applications.
What's Next for Agentic RAG
The March 2026 SoK paper identifies several open research directions that are already shaping production systems:
Learned retrieval policies. Instead of hand-coding routing rules, train the agent to learn optimal retrieval strategies from feedback. Early results show reinforcement learning on retrieval decisions improving accuracy by another 10-15% over rule-based routing.
Tool-augmented retrieval. Moving beyond text search to give agents access to calculators, code interpreters, and API calls as retrieval tools. A financial query might require the agent to retrieve raw data and then compute a ratio—something neither vector search nor the LLM's parametric knowledge can provide directly.
Cross-modal agentic RAG. Extending the architecture to handle image, audio, and video retrieval alongside text. Medical imaging diagnosis, architectural plan review, and legal document analysis with scanned exhibits all require agents that can reason across modalities.
The trajectory is clear: RAG is evolving from a retrieval technique into an agent architecture. The teams that recognize this shift early—and build their systems with agent-controlled retrieval loops instead of fixed pipelines—are the ones shipping RAG systems that actually work on the queries their users care about most.
The fixed pipeline had a good run. For complex, real-world queries, the agent loop is what production demands.
Frequently Asked Questions
Quick answers to common questions about this topic
Traditional RAG follows a fixed pipeline: embed the query, retrieve top-k chunks, generate an answer. It never evaluates whether the retrieved documents are relevant or whether the answer is grounded. Agentic RAG wraps this pipeline in an agent loop where the LLM actively decides when to retrieve, evaluates document relevance, rewrites queries when results are poor, and checks its own output for hallucinations before responding. This iterative control produces 78% accuracy on complex queries versus 34% for traditional RAG.



