In 2026 the dominant agent latency bottleneck is chained tool calls, not model inference speed. Fan-out/fan-in scheduling cuts wall-clock time 36-50% in production content and research workflows, and DAG approaches like LLMCompiler plus PASTE speculative tool execution reach 1.8x-3.7x speedups with up to 6x cost reductions. The catch is idempotency: parallel write tools and hidden dependencies turn a 2x win into a cascading-failure debugging session. This deep-dive covers dependency modeling, fan-in merge strategies, and when to keep things sequential.
The slowest part of a 2026 agent is almost never the model. It is the agent waiting on its own tool calls, one after another, because the default ReAct loop emits exactly one tool call per model turn and then blocks until the result comes back before deciding the next one. A research agent that hits five independent sources sequentially spends most of its wall-clock budget idle, holding a token-generation pipeline open while a search API returns. Model inference got faster every quarter; the chained-tool-call problem did not, and it is now the dominant latency bottleneck in production agent systems.
This is the gap that parallel agent orchestration closes. The core move is to stop treating the agent as a turn-by-turn conversation and start treating its tool calls as a dependency graph: figure out which calls actually depend on each other, run everything independent at the same time, and merge the results. The production numbers are large enough to change architectures over. Fan-out/fan-in scheduling cuts wall-clock time 36-50% on common content and research workflows (per a production study at zylos.ai/research/2026-04-26), and DAG-based approaches reach 1.8x-3.7x speedups with up to 6x cost reductions when agents schedule independent work concurrently instead of one call at a time.
This post is the practitioner walkthrough: the sequential tool-execution problem, fan-out/fan-in, LLMCompiler's full-DAG scheduling, PASTE speculative execution, dependency modeling without race conditions, fan-in merge strategies, and where DAG orchestration backfires badly enough that you should stay sequential. It is opinionated, because the wrong default (parallelize everything) is as expensive as the wrong default it replaces (parallelize nothing).
The Sequential Tool-Execution Problem
Start with where the time actually goes, because it is counterintuitive. Teams chasing agent latency reach for a faster or smaller model first. That helps token generation, which is rarely the bottleneck. The bottleneck is the structure of the loop.
A standard ReAct-style agent runs: think, call one tool, observe the result, think again, call the next tool. Each tool call is a network round-trip the agent waits on synchronously. If a workflow needs five tool calls and each takes two seconds, the wall-clock floor is roughly ten seconds of waiting, plus model time, even if all five calls are completely independent. The model could have fired all five at once. The loop architecture prevents it.
The fix is not a faster model. It is recognizing that "five independent two-second calls" should cost about two seconds, not ten. That collapse, from ten seconds sequential to about two seconds plus aggregation, is the canonical fan-out example, and it is why the latency conversation in 2026 moved from inference speed to orchestration topology. If your agent feels slow despite a fast model, the diagnosis is almost always serialized independent work, the same root cause we unpack in our guide to fixing slow LLM latency in production apps, applied to the tool layer instead of the token layer.
Fan-Out/Fan-In: The Default Parallel Pattern
Fan-out/fan-in is the most widely deployed parallel pattern in production agent stacks, and for good reason: it maps directly onto the two workloads agents do most, research and content generation. Fan-out dispatches a set of independent subtasks simultaneously. Fan-in collects their results and merges them into a single output.
The shape is simple. A research agent decomposes a question into five independent sub-queries, fires all five searches at once, summarizes each result in parallel, then aggregates. A content pipeline drafts five sections concurrently and stitches them. Anywhere subtasks do not depend on each other, fan-out applies, and wall-clock time drops from the sum of the subtask durations to roughly the maximum plus merge overhead.
The 36-50% reduction figure comes from exactly these workloads, where a large share of the work is independent. What makes it work is honest dependency analysis. A "fan-out" where subtask B secretly needs subtask A's output is not a fan-out, it is a two-stage DAG you have mislabeled, and running B before A completes produces wrong answers, not fast ones. Fan-out is the right tool only for the genuinely independent segment of the workload. For coordinating multiple agents rather than a single agent's tool calls, our breakdown of multi-agent orchestration that actually works in production covers the supervisor and handoff patterns that sit one level above this.
LLMCompiler: From Fan-Out to Full DAG Scheduling
Fan-out handles the easy case: fully independent work. Real workflows are messier, with partial dependencies, some calls that can run immediately and some that must wait for specific predecessors. LLMCompiler (Stanford, ICML 2024) generalizes fan-out to handle that mess by treating the entire set of tool calls as a directed acyclic graph.
The mechanism has three parts. A planner LLM reads the task and emits a DAG of tool calls with explicit dependency edges, declaring up front which calls feed which. A task-fetching unit dispatches every node whose dependencies are satisfied. An executor runs those ready nodes in parallel waves, and as results return, newly unblocked nodes get dispatched. Instead of one tool call per model turn, the planner commits to the whole call graph once, and execution parallelizes everything the edges permit.
That is the difference between fan-out and a DAG: fan-out is a DAG with no edges between the parallel nodes. LLMCompiler handles the general case where some edges exist. The reported gains, 1.8x-3.7x latency and up to 6x cost reduction, come from two sources. Parallel execution removes wall-clock idle time. The cost reduction comes from collapsing many model turns (each a full prompt-plus-context inference) into one planning pass plus a single synthesis pass, which matters more as context grows, a problem we go deep on in keeping long-running agents from drowning in token bloat.
The risk concentrates in the planner: the whole speedup rests on the edges being correct. A missed edge (two calls treated as independent when one needs the other) causes wrong-order execution and a bad result. A spurious edge needlessly serializes work and silently eats the speedup. The DAG is only ever as good as the planner's dependency model.
PASTE: Speculative Tool Execution
LLMCompiler still waits for the planner to finish before it executes. PASTE (speculative tool execution, March 2026) attacks the remaining latency by refusing to wait at all where it can avoid it. The agent starts tool calls before the prior step is confirmed, betting on the most likely next path, and aborts the speculative work if the actual path turns out different.
This is branch prediction borrowed from CPU design. When the agent is fairly sure the next step will be "fetch the user's order history," it kicks that fetch off speculatively while the current step is still resolving. If the guess was right, the result is already in hand and the confirmation round-trip is saved. If wrong, the speculative result is discarded. The measured speedups are 1.8x-3.7x wall-clock, because the agent stops paying a serial confirmation latency before every probable call.
The constraint is unforgiving: speculated tools must be idempotent or trivially abortable. Speculatively reading a file, a row, or an API GET is safe, because discarding the result costs nothing. Speculatively sending an email, charging a card, or writing a record is not, because aborting cannot un-send the email. The correct pattern is to speculate aggressively on read-only, side-effect-free paths and gate every write behind explicit confirmation. PASTE buys real latency on the read-heavy path; it must never touch the write path.
Comparing the Three Approaches
The three patterns are not competitors so much as a progression, each handling a case the previous one cannot, at the cost of more machinery and more failure surface. Choosing among them is a function of how independent your work is and how safe your tools are to run early.
The decision rule that falls out of this table: use fan-out/fan-in by default for independent segments, reach for an LLMCompiler-style DAG when partial dependencies make naive fan-out unsafe, and layer PASTE on top only for read-dominant paths where the next step is predictable and side-effect-free. Stacking all three on a workload that does not need them turns a latency win into an unmaintainable scheduler.
| Approach | What it parallelizes | Reported speedup | Best for | Hard constraint |
|---|---|---|---|---|
| Fan-out/fan-in | Fully independent subtasks | 36-50% wall-clock | Research, content, multi-source lookup | Subtasks must be truly independent |
| LLMCompiler DAG | Independent + partially dependent calls | 1.8x-3.7x, up to 6x cost | Mixed workflows with some dependency edges | Planner must model dependencies correctly |
| PASTE speculative | The next probable call, before confirmation | 1.8x-3.7x wall-clock | Read-heavy, predictable agent paths | Speculated tools must be idempotent/abortable |
Modeling the Dependency Graph Without Creating Races
Every one of these patterns lives or dies on the dependency graph. Getting the edges right is the real engineering work, and the cost of getting them wrong is not slowness, it is wrong answers and corrupted state.
The analysis is concrete. For each tool call, ask: does this call need the output of another call to produce a correct result? If yes, that is a dependency edge and the calls must be ordered. If no, they can run in parallel. Both failure modes are expensive. A missed edge runs dependent work out of order and produces garbage. A phantom edge serializes independent work and quietly deletes your speedup, which is harder to catch because the system still answers correctly, just slowly.
The dangerous case is the hidden dependency that never shows up in the data flow. Two branches that look independent on paper can collide through shared state: both write to the same database row, both append to the same file, both mutate the same external resource. Run them in parallel and you get a race condition, a corrupted record, or a lost update. This is why idempotency on parallel write tools is a hard production constraint, not a nice-to-have. A tool is idempotent when running it twice produces the same end state as running it once. Idempotent tools survive retries, speculation, and parallel dispatch. Non-idempotent tools (anything that increments, appends, sends, or charges) must be serialized or guarded with a deduplication key, or parallelism will eventually double the side effect.
The rule we apply: parallelize reads aggressively, serialize writes by default, and only fan out writes when each tool is provably idempotent with a dedup key. Reads are safe to overlap; writes are where the race conditions live.
Fan-In: Merge Strategies, Timeouts, and Cancellation
Fanning out is the easy half. Fan-in (deciding what to do when branches return at different times, partially fail, or hang) is where production parallelism actually gets hard, and it is a product decision more than a technical default.
Pick the merge policy before you parallelize, because the right one depends on what the merge needs:
Whichever policy you choose, two mechanics are non-negotiable. First, per-branch timeouts plus cancellation, so a single stuck tool call cannot hold the entire fan-out hostage at the latency of its slowest member. Without timeouts, fan-out converts your latency to the worst-case branch, the exact tail-latency trap that parallelism was supposed to fix. Second, stream partial results to the user wherever the merge allows it, so a 95th-percentile slow branch does not block visible progress while the rest of the answer is already available. A fan-out that blocks on its slowest branch with no streaming has reintroduced the sequential problem in a more confusing shape.
When DAG Orchestration Backfires
Parallelism is not free, and the failure cases are sharp enough that the right answer is sometimes "stay sequential." Three situations turn a DAG into a liability.
The first is the irreducibly sequential workload. If every step genuinely needs the previous step's output, the critical path is the floor, and a DAG adds planning overhead and debugging complexity for a speedup near zero. The critical path length, not the model, is the latency floor of a sequential chain. Wrapping a chain in a scheduler does not shorten the chain.
The second is non-idempotent tools, covered above but worth repeating because it is the most common production incident: a retried or speculated write that duplicates a side effect. Double-charges, double-sends, and duplicate records are the signature failures of premature write parallelism. If your tools have unprotected side effects, parallelism introduces race conditions that cost far more than the latency they save.
The third is debugging cost, which is underrated until the first parallel outage. A sequential trace reads top to bottom: step three failed, here is why. A parallel failure trace has interleaved branches, partial results, and a merge that consumed a poisoned input from one of several concurrent paths. Root-causing which branch corrupted the merge, in what order, with which partial state, is materially harder than reading a linear log. This is the same accuracy-versus-reliability gap we map in why agent reliability lags accuracy in production: a DAG can pass every offline benchmark and still be operationally fragile because its failure modes are harder to observe and reproduce.
The honest default: start sequential, measure where wall-clock time actually goes, and introduce fan-out/fan-in only on the independent segments that dominate the latency budget. The fastest agent is not the most parallel one. It is the one that parallelizes exactly the independent work and serializes everything else.
Implementation Notes
The patterns above are concrete, and so is the tooling. A few implementation realities are worth flagging for anyone building this.
In LangGraph, an LLMCompiler-style DAG is expressed as a graph where nodes are tool calls and edges are dependencies, and the runtime dispatches independent nodes concurrently rather than walking them in sequence. The discipline is making dependencies explicit in the graph definition rather than relying on the model to serialize them implicitly through its turn order.
In TypeScript, simple async fan-out is just Promise.all over the independent calls with a merge on the resolved array, plus Promise.allSettled and per-call AbortController timeouts when you need best-effort merges and cancellation. That covers fan-out/fan-in without a framework. The moment you have partial dependencies, you want a real scheduler rather than hand-rolled promise chains, because the dependency bookkeeping is where hand-written versions break.
For teams that want the dependency graph checked before runtime, Microsoft WorkflowBuilder offers compile-time type-safe directed-graph orchestration with first-class support for sequential, parallel/fan-out, conditional, and handoff edges. Type-checking the workflow graph catches a whole class of mis-modeled-dependency bugs before they reach production, which is exactly the failure class that makes DAGs risky. The broader research direction, including latency-aware scheduling that accounts for per-tool cost, is captured in work like arXiv 2601.10560 on latency-aware orchestration.
Mapping a real agent's tool-call dependency graph, then deciding what to parallelize and what to keep serial, is the work Particula Tech runs as a fixed-scope latency audit: we trace where wall-clock time actually goes, identify the genuinely independent fan-outs, flag the non-idempotent write tools that must stay sequential, and ship a parallel execution plan with the idempotency guards already specified. The deliverable is a measured speedup with the race conditions designed out, not a scheduler bolted onto a workload that never needed one. The full set of agent patterns this fits into lives in our AI Agents pillar.
The short version: in 2026, chained tool calls are the latency bottleneck, fan-out/fan-in is the 36-50% default win, LLMCompiler and PASTE push it to 1.8x-3.7x, and idempotency is the line between a 2x speedup and a 2 a.m. incident. Parallelize the reads. Serialize the writes. Measure before you promise.
Frequently Asked Questions
Quick answers to common questions about this topic
Fan-out/fan-in is a parallel orchestration pattern where an agent dispatches independent subtasks simultaneously (fan-out), then collects and merges their results (fan-in). It is the most widely deployed parallel pattern in 2026 agent stacks because it maps cleanly onto research and content workflows: search five sources at once instead of sequentially, summarize each in parallel, then aggregate. Production studies show 36-50% wall-clock reductions on common workflows. A pipeline of five independent 2-second tool calls drops from roughly 10 seconds sequential to about 2 seconds plus aggregation overhead. The pattern only works when the subtasks are genuinely independent. If subtask B needs subtask A's output, you have a dependency edge, not a fan-out, and you need full DAG scheduling instead.



