What is the fan-out/fan-in pattern for AI agents?

Fan-out/fan-in is a parallel orchestration pattern where an agent dispatches independent subtasks simultaneously (fan-out), then collects and merges their results (fan-in). It is the most widely deployed parallel pattern in 2026 agent stacks because it maps cleanly onto research and content workflows: search five sources at once instead of sequentially, summarize each in parallel, then aggregate. Production studies show 36-50% wall-clock reductions on common workflows. A pipeline of five independent 2-second tool calls drops from roughly 10 seconds sequential to about 2 seconds plus aggregation overhead. The pattern only works when the subtasks are genuinely independent. If subtask B needs subtask A's output, you have a dependency edge, not a fan-out, and you need full DAG scheduling instead.

How much does parallel tool execution reduce agent latency?

Parallel tool execution reduces agent latency 36-50% on common content and research workflows with simple fan-out/fan-in, per production studies. DAG-based scheduling that handles partial dependencies pushes further: published benchmarks show 1.8x to 3.7x wall-clock speedups and up to 6x cost reductions when agents schedule independent work concurrently rather than emitting one tool call per turn. The exact number depends on how parallelizable your workload is. A research agent hitting many independent sources sees the high end; a tightly chained workflow where each step feeds the next sees almost none, because the critical path is irreducibly sequential. Measure your dependency graph before promising a speedup. The win lives in the wide, independent fan-outs, not in artificially parallelizing a sequential chain.

What is LLMCompiler and why does it matter in 2026?

LLMCompiler is a framework from Stanford (ICML 2024) that extends fan-out into full directed-acyclic-graph scheduling: a planner LLM generates a DAG of tool calls with explicit dependency edges, then an executor runs independent nodes in parallel waves. It matters in 2026 because it became a production-adopted pattern, not just a paper. Instead of the default ReAct loop emitting one tool call per model turn (which serializes everything), LLMCompiler plans the whole call graph up front and parallelizes everything the dependencies allow. The reported gains are 1.8x-3.7x latency and up to 6x cost. The tradeoff is planning quality: if the planner mis-models a dependency, you either lose parallelism or trigger a cascading failure. The DAG is only as good as the dependency edges the planner draws.

What is PASTE speculative tool execution?

PASTE (March 2026) is a speculative execution technique for agents: the agent starts tool calls before the prior step is confirmed, betting on the likely path, and aborts if the actual path diverges. It is the agent equivalent of CPU branch prediction. The reported speedups are 1.8x to 3.7x wall-clock because the agent stops waiting for each confirmation round-trip before kicking off the next probable call. The hard constraint is that speculated tools must be idempotent or trivially abortable. Speculatively reading a file is safe; speculatively sending an email or charging a card is not, because aborting cannot un-send the email. Use PASTE for read-heavy, side-effect-free paths and gate any write or external side effect behind explicit confirmation. The speedup is real but the blast radius of a wrong guess is your tool surface.

When does DAG agent orchestration backfire?

DAG orchestration backfires when dependencies are hidden, tools are non-idempotent, or failure traces become unreadable. The most common production failure is a hidden dependency the planner did not model: two 'independent' branches that both write to the same record race and corrupt it. The second is non-idempotent parallel writes, where a retried or speculated tool call duplicates a side effect (double-charges, double-sends). The third is debugging cost: a sequential trace reads top to bottom, but a parallel failure trace has interleaved branches and partial results, and root-causing which branch poisoned the merge is materially harder. The rule we follow: parallelize reads aggressively, serialize writes by default, and only fan out writes when each tool is provably idempotent with a deduplication key.

How do you handle errors when parallel agent branches fail?

Handle parallel branch failures with an explicit fan-in policy decided before you parallelize: fail-fast, best-effort, or quorum. Fail-fast cancels all sibling branches the moment one fails, which is right when the merge needs every result. Best-effort merges whatever succeeded and marks the rest as missing, which suits research aggregation where partial coverage is still useful. Quorum proceeds once N of M branches return, trading completeness for tail-latency control. Pair the policy with per-branch timeouts and cancellation so one stuck tool call does not pin the whole fan-out at its slowest member, the classic tail-latency trap. Always stream partial results to the user when the merge allows it, so a 95th-percentile slow branch does not block visible progress. The fan-in policy is a product decision, not a default.

Should every multi-agent system use DAG orchestration?

No. DAG orchestration earns its complexity only when your workload has wide, independent fan-outs and your tools are safe to run concurrently. If your agent's critical path is genuinely sequential (each step needs the last step's output), a DAG adds planning overhead and debugging difficulty for near-zero speedup, because the critical path length is the floor. If your tools have side effects and are not idempotent, parallelism introduces race conditions that cost more than the latency they save. Start sequential, measure where wall-clock time actually goes, and only introduce fan-out/fan-in on the independent segments that dominate the latency budget. The fastest agent is not the most parallel one; it is the one that parallelizes exactly the independent work and serializes everything else.

BLOG/AI AGENTS

DAG Orchestration: Cut AI Agent Latency 50% Fan-Out

Fan-out/fan-in DAG scheduling cuts agent wall-clock time 36-50%, and LLMCompiler plus PASTE speculative execution reach 1.8x-3.7x. The patterns and pitfalls.

Sebastian MondragonJUNE 12, 2026 · 12 MIN READ

DAG Orchestration: Cut AI Agent Latency 50% Fan-Out

The slowest part of a 2026 agent is almost never the model. It is the agent waiting on its own tool calls, one after another, because the default ReAct loop emits exactly one tool call per model turn and then blocks until the result comes back before deciding the next one. A research agent that hits five independent sources sequentially spends most of its wall-clock budget idle, holding a token-generation pipeline open while a search API returns. Model inference got faster every quarter; the chained-tool-call problem did not, and it is now the dominant latency bottleneck in production agent systems.

This is the gap that parallel agent orchestration closes. The core move is to stop treating the agent as a turn-by-turn conversation and start treating its tool calls as a dependency graph: figure out which calls actually depend on each other, run everything independent at the same time, and merge the results. The production numbers are large enough to change architectures over. Fan-out/fan-in scheduling cuts wall-clock time 36-50% on common content and research workflows (per a production study at zylos.ai/research/2026-04-26), and DAG-based approaches reach 1.8x-3.7x speedups with up to 6x cost reductions when agents schedule independent work concurrently instead of one call at a time.

This post is the practitioner walkthrough: the sequential tool-execution problem, fan-out/fan-in, LLMCompiler's full-DAG scheduling, PASTE speculative execution, dependency modeling without race conditions, fan-in merge strategies, and where DAG orchestration backfires badly enough that you should stay sequential. It is opinionated, because the wrong default (parallelize everything) is as expensive as the wrong default it replaces (parallelize nothing).

The Sequential Tool-Execution Problem

Start with where the time actually goes, because it is counterintuitive. Teams chasing agent latency reach for a faster or smaller model first. That helps token generation, which is rarely the bottleneck. The bottleneck is the structure of the loop.

A standard ReAct-style agent runs: think, call one tool, observe the result, think again, call the next tool. Each tool call is a network round-trip the agent waits on synchronously. If a workflow needs five tool calls and each takes two seconds, the wall-clock floor is roughly ten seconds of waiting, plus model time, even if all five calls are completely independent. The model could have fired all five at once. The loop architecture prevents it.

The fix is not a faster model. It is recognizing that "five independent two-second calls" should cost about two seconds, not ten. That collapse, from ten seconds sequential to about two seconds plus aggregation, is the canonical fan-out example, and it is why the latency conversation in 2026 moved from inference speed to orchestration topology. If your agent feels slow despite a fast model, the diagnosis is almost always serialized independent work, the same root cause we unpack in our guide to fixing slow LLM latency in production apps, applied to the tool layer instead of the token layer.

Fan-Out/Fan-In: The Default Parallel Pattern

Fan-out/fan-in is the most widely deployed parallel pattern in production agent stacks, and for good reason: it maps directly onto the two workloads agents do most, research and content generation. Fan-out dispatches a set of independent subtasks simultaneously. Fan-in collects their results and merges them into a single output.

The shape is simple. A research agent decomposes a question into five independent sub-queries, fires all five searches at once, summarizes each result in parallel, then aggregates. A content pipeline drafts five sections concurrently and stitches them. Anywhere subtasks do not depend on each other, fan-out applies, and wall-clock time drops from the sum of the subtask durations to roughly the maximum plus merge overhead.

The 36-50% reduction figure comes from exactly these workloads, where a large share of the work is independent. What makes it work is honest dependency analysis. A "fan-out" where subtask B secretly needs subtask A's output is not a fan-out, it is a two-stage DAG you have mislabeled, and running B before A completes produces wrong answers, not fast ones. Fan-out is the right tool only for the genuinely independent segment of the workload. For coordinating multiple agents rather than a single agent's tool calls, our breakdown of multi-agent orchestration that actually works in production covers the supervisor and handoff patterns that sit one level above this.

LLMCompiler: From Fan-Out to Full DAG Scheduling

Fan-out handles the easy case: fully independent work. Real workflows are messier, with partial dependencies, some calls that can run immediately and some that must wait for specific predecessors. LLMCompiler (Stanford, ICML 2024) generalizes fan-out to handle that mess by treating the entire set of tool calls as a directed acyclic graph.

The mechanism has three parts. A planner LLM reads the task and emits a DAG of tool calls with explicit dependency edges, declaring up front which calls feed which. A task-fetching unit dispatches every node whose dependencies are satisfied. An executor runs those ready nodes in parallel waves, and as results return, newly unblocked nodes get dispatched. Instead of one tool call per model turn, the planner commits to the whole call graph once, and execution parallelizes everything the edges permit.

That is the difference between fan-out and a DAG: fan-out is a DAG with no edges between the parallel nodes. LLMCompiler handles the general case where some edges exist. The reported gains, 1.8x-3.7x latency and up to 6x cost reduction, come from two sources. Parallel execution removes wall-clock idle time. The cost reduction comes from collapsing many model turns (each a full prompt-plus-context inference) into one planning pass plus a single synthesis pass, which matters more as context grows, a problem we go deep on in keeping long-running agents from drowning in token bloat.

The risk concentrates in the planner: the whole speedup rests on the edges being correct. A missed edge (two calls treated as independent when one needs the other) causes wrong-order execution and a bad result. A spurious edge needlessly serializes work and silently eats the speedup. The DAG is only ever as good as the planner's dependency model.

PASTE: Speculative Tool Execution

LLMCompiler still waits for the planner to finish before it executes. PASTE (speculative tool execution, March 2026) attacks the remaining latency by refusing to wait at all where it can avoid it. The agent starts tool calls before the prior step is confirmed, betting on the most likely next path, and aborts the speculative work if the actual path turns out different.

This is branch prediction borrowed from CPU design. When the agent is fairly sure the next step will be "fetch the user's order history," it kicks that fetch off speculatively while the current step is still resolving. If the guess was right, the result is already in hand and the confirmation round-trip is saved. If wrong, the speculative result is discarded. The measured speedups are 1.8x-3.7x wall-clock, because the agent stops paying a serial confirmation latency before every probable call.

The constraint is unforgiving: speculated tools must be idempotent or trivially abortable. Speculatively reading a file, a row, or an API GET is safe, because discarding the result costs nothing. Speculatively sending an email, charging a card, or writing a record is not, because aborting cannot un-send the email. The correct pattern is to speculate aggressively on read-only, side-effect-free paths and gate every write behind explicit confirmation. PASTE buys real latency on the read-heavy path; it must never touch the write path.

Comparing the Three Approaches

The three patterns are not competitors so much as a progression, each handling a case the previous one cannot, at the cost of more machinery and more failure surface. Choosing among them is a function of how independent your work is and how safe your tools are to run early.

The decision rule that falls out of this table: use fan-out/fan-in by default for independent segments, reach for an LLMCompiler-style DAG when partial dependencies make naive fan-out unsafe, and layer PASTE on top only for read-dominant paths where the next step is predictable and side-effect-free. Stacking all three on a workload that does not need them turns a latency win into an unmaintainable scheduler.

Approach	What it parallelizes	Reported speedup	Best for	Hard constraint
Fan-out/fan-in	Fully independent subtasks	36-50% wall-clock	Research, content, multi-source lookup	Subtasks must be truly independent
LLMCompiler DAG	Independent + partially dependent calls	1.8x-3.7x, up to 6x cost	Mixed workflows with some dependency edges	Planner must model dependencies correctly
PASTE speculative	The next probable call, before confirmation	1.8x-3.7x wall-clock	Read-heavy, predictable agent paths	Speculated tools must be idempotent/abortable

Modeling the Dependency Graph Without Creating Races

Every one of these patterns lives or dies on the dependency graph. Getting the edges right is the real engineering work, and the cost of getting them wrong is not slowness, it is wrong answers and corrupted state.

The analysis is concrete. For each tool call, ask: does this call need the output of another call to produce a correct result? If yes, that is a dependency edge and the calls must be ordered. If no, they can run in parallel. Both failure modes are expensive. A missed edge runs dependent work out of order and produces garbage. A phantom edge serializes independent work and quietly deletes your speedup, which is harder to catch because the system still answers correctly, just slowly.

The dangerous case is the hidden dependency that never shows up in the data flow. Two branches that look independent on paper can collide through shared state: both write to the same database row, both append to the same file, both mutate the same external resource. Run them in parallel and you get a race condition, a corrupted record, or a lost update. This is why idempotency on parallel write tools is a hard production constraint, not a nice-to-have. A tool is idempotent when running it twice produces the same end state as running it once. Idempotent tools survive retries, speculation, and parallel dispatch. Non-idempotent tools (anything that increments, appends, sends, or charges) must be serialized or guarded with a deduplication key, or parallelism will eventually double the side effect.

The rule we apply: parallelize reads aggressively, serialize writes by default, and only fan out writes when each tool is provably idempotent with a dedup key. Reads are safe to overlap; writes are where the race conditions live.

Fan-In: Merge Strategies, Timeouts, and Cancellation

Fanning out is the easy half. Fan-in (deciding what to do when branches return at different times, partially fail, or hang) is where production parallelism actually gets hard, and it is a product decision more than a technical default.

Pick the merge policy before you parallelize, because the right one depends on what the merge needs:

Fail-fast: the moment one branch errors, cancel all siblings and surface the failure. Correct when the final answer needs every branch and a partial result is useless.

Best-effort: merge whatever succeeded, mark the rest as missing, and proceed. Correct for research aggregation, where five of six sources is still a good answer.

Quorum: proceed once N of M branches return, ignoring stragglers. Trades completeness for predictable tail latency, which matters when one slow source would otherwise pin the whole fan-out.

Whichever policy you choose, two mechanics are non-negotiable. First, per-branch timeouts plus cancellation, so a single stuck tool call cannot hold the entire fan-out hostage at the latency of its slowest member. Without timeouts, fan-out converts your latency to the worst-case branch, the exact tail-latency trap that parallelism was supposed to fix. Second, stream partial results to the user wherever the merge allows it, so a 95th-percentile slow branch does not block visible progress while the rest of the answer is already available. A fan-out that blocks on its slowest branch with no streaming has reintroduced the sequential problem in a more confusing shape.

When DAG Orchestration Backfires

Parallelism is not free, and the failure cases are sharp enough that the right answer is sometimes "stay sequential." Three situations turn a DAG into a liability.

The first is the irreducibly sequential workload. If every step genuinely needs the previous step's output, the critical path is the floor, and a DAG adds planning overhead and debugging complexity for a speedup near zero. The critical path length, not the model, is the latency floor of a sequential chain. Wrapping a chain in a scheduler does not shorten the chain.

The second is non-idempotent tools, covered above but worth repeating because it is the most common production incident: a retried or speculated write that duplicates a side effect. Double-charges, double-sends, and duplicate records are the signature failures of premature write parallelism. If your tools have unprotected side effects, parallelism introduces race conditions that cost far more than the latency they save.

The third is debugging cost, which is underrated until the first parallel outage. A sequential trace reads top to bottom: step three failed, here is why. A parallel failure trace has interleaved branches, partial results, and a merge that consumed a poisoned input from one of several concurrent paths. Root-causing which branch corrupted the merge, in what order, with which partial state, is materially harder than reading a linear log. This is the same accuracy-versus-reliability gap we map in why agent reliability lags accuracy in production: a DAG can pass every offline benchmark and still be operationally fragile because its failure modes are harder to observe and reproduce.

The honest default: start sequential, measure where wall-clock time actually goes, and introduce fan-out/fan-in only on the independent segments that dominate the latency budget. The fastest agent is not the most parallel one. It is the one that parallelizes exactly the independent work and serializes everything else.

Implementation Notes

The patterns above are concrete, and so is the tooling. A few implementation realities are worth flagging for anyone building this.

In LangGraph, an LLMCompiler-style DAG is expressed as a graph where nodes are tool calls and edges are dependencies, and the runtime dispatches independent nodes concurrently rather than walking them in sequence. The discipline is making dependencies explicit in the graph definition rather than relying on the model to serialize them implicitly through its turn order.

In TypeScript, simple async fan-out is just Promise.all over the independent calls with a merge on the resolved array, plus Promise.allSettled and per-call AbortController timeouts when you need best-effort merges and cancellation. That covers fan-out/fan-in without a framework. The moment you have partial dependencies, you want a real scheduler rather than hand-rolled promise chains, because the dependency bookkeeping is where hand-written versions break.

For teams that want the dependency graph checked before runtime, Microsoft WorkflowBuilder offers compile-time type-safe directed-graph orchestration with first-class support for sequential, parallel/fan-out, conditional, and handoff edges. Type-checking the workflow graph catches a whole class of mis-modeled-dependency bugs before they reach production, which is exactly the failure class that makes DAGs risky. The broader research direction, including latency-aware scheduling that accounts for per-tool cost, is captured in work like arXiv 2601.10560 on latency-aware orchestration.

Mapping a real agent's tool-call dependency graph, then deciding what to parallelize and what to keep serial, is the work Particula Tech runs as a fixed-scope latency audit: we trace where wall-clock time actually goes, identify the genuinely independent fan-outs, flag the non-idempotent write tools that must stay sequential, and ship a parallel execution plan with the idempotency guards already specified. The deliverable is a measured speedup with the race conditions designed out, not a scheduler bolted onto a workload that never needed one. The full set of agent patterns this fits into lives in our AI Agents pillar.

The short version: in 2026, chained tool calls are the latency bottleneck, fan-out/fan-in is the 36-50% default win, LLMCompiler and PASTE push it to 1.8x-3.7x, and idempotency is the line between a 2x speedup and a 2 a.m. incident. Parallelize the reads. Serialize the writes. Measure before you promise.

FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI AGENTS

DAG Orchestration: Cut AI Agent Latency 50% Fan-Out

Fan-out/fan-in DAG scheduling cuts agent wall-clock time 36-50%, and LLMCompiler plus PASTE speculative execution reach 1.8x-3.7x. The patterns and pitfalls.

Sebastian MondragonJUNE 12, 2026 · 12 MIN READ

The Sequential Tool-Execution Problem

Fan-Out/Fan-In: The Default Parallel Pattern

LLMCompiler: From Fan-Out to Full DAG Scheduling

PASTE: Speculative Tool Execution

Comparing the Three Approaches

Approach	What it parallelizes	Reported speedup	Best for	Hard constraint
Fan-out/fan-in	Fully independent subtasks	36-50% wall-clock	Research, content, multi-source lookup	Subtasks must be truly independent
LLMCompiler DAG	Independent + partially dependent calls	1.8x-3.7x, up to 6x cost	Mixed workflows with some dependency edges	Planner must model dependencies correctly
PASTE speculative	The next probable call, before confirmation	1.8x-3.7x wall-clock	Read-heavy, predictable agent paths	Speculated tools must be idempotent/abortable

Modeling the Dependency Graph Without Creating Races

Fan-In: Merge Strategies, Timeouts, and Cancellation

Pick the merge policy before you parallelize, because the right one depends on what the merge needs:

Fail-fast: the moment one branch errors, cancel all siblings and surface the failure. Correct when the final answer needs every branch and a partial result is useless.

Best-effort: merge whatever succeeded, mark the rest as missing, and proceed. Correct for research aggregation, where five of six sources is still a good answer.

Quorum: proceed once N of M branches return, ignoring stragglers. Trades completeness for predictable tail latency, which matters when one slow source would otherwise pin the whole fan-out.

When DAG Orchestration Backfires

Parallelism is not free, and the failure cases are sharp enough that the right answer is sometimes "stay sequential." Three situations turn a DAG into a liability.

Implementation Notes

The patterns above are concrete, and so is the tooling. A few implementation realities are worth flagging for anyone building this.

FAQ

Quick answers to the questions this post tends to raise.