Durable execution makes a crashed agent resume from the exact step it died on instead of replaying every LLM call and tool side effect. Temporal is the most mature (Nexus and Multi-Region Replication GA in early 2026, 99.99% SLA) but large LLM payloads saturate workflow history and force you to wire payload codecs that offload to external storage. Inngest gives you the fastest path from TypeScript app code with step.run() async/await, but step-based pricing balloons when multi-model calls and rate-limit retries multiply your billed steps. Restate uses the same journal-and-replay model as Temporal with a lighter footprint and gives you exactly-once writes without hand-rolling app-level idempotency keys. Three rules hold across all three: LLM outputs are recorded once to history and replayed (never re-called) on recovery, workflow code must be deterministic, and every side-effecting tool call must be idempotent.
A coding agent that has been running for forty minutes, eleven tool calls deep into a refactor, hits an unhandled exception in the worker process and dies. Without durable execution for AI agents, everything is gone: the plan it built, the files it already edited, the eight LLM calls you already paid for, the test run it kicked off. The retry, if you have one, starts from token zero and re-does work that may have already mutated your repo. This is the failure mode that makes agents feel unreliable in production even when the model is fine.
The fix is not a better prompt or a bigger model. It is durable execution: an engine that journals every step of the agent run to persistent storage, so a crashed agent resumes from the exact step it died on instead of replaying the whole loop. The agent's LLM outputs are recorded once and replayed from history on recovery, never re-called. A tool that already ran is not run again. A human approval that already happened is not requested twice. Three platforms own this space for application developers in 2026: Temporal, Inngest, and Restate. They share the same core model and diverge sharply on payload handling, pricing, language fit, and operational weight.
This post is the decision framework we use to scope durable execution for agent systems. We will cover why agents specifically need it (probabilistic steps, expensive tool calls, and approval gates that retries cannot survive), the architecture-specific traps in each platform, how checkpointing and resumption actually work, and a decision matrix that picks the right engine by workflow maturity, payload size, cost model, and language. The mistake we see most often is bolting an agent loop onto a generic job queue and discovering, in week one of production, that a queue is not a workflow engine.
Why Agents Need Durable Execution (and Queues Don't Cut It)
A traditional background job is short, idempotent, and cheap to retry from scratch. An agent run is none of those things. It is long (seconds to days), stateful (the plan and accumulated context are the work), and built from steps that are individually expensive and often irreversible. Retrying an agent from the top is not a safety net, it is a way to double-charge a customer and re-corrupt a file.
Three properties of agents break the naive queue-plus-retry pattern:
Steps are probabilistic and non-repeatable. Every LLM call costs money and produces a different result each time. If your agent crashes after step 7, you do not want a retry to re-run steps 1 through 6, that is six more paid completions and six more tool invocations, and the rerun may take a different path entirely because the model is non-deterministic. Durable execution records each step's result once and replays it from the journal, so recovery is exact, not approximate.
Tool calls have side effects. Agents send emails, charge cards, open pull requests, and write database rows. A retry that re-runs a completed tool call sends the email twice. Durable execution combined with idempotent tools means a replayed step is a no-op on the external system.
Approval gates outlive the process. The moment your agent takes a consequential action, you want a human-in-the-loop approval gate in front of it. That gate might take five minutes or five days. No in-memory loop survives a deploy, a scale-down, or a crash across that window. Durable execution suspends the workflow at the gate with zero running compute and wakes it on the exact line when the approval signal arrives.
A job queue gives you at-least-once delivery and retries. It does not give you a journal, deterministic replay, suspendable multi-day waits, or exactly-once step semantics. Those are the durable execution primitives, and they are why this is its own category rather than a feature you add to BullMQ or SQS.
The Core Model: Record Once, Replay on Recovery
All three engines work the same way under the hood. Your workflow code runs, and every side-effecting operation (an LLM call, a tool invocation, a sleep, a wait-for-signal) is wrapped in a durable step. When that step completes, its result is written to a journal. If the process crashes and restarts, the engine re-runs your workflow code from the top, but every step it has already seen in the journal returns the recorded result instantly instead of executing again. The code "fast-forwards" through completed work and resumes live execution at the first unrecorded step. This is why two rules are non-negotiable across all three platforms. First, workflow code must be deterministic: no Math.random(), no Date.now(), no direct network calls outside a durable step, because on replay the code must follow the identical path. Anything non-deterministic goes inside a step so its result is journaled. Second, the LLM output is recorded to history once and replayed on recovery, never re-called. The model's answer becomes a fixed fact in the journal the moment the step completes.
Temporal: The Mature Default With a Payload Trap
Temporal is the most battle-tested durable execution platform, descended from Uber's Cadence and run at scale across large engineering orgs. The model is workflow-as-code: you write a workflow function that orchestrates activities (your side-effecting steps), and Temporal guarantees exactly-once execution semantics for activities through its history-and-replay engine. In early 2026 Temporal shipped Nexus to GA (cross-namespace, cross-team service calls) and Multi-Region Replication to GA with a 99.99% SLA, which closes the last serious gaps for regulated, multi-region agent deployments.
For agents, Temporal's strengths are real. Multi-day and multi-week workflows are first-class. Signals and queries give you clean human-in-the-loop gates and live state inspection. The SDK exists for Go, Java, TypeScript, Python, and .NET, so you are not locked into one language. The operational tradeoff is weight: you run a Temporal server (or pay for Temporal Cloud) plus a worker fleet, and there is a genuine learning curve around determinism constraints and the activity-versus-workflow boundary.
The LLM Payload Saturation Trap
The single most common way teams break Temporal-for-agents is workflow history saturation. Temporal records every activity input and output into the workflow's event history. That history has limits: a soft warning around 10,000 events or roughly 10MB, and a hard ceiling historically near 50MB. Agents move large payloads through every step: full prompts, multi-thousand-token completions, retrieved document chunks, fat tool results. Pass those directly through activity boundaries and history bloats fast, which slows replay, inflates storage cost, and can fail the workflow when it blows the cap. The fix is the payload codec or claim-check pattern: store the large blob in external storage (S3, GCS, a database) and pass only a reference through history. Temporal's payload codec API lets you intercept payloads before they hit history, compress or offload them, and rehydrate on read. Design this before you ship, not after the first workflow dies at 50MB. We treat large-payload offloading as a default for any Temporal agent that touches documents or long completions, the same discipline behind agent memory and context management applies to the durable layer underneath them.
Inngest: Fastest Path From App Code, Watch the Step Bill
Inngest takes the opposite stance on operational weight. There is no worker fleet to run and no separate server to operate in the self-managed sense, you write durable functions in your existing TypeScript (or Python/Go) application and Inngest's platform invokes them over HTTP in response to events. The ergonomics are the cleanest in the category for app developers: you wrap each durable boundary in step.run() with normal async/await, and Inngest memoizes the result so a re-run skips completed steps.
export const researchAgent = inngest.createFunction(
{ id: "research-agent" },
{ event: "agent/research.requested" },
async ({ event, step }) => {
const plan = await step.run("plan", () => llm.plan(event.data.task));
const results = [];
for (const subtask of plan.subtasks) {
// each step.run result is journaled; a crash replays it, never re-calls
const r = await step.run(`tool-${subtask.id}`, () => runTool(subtask));
results.push(r);
}
// suspend with zero compute until a human approves
await step.waitForEvent("approval", {
event: "agent/approved",
timeout: "3d",
match: "data.runId",
});
return step.run("finalize", () => llm.synthesize(results));
}
);That waitForEvent is the human-in-the-loop gate: the function suspends for up to three days at zero running cost and wakes when the approval event lands. This is genuinely the fastest way to make an existing TypeScript agent durable.
The Step-Based Pricing Trap
Inngest prices on steps executed (plus runs), and that model interacts badly with agents in one specific way: retries and multi-model fan-out multiply your billed step count. An agent that calls a planner, then five tools, then a synthesizer is seven steps minimum. Add rate-limit retries against a busy provider, a reranker pass, a guardrail check, and a couple of model fallbacks, and a single logical run can bill fifteen to thirty steps. Multiply across thousands of runs and the bill grows on a curve you did not model from the demo, where a clean three-step function looked cheap. Before committing, instrument your real step count per run including retries, not the happy path, and price against that. The convenience is real, but so is the way the meter runs when an agent's step graph is wide and retry-heavy.
Restate: Same Guarantees, Lighter Footprint, Exactly-Once Writes
Restate is the newest of the three and uses the same journal-and-replay model as Temporal, but ships as a single self-contained binary (the Restate server) with a notably lighter operational footprint. You write durable handlers in TypeScript, Java/Kotlin, Go, Python, or Rust, and Restate journals each step exactly like Temporal does. The pitch is "durable execution at the service boundary": Restate sits in front of your services and makes their handlers durable, resumable, and exactly-once without the heavier Temporal server plus worker topology.
The differentiator that matters most for agents is exactly-once semantics for writes without hand-rolling app-level idempotency keys. In Temporal and Inngest, the durable engine guarantees a step's result is recorded once and replayed, but it does not guarantee the underlying external write happened exactly once if a crash lands mid-call (after the side effect, before the journal write). That at-least-once edge is why you still add idempotency keys to write tools. Restate narrows this gap: its execution model gives handlers exactly-once invocation guarantees, so the amount of idempotency plumbing you write by hand drops. You should still design write tools to be safe under replay, but Restate does more of the work for you.
Restate's awakeables are its human-in-the-loop primitive (suspend, hand out a token, resume when the token is completed), and its lighter footprint makes it attractive when you do not want to operate a full Temporal cluster but you do want Temporal-grade guarantees.
Side-by-Side: The Decision Axes That Matter
The vendor pages push feature checklists. The actual decision is four axes: workflow maturity and duration, how you handle large LLM payloads, the cost model under realistic retry volume, and language fit.
A few patterns hold across the axes. Temporal earns its operational weight when workflows are long and regulated, the multi-region and SLA story is the reason large enterprises pick it, and the payload codec is mandatory homework rather than an edge case. Inngest wins on time-to-durability for TypeScript teams, but the step bill is the thing to model before you scale, not after. Restate is the sharpest pick when you want the journal-and-replay guarantee with the least operational surface and you value the exactly-once write semantics enough to adopt a younger platform.
| Axis | Temporal | Inngest | Restate |
|---|---|---|---|
| Maturity | Most mature; Cadence lineage, large-scale battle-tested | Mature for event-driven serverless workloads | Newest; production-ready, smaller install base |
| Operational weight | Heavy: server + worker fleet (or Temporal Cloud) | Lightest: functions in your app, platform-invoked | Light: single self-contained binary |
| Multi-day workflows | First-class, 99.99% SLA (Multi-Region GA) | Supported via waitForEvent / sleep | First-class via awakeables |
| Large LLM payloads | History saturates; needs payload codec / claim-check | HTTP-invoked; mind payload size, offload large blobs | Journaled; offload large blobs as with Temporal |
| Exactly-once writes | App-level idempotency keys required | App-level idempotency keys required | Built-in exactly-once; less hand-rolled idempotency |
| Cost model | Self-host infra or Temporal Cloud usage | Per-step + per-run (balloons with retries) | Self-host infra (single binary) |
| Language fit | Go, Java, TS, Python, .NET | TS-first (Python, Go) | TS, Java/Kotlin, Go, Python, Rust |
| Best for | Long, regulated, multi-region sagas | Fast durability for existing TS apps | Temporal guarantees, lighter footprint, exactly-once |
Checkpointing, Resumption, and Idempotent Write Tools
The mechanics of "resume from a failed tool call without re-running side effects" come down to three disciplines that apply regardless of engine.
Wrap every side effect in a durable step. The LLM call, the tool invocation, the database write, the sleep, the wait-for-signal, each is a journaled boundary. Inside the step you do the work; the engine records the result. On replay, the recorded result returns and the body never runs again. The corollary is that anything outside a step must be deterministic, because it re-executes on every replay.
Make write tools idempotent. Because the engine's guarantee is at-least-once on the external system (the crash-mid-call window), a write tool can run twice. Pass a stable idempotency key derived from the workflow run ID and step ID, so the second call is a no-op: the payment processor sees a duplicate key and returns the original charge, the email sender deduplicates, the row insert is an upsert. Restate reduces how much of this you write by hand, but the principle is universal, design the tool so a replay is safe. This is the durable-layer twin of the tool-use correctness work you do at the agent layer.
Checkpoint state, not just steps. For agents that accumulate large working state (a plan, a scratchpad, retrieved context), keep the durable history lean by checkpointing big state to external storage and journaling the reference. This is the same claim-check pattern that keeps Temporal history under its cap, and it doubles as a clean recovery point: on resume, rehydrate state from the checkpoint rather than from a 40MB history blob.
Approval Gates as Suspended Workflows
The cleanest payoff of durable execution for agents is the approval gate. Instead of a polling loop or a fragile in-memory wait, the workflow suspends at the gate with zero running compute and persists its full state. When the human acts in your UI, you send a signal (Temporal), an event (Inngest waitForEvent), or complete an awakeable (Restate), and the workflow wakes on the exact line where it paused. The wait is free, the resume is exact, and you get an automatic audit trail of who approved what and when. The durable suspend also makes the front end honest: instead of a spinner that lies about progress, you can surface the agent's real paused state with the UI patterns for long-running AI tasks that keep users oriented across a multi-day wait. This is the durable backbone under the fallback and escalation patterns that route low-confidence agent actions to a human, and it composes naturally with multi-agent orchestration where a supervisor agent waits on the durable completion of its sub-agents.
Recommendation by Scenario
We close every durable execution scoping conversation with a concrete recommendation. They are imperfect, but they hold up most often:
step.run() ergonomics are the quickest path from a working agent to a durable one. Model your real per-run step count including retries before you scale, the step bill is the surprise.Durability is what separates an agent demo from an agent in production. The model can be perfect and the agent will still feel unreliable if a worker restart wipes forty minutes of state and a retry re-runs every paid step. The broader question of when this reliability work pays off, and how to measure it, sits in our agent reliability versus accuracy guide and the wider AI Agents pillar. At Particula Tech we treat durable execution as a default for any agent that takes real-world actions, because the alternative is an agent that works in the demo and loses its mind the first time the process dies.
Pick the engine that matches your workflow duration, payload size, cost model, and language. Wrap every side effect in a durable step, make every write idempotent, and offload large payloads before history saturates. Then your agent survives the crash that was always coming.
Frequently Asked Questions
Quick answers to common questions about this topic
Durable execution is an engine that journals every step of an agent run to persistent storage so that when the process crashes, the agent resumes from the exact step it died on instead of starting over. For agents this matters because each step is expensive and non-repeatable: an LLM call you already paid for, a tool that already charged a card, a human approval that already happened. The engine records the result of each step once and replays it from the journal on recovery, never re-executing the underlying side effect. Temporal, Inngest, and Restate all implement this with slightly different ergonomics, but the core guarantee is identical: the agent loop survives crashes, restarts, deploys, and multi-day waits without losing or duplicating work.



