Long-running agents fail when tool outputs and history accumulate past the context window. Tool results alone can reach 81% of total context, and roughly 65% of enterprise agent failures trace to context drift or memory loss. Fix it with three layers: trigger LLM summarization at 70-80% capacity (keep recent turns full-fidelity, condense the rest), offload tool results over 20k tokens to the filesystem behind path references, and checkpoint state to a git or progress file so you can rebuild after a crash.
Run an agent long enough and it does not get smarter, it gets slower, more expensive, and eventually it crashes with context_length_exceeded. The cause is almost never the model. It is context bloat: every tool call, every file read, every API response gets appended to the conversation and never leaves. By step 80 the agent is paying to re-read step 12's raw JSON on every single turn, and somewhere around step 120 the window overflows. Long-running agents need context compaction the way long-running servers need garbage collection, and most teams ship without it.
The numbers make the failure mode concrete. Across enterprise deployments, roughly 65% of agent failures get attributed to context drift or memory loss during multi-step reasoning, not to the model picking a bad action. And the single biggest contributor is tool output: in long-running loops, tool results can account for 81% of total context. That is the trap in one statistic. Once a step succeeds, its payload is dead weight, the agent already pulled out the one field it needed, but the entire response sits in history, billed and re-processed forever.
This is a how-to. I will walk through measuring token-per-turn growth so you can see the bloat before it bites, the compaction trigger that actually works (70-80% capacity, not 100%), the three converged techniques the major agent runtimes now use, when to offload tool results to disk instead of summarizing them, and the one rule that keeps compaction from quietly destroying your agent's accuracy. If you want the conceptual grounding first, our piece on context engineering replacing prompt engineering frames why this discipline now matters more than prompt wording, and the AI agents pillar covers the broader architecture.
The Failure Mode: Why Long-Running Agents Overflow or Degrade
There are two distinct failure modes, and conflating them leads to the wrong fix.
The first is the hard crash. Total tokens exceed the model's window and the API returns context_length_exceeded. This is binary and obvious. It tends to hit agents that do a lot of file reads, web fetches, or large API calls, because each of those drops a big payload into history. A 200k-token window sounds enormous until a single document read is 40k tokens and you do four of them.
The second is the soft degradation, which is sneakier and more common. The agent does not crash. It just gets worse. Long before you hit the limit, performance starts to slide because the model has to find the relevant needle in a haystack of stale tool output. This is the "context rot" effect: as the window fills, retrieval and reasoning over that context degrade even though you are technically within limits. We covered the evidence in what Chroma's context rot study proves about long context windows, and the practical takeaway is that a full window is a degraded window. You do not get to use all 200k tokens at full quality.
Why It Accumulates
The mechanics are simple and that is exactly why it is so easy to ship the bug. A standard agent loop appends every message: the model's reasoning, the tool call, the tool result, repeat. Nothing is ever removed. The conversation is the agent's only memory, so deleting from it feels dangerous, and most loop implementations default to never deleting anything. Tool results dominate because they are the verbose part. A model's reasoning step might be 200 tokens. The tool result it triggered, a database query returning 500 rows, a file read of a 2,000-line source file, a search API returning ten full documents, can be 20,000 tokens. Multiply by dozens of steps and the math is brutal. This is distinct from the conversational memory problem we cover in making agents remember context across conversations, which is about persisting state between sessions. Compaction is about surviving a single long session.
Measuring Token-Per-Turn Growth
You cannot fix bloat you cannot see. Before adding any compaction logic, instrument the loop to log, on every turn, the running total of prompt_tokens and the per-turn delta. Most observability tools (Helicone, Langfuse, Traceloop) capture this automatically; if you are rolling your own, the token counts are in every API response.
What you are looking for is the shape of the curve. Healthy agents grow roughly linearly and slowly. Bloated agents show sharp step increases that line up with tool calls. When you see a single turn add 30k tokens, that is your culprit, and it tells you whether to summarize (the result was useful but verbose) or to offload (the result was a large blob the agent barely touched).
Here is a representative profile of a coding-style agent running without compaction, the kind of growth that ends in a crash:
Notice that the model's own reasoning is a rounding error here. The window is filled by tool results, and by turn 35 you are already past the 70% threshold where compaction should have fired. The fix is not a bigger window. It is a loop that stops carrying turn 5's file read all the way to turn 60.
| Turn | Action | Per-turn tokens | Running total | % of 200k window |
|---|---|---|---|---|
| 5 | Read large source file | +28,000 | 41,000 | 21% |
| 20 | Search results (10 docs) | +35,000 | 96,000 | 48% |
| 35 | API response (full JSON) | +22,000 | 138,000 | 69% |
| 48 | Another file read | +31,000 | 171,000 | 86% |
| 60 | Tool call | +24,000 | 195,000 | 98% |
| 61 | Next tool call | (overflow) | crash | context_length_exceeded |
The Compaction Trigger: Summarize at 70-80% Capacity
The core technique that all the major runtimes converge on is the same: trigger LLM summarization when the context reaches 70-80% of the window. Not 95%, not 100%. The headroom matters for two reasons. First, the summarization call itself needs room to run, you have to send the history to the model to summarize it. Second, the very next tool result might be large enough to overflow before compaction completes, so you need a buffer.
The structure of a good compaction is two zones:
Keep the active window verbatim. The most recent 2-4 turns stay at full fidelity. The agent is mid-thought on its current sub-task, and it needs high-resolution access to exactly what just happened: the last tool call, the last result, the last error. Compressing the active window is how you get an agent that forgets what it was doing two steps ago.
Replace the archive with a structured summary. Everything older than the active window gets condensed into a single summary message. This is not a vague paragraph. It is a structured record: what the overall goal is, what has been accomplished, what files or resources are in play, what errors are still open, and what the next step was going to be. The discipline is that the summary captures decisions and state, not the verbose record of how each step was executed.
This is a specialized application of the broader prompt compression techniques for context windows, but agents add a constraint that static prompts do not have: the compaction has to be reversible enough that the agent can keep acting. A summary that reads well but drops the open file path is worse than no summary at all.
The Three Converged Techniques: Anthropic, Claude Code, LangGraph
By mid-2026 the leading agent stacks have settled on overlapping patterns. Knowing all three lets you pick the right one for your runtime instead of reinventing it.
The pattern to internalize: offloading is lossless and should be your first move for large payloads; summarization is lossy and should be reserved for the historical narrative of completed steps.
Anthropic Compaction API
Anthropic ships a first-party compaction capability addressed through a model alias (compact-2026-01-12). Instead of you hand-rolling a summarization prompt, you hand the conversation to a compaction-tuned path and get back a condensed history that preserves tool-use structure. The advantage is that it is built to keep the things agents specifically depend on, tool call and result pairing, IDs, and active state, rather than optimizing for human-readable prose the way a generic "summarize this" prompt would. If you are already on Claude, this is the lowest-effort starting point. For model and API specifics, always check Anthropic's current docs, since aliases and behavior shift between releases.
Claude Code's Three-Tier Approach
Claude Code (whose architecture became partly public, see our agent architecture lessons from the Claude Code source leak) uses a layered strategy worth copying:
- 1. Tool-result trimming. Verbose tool outputs are trimmed or truncated once consumed, so a giant result does not persist at full size.
- 2. Cache-friendly layout. The context is arranged so the stable prefix stays constant, which preserves prompt caching and keeps cost down. Compaction that reshuffles everything destroys your cache hit rate, a regression we documented in why Anthropic cache hit rates collapse in a related context.
- 3. Nine-section structured summary. When a full compaction is needed, it writes a summary with explicit sections (goal, progress, current state, open issues, next steps, and so on) rather than freeform text. The sectioning is what makes it reliable: each field has a job, so nothing important gets dropped to save space.
LangGraph: Write, Select, Compress, Isolate
LangGraph frames context management as four operations, which is a useful mental model regardless of framework: The practical guidance that falls out of this: offload tool results over 20k tokens to the filesystem and keep only a path reference in context. We will dig into that next, because it is the technique that scales further than summarization alone. Here is how the approaches compare:
- Write state to an external store rather than holding it all in the prompt.
- Select only the relevant pieces back into context for the current step.
- Compress what does need to be in context (this is summarization).
- Isolate state across sub-agents so each one carries only its slice.
| Technique | Mechanism | Best for | Lossy? |
|---|---|---|---|
| Anthropic compaction API | First-party condense, preserves tool structure | Claude-based agents, low effort | Yes (managed) |
| Claude Code 3-tier | Trim + cache-friendly + 9-section summary | Coding agents, long sessions | Partly (trim) + summary |
| LangGraph write/select/compress/isolate | External store + selective recall | Multi-agent, large state | No (offload) / Yes (compress) |
| Filesystem offload (>20k) | Write payload to disk, pass path | Large tool results, documents | No |
Offloading Large Tool Results to the Filesystem
Summarization throws information away. For a 40k-token document the agent might need to re-read later, that is the wrong trade. The better move is to offload: write the full payload to disk (or an object store), and put a short reference in context instead.
The rule of thumb that has converged across stacks: if a tool result exceeds roughly 20,000 tokens, do not inline it. Instead:
./agent_workspace/tool_results/turn_34_search.json.read_file tool so it can pull the full payload back on demand if a later step needs it.This flips the economics. Instead of paying for 40k tokens on every subsequent turn, you pay for a 50-token reference, and only re-load the full thing in the rare case the agent actually needs it again. For documents, search corpora, large API responses, and file reads, offloading is strictly better than summarization because nothing is lost.
There is a design subtlety. The reference has to carry enough metadata for the agent to decide whether to re-read. "Tool result saved to disk" is useless. "Search returned 10 results about authentication middleware, saved to turn_34_search.json, top result: rate-limiter.ts" lets the agent reason about whether it needs the full payload without spending the tokens to load it. Treat the inline reference as an index entry, not a placeholder.
This is also where filesystem offload meets retrieval: the workspace becomes a small, agent-owned knowledge store. If your agent's offloaded results start looking like a real corpus, the context engineering practices for managing what the agent sees become relevant, because re-reading by path is the simplest possible form of agent-controlled retrieval.
Checkpointing State to Git and Progress Files
Compaction is lossy by design, which means you need a source of truth that lives outside the lossy summary. That source of truth is a durable checkpoint: a git commit, a progress file, or both.
For coding agents, git is the natural checkpoint. After each meaningful step, commit. The diff and commit history become a perfect, full-fidelity record of what changed, completely independent of the conversation. When you compact the context, you can safely drop the verbose record of file edits because git log and git diff reconstruct the actual state. The summary says "implemented the auth middleware, committed as a1b2c3d"; the agent can always git show a1b2c3d to see exactly what it did.
For non-coding agents, a progress file plays the same role. Maintain a structured file, progress.md or state.json, that the agent updates as it works: tasks completed, decisions made, current objective, blockers. This file is never summarized, it is the ground truth. When compaction runs, the summary can point at the progress file rather than trying to encode everything itself.
The combination is powerful: compaction keeps the working context small, and the checkpoint guarantees you can rebuild full state after a crash or a bad summary. If the agent dies at step 90, it does not start over, it reads the progress file and the git history and resumes. This durability mindset connects directly to how agents persist and recall context across sessions; compaction handles the live context, while checkpoints handle survivability.
Avoiding Summary-Induced Quality Loss
This is the section that separates agents that survive compaction from agents that quietly fall apart. Summarization is lossy. The entire game is making sure it loses only the safe-to-lose information.
Never compress the active working set. Anything the agent will reference or act on in the next few steps must survive verbatim. That includes open file paths, resource IDs, the exact current task definition, unresolved errors, and any structured data (a list of records, a config object) it will use again. The summary is for completed history, not for the live task.
Preserve specifics over narrative. A bad summary says "the agent explored the codebase and made progress on the authentication feature." A good summary says "goal: add JWT auth to /api/login. Done: created auth/jwt.ts, added middleware to routes.ts (commit a1b2c3d). Open: token refresh not implemented, test in auth.test.ts line 42 failing on expired-token case. Next: implement refresh in jwt.ts." The second one lets the agent resume; the first one forces it to re-discover everything.
Keep what you cannot reconstruct, drop what you can. This is the deciding heuristic. The full text of a file you wrote? Droppable, git has it. A test failure message you have not addressed? Keep it, it is not recorded anywhere else. The reasoning behind a key architectural decision? Keep it, the agent will not re-derive it. Apply this test to every category of information before deciding whether it survives compaction.
The reason this matters so much circles back to the opening statistic: about 65% of enterprise agent failures come from context drift and memory loss. Compaction done carelessly is a direct cause of exactly that failure. Compaction done with these guardrails is the cure. The difference is entirely in what you choose to preserve.
Putting It Together: A Compaction Loop
Here is the full pattern assembled, the loop logic a production long-running agent should run:
prompt_tokens as a percentage of the window.read_file tool** so the agent can re-load any offloaded payload on demand.Start with measurement and offloading, those two alone eliminate most overflow crashes with zero accuracy risk because offloading is lossless. Add summarization only once you have a sense of which information is safe to compress, and add checkpoints before you rely on compaction in anything that runs for hours.
At Particula Tech, when we audit a long-running agent that overflows or degrades mid-task, this is the order we work in: instrument the token curve first, then offload the big payloads, then add the compaction trigger, then prove durability with checkpoints. The crashes usually stop after step two; the quality recovers after step six. A bigger context window is almost never the answer, because a full window is a degraded window, and the bloat just returns at a higher ceiling. The fix is a loop that knows how to forget the right things.
Frequently Asked Questions
Quick answers to common questions about this topic
Stop letting raw history grow unbounded and add compaction before you hit the wall. The reliable trigger is 70-80% of the model's context window, not 100%, because you need headroom for the next turn's tool output and the summarization call itself. At that threshold, keep the most recent 2-4 turns at full fidelity and replace older turns with an LLM-generated summary. Pair this with offloading large tool results (anything over roughly 20k tokens) to the filesystem and passing a path reference instead of the payload. Together these keep total tokens well under the limit so context_length_exceeded never fires, even on tasks that run for hundreds of steps.



