How do I fix context_length_exceeded errors in a long-running agent?

Stop letting raw history grow unbounded and add compaction before you hit the wall. The reliable trigger is 70-80% of the model's context window, not 100%, because you need headroom for the next turn's tool output and the summarization call itself. At that threshold, keep the most recent 2-4 turns at full fidelity and replace older turns with an LLM-generated summary. Pair this with offloading large tool results (anything over roughly 20k tokens) to the filesystem and passing a path reference instead of the payload. Together these keep total tokens well under the limit so context_length_exceeded never fires, even on tasks that run for hundreds of steps.

What is context compaction for AI agents?

Context compaction is the process of shrinking an agent's working context by summarizing or removing information that is no longer needed for the next decision, while preserving the facts the agent still depends on. It typically runs automatically when the context window reaches 70-80% capacity. The most common form replaces a long run of older tool calls and results with a structured summary, keeping recent turns verbatim. Anthropic ships a compaction API (model alias compact-2026-01-12) and Claude Code uses a three-tier approach: trim verbose tool results, keep the layout cache-friendly, then write a nine-section structured summary. The goal is to extend how long an agent can run without losing the thread.

Why do tool outputs cause so much token bloat?

Tool outputs cause token bloat because they are raw, verbose, and permanent once written into history. A single API response, file read, or search result can run tens of thousands of tokens, and across long-running agents tool results can reach 81% of total context. The trap is that once a step succeeds, that payload is dead weight: the agent has already extracted what it needed, but the full JSON or document text sits in context forever, billed on every subsequent turn. The fix is to trim or summarize tool results after they are consumed and to offload anything over 20k tokens to disk, replacing the inline payload with a short path reference the agent can re-read on demand.

When should an agent trigger summarization?

Trigger summarization at 70-80% of the context window, measured continuously as the conversation grows. Waiting until 95% or 100% is too late because the summarization call itself consumes context and the next tool result might overflow before compaction finishes. At the threshold, summarize everything except the most recent 2-4 turns, which stay verbatim so the agent keeps high-resolution access to its current sub-task. Some teams also trigger on turn count (every 30-50 steps) or on a hard token budget per phase. The percentage trigger is more robust because token-per-turn growth is uneven: a single large tool result can jump you from 50% to 75% in one step.

Does context compaction hurt agent accuracy?

It can, if you compress the wrong things. Summarization is lossy, so anything the agent still needs to act on must survive compaction verbatim: open file paths, IDs, the current task definition, unresolved errors, and any structured data it will reference again. Roughly 65% of enterprise agent failures trace to context drift or memory loss, and aggressive summarization makes that worse if it drops the wrong details. The safe pattern is to never compress the active working set, only the historical record of completed steps, and to combine compaction with a durable checkpoint (a git commit or progress file) so the ground truth lives outside the lossy summary and can be re-read at full fidelity.

How do I trim an agent's conversation history without losing important context?

Split history into two zones: an active window of the most recent 2-4 turns kept verbatim, and an archived zone you replace with a structured summary. Before compacting, extract durable facts (file paths, decisions made, errors still open, IDs) into the summary so nothing actionable is lost. Offload large tool results to the filesystem and keep only a path reference inline. Frameworks formalize this: LangGraph describes write, select, compress, and isolate operations, and Claude Code writes a nine-section summary covering goal, progress, open issues, and next steps. The key discipline is that trimming removes the verbose record of how you got here, never the state you still need to finish.

BLOG/AI AGENTS

Long-Running Agents: Fix Context Bloat With Compaction

Tool results can hit 81% of an agent's context. Trigger compaction at 70-80% capacity, offload big outputs to disk, and stop context_length_exceeded crashes.

Sebastian MondragonMAY 28, 2026 · 12 MIN READ

Long-Running Agents: Fix Context Bloat With Compaction

Run an agent long enough and it does not get smarter, it gets slower, more expensive, and eventually it crashes with context_length_exceeded. The cause is almost never the model. It is context bloat: every tool call, every file read, every API response gets appended to the conversation and never leaves. By step 80 the agent is paying to re-read step 12's raw JSON on every single turn, and somewhere around step 120 the window overflows. Long-running agents need context compaction the way long-running servers need garbage collection, and most teams ship without it.

The numbers make the failure mode concrete. Across enterprise deployments, roughly 65% of agent failures get attributed to context drift or memory loss during multi-step reasoning, not to the model picking a bad action. And the single biggest contributor is tool output: in long-running loops, tool results can account for 81% of total context. That is the trap in one statistic. Once a step succeeds, its payload is dead weight, the agent already pulled out the one field it needed, but the entire response sits in history, billed and re-processed forever.

This is a how-to. I will walk through measuring token-per-turn growth so you can see the bloat before it bites, the compaction trigger that actually works (70-80% capacity, not 100%), the three converged techniques the major agent runtimes now use, when to offload tool results to disk instead of summarizing them, and the one rule that keeps compaction from quietly destroying your agent's accuracy. If you want the conceptual grounding first, our piece on context engineering replacing prompt engineering frames why this discipline now matters more than prompt wording, and the AI agents pillar covers the broader architecture.

01 · The Failure Mode: Why Long-Running Agents Overflow or Degrade

There are two distinct failure modes, and conflating them leads to the wrong fix.

The first is the hard crash. Total tokens exceed the model's window and the API returns context_length_exceeded. This is binary and obvious. It tends to hit agents that do a lot of file reads, web fetches, or large API calls, because each of those drops a big payload into history. A 200k-token window sounds enormous until a single document read is 40k tokens and you do four of them.

The second is the soft degradation, which is sneakier and more common. The agent does not crash. It just gets worse. Long before you hit the limit, performance starts to slide because the model has to find the relevant needle in a haystack of stale tool output. This is the "context rot" effect: as the window fills, retrieval and reasoning over that context degrade even though you are technically within limits. We covered the evidence in what Chroma's context rot study proves about long context windows, and the practical takeaway is that a full window is a degraded window. You do not get to use all 200k tokens at full quality.

Why It Accumulates

The mechanics are simple and that is exactly why it is so easy to ship the bug. A standard agent loop appends every message: the model's reasoning, the tool call, the tool result, repeat. Nothing is ever removed. The conversation is the agent's only memory, so deleting from it feels dangerous, and most loop implementations default to never deleting anything. Tool results dominate because they are the verbose part. A model's reasoning step might be 200 tokens. The tool result it triggered, a database query returning 500 rows, a file read of a 2,000-line source file, a search API returning ten full documents, can be 20,000 tokens. Multiply by dozens of steps and the math is brutal. This is distinct from the conversational memory problem we cover in making agents remember context across conversations, which is about persisting state between sessions. Compaction is about surviving a single long session.

02 · Measuring Token-Per-Turn Growth

You cannot fix bloat you cannot see. Before adding any compaction logic, instrument the loop to log, on every turn, the running total of prompt_tokens and the per-turn delta. Most observability tools (Helicone, Langfuse, Traceloop) capture this automatically; if you are rolling your own, the token counts are in every API response.

What you are looking for is the shape of the curve. Healthy agents grow roughly linearly and slowly. Bloated agents show sharp step increases that line up with tool calls. When you see a single turn add 30k tokens, that is your culprit, and it tells you whether to summarize (the result was useful but verbose) or to offload (the result was a large blob the agent barely touched).

Here is a representative profile of a coding-style agent running without compaction, the kind of growth that ends in a crash:

Notice that the model's own reasoning is a rounding error here. The window is filled by tool results, and by turn 35 you are already past the 70% threshold where compaction should have fired. The fix is not a bigger window. It is a loop that stops carrying turn 5's file read all the way to turn 60.

Turn	Action	Per-turn tokens	Running total	% of 200k window
5	Read large source file	+28,000	41,000	21%
20	Search results (10 docs)	+35,000	96,000	48%
35	API response (full JSON)	+22,000	138,000	69%
48	Another file read	+31,000	171,000	86%
60	Tool call	+24,000	195,000	98%
61	Next tool call	(overflow)	crash	`context_length_exceeded`

03 · The Compaction Trigger: Summarize at 70-80% Capacity

The core technique that all the major runtimes converge on is the same: trigger LLM summarization when the context reaches 70-80% of the window. Not 95%, not 100%. The headroom matters for two reasons. First, the summarization call itself needs room to run, you have to send the history to the model to summarize it. Second, the very next tool result might be large enough to overflow before compaction completes, so you need a buffer.

The structure of a good compaction is two zones:

Keep the active window verbatim. The most recent 2-4 turns stay at full fidelity. The agent is mid-thought on its current sub-task, and it needs high-resolution access to exactly what just happened: the last tool call, the last result, the last error. Compressing the active window is how you get an agent that forgets what it was doing two steps ago.

Replace the archive with a structured summary. Everything older than the active window gets condensed into a single summary message. This is not a vague paragraph. It is a structured record: what the overall goal is, what has been accomplished, what files or resources are in play, what errors are still open, and what the next step was going to be. The discipline is that the summary captures decisions and state, not the verbose record of how each step was executed.

This is a specialized application of the broader prompt compression techniques for context windows, but agents add a constraint that static prompts do not have: the compaction has to be reversible enough that the agent can keep acting. A summary that reads well but drops the open file path is worse than no summary at all.

04 · The Three Converged Techniques: Anthropic, Claude Code, LangGraph

By mid-2026 the leading agent stacks have settled on overlapping patterns. Knowing all three lets you pick the right one for your runtime instead of reinventing it.

The pattern to internalize: offloading is lossless and should be your first move for large payloads; summarization is lossy and should be reserved for the historical narrative of completed steps.

Anthropic Compaction API

Anthropic ships a first-party compaction capability addressed through a model alias (compact-2026-01-12). Instead of you hand-rolling a summarization prompt, you hand the conversation to a compaction-tuned path and get back a condensed history that preserves tool-use structure. The advantage is that it is built to keep the things agents specifically depend on, tool call and result pairing, IDs, and active state, rather than optimizing for human-readable prose the way a generic "summarize this" prompt would. If you are already on Claude, this is the lowest-effort starting point. For model and API specifics, always check Anthropic's current docs, since aliases and behavior shift between releases.

Claude Code's Three-Tier Approach

Claude Code (whose architecture became partly public, see our agent architecture lessons from the Claude Code source leak) uses a layered strategy worth copying:

1. Tool-result trimming. Verbose tool outputs are trimmed or truncated once consumed, so a giant result does not persist at full size.
2. Cache-friendly layout. The context is arranged so the stable prefix stays constant, which preserves prompt caching and keeps cost down. Compaction that reshuffles everything destroys your cache hit rate, a regression we documented in why Anthropic cache hit rates collapse in a related context.
3. Nine-section structured summary. When a full compaction is needed, it writes a summary with explicit sections (goal, progress, current state, open issues, next steps, and so on) rather than freeform text. The sectioning is what makes it reliable: each field has a job, so nothing important gets dropped to save space.

LangGraph: Write, Select, Compress, Isolate

LangGraph frames context management as four operations, which is a useful mental model regardless of framework: The practical guidance that falls out of this: offload tool results over 20k tokens to the filesystem and keep only a path reference in context. We will dig into that next, because it is the technique that scales further than summarization alone. Here is how the approaches compare:

Write state to an external store rather than holding it all in the prompt.
Select only the relevant pieces back into context for the current step.
Compress what does need to be in context (this is summarization).
Isolate state across sub-agents so each one carries only its slice.

Technique	Mechanism	Best for	Lossy?
Anthropic compaction API	First-party condense, preserves tool structure	Claude-based agents, low effort	Yes (managed)
Claude Code 3-tier	Trim + cache-friendly + 9-section summary	Coding agents, long sessions	Partly (trim) + summary
LangGraph write/select/compress/isolate	External store + selective recall	Multi-agent, large state	No (offload) / Yes (compress)
Filesystem offload (>20k)	Write payload to disk, pass path	Large tool results, documents	No

05 · Offloading Large Tool Results to the Filesystem

Summarization throws information away. For a 40k-token document the agent might need to re-read later, that is the wrong trade. The better move is to offload: write the full payload to disk (or an object store), and put a short reference in context instead.

The rule of thumb that has converged across stacks: if a tool result exceeds roughly 20,000 tokens, do not inline it. Instead:

Write the full result to a file, for example ./agent_workspace/tool_results/turn_34_search.json.

Replace the inline result with a compact reference: a one-line description plus the path, plus maybe the first few fields the agent needs immediately.

Give the agent a read_file tool so it can pull the full payload back on demand if a later step needs it.

This flips the economics. Instead of paying for 40k tokens on every subsequent turn, you pay for a 50-token reference, and only re-load the full thing in the rare case the agent actually needs it again. For documents, search corpora, large API responses, and file reads, offloading is strictly better than summarization because nothing is lost.

There is a design subtlety. The reference has to carry enough metadata for the agent to decide whether to re-read. "Tool result saved to disk" is useless. "Search returned 10 results about authentication middleware, saved to turn_34_search.json, top result: rate-limiter.ts" lets the agent reason about whether it needs the full payload without spending the tokens to load it. Treat the inline reference as an index entry, not a placeholder.

This is also where filesystem offload meets retrieval: the workspace becomes a small, agent-owned knowledge store. If your agent's offloaded results start looking like a real corpus, the context engineering practices for managing what the agent sees become relevant, because re-reading by path is the simplest possible form of agent-controlled retrieval.

06 · Checkpointing State to Git and Progress Files

Compaction is lossy by design, which means you need a source of truth that lives outside the lossy summary. That source of truth is a durable checkpoint: a git commit, a progress file, or both.

For coding agents, git is the natural checkpoint. After each meaningful step, commit. The diff and commit history become a perfect, full-fidelity record of what changed, completely independent of the conversation. When you compact the context, you can safely drop the verbose record of file edits because git log and git diff reconstruct the actual state. The summary says "implemented the auth middleware, committed as a1b2c3d"; the agent can always git show a1b2c3d to see exactly what it did.

For non-coding agents, a progress file plays the same role. Maintain a structured file, progress.md or state.json, that the agent updates as it works: tasks completed, decisions made, current objective, blockers. This file is never summarized, it is the ground truth. When compaction runs, the summary can point at the progress file rather than trying to encode everything itself.

The combination is powerful: compaction keeps the working context small, and the checkpoint guarantees you can rebuild full state after a crash or a bad summary. If the agent dies at step 90, it does not start over, it reads the progress file and the git history and resumes. This durability mindset connects directly to how agents persist and recall context across sessions; compaction handles the live context, while checkpoints handle survivability.

07 · Avoiding Summary-Induced Quality Loss

This is the section that separates agents that survive compaction from agents that quietly fall apart. Summarization is lossy. The entire game is making sure it loses only the safe-to-lose information.

Never compress the active working set. Anything the agent will reference or act on in the next few steps must survive verbatim. That includes open file paths, resource IDs, the exact current task definition, unresolved errors, and any structured data (a list of records, a config object) it will use again. The summary is for completed history, not for the live task.

Preserve specifics over narrative. A bad summary says "the agent explored the codebase and made progress on the authentication feature." A good summary says "goal: add JWT auth to /api/login. Done: created auth/jwt.ts, added middleware to routes.ts (commit a1b2c3d). Open: token refresh not implemented, test in auth.test.ts line 42 failing on expired-token case. Next: implement refresh in jwt.ts." The second one lets the agent resume; the first one forces it to re-discover everything.

Keep what you cannot reconstruct, drop what you can. This is the deciding heuristic. The full text of a file you wrote? Droppable, git has it. A test failure message you have not addressed? Keep it, it is not recorded anywhere else. The reasoning behind a key architectural decision? Keep it, the agent will not re-derive it. Apply this test to every category of information before deciding whether it survives compaction.

The reason this matters so much circles back to the opening statistic: about 65% of enterprise agent failures come from context drift and memory loss. Compaction done carelessly is a direct cause of exactly that failure. Compaction done with these guardrails is the cure. The difference is entirely in what you choose to preserve.

08 · Putting It Together: A Compaction Loop

Here is the full pattern assembled, the loop logic a production long-running agent should run:

On every turn, measure running prompt_tokens as a percentage of the window.

On every tool result over ~20k tokens, offload it to the filesystem and replace it inline with a metadata-rich path reference.

At 70-80% capacity, compact: keep the last 2-4 turns verbatim, replace the archive with a structured (sectioned) summary that preserves paths, IDs, open errors, and decisions.

After each meaningful step, checkpoint to git (coding agents) or a progress file (everything else) so a durable source of truth lives outside the summary.

**Provide a read_file tool** so the agent can re-load any offloaded payload on demand.

Protect the working set: never summarize the current task definition or anything the agent will act on next.

Start with measurement and offloading, those two alone eliminate most overflow crashes with zero accuracy risk because offloading is lossless. Add summarization only once you have a sense of which information is safe to compress, and add checkpoints before you rely on compaction in anything that runs for hours.

At Particula Tech, when we audit a long-running agent that overflows or degrades mid-task, this is the order we work in: instrument the token curve first, then offload the big payloads, then add the compaction trigger, then prove durability with checkpoints. The crashes usually stop after step two; the quality recovers after step six. A bigger context window is almost never the answer, because a full window is a degraded window, and the bloat just returns at a higher ceiling. The fix is a loop that knows how to forget the right things.

09 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI AGENTS

Long-Running Agents: Fix Context Bloat With Compaction

Tool results can hit 81% of an agent's context. Trigger compaction at 70-80% capacity, offload big outputs to disk, and stop context_length_exceeded crashes.

Sebastian MondragonMAY 28, 2026 · 12 MIN READ

01 · The Failure Mode: Why Long-Running Agents Overflow or Degrade

There are two distinct failure modes, and conflating them leads to the wrong fix.

Why It Accumulates

02 · Measuring Token-Per-Turn Growth

Here is a representative profile of a coding-style agent running without compaction, the kind of growth that ends in a crash:

Turn	Action	Per-turn tokens	Running total	% of 200k window
5	Read large source file	+28,000	41,000	21%
20	Search results (10 docs)	+35,000	96,000	48%
35	API response (full JSON)	+22,000	138,000	69%
48	Another file read	+31,000	171,000	86%
60	Tool call	+24,000	195,000	98%
61	Next tool call	(overflow)	crash	`context_length_exceeded`

03 · The Compaction Trigger: Summarize at 70-80% Capacity

The structure of a good compaction is two zones:

04 · The Three Converged Techniques: Anthropic, Claude Code, LangGraph

By mid-2026 the leading agent stacks have settled on overlapping patterns. Knowing all three lets you pick the right one for your runtime instead of reinventing it.

The pattern to internalize: offloading is lossless and should be your first move for large payloads; summarization is lossy and should be reserved for the historical narrative of completed steps.

Anthropic Compaction API

Claude Code's Three-Tier Approach

Claude Code (whose architecture became partly public, see our agent architecture lessons from the Claude Code source leak) uses a layered strategy worth copying:

1. Tool-result trimming. Verbose tool outputs are trimmed or truncated once consumed, so a giant result does not persist at full size.
2. Cache-friendly layout. The context is arranged so the stable prefix stays constant, which preserves prompt caching and keeps cost down. Compaction that reshuffles everything destroys your cache hit rate, a regression we documented in why Anthropic cache hit rates collapse in a related context.
3. Nine-section structured summary. When a full compaction is needed, it writes a summary with explicit sections (goal, progress, current state, open issues, next steps, and so on) rather than freeform text. The sectioning is what makes it reliable: each field has a job, so nothing important gets dropped to save space.

LangGraph: Write, Select, Compress, Isolate

Write state to an external store rather than holding it all in the prompt.
Select only the relevant pieces back into context for the current step.
Compress what does need to be in context (this is summarization).
Isolate state across sub-agents so each one carries only its slice.

Technique	Mechanism	Best for	Lossy?
Anthropic compaction API	First-party condense, preserves tool structure	Claude-based agents, low effort	Yes (managed)
Claude Code 3-tier	Trim + cache-friendly + 9-section summary	Coding agents, long sessions	Partly (trim) + summary
LangGraph write/select/compress/isolate	External store + selective recall	Multi-agent, large state	No (offload) / Yes (compress)
Filesystem offload (>20k)	Write payload to disk, pass path	Large tool results, documents	No

05 · Offloading Large Tool Results to the Filesystem

The rule of thumb that has converged across stacks: if a tool result exceeds roughly 20,000 tokens, do not inline it. Instead:

Write the full result to a file, for example ./agent_workspace/tool_results/turn_34_search.json.

Replace the inline result with a compact reference: a one-line description plus the path, plus maybe the first few fields the agent needs immediately.

Give the agent a read_file tool so it can pull the full payload back on demand if a later step needs it.

06 · Checkpointing State to Git and Progress Files

Compaction is lossy by design, which means you need a source of truth that lives outside the lossy summary. That source of truth is a durable checkpoint: a git commit, a progress file, or both.

07 · Avoiding Summary-Induced Quality Loss

08 · Putting It Together: A Compaction Loop

Here is the full pattern assembled, the loop logic a production long-running agent should run:

On every turn, measure running prompt_tokens as a percentage of the window.

On every tool result over ~20k tokens, offload it to the filesystem and replace it inline with a metadata-rich path reference.

At 70-80% capacity, compact: keep the last 2-4 turns verbatim, replace the archive with a structured (sectioned) summary that preserves paths, IDs, open errors, and decisions.

After each meaningful step, checkpoint to git (coding agents) or a progress file (everything else) so a durable source of truth lives outside the summary.

**Provide a read_file tool** so the agent can re-load any offloaded payload on demand.

Protect the working set: never summarize the current task definition or anything the agent will act on next.

09 · FAQ

Quick answers to the questions this post tends to raise.