Chroma's Context Rot research (2026) tested 18 frontier models — including the ones with 1M-token windows — and every one degraded with length. Accuracy dropped 30+ points when relevant facts landed in positions 5–15 of a 20-document context, and even with distractors masked the floor still fell 7.9% from length alone. Counterintuitively, shuffled distractors hurt less than coherent ones, which inverts the 'tidy your context' instinct most RAG teams ship with. RULER's cross-check puts effective context at 50–65% of advertised. The cap we use in production is 25–30% of the advertised window, paired with a rolling-summary + recent-turns compaction loop. Morph Compact (3,300 tok/s, 98% verbatim) is the option we reach for when the summary itself becomes the bottleneck.
A media client's research agent ran fine in eval. They were using a frontier model with a 1M-token window and feeding it the full transcript of every interview, every prior research note, every editorial style guide — about 320K tokens per call. The internal demo crushed it. Two weeks into production, the editorial team started flagging a pattern: facts from the middle of long transcripts were going missing. The agent confidently cited the wrong quarter's earnings, the wrong source, the wrong attribution. The window was big enough. The model was, by any benchmark you cared to point at, a strong long-context reasoner. The output was still wrong.
The diagnosis wasn't a prompt bug. It was context rot — and in 2026 we finally have a body of research that says exactly how bad it gets, in which positions, and at what length. Chroma's Context Rot research tested 18 frontier models across the contexts they all advertise as supported, and every single one degraded with length. The numbers are blunt enough that you should be capping your effective context budget at 25–30% of the advertised window today, not waiting for the next model release to fix it.
This post is the cap, the compaction loop, and the counterintuitive finding that should change how you assemble retrieved context.
What Chroma Actually Measured
The Chroma team ran a long-context retrieval suite — variants of the needle-in-a-haystack setup with controlled distractors — across 18 frontier models with context lengths from 1K to 1M tokens. Three results survive contact with production:
Every model degraded with length. No model held flat across its advertised window. The 1M-token models did not behave like 1M-token models past roughly 200K tokens. The 200K models started rolling over by 80–100K. The shape of the curve varied; the direction did not.
Position 5–15 of 20 is the death zone. Place the relevant document in the first or last few positions of a 20-document context and accuracy stays high. Place it in positions 5 through 15 — the middle — and accuracy drops 30+ percentage points across most models. This is the long-known "lost in the middle" effect, but the magnitude at frontier-model scale is bigger than the older Stanford and Anthropic numbers from 2023–2024 suggested. It has not gotten better with scale.
There is a length-only floor. Even when Chroma stripped distractor content and only varied total context length — so the model had nothing to be distracted by — accuracy still fell 7.9% on average from length alone. Long context is intrinsically harder, independent of what fills it. The attention mechanism has to do more work, the positional encoding becomes less precise, and the model's ability to hold a single coherent task degrades.
The implication is the part most teams miss: even with perfect retrieval, longer prompts cost you accuracy. Better RAG does not save you. Cleaner context does not save you. The only thing that saves you is keeping the prompt smaller.
The Counterintuitive Finding
Here is the result that should make every RAG team rethink how they assemble their final prompt: shuffled distractor text hurts less than coherent distractor text.
Most retrieval pipelines try to produce a tidy, well-organized context block — chunks reordered for narrative flow, transitions smoothed, prose stitched together so the prompt reads like a briefing document. This instinct is wrong for retrieval-style tasks past 32K tokens. Coherent text gives the attention mechanism a plausible alternative story to follow. The model latches onto the well-formed paragraphs and reasons over them as if they were the answer. Obvious distractors — chunks separated by --- markers, raw concatenations, low-coherence text — are easier for the model to skip.
The same effect shows up in our own evaluations across client deployments. A pipeline that retrieves 8 chunks and concatenates them with explicit ## Source N headers consistently beats one that paraphrases the chunks into a single flowing summary, especially as the budget grows. The summary version reads better to a human reviewer, which is precisely why it loses — it reads better to the model too, and the model has no way to tell the difference between a coherent summary of irrelevant material and a coherent answer.
For context on where this fits in the broader retrieval stack, see our deep-dive on document chunking and RAG context preservation, and the broader landscape in our RAG systems pillar page.
RULER, Cross-Checked
Chroma's findings are not standalone. RULER — the most cited long-context benchmark, now with 13 task types from retrieval to multi-hop reasoning to aggregation — has been quietly publishing the same story since late 2024. The summary across the current frontier-model leaderboard:
Effective context lands at roughly 50–65% of advertised for needle-in-a-haystack and goes lower for multi-hop reasoning, aggregation, and code-style tasks where the model has to synthesize across distant chunks rather than retrieve a single fact.
In production, we cap one level tighter than RULER. We assume 25–30% of the advertised window is the budget before accuracy visibly degrades on real tasks. That gives you headroom for the gap between benchmark conditions (clean, controlled, single-task) and production (noisy, multi-turn, tool-laden, paraphrase-heavy). For more on why long-context windows have these production-side problems independent of accuracy, see our earlier piece on long-context LLM performance issues.
| Advertised window | Effective window (RULER) | Effective for hard tasks |
|---|---|---|
| 200K | 100–130K | 60–80K |
| 1M | 500–650K | 250–350K |
| 2M | 800K–1.1M | 400–500K |
The Working Cap
Translate this into a number you can put in your config file:
const CONTEXT_BUDGETS = {
// model: [advertised, working_cap]
"claude-opus-4-7-1m": [1_000_000, 280_000],
"claude-opus-4-7": [200_000, 56_000],
"gemini-2-flash-1m": [1_000_000, 300_000],
"gpt-5-200k": [200_000, 50_000],
} as const;Past the working cap, you summarize. Not at the hard limit — by then you have already shipped a degraded answer to your user. The trigger is the working cap.
This is the same discipline NVIDIA's April 2026 long-context engineering post lands on from a different angle: their internal recommendation for production retrieval-augmented agents is to keep the prompt under one-third of the model's stated window, with the rest reserved for output and headroom. The exact number varies by model and task, but the order of magnitude is consistent across independent measurements.
A Compaction Loop That Holds
The pattern we ship for long-running agents is rolling-summary + recent-turns hybrid. Pin the task invariants in a small system header. Keep the last N turns verbatim. Summarize everything older into a running brief that gets re-summarized periodically to prevent drift.
type Turn = { role: "user" | "assistant" | "tool"; content: string; tokens: number };
type CompactionState = {
systemHeader: string; // pinned, ~500-1500 tokens
runningSummary: string; // grows + re-summarized
recentTurns: Turn[]; // last N verbatim
};
const RECENT_TURN_COUNT = 6;
const WORKING_CAP = 280_000; // for 1M model
const COMPACTION_TRIGGER = 0.6 * WORKING_CAP; // start early
const RESUMMARIZE_EVERY = 8; // turns
async function appendTurn(
state: CompactionState,
turn: Turn,
turnsSinceResummarize: number,
summarize: (text: string) => Promise<string>,
): Promise<{ state: CompactionState; turnsSinceResummarize: number }> {
const next: CompactionState = {
systemHeader: state.systemHeader,
runningSummary: state.runningSummary,
recentTurns: [...state.recentTurns, turn],
};
// 1. Demote the oldest verbatim turn into the summary when over N.
while (next.recentTurns.length > RECENT_TURN_COUNT) {
const evicted = next.recentTurns.shift()!;
next.runningSummary = await summarize(
`${next.runningSummary}\n\n[turn ${evicted.role}] ${evicted.content}`,
);
turnsSinceResummarize += 1;
}
// 2. Re-summarize the running brief itself periodically (drift control).
if (turnsSinceResummarize >= RESUMMARIZE_EVERY) {
next.runningSummary = await summarize(
`Tighten this running brief, preserving facts, entities, decisions:\n\n${next.runningSummary}`,
);
turnsSinceResummarize = 0;
}
// 3. Hard guard: if total still exceeds the trigger, summarize aggressively.
const total = tokenCount(next.systemHeader) +
tokenCount(next.runningSummary) +
next.recentTurns.reduce((acc, t) => acc + t.tokens, 0);
if (total > COMPACTION_TRIGGER) {
next.runningSummary = await summarize(
`Compress to fit under ${Math.floor(COMPACTION_TRIGGER * 0.4)} tokens:\n\n${next.runningSummary}`,
);
}
return { state: next, turnsSinceResummarize };
}Three production gotchas this loop handles that naive compaction misses:
For the broader thinking behind this approach — why context engineering is now its own discipline — see our piece on context engineering after the prompt era and the practical primer on prompt compression and context window optimization.
Where Morph Compact Fits
Most teams start the compaction loop with whatever small model their provider offers — Haiku, Gemini Flash, GPT-mini. That works until the summarizer becomes the bottleneck. For long-running agents that compact on every turn, you can end up paying more in summarization tokens than in the main model. Latency starts to dominate too: a 2-second compaction step on every turn turns a snappy agent into a sluggish one.
Morph Compact is the purpose-built option for this slot. The published numbers are 3,300 tokens/sec throughput and 98% verbatim retention on factual content — both of which matter for the use cases where summarization typically loses information. Verbatim retention is the key metric. Standard summarizers paraphrase, which is fine for chat history and disastrous for legal text, medical notes, code blocks, and exact quantities. A 98% verbatim summary keeps the surface form when surface form is load-bearing.
When to reach for it: long-running agents (compact every turn), high-fidelity domains (legal, clinical, financial, code), and pipelines where summarization latency is on the user-visible path. When not to: chatty general-purpose agents where a small frontier model is fast enough and rephrasing is fine.
When to Skip the Long Context Entirely
The deeper question Chroma's research forces is whether you should be feeding long context to the model at all. The answer for a growing share of use cases is: not the way most teams are doing it.
Two patterns now consistently outperform "stuff everything into the window":
Compiled knowledge over retrieved context. For knowledge that is stable and queried often, compile it into a model-friendly artifact (a wiki, a set of cards, a fine-tuned adapter) rather than retrieving and stuffing on every call. Andrej Karpathy's recent push on this pattern is the cleanest articulation we have — see our deep-dive on Karpathy's LLM wiki pattern and when compiled knowledge beats RAG.
Working memory inside the window, long tail outside. Use the long-context budget for what the agent actively needs in a single task: the last few turns, the active document, the immediate tool results. Use RAG for the long tail of knowledge that the agent might need but probably will not. The advertised 1M-token window is for headroom on working memory, not for replacing the retrieval layer.
The teams shipping reliable long-context agents in 2026 aren't the ones with the biggest windows. They're the ones whose effective context — the part the model actually reasons over — is small, structured, and held below 30% of advertised. That is the cap. The rest is engineering.
What to Do This Week
Three things every team running a long-context agent should ship before the end of the week:
## Source N markers. Compare accuracy on 50 inputs. The result will surprise you.None of this is exotic. The Chroma study just gave us the magnitudes — and the magnitudes say the working budget is much smaller than the marketing slide.
The Real Insight
The vendor pitch — "1M tokens, just put everything in" — was always a measure of what fits, not what works. Chroma's contribution is putting a number on the gap. Every model degrades. The middle of the prompt is the worst place for a fact to live. Coherent distractors hurt more than messy ones. Length alone costs you accuracy even with a clean context.
The teams that internalize this stop chasing bigger windows and start engineering smaller effective contexts. They cap early, summarize aggressively, and treat retrieval as a way to keep the prompt short, not a way to fill the window. That is the lesson the Chroma study makes unavoidable, and it is the difference between an agent that demos beautifully and one that stays correct on the 100,000th customer conversation.
Frequently Asked Questions
Quick answers to common questions about this topic
Context rot is the measurable drop in accuracy as the input length grows, even when the model's advertised context window is far larger than the prompt. Chroma's 2026 study tested 18 frontier models — including ones with 1M-token windows — and found every model degraded with length. The drop is not just 'lost in the middle' (where mid-prompt facts are missed); it is a baseline accuracy floor that falls 7.9% from length alone when distractors are masked. The advertised window is the upper bound on what fits, not the upper bound on what works.


