April 23, 2026

Your 1M Context Window Is Lying: What Chroma's Context Rot Study Proves

Chroma tested 18 frontier models across long contexts. All of them degraded — 30%+ accuracy drops when the answer sits mid-document, 7.9% loss from length alone. Here's the cap and the compaction loop we ship.

Sebastian Mondragon

10 min read

TL;DR

Chroma's Context Rot research (2026) tested 18 frontier models — including the ones with 1M-token windows — and every one degraded with length. Accuracy dropped 30+ points when relevant facts landed in positions 5–15 of a 20-document context, and even with distractors masked the floor still fell 7.9% from length alone. Counterintuitively, shuffled distractors hurt less than coherent ones, which inverts the 'tidy your context' instinct most RAG teams ship with. RULER's cross-check puts effective context at 50–65% of advertised. The cap we use in production is 25–30% of the advertised window, paired with a rolling-summary + recent-turns compaction loop. Morph Compact (3,300 tok/s, 98% verbatim) is the option we reach for when the summary itself becomes the bottleneck.

A media client's research agent ran fine in eval. They were using a frontier model with a 1M-token window and feeding it the full transcript of every interview, every prior research note, every editorial style guide — about 320K tokens per call. The internal demo crushed it. Two weeks into production, the editorial team started flagging a pattern: facts from the middle of long transcripts were going missing. The agent confidently cited the wrong quarter's earnings, the wrong source, the wrong attribution. The window was big enough. The model was, by any benchmark you cared to point at, a strong long-context reasoner. The output was still wrong.

The diagnosis wasn't a prompt bug. It was context rot — and in 2026 we finally have a body of research that says exactly how bad it gets, in which positions, and at what length. Chroma's Context Rot research tested 18 frontier models across the contexts they all advertise as supported, and every single one degraded with length. The numbers are blunt enough that you should be capping your effective context budget at 25–30% of the advertised window today, not waiting for the next model release to fix it.

This post is the cap, the compaction loop, and the counterintuitive finding that should change how you assemble retrieved context.

What Chroma Actually Measured

The Chroma team ran a long-context retrieval suite — variants of the needle-in-a-haystack setup with controlled distractors — across 18 frontier models with context lengths from 1K to 1M tokens. Three results survive contact with production:

Every model degraded with length. No model held flat across its advertised window. The 1M-token models did not behave like 1M-token models past roughly 200K tokens. The 200K models started rolling over by 80–100K. The shape of the curve varied; the direction did not.

Position 5–15 of 20 is the death zone. Place the relevant document in the first or last few positions of a 20-document context and accuracy stays high. Place it in positions 5 through 15 — the middle — and accuracy drops 30+ percentage points across most models. This is the long-known "lost in the middle" effect, but the magnitude at frontier-model scale is bigger than the older Stanford and Anthropic numbers from 2023–2024 suggested. It has not gotten better with scale.

There is a length-only floor. Even when Chroma stripped distractor content and only varied total context length — so the model had nothing to be distracted by — accuracy still fell 7.9% on average from length alone. Long context is intrinsically harder, independent of what fills it. The attention mechanism has to do more work, the positional encoding becomes less precise, and the model's ability to hold a single coherent task degrades.

The implication is the part most teams miss: even with perfect retrieval, longer prompts cost you accuracy. Better RAG does not save you. Cleaner context does not save you. The only thing that saves you is keeping the prompt smaller.

The Counterintuitive Finding

Here is the result that should make every RAG team rethink how they assemble their final prompt: shuffled distractor text hurts less than coherent distractor text.

Most retrieval pipelines try to produce a tidy, well-organized context block — chunks reordered for narrative flow, transitions smoothed, prose stitched together so the prompt reads like a briefing document. This instinct is wrong for retrieval-style tasks past 32K tokens. Coherent text gives the attention mechanism a plausible alternative story to follow. The model latches onto the well-formed paragraphs and reasons over them as if they were the answer. Obvious distractors — chunks separated by --- markers, raw concatenations, low-coherence text — are easier for the model to skip.

The same effect shows up in our own evaluations across client deployments. A pipeline that retrieves 8 chunks and concatenates them with explicit ## Source N headers consistently beats one that paraphrases the chunks into a single flowing summary, especially as the budget grows. The summary version reads better to a human reviewer, which is precisely why it loses — it reads better to the model too, and the model has no way to tell the difference between a coherent summary of irrelevant material and a coherent answer.

For context on where this fits in the broader retrieval stack, see our deep-dive on document chunking and RAG context preservation, and the broader landscape in our RAG systems pillar page.

RULER, Cross-Checked

Chroma's findings are not standalone. RULER — the most cited long-context benchmark, now with 13 task types from retrieval to multi-hop reasoning to aggregation — has been quietly publishing the same story since late 2024. The summary across the current frontier-model leaderboard:

Effective context lands at roughly 50–65% of advertised for needle-in-a-haystack and goes lower for multi-hop reasoning, aggregation, and code-style tasks where the model has to synthesize across distant chunks rather than retrieve a single fact.

In production, we cap one level tighter than RULER. We assume 25–30% of the advertised window is the budget before accuracy visibly degrades on real tasks. That gives you headroom for the gap between benchmark conditions (clean, controlled, single-task) and production (noisy, multi-turn, tool-laden, paraphrase-heavy). For more on why long-context windows have these production-side problems independent of accuracy, see our earlier piece on long-context LLM performance issues.

Advertised window	Effective window (RULER)	Effective for hard tasks
200K	100–130K	60–80K
1M	500–650K	250–350K
2M	800K–1.1M	400–500K

The Working Cap

Translate this into a number you can put in your config file:

const CONTEXT_BUDGETS = {
  // model: [advertised, working_cap]
  "claude-opus-4-7-1m": [1_000_000, 280_000],
  "claude-opus-4-7":    [200_000,    56_000],
  "gemini-2-flash-1m":  [1_000_000, 300_000],
  "gpt-5-200k":         [200_000,    50_000],
} as const;

Past the working cap, you summarize. Not at the hard limit — by then you have already shipped a degraded answer to your user. The trigger is the working cap.

This is the same discipline NVIDIA's April 2026 long-context engineering post lands on from a different angle: their internal recommendation for production retrieval-augmented agents is to keep the prompt under one-third of the model's stated window, with the rest reserved for output and headroom. The exact number varies by model and task, but the order of magnitude is consistent across independent measurements.

A Compaction Loop That Holds

The pattern we ship for long-running agents is rolling-summary + recent-turns hybrid. Pin the task invariants in a small system header. Keep the last N turns verbatim. Summarize everything older into a running brief that gets re-summarized periodically to prevent drift.

type Turn = { role: "user" | "assistant" | "tool"; content: string; tokens: number };

type CompactionState = {
  systemHeader: string;       // pinned, ~500-1500 tokens
  runningSummary: string;     // grows + re-summarized
  recentTurns: Turn[];        // last N verbatim
};

const RECENT_TURN_COUNT = 6;
const WORKING_CAP = 280_000;            // for 1M model
const COMPACTION_TRIGGER = 0.6 * WORKING_CAP;  // start early
const RESUMMARIZE_EVERY = 8;            // turns

async function appendTurn(
  state: CompactionState,
  turn: Turn,
  turnsSinceResummarize: number,
  summarize: (text: string) => Promise<string>,
): Promise<{ state: CompactionState; turnsSinceResummarize: number }> {
  const next: CompactionState = {
    systemHeader: state.systemHeader,
    runningSummary: state.runningSummary,
    recentTurns: [...state.recentTurns, turn],
  };

  // 1. Demote the oldest verbatim turn into the summary when over N.
  while (next.recentTurns.length > RECENT_TURN_COUNT) {
    const evicted = next.recentTurns.shift()!;
    next.runningSummary = await summarize(
      `${next.runningSummary}\n\n[turn ${evicted.role}] ${evicted.content}`,
    );
    turnsSinceResummarize += 1;
  }

  // 2. Re-summarize the running brief itself periodically (drift control).
  if (turnsSinceResummarize >= RESUMMARIZE_EVERY) {
    next.runningSummary = await summarize(
      `Tighten this running brief, preserving facts, entities, decisions:\n\n${next.runningSummary}`,
    );
    turnsSinceResummarize = 0;
  }

  // 3. Hard guard: if total still exceeds the trigger, summarize aggressively.
  const total = tokenCount(next.systemHeader) +
                tokenCount(next.runningSummary) +
                next.recentTurns.reduce((acc, t) => acc + t.tokens, 0);
  if (total > COMPACTION_TRIGGER) {
    next.runningSummary = await summarize(
      `Compress to fit under ${Math.floor(COMPACTION_TRIGGER * 0.4)} tokens:\n\n${next.runningSummary}`,
    );
  }

  return { state: next, turnsSinceResummarize };
}

Three production gotchas this loop handles that naive compaction misses:

Trigger early, not at the limit. Compaction at 60% of the working cap means the model never sees a context that has already degraded. By the time you hit the hard limit, your last few responses were already worse than they needed to be.

Re-summarize the summary. Without periodic re-summarization, the running brief accumulates appended chunks and slowly bloats and contradicts itself. Every 8 turns, tighten it.

Pin the invariants. The system header (task definition, role, key constraints, tool descriptions) never goes into the running summary. It lives at the top of every call, untouched, because losing it is the failure mode that breaks the agent's identity mid-conversation.

For the broader thinking behind this approach — why context engineering is now its own discipline — see our piece on context engineering after the prompt era and the practical primer on prompt compression and context window optimization.

Where Morph Compact Fits

Most teams start the compaction loop with whatever small model their provider offers — Haiku, Gemini Flash, GPT-mini. That works until the summarizer becomes the bottleneck. For long-running agents that compact on every turn, you can end up paying more in summarization tokens than in the main model. Latency starts to dominate too: a 2-second compaction step on every turn turns a snappy agent into a sluggish one.

Morph Compact is the purpose-built option for this slot. The published numbers are 3,300 tokens/sec throughput and 98% verbatim retention on factual content — both of which matter for the use cases where summarization typically loses information. Verbatim retention is the key metric. Standard summarizers paraphrase, which is fine for chat history and disastrous for legal text, medical notes, code blocks, and exact quantities. A 98% verbatim summary keeps the surface form when surface form is load-bearing.

When to reach for it: long-running agents (compact every turn), high-fidelity domains (legal, clinical, financial, code), and pipelines where summarization latency is on the user-visible path. When not to: chatty general-purpose agents where a small frontier model is fast enough and rephrasing is fine.

When to Skip the Long Context Entirely

The deeper question Chroma's research forces is whether you should be feeding long context to the model at all. The answer for a growing share of use cases is: not the way most teams are doing it.

Two patterns now consistently outperform "stuff everything into the window":

Compiled knowledge over retrieved context. For knowledge that is stable and queried often, compile it into a model-friendly artifact (a wiki, a set of cards, a fine-tuned adapter) rather than retrieving and stuffing on every call. Andrej Karpathy's recent push on this pattern is the cleanest articulation we have — see our deep-dive on Karpathy's LLM wiki pattern and when compiled knowledge beats RAG.

Working memory inside the window, long tail outside. Use the long-context budget for what the agent actively needs in a single task: the last few turns, the active document, the immediate tool results. Use RAG for the long tail of knowledge that the agent might need but probably will not. The advertised 1M-token window is for headroom on working memory, not for replacing the retrieval layer.

The teams shipping reliable long-context agents in 2026 aren't the ones with the biggest windows. They're the ones whose effective context — the part the model actually reasons over — is small, structured, and held below 30% of advertised. That is the cap. The rest is engineering.

What to Do This Week

Three things every team running a long-context agent should ship before the end of the week:

Set a working cap at 25–30% of your model's advertised window. Put it in config. Wire your token counter to alert when calls exceed it. You will be surprised how many already do.

Run a needle-in-the-middle eval against your real prompts. Take 30 production-like questions where the answer is in your retrieved context. Measure accuracy when the answer chunk is in position 1, position 5, position 10, position 15, position 20. The drop between positions 5 and 15 is your context rot exposure. If it is over 20 points, ship retrieval reranking and shorter contexts before any other change.

Replace one paraphrased context block with concatenated chunks plus headers. Pick the prompt where you currently summarize retrieved content into a single coherent paragraph. Replace it with raw chunks separated by ## Source N markers. Compare accuracy on 50 inputs. The result will surprise you.

None of this is exotic. The Chroma study just gave us the magnitudes — and the magnitudes say the working budget is much smaller than the marketing slide.

The Real Insight

The vendor pitch — "1M tokens, just put everything in" — was always a measure of what fits, not what works. Chroma's contribution is putting a number on the gap. Every model degrades. The middle of the prompt is the worst place for a fact to live. Coherent distractors hurt more than messy ones. Length alone costs you accuracy even with a clean context.

The teams that internalize this stop chasing bigger windows and start engineering smaller effective contexts. They cap early, summarize aggressively, and treat retrieval as a way to keep the prompt short, not a way to fill the window. That is the lesson the Chroma study makes unavoidable, and it is the difference between an agent that demos beautifully and one that stays correct on the 100,000th customer conversation.

Frequently Asked Questions

Quick answers to common questions about this topic

Context rot is the measurable drop in accuracy as the input length grows, even when the model's advertised context window is far larger than the prompt. Chroma's 2026 study tested 18 frontier models — including ones with 1M-token windows — and found every model degraded with length. The drop is not just 'lost in the middle' (where mid-prompt facts are missed); it is a baseline accuracy floor that falls 7.9% from length alone when distractors are masked. The advertised window is the upper bound on what fits, not the upper bound on what works.

April 23, 2026

Your 1M Context Window Is Lying: What Chroma's Context Rot Study Proves

Sebastian Mondragon

10 min read

TL;DR

This post is the cap, the compaction loop, and the counterintuitive finding that should change how you assemble retrieved context.

What Chroma Actually Measured

The Counterintuitive Finding

Here is the result that should make every RAG team rethink how they assemble their final prompt: shuffled distractor text hurts less than coherent distractor text.

For context on where this fits in the broader retrieval stack, see our deep-dive on document chunking and RAG context preservation, and the broader landscape in our RAG systems pillar page.

RULER, Cross-Checked

Advertised window	Effective window (RULER)	Effective for hard tasks
200K	100–130K	60–80K
1M	500–650K	250–350K
2M	800K–1.1M	400–500K

The Working Cap

Translate this into a number you can put in your config file:

const CONTEXT_BUDGETS = {
  // model: [advertised, working_cap]
  "claude-opus-4-7-1m": [1_000_000, 280_000],
  "claude-opus-4-7":    [200_000,    56_000],
  "gemini-2-flash-1m":  [1_000_000, 300_000],
  "gpt-5-200k":         [200_000,    50_000],
} as const;

Past the working cap, you summarize. Not at the hard limit — by then you have already shipped a degraded answer to your user. The trigger is the working cap.

A Compaction Loop That Holds

type Turn = { role: "user" | "assistant" | "tool"; content: string; tokens: number };

type CompactionState = {
  systemHeader: string;       // pinned, ~500-1500 tokens
  runningSummary: string;     // grows + re-summarized
  recentTurns: Turn[];        // last N verbatim
};

const RECENT_TURN_COUNT = 6;
const WORKING_CAP = 280_000;            // for 1M model
const COMPACTION_TRIGGER = 0.6 * WORKING_CAP;  // start early
const RESUMMARIZE_EVERY = 8;            // turns

async function appendTurn(
  state: CompactionState,
  turn: Turn,
  turnsSinceResummarize: number,
  summarize: (text: string) => Promise<string>,
): Promise<{ state: CompactionState; turnsSinceResummarize: number }> {
  const next: CompactionState = {
    systemHeader: state.systemHeader,
    runningSummary: state.runningSummary,
    recentTurns: [...state.recentTurns, turn],
  };

  // 1. Demote the oldest verbatim turn into the summary when over N.
  while (next.recentTurns.length > RECENT_TURN_COUNT) {
    const evicted = next.recentTurns.shift()!;
    next.runningSummary = await summarize(
      `${next.runningSummary}\n\n[turn ${evicted.role}] ${evicted.content}`,
    );
    turnsSinceResummarize += 1;
  }

  // 2. Re-summarize the running brief itself periodically (drift control).
  if (turnsSinceResummarize >= RESUMMARIZE_EVERY) {
    next.runningSummary = await summarize(
      `Tighten this running brief, preserving facts, entities, decisions:\n\n${next.runningSummary}`,
    );
    turnsSinceResummarize = 0;
  }

  // 3. Hard guard: if total still exceeds the trigger, summarize aggressively.
  const total = tokenCount(next.systemHeader) +
                tokenCount(next.runningSummary) +
                next.recentTurns.reduce((acc, t) => acc + t.tokens, 0);
  if (total > COMPACTION_TRIGGER) {
    next.runningSummary = await summarize(
      `Compress to fit under ${Math.floor(COMPACTION_TRIGGER * 0.4)} tokens:\n\n${next.runningSummary}`,
    );
  }

  return { state: next, turnsSinceResummarize };
}

Three production gotchas this loop handles that naive compaction misses:

Re-summarize the summary. Without periodic re-summarization, the running brief accumulates appended chunks and slowly bloats and contradicts itself. Every 8 turns, tighten it.

Where Morph Compact Fits

When to Skip the Long Context Entirely

The deeper question Chroma's research forces is whether you should be feeding long context to the model at all. The answer for a growing share of use cases is: not the way most teams are doing it.

Two patterns now consistently outperform "stuff everything into the window":

What to Do This Week

Three things every team running a long-context agent should ship before the end of the week:

Set a working cap at 25–30% of your model's advertised window. Put it in config. Wire your token counter to alert when calls exceed it. You will be surprised how many already do.

None of this is exotic. The Chroma study just gave us the magnitudes — and the magnitudes say the working budget is much smaller than the marketing slide.

The Real Insight

Frequently Asked Questions

Quick answers to common questions about this topic

Your 1M Context Window Is Lying: What Chroma's Context Rot Study Proves

What Chroma Actually Measured

The Counterintuitive Finding

RULER, Cross-Checked

The Working Cap

A Compaction Loop That Holds

Where Morph Compact Fits

When to Skip the Long Context Entirely

What to Do This Week

The Real Insight

Frequently Asked Questions

Need help designing a context strategy that holds up in production? We build them end to end.

Related Articles

Karpathy's LLM Wiki Pattern: When Compiled Knowledge Beats RAG

Agentic RAG Explained: How Agent-Controlled Retrieval Beats Fixed Pipelines

LazyGraphRAG: 700x Cheaper GraphRAG That Actually Works

Your 1M Context Window Is Lying: What Chroma's Context Rot Study Proves

What Chroma Actually Measured

The Counterintuitive Finding

RULER, Cross-Checked

The Working Cap

A Compaction Loop That Holds

Where Morph Compact Fits

When to Skip the Long Context Entirely

What to Do This Week

The Real Insight

Frequently Asked Questions

Need help designing a context strategy that holds up in production? We build them end to end.

Related Articles

Karpathy's LLM Wiki Pattern: When Compiled Knowledge Beats RAG

Agentic RAG Explained: How Agent-Controlled Retrieval Beats Fixed Pipelines

LazyGraphRAG: 700x Cheaper GraphRAG That Actually Works