On March 6, 2026 Anthropic changed the default prompt cache TTL from 3600s to 300s. The April 23 postmortem confirmed the rollout. Teams running Claude Code, long-running RAG workers, and overnight batch jobs saw cache_read_input_tokens collapse to zero between calls more than five minutes apart, with reports of 15-20x quota burn. Confirm the regression by inspecting usage.cache_read_input_tokens on every request. Fix it by setting cache_control to {"type":"ephemeral","ttl":3600} on the breakpoint that anchors your prefix, accepting the 25% write premium in exchange for an hour of reads at 90% off. Then audit the four prefix drifts that silently invalidate any TTL: image binary changes, tool_choice flips, --resume session reseeding, and non-deterministic system prompt construction.
A client pinged us on April 24 with a subject line we read twice: "Anthropic spend up 19x this month, can you look?" Their setup hadn't changed in months — same RAG worker, same 38K-token system prompt, same Opus 4.6 model. But cache reads had cratered. Our first instinct was a code regression, a deploy that broke prefix determinism. The diff was clean. The bill wasn't.
The cause was Anthropic's quiet default change on March 6, 2026: prompt cache TTL dropped from 3600 seconds to 300 seconds for any cache_control block without an explicit ttl field. Anthropic confirmed it publicly in the April 23 postmortem and pointed at the long-running claude-code#46829 thread where teams had been reporting the symptom for weeks. If your workload's natural cadence is slower than 5 minutes — overnight batch jobs, agent loops with human review steps, multi-tenant background workers — your cache hit rate collapsed silently.
This post is the runbook we ship to clients. We'll cover what actually changed, how to read usage.cache_read_input_tokens to confirm the regression, the four prefix drifts that defeat any TTL you set, the cost math on opting back into the 1-hour cache, and the architecture changes that make the next default flip a non-event. For broader cost reduction strategies that complement this fix, see our walkthrough of how smart caching cut a client's AI API costs by 75% and the decision guide on when to cache LLM responses.
What Changed on March 6, 2026
The Anthropic prompt cache has always supported two TTL options: a 5-minute ephemeral cache (the cheap default) and a 1-hour ephemeral cache (a 25% premium on writes, same 90% discount on reads). Until March 6, 2026, the API behavior most production codebases depended on was the 1-hour default for any cache_control block omitting the ttl field. The change rolled the default down to 5 minutes.
The wire format looks identical:
{
"type": "ephemeral"
}That object now means 5 minutes of TTL, not 60. To get the old behavior back, you have to write it explicitly:
{
"type": "ephemeral",
"ttl": 3600
}The April 23 postmortem framed the change as "aligning the default with the documentation," and the docs do specify 5 minutes as the default. But the lived experience for thousands of teams was that production code which had run for months suddenly started paying full input rates on every call slower than 5 minutes apart. The gap between intended default and observed default is what generated the postmortem.
The blast radius was largest for three workload patterns: long-running agent loops with human-in-the-loop steps that exceed 5 minutes, scheduled batch jobs that fire every 10-15 minutes against the same prefix, and multi-tenant gateways that round-robin across thousands of distinct prefixes faster than each one can be revisited. Claude Code users hit it inside a single coding session whenever they stepped away from the terminal for a coffee.
Confirming the Regression: Read `usage.cache_read_input_tokens`
Every Anthropic response includes a usage object with four cache-relevant fields. If you're not logging all four on every request, you can't see the regression. Here's the structure:
{
"usage": {
"input_tokens": 280,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 38420,
"output_tokens": 612
}
}A healthy cached request looks like the example above: cache_read_input_tokens carries the bulk of the prefix at 10% of input price, input_tokens is small (just the dynamic suffix and user message), and cache_creation_input_tokens is zero. The first request of a window writes the cache:
{
"usage": {
"input_tokens": 280,
"cache_creation_input_tokens": 38420,
"cache_read_input_tokens": 0,
"output_tokens": 612
}
}A regressed request — same code, same prefix, but the previous cache entry expired — looks like this:
{
"usage": {
"input_tokens": 38700,
"cache_creation_input_tokens": 38420,
"cache_read_input_tokens": 0,
"output_tokens": 612
}
}Notice both input_tokens and cache_creation_input_tokens are populated. You're paying full input rate on the prefix and writing a fresh cache entry that the next request will probably also miss. This is the worst of both worlds, and it's the financial signature of the TTL regression on a slow-cadence workload.
SELECT
date_trunc('minute', request_ts) AS minute,
SUM(cache_read_input_tokens) AS cache_reads,
SUM(input_tokens) AS uncached_reads,
SUM(cache_read_input_tokens)::float
/ NULLIF(SUM(cache_read_input_tokens) + SUM(input_tokens), 0)
AS hit_rate
FROM anthropic_usage
WHERE request_ts > NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1;The Diagnostic Query
Group your usage logs into 5-minute and 60-minute buckets and compute hit rate per bucket: Plot hit_rate over time. If the line drops a step on March 6 (or whenever your client library updated), the TTL change is your culprit. If the line is flat-zero across all windows, the issue is prefix non-determinism — covered in the next section. If the hit rate is healthy within 5 minutes but collapses above it, you have a clean TTL regression and the fix is one field.
The Four Prefix Drifts That Bust Any TTL
Setting ttl: 3600 only helps if your prefix is byte-identical between calls. The cache key is a hash of the prefix up to and including the cache_control breakpoint, so anything that mutates those bytes invalidates the entry no matter what TTL you set. We've debugged the same four drifts across maybe two dozen client engagements. They account for almost every "I set the TTL but cache_read is still zero" report.
Drift 1: Image Binary Mutation
If your prefix includes an image — a screenshot, a logo, a vision-grounded reference — and your application re-encodes that image on every request, the bytes change even when the visible content doesn't. PIL's save() produces different bytes than cv2.imwrite() on the same pixel data. Re-base64-encoding from a freshly opened file produces a different string than the cached one because EXIF metadata and chunk ordering aren't stable across reads. Fix: hash the image bytes once at upload, persist the canonical encoding, and reuse the same string in every cache_control prefix.
Drift 2: `tool_choice` Flips
The tool_choice parameter is part of the cached prefix. Switching between {"type":"auto"}, {"type":"any"}, and {"type":"tool","name":"search"} produces three different cache keys even when the system prompt and tools list are identical. Agent frameworks that adapt tool_choice based on conversation state will silently cycle through cache misses. Fix: if you need adaptive tool_choice, move the cache breakpoint earlier — before the tools block — so the prefix that gets cached doesn't include the choice field.
Drift 3: `--resume` Session Reseeding
Claude Code's --resume flag rehydrates a session from disk. Pre-1.x versions injected a session-id and a timestamp into the system header on resume, which made the prefix unique per resume even when the user-visible context was identical. Tools like custom subagent runners and homegrown agent loops often replicate this pattern by accident. Fix: strip session metadata out of the prefix and pass it as request-level metadata or as part of the user message, where it can vary without invalidating the cache.
Drift 4: Non-Deterministic System Prompt Construction
This is the most common drift and the hardest to spot. A system prompt built from a Python dictionary serializes in dict insertion order, which is stable in CPython 3.7+ within a single process — but across processes, deploys, or workers, the order can shift if the construction code reads from environment variables, set literals, or unordered iteration. JSON-stringifying an object with an unordered Set of tools produces a different string per call. Fix: sort keys explicitly when serializing, build prefixes from ordered lists not sets, and pin a single canonicalization function (json.dumps(obj, sort_keys=True, separators=(",",":"))) that you reuse everywhere a prefix is constructed. You can audit all four with a single guard: hash the first 4KB of every outgoing request body, log the hash, and assert it matches across consecutive same-prefix calls. The first time the hash drifts, you've found which of the four is biting you. We recommend this guard live in your gateway layer alongside the per-tenant LLM cost attribution instrumentation we wrote about earlier this year.
Setting `ttl: 3600` and the Cost Math
Once your prefix is byte-stable, opt back into the 1-hour cache by adding the ttl field on the cache breakpoint:
{
"model": "claude-opus-4-7-20260315",
"system": [
{
"type": "text",
"text": "<your 38K-token system prompt>",
"cache_control": { "type": "ephemeral", "ttl": 3600 }
}
],
"messages": [
{ "role": "user", "content": "<user query>" }
]
}The 1-hour TTL costs 25% more on writes than the 5-minute default. Reads are still discounted 90% off input price regardless of which TTL you wrote with. The break-even point is the number of reads you complete per write before expiry.
Numbers above use Opus 4.6 input pricing of $3/M tokens. The same logic applies for Sonnet at $3/M and Haiku at $0.25/M; the absolute dollars shrink but the break-even count stays around 3 reads per hour. For most agent loops, RAG workers, and interactive coding sessions, 3 reads per hour is a floor you clear before the second cup of coffee.
The case where 5 minutes is still correct is bursty short-lived prefixes — a one-off batch with N parallel tool calls fired in 30 seconds, then never revisited. Pay the cheaper write, take the discount on the immediate fan-out, and let the cache expire.
For a deeper treatment of when caching is worth the engineering investment in the first place, see our decision guide on caching LLM responses and token cost optimization patterns.
| Window length | 5-min write cost (38K prefix) | 1-hour write cost (38K prefix) | Read cost per call | Reads to break even |
|---|---|---|---|---|
| 5 minutes | $0.135 | n/a | $0.0114 | n/a |
| 1 hour | n/a (would expire) | $0.169 | $0.0114 | ~3 |
A Debugging Runbook with `curl`
Here's the sequence we run when a client reports a sudden Anthropic cost spike. Every step is something you can paste into a terminal in under a minute.
Step 1: Reproduce the call with full usage logging.
curl https://api.anthropic.com/v1/messages \ -H "x-api-key: $ANTHROPIC_API_KEY" \ -H "anthropic-version: 2023-06-01" \ -H "content-type: application/json" \ -d @./repro-prefix.json | jq '.usage'
If cache_read_input_tokens is 0 on the second-and-later calls, you have a problem.
Step 2: Time the second call.
Wait 6 minutes between identical calls. Re-run. If the second call shows cache_creation_input_tokens (a fresh write) instead of cache_read_input_tokens (a hit), the TTL is your issue. Re-run within 4 minutes — if that one hits, you've confirmed the 5-minute window.
Step 3: Pin the TTL.
Edit repro-prefix.json so the cache_control breakpoint reads {"type":"ephemeral","ttl":3600} and re-run the 6-minute spaced test. If you now hit the cache, ship the same change to production. If you still don't hit it, you have a prefix drift — go to step 4.
Step 4: Hash and diff the prefix.
jq -c '{model, system, tools}' call-1.json | sha256sum
jq -c '{model, system, tools}' call-2.json | sha256sumDifferent hashes mean the prefix is mutating between calls. Diff the two JSONs to find which field is moving. In our experience the answer is one of the four drifts above 90% of the time.
Step 5: Add the gateway guard.
Once you've fixed the immediate issue, install a permanent guard in your Anthropic client wrapper that logs the prefix hash and the four usage fields per call, alerts on hit rate below a threshold (we use 70% as the floor for steady-state workloads), and exposes both as Prometheus metrics or a dashboard panel.
Architecture Changes That Survive the Next TTL Shift
The TTL regression is a specific instance of a general problem: provider-side defaults can change without notice, and any architecture that depends on those defaults inherits the risk. The pattern we ship to clients now treats provider-side caching as one layer in a stack we control end-to-end, not the load-bearing layer.
Layer 1: Application-layer prefix hashing. We hash the prefix in our gateway and store the hash, the request, and the response in our own cache. Provider-side caching is best-effort discount on top of that, not the primary mechanism.
Layer 2: Semantic similarity caching. A query embedding lookup at threshold 0.93 catches paraphrased duplicates that exact-prefix caching misses entirely. We covered the threshold calibration in our semantic similarity caching guide.
Layer 3: Model routing with cheap-first fallback. Classification and extraction tasks route to Haiku before Opus; long-context refactors route to Opus directly. Our model routing playbook walks through the routing rules in detail.
Layer 4: Provider-side prompt caching with explicit TTLs. Set the TTL on every cache_control block, instrument the four usage fields, alert on hit-rate regressions. The next time Anthropic, OpenAI, or Google flips a default, your alerting catches it the same hour rather than the next billing cycle.
That four-layer stack is the architecture we walk every Particula Tech consulting engagement through when the conversation moves from "we picked a model" to "we have to keep that model affordable in production." The single biggest unlock isn't any individual layer — it's that no single provider-side change can collapse your hit rate to zero without something in your own stack catching it first.
For broader context on when caching is the right answer versus when latency or routing should come first, see our AI Development Tools pillar and the surrounding posts on cost-aware production architecture.
What to Do This Week
Three actions, in order:
usage fields, and compute hit rate over the last 48 hours. If it's not above 70%, you have lost money this month.ttl: 3600 explicitly. This is a one-line change per breakpoint and is reversible. Ship it before your next billing cycle.If the audit surfaces an order-of-magnitude bill or your engineering team doesn't have the cycles to run the runbook, that's the conversation we have most often this quarter — reach out and we'll help you instrument it.
Frequently Asked Questions
Quick answers to common questions about this topic
Anthropic shipped a quiet default change that dropped the prompt cache TTL from 3600 seconds to 300 seconds for any cache_control block that didn't specify a ttl field. The change was confirmed in Anthropic's April 23, 2026 postmortem after roughly six weeks of customer reports, including the high-traffic claude-code#46829 issue. The 1-hour TTL is still available — but only if you opt in explicitly with {"type":"ephemeral","ttl":3600} on the cache breakpoint. Teams that assumed the older default kept paying full input rates on every request more than five minutes apart, which is why monthly bills blew up 15-20x for slow-cadence workloads.
