An LLM cost spike comes from one of six buckets: a runaway agent loop, a prompt regression, an unintended model upgrade, a retry storm, a noisy tenant, or a leaked key. Read the split first: large completion_tokens points to open-ended generation or reasoning thinking-tokens billed as output (the silent 10x); large prompt_tokens points to prompt bloat, oversized tool schemas, or untrimmed retrieval. A single runaway loop can burn $40 in minutes. Tag every request, set feature-level rolling-baseline alerts, and enforce hard retry caps plus max_tokens.
The on-call page says the LLM bill spiked overnight. Yesterday's API spend was a flat line; this morning it is a cliff. Nothing shipped, no traffic spike shows in the dashboards, and yet the projected monthly cost just tripled. This is the most common LLM cost incident there is, and the diagnosis is faster than it feels in the moment if you read the signals in the right order.
A token cost runaway almost never comes from "the model got more expensive." It comes from one of six concrete failure modes, and each one leaves a different fingerprint in the usage data. A runaway agent loop looks nothing like a prompt regression, which looks nothing like a leaked key. The mistake most teams make under pressure is staring at the total dollar figure instead of the one number that actually localizes the problem: the split between prompt tokens and completion tokens. That split alone cuts the search space from six causes to two or three in about five minutes.
This is the playbook I use to triage a token cost spike, written as an incident runbook rather than a theory piece. We will walk the six spike buckets, the diagnostic signals that distinguish them, the silent 10x that reasoning models introduce, and the permanent fixes that stop the same incident from happening twice. The goal is not just to explain this invoice. It is to make the next spike fire an alert before the invoice, not after.
The On-Call Scenario: Triage Before You Theorize
When the bill jumps, resist the urge to guess. The fastest path is a fixed triage order, the same way you would read a latency incident. Pull the usage data for the spike window and answer three questions in sequence.
Question one: did request volume spike, or did request size spike? Plot request count over the incident window next to total cost. If the request count tracks the cost curve, you have a volume problem (a loop or a retry storm). If request count is flat but cost climbed, you have a size problem (something made each request more expensive). This single chart eliminates half the buckets immediately.
Question two: is the extra cost on the input side or the output side? Split the spike window into aggregate prompt_tokens versus completion_tokens. This is the highest-signal number in the entire investigation, and we will spend the next two sections on what each side means.
Question three: which tag owns the spike? If you tagged your requests (and if you did not, fix that the moment this incident is over), group the cost by feature, by user, by agent-run, by tenant. A spike that concentrates in one tag is a different problem from one spread evenly across all traffic.
Three questions, maybe ten minutes with a decent observability layer, and you have narrowed a vague "the bill spiked" into a specific hypothesis. Now you can theorize.
The Six Spike Buckets
Every token cost runaway I have triaged fits one of these six buckets. Memorize them; they are your differential diagnosis.
The buckets are not mutually exclusive (a retry storm can also be a noisy tenant), but in practice one of them dominates the spike. Walk them in roughly this order, because the early ones are both more common and faster to confirm.
Runaway agent loop
The most expensive bucket and the most common in agentic systems. An agent with a tool-use loop and no hard iteration cap can get stuck retrying the same step, re-sending its entire context on every turn. Because the full conversation history is resent each iteration, cost grows faster than linearly with loop length. A runaway loop can burn $40 in minutes, and it usually fires in a background job where nobody is watching. The fingerprint is unmistakable once you have per-agent-run tags: one run ID with hundreds or thousands of calls in a tight time window.
Prompt regression
A code change quietly grew the prompt. Someone added more few-shot examples, stopped trimming conversation history, started stuffing more retrieval chunks into context, or expanded the system prompt. Request count stays flat, but prompt_tokens per request jumps at a deploy boundary. This is why correlating the spike start time with your deploy timeline is worth doing early.
Unintended model upgrade
The token counts look normal but cost-per-token rose. A provider bumped the default model behind an alias, an autoscaling or fallback path routed traffic to a pricier tier, or someone changed a model string in config. Note that Anthropic shifted Claude enterprise to dynamic usage pricing in April 2026, which on its own can move heavy users' bills 2-3x with no code change at all. Pin your model versions explicitly so a default upgrade cannot silently rewrite your unit economics.
Retry storm
A flaky upstream (the model API timing out, a tool endpoint 500ing) triggers retries. Without a hard retry cap and proper backoff, a brief outage turns into a flood of duplicated requests. The fingerprint is bursts of byte-identical payloads clustered around a dependency blip.
Noisy tenant
In a multi-tenant product on a shared budget, one tenant's behavior can dominate everyone's cost. A single customer kicks off a bulk job, or one account gets abused, and the shared line item spikes. You only see this cleanly if you tag by tenant. We cover the attribution side of this in depth in our guide to per-tenant LLM cost attribution for multi-tenant SaaS.
Leaked API key
The worst case. A key ended up in a public repo, a client bundle, or a leaked log, and someone else is spending your money. The fingerprint is usage that does not match any of your features: unfamiliar models, odd regions, off-hours traffic, request shapes your app never produces. If the spike has no internal explanation, rotate the key first and ask questions after.
| Bucket | Volume or size | Token signal | Typical fingerprint |
|---|---|---|---|
| Runaway agent loop | Volume | Large completion + prompt | One agent-run tag with thousands of calls in minutes |
| Prompt regression | Size | Large prompt_tokens | Flat request count, prompt_tokens jumped at a deploy boundary |
| Unintended model upgrade | Size | Cost-per-token jumped | Same token counts, higher cost; default version or tier changed |
| Retry storm | Volume | Duplicated identical requests | Bursts of identical payloads, often after a dependency blip |
| Noisy tenant | Volume or size | Concentrated in one tenant tag | Shared budget, one tenant suddenly dominates |
| Leaked API key | Volume | Unfamiliar usage pattern | Traffic from unknown features, regions, or off-hours |
Reading the Signals: Completion Tokens vs Prompt Tokens
The prompt/completion split is the workhorse of this diagnosis because input and output cost are billed separately and caused by completely different things. Read it before you read anything else.
When completion_tokens is the culprit
Large completion_tokens means the model generated a lot of output. The usual causes: The fix pattern for output-side spikes is always the same: cap it, stop it, and only use reasoning tiers where the task genuinely needs them.
- An open-ended prompt. The instruction invites long generation ("write a detailed report on...") and the model obliges to the limit.
- A missing stop sequence. For structured output, without a stop the model keeps going past the useful content, padding the response.
- A missing max-output cap. No
max_tokensmeans a single misread request can generate to the full context window. This is the most preventable output-side runaway. - Reasoning thinking-tokens billed as output. The silent 10x, covered in its own section below because it surprises people the most.
When prompt_tokens is the culprit
Large prompt_tokens means each request carries too much context in. The usual causes: Input-side spikes are about discipline: trim history, prune tool schemas to what the current task needs, and cap retrieval payload size. Many of these overlap with the broader optimization patterns in our guide to reducing LLM token costs through optimization.
- Prompt bloat. The system prompt, few-shot examples, or instructions grew over time.
- Oversized tool schemas. Agentic apps pay for every tool definition on every call. A catalog of 50-plus tools can eat a meaningful slice of context before the user's message even appears.
- Long history. Conversation or agent history is being resent in full instead of trimmed or summarized.
- Untrimmed retrieval chunks. A RAG step is stuffing top-k results in raw, including chunks that are too long or only marginally relevant.
The Silent 10x: Reasoning Thinking-Tokens
Reasoning models deserve their own warning because they break the intuition that output cost tracks visible output length. On most reasoning models, the internal chain-of-thought is billed as output tokens even though it never appears in the response. A request that returns a tidy 200-token answer can bill for 2,000 or more output tokens because the model spent the rest of its budget thinking.
This produces a spike that is genuinely confusing on first read. Your completion_tokens aggregate jumps, but if you sample the actual responses they look normal, short even. Nothing in the user-facing output explains the cost. The trigger is almost always a config change: someone switched a feature to a reasoning model, or raised the reasoning-effort setting, or a routing rule started sending more traffic to a reasoning tier.
Three controls keep this in check:
When you see large completion_tokens against short answers, suspect reasoning overhead before anything else. It is the bucket teams most often miss because the evidence is invisible by design.
Per-Request Tagging So You Can Find the Source Fast
Everything above assumes you can slice your spend by feature, user, and agent-run. If you cannot, the spike is just a bigger number with no handle on it, and you are reduced to grepping logs and guessing. Per-request tagging is the single highest-leverage thing you can set up before a spike, because it converts a multi-hour forensic exercise into a one-query lookup.
Tag every call with at least these dimensions:
You do not have to build this from scratch. Observability layers handle it as metadata or headers with near-zero overhead:
Pick one and tag consistently. The day you do this is the day cost incidents stop being scary, because the answer to "who caused this" becomes a filter, not an investigation.
Feature-Level Alerts That Fire Before the Invoice
Tagging tells you who caused a spike after it happens. Alerting tells you while it is happening. The difference is the difference between a $40 incident and a $4,000 one.
The naive approach (a single global "alert if monthly spend exceeds X") is nearly useless. A global threshold is too coarse to catch a single feature going haywire until it has already moved the whole number, and it fires too late to matter. What works is feature-level rolling-baseline alerting:
This pattern is part of the broader practice of watching cost as a first-class production signal alongside latency and quality, which we cover in AI production monitoring: quality drift, hallucinations, and costs. Cost is not a billing concern that lives in finance. It is an operational metric, and it deserves the same alerting rigor as your error rate.
Permanent Fixes: Cap, Enforce, Control
Diagnosing the spike is half the job. The other half is making sure this exact incident cannot recur. These are the controls that turn a recurring fire drill into a non-event. Across the production LLM systems we have reviewed at Particula Tech, the teams that never get paged for cost are the ones that wired these in before they needed them.
Hard retry caps with backoff and jitter. Cap retries at a small number (2-3), use exponential backoff so you do not hammer a struggling dependency, and add jitter so simultaneous failures do not retry in lockstep. This single control kills the retry-storm bucket outright.
max_tokens on every call. Set max_tokens (or max_output_tokens) on every production request, sized just above the longest legitimate response. It bounds the blast radius of any output-side runaway. Pair it with stop sequences for structured output so generation ends when the useful content is done rather than running to the cap.
Hard iteration caps on agent loops. Every agentic loop needs a max-iterations guard and a per-run spend ceiling. An agent with no upper bound on iterations is an open tap. This is the one control that would prevent most of the $40-in-minutes incidents.
Reasoning-effort controls. Where the provider exposes a reasoning-effort setting, set it deliberately per feature, and route only hard tasks to reasoning tiers. Do not let reasoning be the default.
Model-version pinning. Pin explicit model versions in config rather than floating on a provider default. A pinned version cannot be silently upgraded into a more expensive tier underneath you.
Tagging and rolling-baseline alerts. The two controls from the previous sections are also permanent fixes. They are what make the next incident a five-minute lookup instead of an overnight surprise.
Here is the priority order if you can only do a few this week:
None of these is a heavy lift. The whole set is roughly a day of work, and it pays for itself the first time a background job tries to spend your quarter's budget in an afternoon.
| Fix | Stops which bucket | Effort | Priority |
|---|---|---|---|
| Per-request tagging | All (enables diagnosis) | Low | Do first |
| max_tokens enforcement | Output runaway, open-ended prompts | Low | Do first |
| Hard iteration cap + per-run ceiling | Runaway agent loop | Low | Do first |
| Retry cap + backoff + jitter | Retry storm | Low | High |
| Rolling-baseline feature alerts | All (early warning) | Medium | High |
| Model-version pinning | Unintended upgrade | Low | Medium |
| Reasoning-effort controls | Reasoning thinking-token 10x | Low | Medium |
Wrapping Up: From Surprise Invoice to Boring Dashboard
A token cost runaway feels like chaos in the moment, but the diagnosis is mechanical once you stop staring at the dollar total and start reading the data. Check whether volume or size spiked. Split prompt tokens from completion tokens. Group by tag. Those three moves narrow six buckets to one in about ten minutes, and the silent-10x reasoning case is the only one that needs you to look past the visible output.
The deeper lesson is that cost spikes are a monitoring failure as much as a code failure. The incident is loud because the bill is the first signal you got. Wire in per-request tagging, set feature-level rolling-baseline alerts, and enforce the hard caps (retries, max_tokens, iterations, per-run ceilings) and the next runaway loop trips an alert and self-terminates instead of running until the invoice arrives. Cost belongs on the same dashboard as latency and error rate, and it deserves the same alerting discipline.
If you want the full picture of how token spend behaves at the program level (budgeting, forecasting, and chargeback rather than just incident response), see the AI development tools pillar for the broader cluster. And if you would rather have the tagging, alerting, and guardrails built into your stack correctly the first time, that is exactly the kind of work we do at Particula Tech. The setup is small. The first incident it prevents is not.
Frequently Asked Questions
Quick answers to common questions about this topic
An overnight LLM bill jump almost always traces to one of six causes: a runaway agent loop that retried itself thousands of times, a prompt change that ballooned token counts per request, an unintended model upgrade (a default version bump or a fallback to a pricier tier), a retry storm from a flaky dependency, a single noisy tenant hammering a shared budget, or a leaked API key being used by someone else. Start by pulling the token-per-request data for the spike window and splitting it into prompt_tokens versus completion_tokens. That single split tells you whether the problem is on the input side (bloated context) or the output side (uncapped generation), which narrows six buckets down to two or three in about five minutes.



