Why did my OpenAI bill jump overnight?

An overnight LLM bill jump almost always traces to one of six causes: a runaway agent loop that retried itself thousands of times, a prompt change that ballooned token counts per request, an unintended model upgrade (a default version bump or a fallback to a pricier tier), a retry storm from a flaky dependency, a single noisy tenant hammering a shared budget, or a leaked API key being used by someone else. Start by pulling the token-per-request data for the spike window and splitting it into prompt_tokens versus completion_tokens. That single split tells you whether the problem is on the input side (bloated context) or the output side (uncapped generation), which narrows six buckets down to two or three in about five minutes.

How do I diagnose an LLM cost spike?

Diagnose an LLM cost spike by reading three signals in order: total request volume, the prompt_tokens versus completion_tokens split, and the per-request tags. First check whether request count spiked (a loop or retry storm) or stayed flat while cost rose (bigger requests). Then read the token split: large completion_tokens means open-ended generation, a missing stop sequence, no max_tokens cap, or reasoning thinking-tokens billed as output. Large prompt_tokens means context bloat, oversized tool schemas, long history, or untrimmed retrieval chunks. Finally, group cost by your per-request tags (user, feature, agent-run) to isolate exactly which surface caused it. Tools like Helicone, Langfuse, and Traceloop give you all three views without instrumenting each call by hand.

How fast can a runaway agent loop burn money?

A runaway agent loop can burn $40 in minutes. The math is simple: an agent that calls a frontier model in a tight loop, with a large context window re-sent on every iteration, can fire dozens of requests per second. At a few cents per call and tens of thousands of tokens per call, that compounds fast. The danger is that loops often run unattended overnight or in background jobs, so the first signal you get is the invoice, not an alert. This is why hard iteration caps and per-agent-run spend ceilings matter more than after-the-fact dashboards. A loop with no max-iterations guard and no per-run budget is an open tap, and it will run until something else stops it.

Are reasoning model thinking tokens billed as output tokens?

Yes. On most reasoning models the internal thinking tokens are billed as output tokens, even though you never see them in the response. This is the silent 10x: a request that returns a 200-token answer can bill for 2,000-plus output tokens because the model spent the rest reasoning. If you switch a feature to a reasoning model or raise its reasoning-effort setting, your completion_tokens can multiply with no visible change in the final answers. When you see large completion_tokens but short user-facing responses, suspect reasoning overhead first. Control it with reasoning-effort settings where the provider exposes them, route only genuinely hard tasks to reasoning tiers, and keep a separate budget line for reasoning spend so it does not hide inside general output cost.

How do I prevent runaway LLM costs in production?

Prevent runaway LLM costs with four permanent controls. First, enforce hard retry caps with exponential backoff plus jitter so a flaky dependency cannot trigger a retry storm. Second, set max_tokens (or max_output_tokens) on every call so a single request cannot generate unbounded output. Third, tag every request with user, feature, and agent-run identifiers, and put per-agent-run spend ceilings on autonomous loops. Fourth, add feature-level rolling-baseline alerts that compare current spend against a trailing window and fire before the invoice lands, not after. Layer on reasoning-effort controls for reasoning models and a model-version pin so a default upgrade cannot silently change your unit economics. These controls cost a day to wire and pay for themselves the first time a loop misbehaves.

What is per-request tagging and why does it matter for cost?

Per-request tagging attaches metadata (user ID, feature name, agent-run ID, tenant, environment) to every LLM call so you can slice cost by any dimension after the fact. It matters because without tags, a spike is just a bigger number on the invoice with no way to find the source. With tags, you group the spike window by feature in seconds and see that, for example, one background summarization job accounts for 80% of the cost. Most observability layers (Helicone, Langfuse, Traceloop) let you pass tags as headers or metadata fields with near-zero overhead. Tagging is the single highest-leverage thing you can do before a spike happens, because it turns a multi-hour forensic investigation into a one-query lookup.

Should I set max_tokens on every LLM call?

Yes, set max_tokens (or the provider's max_output_tokens equivalent) on every production LLM call. It is the cheapest insurance against output-side runaway cost. Without a cap, a model that misreads an open-ended prompt can generate to the full context limit, and a single such request can cost dollars instead of cents. Pick a cap slightly above the longest legitimate response your feature needs, then alert when responses hit the cap so you can catch prompts that are genuinely truncating. The cap does not fix a bad prompt, but it bounds the blast radius. Combine it with stop sequences for structured outputs so generation ends as soon as the useful content is complete rather than running to the limit.

BLOG/AI DEVELOPMENT TOOLS

LLM Bill Spiked Overnight: A Token Cost Runaway Playbook

An on-call playbook for an LLM cost spike: six failure buckets, the completion vs prompt token signal that finds the source, and the permanent fixes.

Sebastian MondragonMAY 26, 2026 · 11 MIN READ

LLM Bill Spiked Overnight: A Token Cost Runaway Playbook

The on-call page says the LLM bill spiked overnight. Yesterday's API spend was a flat line; this morning it is a cliff. Nothing shipped, no traffic spike shows in the dashboards, and yet the projected monthly cost just tripled. This is the most common LLM cost incident there is, and the diagnosis is faster than it feels in the moment if you read the signals in the right order.

A token cost runaway almost never comes from "the model got more expensive." It comes from one of six concrete failure modes, and each one leaves a different fingerprint in the usage data. A runaway agent loop looks nothing like a prompt regression, which looks nothing like a leaked key. The mistake most teams make under pressure is staring at the total dollar figure instead of the one number that actually localizes the problem: the split between prompt tokens and completion tokens. That split alone cuts the search space from six causes to two or three in about five minutes.

This is the playbook I use to triage a token cost spike, written as an incident runbook rather than a theory piece. We will walk the six spike buckets, the diagnostic signals that distinguish them, the silent 10x that reasoning models introduce, and the permanent fixes that stop the same incident from happening twice. The goal is not just to explain this invoice. It is to make the next spike fire an alert before the invoice, not after.

01 · The On-Call Scenario: Triage Before You Theorize

When the bill jumps, resist the urge to guess. The fastest path is a fixed triage order, the same way you would read a latency incident. Pull the usage data for the spike window and answer three questions in sequence.

Question one: did request volume spike, or did request size spike? Plot request count over the incident window next to total cost. If the request count tracks the cost curve, you have a volume problem (a loop or a retry storm). If request count is flat but cost climbed, you have a size problem (something made each request more expensive). This single chart eliminates half the buckets immediately.

Question two: is the extra cost on the input side or the output side? Split the spike window into aggregate prompt_tokens versus completion_tokens. This is the highest-signal number in the entire investigation, and we will spend the next two sections on what each side means.

Question three: which tag owns the spike? If you tagged your requests (and if you did not, fix that the moment this incident is over), group the cost by feature, by user, by agent-run, by tenant. A spike that concentrates in one tag is a different problem from one spread evenly across all traffic.

Three questions, maybe ten minutes with a decent observability layer, and you have narrowed a vague "the bill spiked" into a specific hypothesis. Now you can theorize.

02 · The Six Spike Buckets

Every token cost runaway I have triaged fits one of these six buckets. Memorize them; they are your differential diagnosis.

The buckets are not mutually exclusive (a retry storm can also be a noisy tenant), but in practice one of them dominates the spike. Walk them in roughly this order, because the early ones are both more common and faster to confirm.

Runaway agent loop

The most expensive bucket and the most common in agentic systems. An agent with a tool-use loop and no hard iteration cap can get stuck retrying the same step, re-sending its entire context on every turn. Because the full conversation history is resent each iteration, cost grows faster than linearly with loop length. A runaway loop can burn $40 in minutes, and it usually fires in a background job where nobody is watching. The fingerprint is unmistakable once you have per-agent-run tags: one run ID with hundreds or thousands of calls in a tight time window.

Prompt regression

A code change quietly grew the prompt. Someone added more few-shot examples, stopped trimming conversation history, started stuffing more retrieval chunks into context, or expanded the system prompt. Request count stays flat, but prompt_tokens per request jumps at a deploy boundary. This is why correlating the spike start time with your deploy timeline is worth doing early.

Unintended model upgrade

The token counts look normal but cost-per-token rose. A provider bumped the default model behind an alias, an autoscaling or fallback path routed traffic to a pricier tier, or someone changed a model string in config. Note that Anthropic shifted Claude enterprise to dynamic usage pricing in April 2026, which on its own can move heavy users' bills 2-3x with no code change at all. Pin your model versions explicitly so a default upgrade cannot silently rewrite your unit economics.

Retry storm

A flaky upstream (the model API timing out, a tool endpoint 500ing) triggers retries. Without a hard retry cap and proper backoff, a brief outage turns into a flood of duplicated requests. The fingerprint is bursts of byte-identical payloads clustered around a dependency blip.

Noisy tenant

In a multi-tenant product on a shared budget, one tenant's behavior can dominate everyone's cost. A single customer kicks off a bulk job, or one account gets abused, and the shared line item spikes. You only see this cleanly if you tag by tenant. We cover the attribution side of this in depth in our guide to per-tenant LLM cost attribution for multi-tenant SaaS.

Leaked API key

The worst case. A key ended up in a public repo, a client bundle, or a leaked log, and someone else is spending your money. The fingerprint is usage that does not match any of your features: unfamiliar models, odd regions, off-hours traffic, request shapes your app never produces. If the spike has no internal explanation, rotate the key first and ask questions after.

Bucket	Volume or size	Token signal	Typical fingerprint
Runaway agent loop	Volume	Large completion + prompt	One agent-run tag with thousands of calls in minutes
Prompt regression	Size	Large prompt_tokens	Flat request count, prompt_tokens jumped at a deploy boundary
Unintended model upgrade	Size	Cost-per-token jumped	Same token counts, higher cost; default version or tier changed
Retry storm	Volume	Duplicated identical requests	Bursts of identical payloads, often after a dependency blip
Noisy tenant	Volume or size	Concentrated in one tenant tag	Shared budget, one tenant suddenly dominates
Leaked API key	Volume	Unfamiliar usage pattern	Traffic from unknown features, regions, or off-hours

03 · Reading the Signals: Completion Tokens vs Prompt Tokens

The prompt/completion split is the workhorse of this diagnosis because input and output cost are billed separately and caused by completely different things. Read it before you read anything else.

When completion_tokens is the culprit

Large completion_tokens means the model generated a lot of output. The usual causes: The fix pattern for output-side spikes is always the same: cap it, stop it, and only use reasoning tiers where the task genuinely needs them.

An open-ended prompt. The instruction invites long generation ("write a detailed report on...") and the model obliges to the limit.
A missing stop sequence. For structured output, without a stop the model keeps going past the useful content, padding the response.
A missing max-output cap. No max_tokens means a single misread request can generate to the full context window. This is the most preventable output-side runaway.
Reasoning thinking-tokens billed as output. The silent 10x, covered in its own section below because it surprises people the most.

When prompt_tokens is the culprit

Large prompt_tokens means each request carries too much context in. The usual causes: Input-side spikes are about discipline: trim history, prune tool schemas to what the current task needs, and cap retrieval payload size. Many of these overlap with the broader optimization patterns in our guide to reducing LLM token costs through optimization.

Prompt bloat. The system prompt, few-shot examples, or instructions grew over time.
Oversized tool schemas. Agentic apps pay for every tool definition on every call. A catalog of 50-plus tools can eat a meaningful slice of context before the user's message even appears.
Long history. Conversation or agent history is being resent in full instead of trimmed or summarized.
Untrimmed retrieval chunks. A RAG step is stuffing top-k results in raw, including chunks that are too long or only marginally relevant.

04 · The Silent 10x: Reasoning Thinking-Tokens

Reasoning models deserve their own warning because they break the intuition that output cost tracks visible output length. On most reasoning models, the internal chain-of-thought is billed as output tokens even though it never appears in the response. A request that returns a tidy 200-token answer can bill for 2,000 or more output tokens because the model spent the rest of its budget thinking.

This produces a spike that is genuinely confusing on first read. Your completion_tokens aggregate jumps, but if you sample the actual responses they look normal, short even. Nothing in the user-facing output explains the cost. The trigger is almost always a config change: someone switched a feature to a reasoning model, or raised the reasoning-effort setting, or a routing rule started sending more traffic to a reasoning tier.

Three controls keep this in check:

Use reasoning-effort settings where the provider exposes them. Lower effort for tasks that do not need deep reasoning.

Route deliberately. Send only genuinely hard tasks to reasoning tiers. Cheap-first routing, which we cover in LLM model routing: cheap first, expensive only when needed, keeps the expensive reasoning path off the hot path for simple work.

Budget reasoning separately. Keep a distinct cost line for reasoning spend so it cannot hide inside general output cost and so a regression here shows up against its own baseline.

When you see large completion_tokens against short answers, suspect reasoning overhead before anything else. It is the bucket teams most often miss because the evidence is invisible by design.

05 · Per-Request Tagging So You Can Find the Source Fast

Everything above assumes you can slice your spend by feature, user, and agent-run. If you cannot, the spike is just a bigger number with no handle on it, and you are reduced to grepping logs and guessing. Per-request tagging is the single highest-leverage thing you can set up before a spike, because it converts a multi-hour forensic exercise into a one-query lookup.

Tag every call with at least these dimensions:

User or account ID (catches noisy tenants and abuse)

Feature name (isolates which product surface spiked)

Agent-run ID (essential for catching runaway loops; one run that owns thousands of calls jumps out immediately)

Tenant and environment (separates a prod incident from a staging job that escaped its budget)

You do not have to build this from scratch. Observability layers handle it as metadata or headers with near-zero overhead:

Helicone sits as a proxy and attaches custom properties per request, giving you cost grouped by any tag in its dashboard.

Langfuse captures traces with tags and metadata, which is useful when the spike is inside a multi-step agent run and you need to see which step exploded.

Traceloop instruments via OpenTelemetry, so your LLM spend lands in the same tracing system as the rest of your stack.

Pick one and tag consistently. The day you do this is the day cost incidents stop being scary, because the answer to "who caused this" becomes a filter, not an investigation.

06 · Feature-Level Alerts That Fire Before the Invoice

Tagging tells you who caused a spike after it happens. Alerting tells you while it is happening. The difference is the difference between a $40 incident and a $4,000 one.

The naive approach (a single global "alert if monthly spend exceeds X") is nearly useless. A global threshold is too coarse to catch a single feature going haywire until it has already moved the whole number, and it fires too late to matter. What works is feature-level rolling-baseline alerting:

Compute a trailing baseline per feature. For each tagged feature, maintain a rolling average of its cost or token rate over a trailing window (say, the last 7 days at the same hour).

Alert on deviation, not absolute value. Fire when a feature's current rate exceeds its own baseline by a meaningful multiple (3x is a reasonable starting threshold). A feature that normally costs pennies an hour jumping to dollars an hour is a far stronger signal than a global number creeping up.

Add a per-agent-run ceiling. For autonomous loops, do not even wait for the rolling alert. Put a hard spend ceiling on each agent-run so a single loop self-terminates before it can burn through a meaningful budget.

This pattern is part of the broader practice of watching cost as a first-class production signal alongside latency and quality, which we cover in AI production monitoring: quality drift, hallucinations, and costs. Cost is not a billing concern that lives in finance. It is an operational metric, and it deserves the same alerting rigor as your error rate.

07 · Permanent Fixes: Cap, Enforce, Control

Diagnosing the spike is half the job. The other half is making sure this exact incident cannot recur. These are the controls that turn a recurring fire drill into a non-event. Across the production LLM systems we have reviewed at Particula Tech, the teams that never get paged for cost are the ones that wired these in before they needed them.

Hard retry caps with backoff and jitter. Cap retries at a small number (2-3), use exponential backoff so you do not hammer a struggling dependency, and add jitter so simultaneous failures do not retry in lockstep. This single control kills the retry-storm bucket outright.

max_tokens on every call. Set max_tokens (or max_output_tokens) on every production request, sized just above the longest legitimate response. It bounds the blast radius of any output-side runaway. Pair it with stop sequences for structured output so generation ends when the useful content is done rather than running to the cap.

Hard iteration caps on agent loops. Every agentic loop needs a max-iterations guard and a per-run spend ceiling. An agent with no upper bound on iterations is an open tap. This is the one control that would prevent most of the $40-in-minutes incidents.

Reasoning-effort controls. Where the provider exposes a reasoning-effort setting, set it deliberately per feature, and route only hard tasks to reasoning tiers. Do not let reasoning be the default.

Model-version pinning. Pin explicit model versions in config rather than floating on a provider default. A pinned version cannot be silently upgraded into a more expensive tier underneath you.

Tagging and rolling-baseline alerts. The two controls from the previous sections are also permanent fixes. They are what make the next incident a five-minute lookup instead of an overnight surprise.

Here is the priority order if you can only do a few this week:

None of these is a heavy lift. The whole set is roughly a day of work, and it pays for itself the first time a background job tries to spend your quarter's budget in an afternoon.

Fix	Stops which bucket	Effort	Priority
Per-request tagging	All (enables diagnosis)	Low	Do first
max_tokens enforcement	Output runaway, open-ended prompts	Low	Do first
Hard iteration cap + per-run ceiling	Runaway agent loop	Low	Do first
Retry cap + backoff + jitter	Retry storm	Low	High
Rolling-baseline feature alerts	All (early warning)	Medium	High
Model-version pinning	Unintended upgrade	Low	Medium
Reasoning-effort controls	Reasoning thinking-token 10x	Low	Medium

08 · Wrapping Up: From Surprise Invoice to Boring Dashboard

A token cost runaway feels like chaos in the moment, but the diagnosis is mechanical once you stop staring at the dollar total and start reading the data. Check whether volume or size spiked. Split prompt tokens from completion tokens. Group by tag. Those three moves narrow six buckets to one in about ten minutes, and the silent-10x reasoning case is the only one that needs you to look past the visible output.

The deeper lesson is that cost spikes are a monitoring failure as much as a code failure. The incident is loud because the bill is the first signal you got. Wire in per-request tagging, set feature-level rolling-baseline alerts, and enforce the hard caps (retries, max_tokens, iterations, per-run ceilings) and the next runaway loop trips an alert and self-terminates instead of running until the invoice arrives. Cost belongs on the same dashboard as latency and error rate, and it deserves the same alerting discipline.

If you want the full picture of how token spend behaves at the program level (budgeting, forecasting, and chargeback rather than just incident response), see the AI development tools pillar for the broader cluster. And if you would rather have the tagging, alerting, and guardrails built into your stack correctly the first time, that is exactly the kind of work we do at Particula Tech. The setup is small. The first incident it prevents is not.

09 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI DEVELOPMENT TOOLS

LLM Bill Spiked Overnight: A Token Cost Runaway Playbook

An on-call playbook for an LLM cost spike: six failure buckets, the completion vs prompt token signal that finds the source, and the permanent fixes.

Sebastian MondragonMAY 26, 2026 · 11 MIN READ

01 · The On-Call Scenario: Triage Before You Theorize

Three questions, maybe ten minutes with a decent observability layer, and you have narrowed a vague "the bill spiked" into a specific hypothesis. Now you can theorize.

02 · The Six Spike Buckets

Every token cost runaway I have triaged fits one of these six buckets. Memorize them; they are your differential diagnosis.

Runaway agent loop

Prompt regression

Unintended model upgrade

Retry storm

Noisy tenant

Leaked API key

Bucket	Volume or size	Token signal	Typical fingerprint
Runaway agent loop	Volume	Large completion + prompt	One agent-run tag with thousands of calls in minutes
Prompt regression	Size	Large prompt_tokens	Flat request count, prompt_tokens jumped at a deploy boundary
Unintended model upgrade	Size	Cost-per-token jumped	Same token counts, higher cost; default version or tier changed
Retry storm	Volume	Duplicated identical requests	Bursts of identical payloads, often after a dependency blip
Noisy tenant	Volume or size	Concentrated in one tenant tag	Shared budget, one tenant suddenly dominates
Leaked API key	Volume	Unfamiliar usage pattern	Traffic from unknown features, regions, or off-hours

03 · Reading the Signals: Completion Tokens vs Prompt Tokens

The prompt/completion split is the workhorse of this diagnosis because input and output cost are billed separately and caused by completely different things. Read it before you read anything else.

When completion_tokens is the culprit

An open-ended prompt. The instruction invites long generation ("write a detailed report on...") and the model obliges to the limit.
A missing stop sequence. For structured output, without a stop the model keeps going past the useful content, padding the response.
A missing max-output cap. No max_tokens means a single misread request can generate to the full context window. This is the most preventable output-side runaway.
Reasoning thinking-tokens billed as output. The silent 10x, covered in its own section below because it surprises people the most.

When prompt_tokens is the culprit

Prompt bloat. The system prompt, few-shot examples, or instructions grew over time.
Oversized tool schemas. Agentic apps pay for every tool definition on every call. A catalog of 50-plus tools can eat a meaningful slice of context before the user's message even appears.
Long history. Conversation or agent history is being resent in full instead of trimmed or summarized.
Untrimmed retrieval chunks. A RAG step is stuffing top-k results in raw, including chunks that are too long or only marginally relevant.

04 · The Silent 10x: Reasoning Thinking-Tokens

Three controls keep this in check:

Use reasoning-effort settings where the provider exposes them. Lower effort for tasks that do not need deep reasoning.

Budget reasoning separately. Keep a distinct cost line for reasoning spend so it cannot hide inside general output cost and so a regression here shows up against its own baseline.

When you see large completion_tokens against short answers, suspect reasoning overhead before anything else. It is the bucket teams most often miss because the evidence is invisible by design.

05 · Per-Request Tagging So You Can Find the Source Fast

Tag every call with at least these dimensions:

User or account ID (catches noisy tenants and abuse)

Feature name (isolates which product surface spiked)

Agent-run ID (essential for catching runaway loops; one run that owns thousands of calls jumps out immediately)

Tenant and environment (separates a prod incident from a staging job that escaped its budget)

You do not have to build this from scratch. Observability layers handle it as metadata or headers with near-zero overhead:

Helicone sits as a proxy and attaches custom properties per request, giving you cost grouped by any tag in its dashboard.

Langfuse captures traces with tags and metadata, which is useful when the spike is inside a multi-step agent run and you need to see which step exploded.

Traceloop instruments via OpenTelemetry, so your LLM spend lands in the same tracing system as the rest of your stack.

Pick one and tag consistently. The day you do this is the day cost incidents stop being scary, because the answer to "who caused this" becomes a filter, not an investigation.

06 · Feature-Level Alerts That Fire Before the Invoice

Tagging tells you who caused a spike after it happens. Alerting tells you while it is happening. The difference is the difference between a $40 incident and a $4,000 one.

Compute a trailing baseline per feature. For each tagged feature, maintain a rolling average of its cost or token rate over a trailing window (say, the last 7 days at the same hour).

07 · Permanent Fixes: Cap, Enforce, Control

Model-version pinning. Pin explicit model versions in config rather than floating on a provider default. A pinned version cannot be silently upgraded into a more expensive tier underneath you.

Here is the priority order if you can only do a few this week:

None of these is a heavy lift. The whole set is roughly a day of work, and it pays for itself the first time a background job tries to spend your quarter's budget in an afternoon.

Fix	Stops which bucket	Effort	Priority
Per-request tagging	All (enables diagnosis)	Low	Do first
max_tokens enforcement	Output runaway, open-ended prompts	Low	Do first
Hard iteration cap + per-run ceiling	Runaway agent loop	Low	Do first
Retry cap + backoff + jitter	Retry storm	Low	High
Rolling-baseline feature alerts	All (early warning)	Medium	High
Model-version pinning	Unintended upgrade	Low	Medium
Reasoning-effort controls	Reasoning thinking-token 10x	Low	Medium

08 · Wrapping Up: From Surprise Invoice to Boring Dashboard

09 · FAQ

Quick answers to the questions this post tends to raise.