The widely-cited $20K/mo (or 5-10M premium tokens) self-host break-even ignores three line items that dominate real TCO: idle GPU utilization at off-peak (30-50% capacity routinely wasted at steady state), the over-provisioning needed to absorb a P95 spike, and the senior MLOps/infra hire ($250-360K loaded in the US, $140-220K in the EU). Real flip point for pure self-host in 2026 is closer to $50-80K/mo of premium API spend. The hybrid pattern — reserved H100 capacity for the 95-98% of traffic that fits an open-weight model, premium API for the spiky long-tail — flips the break-even down to roughly $25K/mo and stays profitable as you scale. Most self-host decisions we see fail the spreadsheet test on idle GPU and headcount, not on token throughput.
H100 spot pricing on RunPod sits at $2.69/hr in 2026. Eight of those for a month is roughly $15,500. Premium API spend at $30K/mo sounds like it should fall to half that the moment you self-host. The napkin math is clean, the slide deck writes itself, and most of the time the spreadsheet is wrong.
Across the self-host migrations we've reviewed, the post-launch number lands 1.5-2.5x higher than the pre-decision spreadsheet promised — sometimes higher. The misses are not on token throughput or model accuracy. They're on three line items that don't make it into the napkin math: idle GPU utilization at off-peak, the over-provisioning required to absorb a P95 spike without falling over, and the senior MLOps hire who keeps the stack alive at 3 a.m. Add eval, observability, on-call, and the model-update churn cost from a new SOTA shipping every 6-8 weeks, and the real self-host LLM vs API break-even sits closer to $50-80K/mo of premium API spend, not $20K.
This post is the TCO framework we use to scope that decision when an engineering leader needs to bring a defensible recommendation to finance. We'll walk the naïve break-even and what it hides, the line-by-line cost model, the worked H100-hosting comparison, and the hybrid pattern that flips the break-even down to roughly $25K/mo and stays there as you scale. It's the same logic that runs underneath our 4-tier AI gateway decision framework and the cheap-first model routing pattern — different question, same first principles.
The Naïve Break-Even (And What It Hides)
The number you'll see on Twitter, on most "self-host vs OpenAI" blog posts, and in roughly half the LinkedIn slide decks circulating among AI eng leads in 2026 is some variant of this:
Below those, stay on API. Above, self-host. The math comes from dividing premium token cost (input + output, weighted) by H100 sticker pricing at full utilization, then assuming the rest of the stack is free.
It's not a useless number. It's a starting point. But it leaves out everything that turns a Friday slide into a Monday operating reality:
The naïve break-even number assumes all five of those line items are zero. They are not zero. They are roughly half of the spreadsheet.
The Full TCO Model — Line by Line
The cost worksheet we use looks like this. The right column is what gets left off the napkin.
The TCO column is the one that determines whether the project is profitable. A common shape across self-host migrations we've audited: the team sizes the project on column 1, ships it on column 2, and is back to API for the long-tail workloads inside 90 days because the headcount math didn't survive contact with hiring.
The instinct is to pick on $/hr. Across the stacks we've reviewed, the right pick is almost always Modal or Baseten until your monthly premium spend justifies a dedicated platform engineer — at which point the bare-metal number starts to matter.
H100 Hosting Provider Comparison (2026)
Sticker prices below are public 2026 rates and shift quarterly — confirm with the vendor before budget planning.
| Line item | Naïve napkin math | What it actually costs |
|---|---|---|
| GPU compute (sticker) | $2.69/hr × 8 × 730 = $15,700/mo | Same — but multiply by 1.5-2.5x for utilization tax |
| Effective utilization | Assumed 100% | 40-65% sustained → $24-39K/mo effective |
| Inference engine + serving | Free (vLLM is OSS) | 0.1-0.25 FTE ops, eval rig, autoscaler |
| Senior MLOps/infra hire | $0 | 0.5-1.5 FTE × $250-360K loaded (US) |
| Eval + observability stack | $0 | $500-3K/mo SaaS or 0.1 FTE self-hosted |
| On-call rotation | $0 | Engineer-hours, plus comp uplift for paged shifts |
| Model-update churn | $0 | 1-3 engineer-weeks per quarter, recurring |
| Incident cost | $0 | The first sev-1 outage of a self-host stack costs more than 6 months of premium API delta |
| All-in monthly TCO | ~$15K | ~$45-90K depending on FTE allocation |
| Provider | $/hr H100 (2026) | What you operate | Best fit |
|---|---|---|---|
| RunPod (spot) | $2.69 | Everything: vLLM/SGLang, autoscaler, eval, obs, on-call | Existing platform team, $80K+/mo premium spend, committed-use discount |
| RunPod (reserved) | ~$4.69 | Same as spot, but with availability guarantees | Steady-state production, no preemption tolerance |
| Modal | $3.95 | App code + image; Modal owns orchestration and autoscale | $25-80K/mo, no dedicated infra hire, serverless ergonomics |
| Baseten | $6.50 | Almost nothing — managed inference platform | Compliance-heavy, $20-60K/mo, want SOC 2-grade observability without staffing it |
| Together / Anyscale | Varies | Hosted open-weight inference per token | Below the self-host break-even but want open-weight cost shape |
| Lambda / CoreWeave (reserved) | ~$3-4 (committed) | Bare metal, you bring the stack | $100K+/mo, multi-year commits, peak performance |
Token Throughput on H100s — A Sanity-Check Baseline
Before sizing reserved capacity, you need a defensible throughput number for the model class you're hosting. The rule of thumb across the open-weight defaults that matter in 2026:
Two cautions. First, treat published vLLM/SGLang and vendor numbers as ceilings, not budgets — real throughput depends on your prompt length distribution, batch size, KV cache config, and decoding strategy. Our vLLM vs SGLang inference engine comparison covers the engine-level decision. Second, the only number that matters is the one you measure on your own prompt distribution. Synthetic benchmarks lie, especially at long context.
| Model class | Footprint | Order-of-magnitude output tok/s (8xH100, batched) | Notes |
|---|---|---|---|
| DeepSeek V4 (MoE) | 8xH100 with quant | High — MoE active params keep per-token cost down | Best $/token for general reasoning if your traffic fits a permissive license |
| Llama 3.3-70B (dense) | 8xH100 | Moderate — dense compute scales with params | Enterprise default; well-supported toolchain |
| Mistral Medium 3.5 | 4-8xH100 | Moderate-high | Mid-tier general; strong instruction following |
| Qwen3.6-35B-A3B (MoE) | 1xH100 (80GB) with headroom | High per-GPU (3B active params) | The local coding default — see our open-weight coding model comparison |
Premium API Price Trajectory — The Moving Target
The break-even number is not stable. Over the last 12 months:
For the self-host decision, this matters in two directions. On simple workloads, the API is winning faster than your spreadsheet predicted, and the break-even threshold keeps climbing. On reasoning- and tool-heavy workloads, the threshold keeps falling. Run the TCO model every six months — the number that justified self-hosting in November doesn't necessarily justify it in May.
The Hybrid Pattern (Where Most Production Systems Land)
Most production AI systems we've reviewed at the $25-100K/mo premium spend band converge on the same shape, even when the team starts out aiming for pure self-host or pure API:
The savings come from sizing reserved GPU capacity for the median, not the P95 — the spike traffic gets routed to API, where elastic capacity is the vendor's problem. That alone is the difference between a 40-50% utilization stack and a 70-85% one, and it's where the break-even flips down to roughly $25K/mo.
# litellm config.yaml — hybrid router with self-hosted vLLM primary
model_list:
# primary: self-hosted Qwen3.6-35B on internal vLLM endpoint
- model_name: chat-default
litellm_params:
model: openai/qwen-3.6-35b
api_base: https://vllm-internal.svc.cluster.local/v1
api_key: os.environ/INTERNAL_API_KEY
timeout: 8
# frontier tier: premium API for the long-context / reasoning tail
- model_name: chat-frontier
litellm_params:
model: anthropic/claude-opus-4-7
api_key: os.environ/ANTHROPIC_API_KEY
timeout: 30
router_settings:
routing_strategy: simple-shuffle
num_retries: 1
allowed_fails: 2
cooldown_time: 30
# self-host outage → spill to frontier automatically
fallbacks:
- chat-default: ["chat-frontier"]
# any request beyond self-host context budget escapes to frontier
context_window_fallbacks:
- chat-default: ["chat-frontier"]A Minimal LiteLLM Hybrid-Router Config
The shape of the routing config that makes this concrete: The two relevant moves are the context_window_fallbacks (anything beyond your self-host context budget escapes to a premium model automatically) and the cooldown_time on self-host failures (so a vLLM outage doesn't take down the product — it just shifts traffic to the API tier until the cluster recovers).
Worked Scenarios — Three Shapes We See Converge
Hedged plural scenarios drawn from the deployments we've audited. Numbers are illustrative ranges, not single-case figures.
Shape A — API still wins (below $15-20K/mo)
Across stacks at sub-$15K/mo premium spend with a single team, single provider, and low traffic variance, self-host loses on every spreadsheet we've run. The headcount line alone is more than the entire premium bill. The right move is to invest in smart caching and cheap-first model routing — both can take 30-60% off the bill without touching infrastructure ownership.
Shape B — Hybrid wins (roughly $25-80K/mo)
The most common shape in 2026. Stacks at $25-80K/mo with predictable steady-state, two providers, and at least one workload class that fits an open-weight model cleanly. Self-hosting the 95-98% of qualifying traffic on Modal or Baseten and routing the tail to premium API typically takes 35-55% off the all-in monthly bill after the headcount line is paid. The savings are larger if a senior platform engineer already exists and can absorb the additional ops without a new hire.
Shape C — Pure self-host wins ($80K+/mo, regulated data, low spike ratio)
Stacks above $80K/mo with regulated data residency, a low spike-to-median ratio, and an existing platform/SRE team converge on bare-metal reserved H100s with a full self-host stack. Net savings versus continuing on premium API land in the 50-65% range over a 24-month window, with the bulk of the delta coming from committed-use GPU pricing and elimination of premium token costs. Our cloud vs on-premise AI cost analysis covers the regulated-imaging variant of this shape, and the EU AI Act data sovereignty constraint increasingly pushes EU-headquartered teams here regardless of the pure cost math.
Decision Matrix — Which Shape Are You?
If you're scoring B on most rows but feel pulled toward C, the bias to resist is the founder/engineer instinct that "we can run our own GPUs." You probably can. You probably shouldn't, until the spend justifies the hire. If you're scoring A but feel pulled toward B, the bias to resist is "open weights are cheaper" — they're cheaper per token, not per delivered request, and the difference is the whole post.
| Signal | Shape A: stay on API | Shape B: hybrid | Shape C: pure self-host |
|---|---|---|---|
| Monthly premium spend | < $20K | $25-80K | > $80K |
| Traffic variance (P95/median) | Any | Predictable, 2-5x | Low, < 2x |
| Existing platform/SRE muscle | None or minimal | Some — can absorb 0.5 FTE | Mature — already on-call |
| Workload class | Frontier reasoning, agentic | Mostly OW-fit + tail | OW-fit dominant, regulated data |
| Compliance constraint | Vendor SOC 2 acceptable | Vendor + audit trail acceptable | Data residency / sovereignty mandatory |
| Time horizon | 6-12 mo | 12-24 mo | 24+ mo with reserved commits |
The "Don't Self-Host" Checklist
Before committing, run the negative case. Self-host is the wrong answer if any of these are true:
Hit any two and the spreadsheet has already decided for you.
Practical Migration Path (Without a Day-One Infra Hire)
For teams in Shape B who don't yet have the platform muscle for full self-host, the phased rollout that minimizes risk:
Each step is reversible. The mistake is to skip steps 1-2 and cut over directly because "the math works on paper." It does — until the first sev-1 outage at 3 a.m. burns six months of compute savings in a single weekend.
When the Math Actually Flips
The honest answer to "should we self-host?" in 2026 is: probably not yet, and definitely not on the napkin. The break-even number you can defend in front of finance is the one that includes idle GPU, tail traffic, headcount, eval, churn, and incident cost — and that number is closer to $50-80K/mo for pure self-host, or $25K/mo for the hybrid pattern that most production systems converge on anyway.
The teams that get this right share a habit: they re-run the TCO model every six months, against current vendor pricing and current open-weight throughput. The teams that get it wrong run the model once, build the slide deck, and treat the number as static. In a year where premium input pricing dropped 30-50% and open-weight throughput climbed sharply on the same hardware, static numbers age in weeks, not quarters.
If you're using this framework to bring a recommendation to your CFO, the part that matters is showing both columns of the worksheet — sticker math and TCO math — and being explicit about which line items are assumed away in each. That's the conversation our build vs buy decision framework walks through, and it's the one that holds up in front of finance whether the answer is API, hybrid, or self-host. For deeper architectural patterns across the AI for Business cluster, the gateway, attribution, and routing posts linked above are the rest of the toolkit.
Frequently Asked Questions
Quick answers to common questions about this topic
For pure self-hosting on rented H100s, the honest break-even sits closer to $50-80K/mo of premium API spend once you include idle GPU utilization, P95 over-provisioning, and a senior infra hire. The widely-cited $20K/mo number works only if you assume 100% utilization, no on-call rotation, and a free MLOps engineer — none of which exist. The hybrid pattern (reserved capacity for steady-state, API for the spiky tail) drops the break-even to roughly $25K/mo, because it lets you size GPU capacity for the median rather than the 95th percentile and avoids the headcount cliff. Below $15K/mo, API economics still win on every realistic spreadsheet.



