Most teams jump to caching and model routing before they can answer 'which tenant spent the $14K?' Fix attribution first: tag every call with tenant_id, user_id, feature, and environment at the SDK call site, run all traffic through a gateway (LiteLLM, Portkey, or Langfuse), and budget a week to handle the streaming-usage edge cases. The 80/20 split most multi-tenant SaaS teams find on day one is brutal: 5% of tenants drive 60% of token spend. Once you can see it, you can price for it, throttle it, or route it to a smaller model.
A SaaS founder messaged me last month asking how to cut his OpenAI bill. It had hit $94K the previous month, up from $31K in January, and his board wanted a number by Friday. I asked him the question I always ask first: which tenant is driving it? He didn't know. The OpenAI dashboard showed one API key — the one his backend used for everyone — and the bill was a single undifferentiated total. Three days of digging through application logs later, we discovered that one tenant on his $99/month plan was generating 41% of total token spend through a recursive document-summarization loop that nobody had noticed.
That story is so common it's almost boring. Every multi-tenant SaaS team I've worked with eventually arrives at the same realization: per-tenant LLM cost attribution is the prerequisite for everything else. You cannot decide whether to cache, route, downgrade, or rate-limit until you can answer "which tenant spent the money." And by default, your stack cannot answer that question, because the LLM providers see one API key and your application doesn't write the right things to the right place.
This post is the playbook I give clients when they ask me to fix this. It covers the metadata you have to tag, the gateway pattern that does the actual attribution, the streaming-usage bug that bites everyone, and a 20-line LiteLLM proxy config that most teams should start with on day one.
Why Cost-Reduction Posts Miss the Point
Most "cut your LLM costs" content — including some I've written, like our smart caching architecture walkthrough and our token cost optimization guide — assumes you already know where your spend is coming from. That assumption is wrong for almost every team I've audited. They have a total bill, a vague feeling that "the chatbot is expensive," and no way to draw a line from a single dollar to a single tenant or a single feature.
Without attribution you end up doing one of three unhelpful things:
Attribution unlocks every other lever. Once you can say "Acme generates 22% of our spend on a $99 plan," you have three real options — model routing, an enterprise upsell, or a hard usage cap — and you can pick the right one with data. We covered the routing leg of this in detail elsewhere; routing without attribution just shuffles deck chairs.
Step 1: Tag Every Call at the Call Site
The non-negotiable foundation is metadata tagging. Every single LLM call your application makes has to carry, at minimum, a tenant_id, user_id, feature, and environment. Not "most calls." Not "the customer-facing ones." Every call, including the cron job that scores leads at 3 AM and the internal admin tool that nobody uses on weekends.
The right place to do this is the SDK call site, wrapped in a thin client your team is forced to use. With OpenAI's Python SDK and LiteLLM, you pass metadata as part of the call:
from litellm import completion
response = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": user_prompt}],
metadata={
"tenant_id": ctx.tenant.id,
"user_id": ctx.user.id,
"feature": "document_summarize",
"environment": "prod",
"request_id": ctx.request_id,
},
)LiteLLM forwards this metadata through to its spend tracker and to any downstream callback (Langfuse, Helicone, a custom webhook). Portkey reads the same idea via x-portkey-metadata headers. OpenAI's native SDK lets you pass extra_headers that your gateway can intercept. The pattern is the same regardless of vendor: metadata at the call site, propagated as structured data, never as a comment in a system prompt.
Three rules I enforce on every project:
llm_client.py and ban direct imports of openai or anthropic via a lint rule. The wrapper requires tenant_id as a positional argument so you cannot forget it.tenant_id manually, half of them will pass the wrong one. Read it from your auth middleware's request context and inject it inside the wrapper.tenant_id should crash a unit test, log an error in staging, and write to a tenant_id=unknown bucket in production so you can find the offender without breaking the request.If your codebase is too large to retrofit cleanly, a dependency-aware refactor strategy gets you there without a rewrite.
Step 2: Route Everything Through a Gateway
Tagging without a gateway is a half-solution. You need one place where every LLM call lands, the metadata is recorded alongside provider usage data, and the resulting row gets written to a database you control. This is the role of an LLM gateway, and there are three credible options as of April 2026.
LiteLLM is what I deploy for most clients because it is a single binary that speaks the OpenAI API on the front, talks to ~100 providers on the back, and writes spend records to Postgres or any callback you wire up. Portkey is the right call when the team explicitly does not want to run any more infrastructure and wants a dashboard the day they sign up. Langfuse is the right call when observability is the bigger pain than cost — the trace-driven debugging flow is genuinely better than anything you'll build yourself, and Langfuse will reconcile per-user cost as a side effect of its tracing.
A common pattern that works well: LiteLLM in front for routing and spend tracking, Langfuse behind it for traces and quality monitoring. LiteLLM's success_callback writes to Langfuse automatically. You get attribution from LiteLLM's spend table, and you get full input/output traces from Langfuse, and you didn't have to build either.
The minimum LiteLLM proxy config that gets you per-tenant attribution looks like this:
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: claude-opus
litellm_params:
model: anthropic/claude-opus-4-6
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
success_callback: ["langfuse"]
drop_params: true
general_settings:
database_url: os.environ/DATABASE_URL
store_model_in_db: true
spend_logs_metadata_enabled: trueThat's roughly 20 lines and it gives you a Postgres-backed spend log keyed by API key, model, tenant_id, and any other metadata you tagged at the call site. From there it's straight SQL to build the rollups.
| Gateway | Best for | Hosting | Cost model |
|---|---|---|---|
| LiteLLM Proxy | Self-hosting, OpenAI-compatible interface, lots of providers | Self-hosted (Docker, Helm) | Free, OSS |
| Portkey | Hosted gateway, dashboards out of the box | Hosted or self-hosted | Per-request after free tier |
| Langfuse | Tracing-first observability, prompt management | Hosted or self-hosted | Free OSS, paid cloud tier |
Step 3: Fix the Streaming Usage Bug Before It Bites You
This is the part nobody warns you about until you're three months into production with a known-incorrect cost report. When a streaming LLM call's client disconnects before the final chunk, providers do not emit usage data, and most gateways silently lose the call. I've seen teams under-count spend by 8-15% from this single class of bug.
The two issues to know:
The fixes are not exotic but you have to do them deliberately:
stream_options={"include_usage": True}** to OpenAI streaming calls. Without it there is no usage chunk to track in the first place.litellm.disable_streaming_logging = False and improvements to the response post-processing path; verify your version handles it.prompt_tokens and completion_tokens across chunks rather than overwriting on the last one. We've seen this matter most for self-hosted inference engines — our vLLM vs Ollama vs TensorRT comparison walks through where cancellation behavior diverges from the cloud providers.The cost of skipping this: a finance team that stops trusting your numbers, a customer success team that under-charges enterprise tenants, and a quarterly board meeting where the bill suddenly "jumps" 12% because someone finally reconciled.
Step 4: Two Reports, Not One
Once spend is flowing into your database with metadata, the temptation is to build one giant dashboard. Don't. You need exactly two reports, and they have different audiences and different latencies.
Report A: Monthly billing rollup. This is the source of truth for finance, customer success, and any usage-based pricing. It's a daily-batched job that produces, per tenant, per month, per model: input tokens, output tokens, cache_creation_tokens, cache_read_tokens, reconciled USD, and a delta against the provider invoice. Stable, slow, append-only. We typically materialize this into a tenant_spend_monthly table in the same database the gateway writes to.
SELECT
date_trunc('month', start_time) AS month,
metadata->>'tenant_id' AS tenant_id,
model,
SUM(prompt_tokens) AS input_tokens,
SUM(completion_tokens) AS output_tokens,
SUM(spend) AS usd_spend
FROM spend_logs
WHERE start_time >= date_trunc('month', now()) - interval '12 months'
AND metadata->>'environment' = 'prod'
GROUP BY 1, 2, 3
ORDER BY usd_spend DESC;Report B: Operational anomaly view. This runs every few minutes and answers very different questions: which tenant's burn rate jumped above its 30-day baseline in the last hour, which feature is responsible for the top 10 most expensive calls today, which API keys are approaching their daily cap. This is a live dashboard or a Slack alert pipeline, not a finance artifact. We use Grafana on top of the same Postgres for the simple cases and a dedicated stream processor when call volume passes ~10M/day.
The single most useful column on Report B is cost_per_request_p99 per feature. Average cost per request hides the disasters; p99 surfaces them. The one tenant doing recursive summarization in our opening story would have shown up on day one as a p99 outlier on the document_summarize feature.
Step 5: Do Something With the Numbers
Attribution is the input to a decision, not the decision itself. The patterns I see work, in rough order of how often I recommend them:
document_summarize accounts for 70% of cost and produces 5% of measured user value, you have permission to redesign or sunset it. Without per-feature attribution, that argument always loses to "but users like it."A pattern we've stopped recommending: per-tenant API keys. The operational cost of rotating, monitoring, and rate-limiting hundreds of provider keys outweighs the visibility benefit, and metadata tagging gives you the same answer with less rope.
A Realistic Rollout for an Existing SaaS
If you're inheriting an LLM-powered SaaS that has none of this in place, the order of operations that we've seen work in two-to-three week sprints:
tenant_id, user_id, feature as arguments. Lint-ban direct imports. Let CI fail until every call is migrated.That's four weeks from "we have no idea who's spending the money" to "we charge enterprise tenants accurately and our bill is 30-40% lower." Most of the work is plumbing, not LLM cleverness. At Particula Tech we've now done this rollout for enough multi-tenant SaaS clients that we treat it as a templated engagement rather than custom work.
What to Build First
If you read nothing else and want to ship something today: stand up LiteLLM in front of your existing provider, force every call through your wrapper with tenant_id and feature metadata, and run the SQL query in Section 4 against the spend_logs table at the end of the week. The first time you see the by-tenant rollup, you will find at least one number that surprises you. That number is the cheapest piece of cost intelligence you'll buy this year, and it costs about 20 lines of YAML and a wrapper module to obtain.
Cost reduction comes second. Attribution comes first. Stop optimizing in the dark.
Frequently Asked Questions
Quick answers to common questions about this topic
The OpenAI and Anthropic dashboards group spend by API key, not by tenant. If your SaaS uses one shared key for all tenants — which most do, because per-tenant keys mean per-tenant rate limits and per-tenant rotation — the dashboard can't tell you anything about who consumed what. You either issue one key per tenant (operational nightmare at any scale) or you build attribution at the application layer with metadata tagging. Every serious multi-tenant SaaS we've worked with ends up at the second option.



