April 15, 2026

Per-Tenant LLM Cost Attribution for Multi-Tenant SaaS

You can't cut costs you can't attribute. Here's the metadata + gateway pattern that pins every OpenAI dollar to a tenant, including the streaming usage bug LiteLLM users keep hitting.

Sebastian Mondragon

12 min read

Per-Tenant LLM Cost Attribution for Multi-Tenant SaaS

TL;DR

Most teams jump to caching and model routing before they can answer 'which tenant spent the $14K?' Fix attribution first: tag every call with tenant_id, user_id, feature, and environment at the SDK call site, run all traffic through a gateway (LiteLLM, Portkey, or Langfuse), and budget a week to handle the streaming-usage edge cases. The 80/20 split most multi-tenant SaaS teams find on day one is brutal: 5% of tenants drive 60% of token spend. Once you can see it, you can price for it, throttle it, or route it to a smaller model.

Per-tenant LLM cost attribution is the prerequisite for every other cost lever in a multi-tenant SaaS. Caching, model routing, downgrades, rate limiting, none of them can be aimed correctly until you can answer "which tenant spent the money." And by default, your stack cannot answer that question, because the OpenAI and Anthropic dashboards group spend by API key, not by tenant. If your backend uses one shared key for all customers, which most do, because per-tenant keys mean per-tenant rate limits and per-tenant rotation, your bill arrives as a single undifferentiated total.

The pattern we see at the moment a SaaS team finally instruments attribution is almost always the same: a small number of tenants are driving most of the spend, and at least one of them is on a low-tier plan running something nobody has noticed (a recursive summarization loop, a polling job, a "let me retry on every failure" pattern). The board asks for a cost-cut number; three days of grepping application logs surfaces the offender. That investigation should not take three days. It should take a SQL query.

This post is the playbook I give clients when they ask me to fix this. It covers the metadata you have to tag, the gateway pattern that does the actual attribution, the streaming-usage bug that bites everyone, and a 20-line LiteLLM proxy config that most teams should start with on day one.

Why Cost-Reduction Posts Miss the Point

Most "cut your LLM costs" content, including some I've written, like our smart caching architecture walkthrough and our token cost optimization guide, assumes you already know where your spend is coming from. That assumption is wrong for almost every team I've audited. They have a total bill, a vague feeling that "the chatbot is expensive," and no way to draw a line from a single dollar to a single tenant or a single feature.

Without attribution you end up doing one of three unhelpful things:

Optimizing the wrong endpoint. You spend two weeks shrinking the prompt of the feature you happen to remember, while the actual cost driver is a background job nobody mentions in standups.

Spreading pain across all tenants. You downgrade everyone to a cheaper model and lose quality for the 95% of tenants who weren't the problem.

Failing to monetize. You absorb a power user's cost into your flat-rate plan because you literally cannot prove they're a power user.

Attribution unlocks every other lever. Once you can say "Acme generates 22% of our spend on a $99 plan," you have three real options, model routing, an enterprise upsell, or a hard usage cap, and you can pick the right one with data. Attribution is also the foundation for the full AI FinOps framework for org, team, project, and agent spend, which extends per-tenant tagging into budgets and chargeback once the numbers are trustworthy. We covered the routing leg of this in detail elsewhere; routing without attribution just shuffles deck chairs.

Step 1: Tag Every Call at the Call Site

The non-negotiable foundation is metadata tagging. Every single LLM call your application makes has to carry, at minimum, a tenant_id, user_id, feature, and environment. Not "most calls." Not "the customer-facing ones." Every call, including the cron job that scores leads at 3 AM and the internal admin tool that nobody uses on weekends.

The right place to do this is the SDK call site, wrapped in a thin client your team is forced to use. With OpenAI's Python SDK and LiteLLM, you pass metadata as part of the call:

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": user_prompt}],
    metadata={
        "tenant_id": ctx.tenant.id,
        "user_id": ctx.user.id,
        "feature": "document_summarize",
        "environment": "prod",
        "request_id": ctx.request_id,
    },
)

LiteLLM forwards this metadata through to its spend tracker and to any downstream callback (Langfuse, Helicone, a custom webhook). Portkey reads the same idea via x-portkey-metadata headers. OpenAI's native SDK lets you pass extra_headers that your gateway can intercept. The pattern is the same regardless of vendor: metadata at the call site, propagated as structured data, never as a comment in a system prompt.

Three rules I enforce on every project:

Make the wrapper mandatory. Wrap the LLM client in your own internal llm_client.py and ban direct imports of openai or anthropic via a lint rule. The wrapper requires tenant_id as a positional argument so you cannot forget it.

Pull tenant from request context, not function parameters. If you let developers pass tenant_id manually, half of them will pass the wrong one. Read it from your auth middleware's request context and inject it inside the wrapper.

Fail loudly in dev, warn in prod. A missing tenant_id should crash a unit test, log an error in staging, and write to a tenant_id=unknown bucket in production so you can find the offender without breaking the request.

If your codebase is too large to retrofit cleanly, a dependency-aware refactor strategy gets you there without a rewrite.

Step 2: Route Everything Through a Gateway

Tagging without a gateway is a half-solution. You need one place where every LLM call lands, the metadata is recorded alongside provider usage data, and the resulting row gets written to a database you control. This is the role of an LLM gateway, and there are three credible options as of April 2026.

LiteLLM is what I deploy for most clients because it is a single binary that speaks the OpenAI API on the front, talks to ~100 providers on the back, and writes spend records to Postgres or any callback you wire up. Portkey is the right call when the team explicitly does not want to run any more infrastructure and wants a dashboard the day they sign up. Langfuse is the right call when observability is the bigger pain than cost, the trace-driven debugging flow is genuinely better than anything you'll build yourself, and Langfuse will reconcile per-user cost as a side effect of its tracing. For a deeper tier-by-tier scoping that includes Kong AI Gateway and Apigee at the enterprise end, see our AI gateway decision framework.

A common pattern that works well: LiteLLM in front for routing and spend tracking, Langfuse behind it for traces and quality monitoring. LiteLLM's success_callback writes to Langfuse automatically. You get attribution from LiteLLM's spend table, and you get full input/output traces from Langfuse, and you didn't have to build either.

The minimum LiteLLM proxy config that gets you per-tenant attribution looks like this:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  success_callback: ["langfuse"]
  drop_params: true

general_settings:
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true
  spend_logs_metadata_enabled: true

That's roughly 20 lines and it gives you a Postgres-backed spend log keyed by API key, model, tenant_id, and any other metadata you tagged at the call site. From there it's straight SQL to build the rollups.

Gateway	Best for	Hosting	Cost model
LiteLLM Proxy	Self-hosting, OpenAI-compatible interface, lots of providers	Self-hosted (Docker, Helm)	Free, OSS
Portkey	Hosted gateway, dashboards out of the box	Hosted or self-hosted	Per-request after free tier
Langfuse	Tracing-first observability, prompt management	Hosted or self-hosted	Free OSS, paid cloud tier

Step 3: Fix the Streaming Usage Bug Before It Bites You

This is the part nobody warns you about until you're three months into production with a known-incorrect cost report. When a streaming LLM call's client disconnects before the final chunk, providers do not emit usage data, and most gateways silently lose the call. I've seen teams under-count spend by 8-15% from this single class of bug.

The two issues to know:

[LiteLLM Issue #14457](https://github.com/BerriAI/litellm/issues/14457): streaming usage tracking can be lost when the client cancels mid-stream. The proxy receives the cancel signal and stops consuming the upstream response, which means it never sees the usage chunk OpenAI emits at the end.

[LiteLLM Issue #25389](https://github.com/BerriAI/litellm/issues/25389): vLLM and some other engines emit multiple usage chunks during a stream rather than one final chunk. Naive consumers overwrite each one and miss the running totals.

The fixes are not exotic but you have to do them deliberately:

**Always pass stream_options={"include_usage": True}** to OpenAI streaming calls. Without it there is no usage chunk to track in the first place.

Detach upstream consumption from client lifetime. Your gateway has to keep consuming the upstream stream after the client disconnects, until the usage chunk arrives, and only then close the connection and write the spend record. LiteLLM has been moving toward this with litellm.disable_streaming_logging = False and improvements to the response post-processing path; verify your version handles it.

Stitch multi-chunk usage. If your upstream is vLLM, SGLang, or any engine that streams running totals, accumulate prompt_tokens and completion_tokens across chunks rather than overwriting on the last one. We've seen this matter most for self-hosted inference engines, our vLLM vs Ollama vs TensorRT comparison walks through where cancellation behavior diverges from the cloud providers.

Reconcile against provider invoices monthly. Even with all of the above, expect a 0.5-3% gap between your internal spend log and the provider's invoice: non-200 responses, retries, and trial credits create rounding edges. Run a job that diffs your aggregate against the invoice and alerts on >5% drift, not 0%.

The cost of skipping this: a finance team that stops trusting your numbers, a customer success team that under-charges enterprise tenants, and a quarterly board meeting where the bill suddenly "jumps" 12% because someone finally reconciled.

Step 4: Two Reports, Not One

Once spend is flowing into your database with metadata, the temptation is to build one giant dashboard. Don't. You need exactly two reports, and they have different audiences and different latencies.

Report A: Monthly billing rollup. This is the source of truth for finance, customer success, and any usage-based pricing. It's a daily-batched job that produces, per tenant, per month, per model: input tokens, output tokens, cache_creation_tokens, cache_read_tokens, reconciled USD, and a delta against the provider invoice. Stable, slow, append-only. We typically materialize this into a tenant_spend_monthly table in the same database the gateway writes to.

SELECT
  date_trunc('month', start_time) AS month,
  metadata->>'tenant_id' AS tenant_id,
  model,
  SUM(prompt_tokens) AS input_tokens,
  SUM(completion_tokens) AS output_tokens,
  SUM(spend) AS usd_spend
FROM spend_logs
WHERE start_time >= date_trunc('month', now()) - interval '12 months'
  AND metadata->>'environment' = 'prod'
GROUP BY 1, 2, 3
ORDER BY usd_spend DESC;

Report B: Operational anomaly view. This runs every few minutes and answers very different questions: which tenant's burn rate jumped above its 30-day baseline in the last hour, which feature is responsible for the top 10 most expensive calls today, which API keys are approaching their daily cap. This is a live dashboard or a Slack alert pipeline, not a finance artifact. We use Grafana on top of the same Postgres for the simple cases and a dedicated stream processor when call volume passes ~10M/day.

The single most useful column on Report B is cost_per_request_p99 per feature. Average cost per request hides the disasters; p99 surfaces them. The recursive-summarization-loop tenant we mentioned earlier would have shown up on day one as a p99 outlier on the document_summarize feature, instead of three days deep into a log-spelunking investigation.

Step 5: Do Something With the Numbers

Attribution is the input to a decision, not the decision itself. The patterns I see work, in rough order of how often I recommend them:

Tier-aware model routing. Free-tier tenants get Claude Haiku or GPT-4o-mini for everything. Paid tenants get the flagship. Enterprise tenants get whatever they're paying for. The gateway picks the model from the tenant's plan record at request time. Combined with our cheap-first routing pattern, this alone usually handles 30-50% of total cost on day one.

Hard daily caps with graceful degradation. Each plan has a daily token budget. When it's exceeded, the gateway returns a 429 with a structured error the frontend handles by showing a "you've hit today's AI quota" message instead of crashing. Critical for keeping a single runaway tenant from doubling your bill overnight, since a noisy tenant can trigger a token cost runaway on a shared budget faster than a daily batch report will catch it.

Targeted upsell. Customer success gets a weekly list of tenants whose spend exceeds their plan margin by 3x or more. Those are the conversations that turn into enterprise contracts. We've watched this single workflow recover hundreds of thousands of dollars in unbilled value for clients.

Feature-level kill switch. If document_summarize accounts for 70% of cost and produces 5% of measured user value, you have permission to redesign or sunset it. Without per-feature attribution, that argument always loses to "but users like it."

Cache and prompt compression at the top. Once you know which tenant and feature are expensive, smart caching and prompt slimming target the calls that actually matter. Caching everyone's responses indiscriminately wastes engineering time on the long tail.

A pattern we've stopped recommending: per-tenant API keys. The operational cost of rotating, monitoring, and rate-limiting hundreds of provider keys outweighs the visibility benefit, and metadata tagging gives you the same answer with less rope.

A Realistic Rollout for an Existing SaaS

If you're inheriting an LLM-powered SaaS that has none of this in place, the order of operations that we've seen work in two-to-three week sprints:

Week 1, day 1-2: Stand up LiteLLM proxy in front of OpenAI/Anthropic. Don't change any application code yet, point your existing client at the proxy URL with the same API key. Verify spend logs start populating.

Week 1, day 3-5: Wrap the LLM client in your application. Force tenant_id, user_id, feature as arguments. Lint-ban direct imports. Let CI fail until every call is migrated.

Week 2: Build Report A (billing rollup) as a SQL view. Reconcile against last month's provider invoices. Expect to find a 5-15% under-count on the first run because of streaming usage loss.

Week 2-3: Apply the streaming fixes from Section 3. Re-reconcile. Should now be within 1-3% of invoices.

Week 3: Build Report B (operational anomalies). Wire alerts for p99 outliers and tenants exceeding 3x their 30-day baseline.

Week 4 onward: Pick the highest-value lever from Section 5 and ship it. Tier-aware routing is usually the right first move.

That's four weeks from "we have no idea who's spending the money" to "we charge enterprise tenants accurately and our bill is 30-40% lower." Most of the work is plumbing, not LLM cleverness. At Particula Tech we've now done this rollout for enough multi-tenant SaaS clients that we treat it as a templated engagement rather than custom work.

What to Build First

If you read nothing else and want to ship something today: stand up LiteLLM in front of your existing provider, force every call through your wrapper with tenant_id and feature metadata, and run the SQL query in Section 4 against the spend_logs table at the end of the week. The first time you see the by-tenant rollup, you will find at least one number that surprises you. That number is the cheapest piece of cost intelligence you'll buy this year, and it costs about 20 lines of YAML and a wrapper module to obtain.

Cost reduction comes second. Attribution comes first. Stop optimizing in the dark.

Frequently Asked Questions

Quick answers to common questions about this topic

The OpenAI and Anthropic dashboards group spend by API key, not by tenant. If your SaaS uses one shared key for all tenants, which most do, because per-tenant keys mean per-tenant rate limits and per-tenant rotation, the dashboard can't tell you anything about who consumed what. You either issue one key per tenant (operational nightmare at any scale) or you build attribution at the application layer with metadata tagging. Every serious multi-tenant SaaS we've worked with ends up at the second option.

April 15, 2026

Per-Tenant LLM Cost Attribution for Multi-Tenant SaaS

You can't cut costs you can't attribute. Here's the metadata + gateway pattern that pins every OpenAI dollar to a tenant, including the streaming usage bug LiteLLM users keep hitting.

Sebastian Mondragon

12 min read

TL;DR

Why Cost-Reduction Posts Miss the Point

Without attribution you end up doing one of three unhelpful things:

Optimizing the wrong endpoint. You spend two weeks shrinking the prompt of the feature you happen to remember, while the actual cost driver is a background job nobody mentions in standups.

Spreading pain across all tenants. You downgrade everyone to a cheaper model and lose quality for the 95% of tenants who weren't the problem.

Failing to monetize. You absorb a power user's cost into your flat-rate plan because you literally cannot prove they're a power user.

Step 1: Tag Every Call at the Call Site

The right place to do this is the SDK call site, wrapped in a thin client your team is forced to use. With OpenAI's Python SDK and LiteLLM, you pass metadata as part of the call:

from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": user_prompt}],
    metadata={
        "tenant_id": ctx.tenant.id,
        "user_id": ctx.user.id,
        "feature": "document_summarize",
        "environment": "prod",
        "request_id": ctx.request_id,
    },
)

Three rules I enforce on every project:

If your codebase is too large to retrofit cleanly, a dependency-aware refactor strategy gets you there without a rewrite.

Step 2: Route Everything Through a Gateway

The minimum LiteLLM proxy config that gets you per-tenant attribution looks like this:

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  success_callback: ["langfuse"]
  drop_params: true

general_settings:
  database_url: os.environ/DATABASE_URL
  store_model_in_db: true
  spend_logs_metadata_enabled: true

Gateway	Best for	Hosting	Cost model
LiteLLM Proxy	Self-hosting, OpenAI-compatible interface, lots of providers	Self-hosted (Docker, Helm)	Free, OSS
Portkey	Hosted gateway, dashboards out of the box	Hosted or self-hosted	Per-request after free tier
Langfuse	Tracing-first observability, prompt management	Hosted or self-hosted	Free OSS, paid cloud tier

Step 3: Fix the Streaming Usage Bug Before It Bites You

The two issues to know:

The fixes are not exotic but you have to do them deliberately:

**Always pass stream_options={"include_usage": True}** to OpenAI streaming calls. Without it there is no usage chunk to track in the first place.

Step 4: Two Reports, Not One

Once spend is flowing into your database with metadata, the temptation is to build one giant dashboard. Don't. You need exactly two reports, and they have different audiences and different latencies.

SELECT
  date_trunc('month', start_time) AS month,
  metadata->>'tenant_id' AS tenant_id,
  model,
  SUM(prompt_tokens) AS input_tokens,
  SUM(completion_tokens) AS output_tokens,
  SUM(spend) AS usd_spend
FROM spend_logs
WHERE start_time >= date_trunc('month', now()) - interval '12 months'
  AND metadata->>'environment' = 'prod'
GROUP BY 1, 2, 3
ORDER BY usd_spend DESC;

Step 5: Do Something With the Numbers

Attribution is the input to a decision, not the decision itself. The patterns I see work, in rough order of how often I recommend them:

A Realistic Rollout for an Existing SaaS

If you're inheriting an LLM-powered SaaS that has none of this in place, the order of operations that we've seen work in two-to-three week sprints:

Week 1, day 3-5: Wrap the LLM client in your application. Force tenant_id, user_id, feature as arguments. Lint-ban direct imports. Let CI fail until every call is migrated.

Week 2: Build Report A (billing rollup) as a SQL view. Reconcile against last month's provider invoices. Expect to find a 5-15% under-count on the first run because of streaming usage loss.

Week 2-3: Apply the streaming fixes from Section 3. Re-reconcile. Should now be within 1-3% of invoices.

Week 3: Build Report B (operational anomalies). Wire alerts for p99 outliers and tenants exceeding 3x their 30-day baseline.

Week 4 onward: Pick the highest-value lever from Section 5 and ship it. Tier-aware routing is usually the right first move.

What to Build First

Cost reduction comes second. Attribution comes first. Stop optimizing in the dark.

Frequently Asked Questions

Quick answers to common questions about this topic

Per-Tenant LLM Cost Attribution for Multi-Tenant SaaS

Why Cost-Reduction Posts Miss the Point

Step 1: Tag Every Call at the Call Site

Step 2: Route Everything Through a Gateway

Step 3: Fix the Streaming Usage Bug Before It Bites You

Step 4: Two Reports, Not One

Step 5: Do Something With the Numbers

A Realistic Rollout for an Existing SaaS

What to Build First

Frequently Asked Questions

Need a per-tenant cost attribution layer wired into your LLM stack?

Related Articles

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

Claude Fable 5 vs Opus 4.8: When to Use Which Model

Stop Parsing LLM JSON With Regex: Constrained Decoding

Per-Tenant LLM Cost Attribution for Multi-Tenant SaaS

Why Cost-Reduction Posts Miss the Point

Step 1: Tag Every Call at the Call Site

Step 2: Route Everything Through a Gateway

Step 3: Fix the Streaming Usage Bug Before It Bites You

Step 4: Two Reports, Not One

Step 5: Do Something With the Numbers

A Realistic Rollout for an Existing SaaS

What to Build First

Frequently Asked Questions

Need a per-tenant cost attribution layer wired into your LLM stack?

Related Articles

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

Claude Fable 5 vs Opus 4.8: When to Use Which Model

Stop Parsing LLM JSON With Regex: Constrained Decoding