Helicone, Langfuse, and LangSmith are not three flavors of the same product — they are three different architectural bets. Helicone is a proxy that logs requests at the wire (5-minute integration, OpenAI-compatible, weakest agent visibility). Langfuse is a span-tracing platform with first-class OTel GenAI support, an Apache-2.0 self-host, and the deepest eval primitives in open source. LangSmith is framework-native to LangChain and LangGraph — best-in-class agent trajectory views, weakest story for stack-agnostic teams. Below ~$30K monthly LLM spend with simple chains, Helicone is honest engineering. Between $30K and $200K with multi-step agents, Langfuse is the default — managed if speed matters, self-hosted if compliance does. Above $200K with a deep LangGraph investment, LangSmith earns its premium. The wrong move is picking by GitHub stars or 'we already use LangChain' — pick by what your traces look like and where governance will land in 12 months.
A Series C SaaS client called us last quarter because their LLM bill had quietly tripled over six weeks and nobody could explain why. They had 47 microservices calling four LLM providers through a mix of LangChain, the OpenAI SDK, and a hand-rolled retry wrapper. Their observability stack was Datadog APM logging request counts and a CSV that someone exported from the OpenAI dashboard every Friday. The diagnosis took eleven days. The fix took an afternoon. The eleven days was entirely lost to the question "which calls are the expensive ones, and what prompted them?" — a question their stack could not answer.
That's what LLM observability solves, and it's why the decision between Helicone, Langfuse, and LangSmith matters more than the marketing pages make clear. These three are not interchangeable. They represent three different architectural bets — proxy logging, span tracing, and framework-native instrumentation — and picking the wrong one for your stack costs you exactly the eleven days that client paid in lost engineering time. This post is the decision framework we use with consulting clients to scope LLM observability in 2026: when each one is right, what each one's ceiling actually is, and how to migrate between them without losing the historical traces that took you a year to accumulate.
We'll walk through the three architectural bets, the self-host vs. managed math at 10M / 100M / 1B traces, the OTel and eval feature parity, the decision matrix, and the migration cost between tiers. Each platform has a workload it's right for and one it isn't. The mistake we see most often is picking by GitHub stars or by "we already use LangChain" — neither of which survives contact with a real production trace volume.
The Three Architectural Bets
Before pricing, before features, before any vendor pitch, understand what you're buying. The three platforms are built on fundamentally different mental models, and the model determines what you can see in production.
Helicone is a proxy. You change https://api.openai.com/v1 to https://oai.helicone.ai/v1 and every request flows through Helicone's infrastructure. Latency overhead in 2026 sits at roughly 10-30ms regional, which is fine for batch and most chat workloads and a problem for real-time autocomplete. Integration takes five minutes and the data you get is the wire data: prompt, completion, model, latency, cost, custom headers for tagging. What you don't get is the structure inside an agent — if your "single LLM call" actually fans out to a planner, a retriever, three tool calls, and a reranker, the proxy sees five flat requests with no parent-child relationship.
Langfuse is a span-tracing platform. You instrument with the SDK (Python or TypeScript) or, increasingly often in 2026, with an OpenTelemetry GenAI exporter. Each LLM call, retrieval, tool invocation, and reranker hop becomes a span with a parent-child relationship. The trace UI shows the full tree, with token counts, cost, and latency at each level. Integration takes 30 minutes to two hours depending on stack complexity, and the data you get scales with how much you instrument. The ceiling is high. We've seen clients run 200-step agentic traces and inspect them like a flame graph.
LangSmith is framework-native to LangChain and LangGraph. Set LANGSMITH_API_KEY and LANGSMITH_TRACING=true, and every LangChain runnable, LangGraph node, and tool call is automatically traced with rich metadata. Outside LangChain, you fall back to manual run logging that works but loses the killer features. The integration is the closest thing to free in this category — assuming you've already committed to LangChain and LangGraph as the orchestration layer. If you haven't, the value drops to "decent trace viewer with average pricing."
These three models are not converging. Helicone is doubling down on proxy speed and prompt management. Langfuse is doubling down on OTel GenAI and self-host scale. LangSmith is doubling down on the LangGraph debugger and prompt-engineering surfaces inside the LangChain ecosystem. Pick the model that matches your traces, not the one with the prettiest landing page.
Pricing at 10M, 100M, and 1B Traces per Month
The pricing pages on all three vendors have shifted at least three times in the last twelve months, so confirm before you budget. The numbers below are the ranges we see in real client engagements as of Q2 2026.
A few patterns hold across volume tiers:
The economics inversion matters most around 100M traces per month. Below that, managed wins. Above that, self-host with serious infra discipline wins. The middle band is where teams thrash, and the thrashing is almost always cheaper to resolve by overpaying for managed for another six months than by rushing into a self-host migration.
| Volume | Helicone (managed) | Langfuse (Cloud) | LangSmith | Self-host (Langfuse) |
|---|---|---|---|---|
| 10M / mo | $80-150 | $100-200 (Pro) | $200-400 (Plus) | ~$200 infra |
| 100M / mo | $1.5K-3K (Ent) | $2K-4K (Team) | $4K-8K (Ent) | $400-800 infra + 0.1 FTE |
| 1B / mo | Custom enterprise | Custom enterprise | Custom enterprise | $800-1.5K infra + 0.25 FTE |
Self-Host vs. Managed: What Compliance Teams Actually Accept
The self-host conversation is rarely about cost. It's almost always about where the prompt content physically lives, who has access to the database, and whether the deployment posture matches the rest of the regulated stack.
Langfuse is the only credible self-host option of the three. Apache-2.0 license, single Docker Compose for dev, Helm charts for Kubernetes, and ClickHouse as the trace store. We've shipped Langfuse self-host inside SOC 2, HIPAA, and EU AI Act environments without architectural friction. The community runs it at billions of traces per month. The enterprise tier (paid) adds SSO, RBAC, and SLA support; the community tier covers everything else.
Helicone offers self-host on the higher-tier paid license. It works, but the deployment story is less mature than Langfuse and the community is smaller. We recommend it primarily when a client is already on Helicone managed and needs to move to self-host for compliance — not as the first choice for a self-host greenfield.
LangSmith self-host requires an enterprise contract. It is rare in the wild. Every regulated client we've worked with that started on LangSmith managed has either stayed managed (after legal reviewed the data residency story) or migrated off entirely. The middle path is uncommon.
What compliance teams actually accept varies. Most SOC 2 Type II audits clear managed Langfuse Cloud and managed LangSmith without serious friction if the data processing addendum is in place. HIPAA tends to push toward self-host or BAA-covered managed, and the BAA story is strongest with LangSmith Enterprise (because LangChain has been engineering for regulated AI for longer) and weakest with Helicone (because the proxy architecture creates a data path that compliance reviewers ask hard questions about). EU AI Act compliance is the wildcard — by mid-2026, EU clients are asking explicit residency questions, and self-hosted Langfuse in an EU region answers them cleanly. The other two require vendor commitments that vary contract to contract.
Where Each One Breaks: OTel, Agent Trajectories, Eval Integration
Feature parity is misleading on landing pages. What matters is where each platform's depth runs out, because that's where you'll be debugging at 2 AM.
OpenTelemetry GenAI semantic conventions. Langfuse leads, comfortably. The OTel GenAI working group convention has stabilized in 2026, and Langfuse renders LLM-specific attributes — token counts, cache hits, model versions, tool calls, prompt cache reads — natively. Helicone added OTel ingest as a secondary path in 2025 but the proxy remains the recommended integration; OTel-via-Helicone has rough edges. LangSmith ingests OTel but flattens to the LangChain run model, which means non-LangChain spans lose fidelity. If you're standardizing on OTel for everything across your stack, Langfuse is the only one that doesn't fight you.
Agent trajectory views. LangSmith leads if you're on LangGraph — the trajectory replay, state diff visualization, and node-by-node debugger are the best in the market for that specific stack. Langfuse is a strong second with generic span trees that work for any orchestrator. Helicone is weakest here because the proxy can't see the parent-child structure of an agent graph; you get flat request lists with custom-header tagging.
Eval integration. Langfuse has the most complete open-source eval primitives — datasets, experiments, scoring, LLM-as-judge runs, and A/B comparisons all native. LangSmith has the polished eval workflow if you're paying for it, with prompt playground and dataset management closely tied to traces. Helicone has eval features but they're newer and less battle-tested. For the eval workflow specifically, our evals-driven development guide and LLM-as-judge regression testing patterns walk through the discipline that makes any of these eval surfaces actually useful in CI/CD.
Production monitoring depth. This is where most teams get caught. Tracing is necessary but not sufficient — you also need quality drift detection, cost anomaly alerts, and hallucination rate trending. Langfuse and LangSmith both ship some of this; Helicone ships less. The piece all three lack is the cross-trace aggregation that flags "answer quality dropped 8% on Tuesday across all support queries." For that, you wire your own dashboards or layer Phoenix or a custom aggregator on top. Our AI production monitoring playbook covers the metrics worth tracking regardless of which platform you pick.
Decision Matrix: Stack-Agnostic vs. LangChain-Locked, Eval-First vs. Cost-First
The vendor websites push you toward feature comparisons. The actual decision is two axes: how stack-agnostic are you, and what's your dominant pain — cost or quality.
A few patterns worth flagging. "We already use LangChain" is not a sufficient reason to pick LangSmith. It's a sufficient reason to evaluate it carefully, but plenty of LangChain shops run Langfuse instead because they want OTel-first instrumentation and self-host optionality. "Helicone is cheaper" is not a sufficient reason to pick Helicone at 100M+ traces with multi-step agents — the cost gap closes and the visibility gap dominates. "Langfuse is open source" is not a sufficient reason to self-host on day one — managed Langfuse Cloud is the right default for most teams under 100M traces.
The tier alignment we recommend most often parallels the AI gateway tier framework: teams that picked LiteLLM as their gateway tend to pair well with Langfuse for observability (both Apache-2.0, both self-host-friendly, both OTel-native). Teams that picked Portkey or Truefoundry tend to use the gateway's built-in observability first and add Langfuse only when eval workflows mature. Teams on Kong AI Gateway tend to terminate observability through the existing Datadog/Splunk pipeline — at which point the question becomes whether to layer Langfuse on top for the LLM-specific UI or live with raw OTel traces.
| You are... | Pick | Why |
|---|---|---|
| Single-step chains, OpenAI/Anthropic SDK, cost-first | Helicone | 5-min integration, cheapest at low volume, prompt mgmt is solid, you don't need spans |
| Multi-step agents, mixed SDK + framework, want OTel future | Langfuse | OTel GenAI native, self-host real, eval primitives best in OSS, scales to 1B traces |
| Deep LangGraph investment, eval-first, premium budget | LangSmith | Trajectory replay best-in-class, prompt playground tight, only if you're already locked in |
| Regulated, EU residency, or HIPAA required | Langfuse self-host | Apache-2.0, ClickHouse-backed, only one with a real self-host community at scale |
| Multi-tenant SaaS with per-customer attribution | Langfuse or Helicone | Both ship per-key/per-tenant attribution; LangSmith is workable but less ergonomic |
| Greenfield, < 6 months in production, < $30K/mo LLM spend | Helicone | Don't over-buy. You can migrate to Langfuse in a week when you outgrow it |
Migration Cost: When Switching Pays for Itself
The migration math between these three is asymmetric, and the asymmetry is worth understanding before you commit.
Helicone → Langfuse or LangSmith is the cheapest direction. Change the base URL back to the provider, drop in the SDK or OTel exporter, and your application code barely moves. Plan three to five engineering days for a typical microservices stack, plus a week to rebuild dashboards. Historical Helicone traces stay queryable in Helicone; you start fresh on the new platform. We've migrated half a dozen clients in this direction over the last 18 months. None regretted it.
LangSmith → Langfuse is the hardest. LangChain callbacks fan out to many internal hooks, and replacing them means either swapping the callback handler globally (medium-hard) or moving instrumentation to OTel auto-instrumentors (cleaner but a bigger refactor). Plan two to four weeks for a non-trivial codebase. The agent trajectory views don't translate one-to-one because the data models differ — LangSmith's trajectory replay relies on LangGraph state, which Langfuse renders generically as nested spans. The information is there, but the visualization is different.
Langfuse → LangSmith is straightforward if you're on LangChain (the callback handler swap is a one-line change), and surprisingly hard if you're not (you have to rebuild the trace structure to fit LangSmith's run model).
Self-hosted Langfuse → managed Langfuse Cloud (or vice versa) is the easiest of all. Same SDK, same data model, point at a different URL. Historical traces don't migrate, but you can run both in parallel for a 90-day cutover.
The honest cost in any direction is the historical traces. None of these three exposes a clean export-and-import to either of the others. Plan to run the old platform read-only for 60-90 days while the new platform accumulates fresh history, and write the runbook for finding old incidents in the legacy UI. We've seen teams burn weeks trying to write migration scripts; in every case, the cost-effective answer was to accept the discontinuity.
The trigger for switching is usually one of three: (1) hitting an architectural wall the current platform can't address (Helicone proxy blind spots on agent graphs), (2) a compliance event that forces self-host (Langfuse self-host is the answer here), or (3) a 3-5x cost spike at scale that pushes you off managed. Below those triggers, the friction of migration almost always exceeds the value of switching. For the broader debugging discipline that makes any platform actually useful when production breaks, our AI failure tracing guide covers the runbook side that no observability tool replaces.
Recommendation by Scenario
We close every observability scoping conversation with one of five concrete recommendations. They're imperfect — every workload has wrinkles — but they're the starting points we've been right about most often:
The wrong move in every case is doing nothing because the decision feels too big. We've yet to meet a team that regretted introducing LLM observability that fit their stack. We've met plenty that regretted skipping it for another quarter and finding out the hard way — usually via a billing anomaly that took eleven days to diagnose, sometimes via a customer-reported regression that nobody could trace, occasionally via a regulator. The broader strategic context — when LLM observability maps to real engineering ROI versus when it's a premature cost — sits in our AI Development Tools pillar.
Pick the architectural bet that matches your traces. Migrate when the ceiling bites, not when the hype shifts. And instrument before the bill triples, not after.
Frequently Asked Questions
Quick answers to common questions about this topic
Three different architectural bets. Helicone is a proxy — you change one base URL and every request is logged at the wire. Integration is 5 minutes, but you only see what the wire shows: prompt in, completion out. Langfuse is a span-tracing platform built around OpenTelemetry GenAI semantic conventions — you instrument with an SDK or OTel exporter and get nested spans for every LLM call, retrieval, tool invocation, and reranker hop. LangSmith is framework-native to LangChain and LangGraph — set one environment variable and you get traces, but the resolution drops sharply outside the LangChain ecosystem. Pick proxy if you want speed and don't care about agent trajectories. Pick span tracing if your traces have more than three steps. Pick LangSmith if you've already committed to LangGraph and need the agent debugger that ships with it.
