Vapi and Retell are managed orchestrators; LiveKit and Pipecat are frameworks you operate. Vapi headline pricing is $0.05-0.13/min but jumps to $0.23-0.33/min once you add your own BYOK model and STT/TTS. Retell is a flatter $0.07/min with HIPAA included. The governing constraint is the 300ms turn-latency budget split across STT, LLM, tool calls, and TTS. Under 10K minutes/month, stay managed. Between 10K and 50K, go hybrid. Above 50K minutes/month, building on LiveKit or Pipecat saves 60-80% on per-minute cost, but you now own latency tuning, telephony, and uptime.
A voice agent lives or dies on a number most teams never budget for: the gap between when a caller stops talking and when they hear a reply. AssemblyAI and most of the voice ecosystem put the conversational threshold at roughly 300ms. Cross it consistently and callers start talking over the agent, the turn-taking breaks, and the whole thing feels like a bad IVR from 2009. Every architecture decision in this space, including the choice between Vapi, Retell, LiveKit, and Pipecat, is downstream of that latency budget and the per-minute cost it takes to hit it.
The four platforms split cleanly into two camps. Vapi and Retell are managed orchestrators: you describe an agent and they run the speech-to-text, the LLM, the text-to-speech, the telephony, and the turn-taking for you. LiveKit and Pipecat are frameworks you operate yourself: they hand you a production-grade real-time backbone and pipeline primitives, and you own everything above the metal. The managed camp trades money for speed to launch. The framework camp trades engineering hours for unit economics that hold up at scale. Picking wrong in either direction is expensive, and the crossover point is more predictable than most build-vs-buy decisions.
This post is the decision framework we use to scope voice agent stacks: the real pricing once you strip away the headline rates, the 300ms latency budget broken down stage by stage, the volume thresholds where each option wins, the telephony and compliance cuts that quietly decide vendor selection, and the migration plan for when you outgrow managed. The voice agent platform you should pick depends on three inputs: your monthly minutes, your latency tolerance, and your compliance posture. Everything else is detail.
The Four-Platform Landscape: Managed vs Framework
Before pricing, understand what each product actually is, because the category determines what you are responsible for.
Vapi is a managed orchestrator built around provider flexibility. You configure an assistant through its API or dashboard, choose your STT, LLM, and TTS providers, and Vapi handles the real-time plumbing, telephony, and turn-taking. Its signature feature is bring-your-own-key (BYOK): you can plug in your own OpenAI, Deepgram, ElevenLabs, or Cartesia accounts. That flexibility is also where its pricing gets complicated, which we will get to.
Retell is a managed orchestrator built around simplicity and compliance. It bundles orchestration plus a usable model into one flatter per-minute rate, and it includes HIPAA in the standard pricing rather than gating it behind an enterprise tier. Where Vapi gives you knobs, Retell gives you a predictable bill. For high-volume support and healthcare workloads, that predictability is often worth more than provider choice.
LiveKit is an open real-time infrastructure framework. Its core is a production WebRTC backbone used well beyond voice agents, and its Agents framework sits on top for building voice and multimodal agents. LiveKit v1.5 shipped adaptive interruption handling and dynamic endpointing, which are exactly the turn-taking primitives that are painful to build yourself. You self-host or run it on LiveKit Cloud, and you wire the STT/LLM/TTS components.
Pipecat is an open, Python-first pipeline framework. Originally from Daily, it models a voice agent as a pipeline of processors (transcription, LLM, synthesis, transport) that you compose in code. It is the most flexible of the four and the most hands-on. Teams that want full control over every stage of the audio loop, or that need custom processing the managed platforms do not expose, tend to land here or on LiveKit.
The honest way to read this landscape: Vapi and Retell sell you time, LiveKit and Pipecat sell you ceiling. Softcery's widely shared 12-platform comparison makes the same split, and the more platforms you evaluate, the more the decision collapses back to managed-versus-framework rather than feature checklists. This is the same build-versus-buy axis we cover in when to build vs buy AI, applied to the specific economics of real-time voice.
Pricing Reality: Headline Rates vs What You Actually Pay
The per-minute numbers on these landing pages are real but incomplete. Here is what we see once a configuration is fully specified, as of Q2 2026.
A few patterns matter more than the exact cents:
Vapi's headline rate assumes bundled defaults. The $0.05-0.13/min range holds when you use Vapi's bundled providers. The moment you switch to BYOK to control quality or compliance, you pay Vapi's platform fee plus every downstream vendor separately, and the effective rate climbs to roughly $0.23-0.33/min. BYOK is not a discount; it is a control lever you pay for. Choose it when provider selection is a hard requirement, not to save money.
Retell's flat $0.07/min is the predictability play. One number, HIPAA included, no per-vendor reconciliation. For a support line doing tens of thousands of minutes a month, a flat rate that already covers compliance is frequently the lowest total cost of ownership even if a hand-tuned BYOK setup could shave cents per minute in theory.
LiveKit and Pipecat are infrastructure-only. Your per-minute cost is whatever your STT, LLM, and TTS vendors charge plus your compute. At volume that pushes well under $0.05/min, but you add infrastructure, on-call, and latency-tuning labor that the managed platforms absorb.
The market is moving under all of this. ElevenLabs cut its Conversational AI pricing roughly 50% in February 2026, which dragged TTS economics down across every stack that uses it. Voice pricing is one of the faster-moving corners of AI infrastructure right now, so model your own minute volume against a specific provider combination and re-check rates before you commit a budget.
| Platform | Effective cost/min | Pricing model | HIPAA | You operate |
|---|---|---|---|---|
| Vapi (bundled) | $0.05-0.13 | Platform fee + bundled providers | Higher tiers | No |
| Vapi (BYOK) | $0.23-0.33 | Platform fee + each vendor billed separately | Higher tiers | No |
| Retell | ~$0.07 flat | All-in bundle | Included | No |
| LiveKit | Infra + vendor rates | Self-host / Cloud + STT/LLM/TTS | Your responsibility | Yes |
| Pipecat | Infra + vendor rates | Self-host + STT/LLM/TTS | Your responsibility | Yes |
The 300ms Latency Budget, Stage by Stage
Latency is the metric that decides whether your agent feels like a conversation or a phone tree. The target most of the voice ecosystem converges on is roughly 300ms of perceived turn latency, with anything past ~800ms total reading as sluggish. That budget is not one number; it is a chain, and every link spends part of it.
Here is the breakdown of a single turn, from the caller finishing a sentence to hearing the first word back:
Add those naively and you are already over a second. Hitting a conversational feel means overlapping the stages, not running them in series:
This is where LiveKit's v1.5 work earns its keep: adaptive interruption and dynamic endpointing tune that first 100-300ms VAD window in real time instead of using a fixed threshold, which is the difference between an agent that talks over people and one that waits a beat too long. Managed platforms hide this tuning from you, which is convenient until you need to change it. If your broader stack is fighting latency outside the voice loop too, the same principles in our LLM latency fixes for production apps guide apply directly to the LLM and tool-call stages of this budget.
| Stage | What it does | Typical spend | Where it goes wrong |
|---|---|---|---|
| Endpointing / VAD | Detect the caller stopped talking | 100-300ms | Too eager interrupts; too slow feels dead |
| STT finalization | Final transcript of the utterance | ~100-300ms | Waiting for full transcript instead of streaming |
| LLM time-to-first-token | Model starts generating | 200-600ms | Cold model, long prompt, no streaming |
| Tool / DB calls | Lookups the answer depends on | 0-500ms+ | One slow call blows the whole budget |
| TTS first audio | First synthesized audio out | ~100-300ms | Waiting for full text before synthesizing |
When Each Platform Wins: The Volume Thresholds
The cleanest way to decide is by monthly voice minutes, because that is what flips the cost math from favoring managed to favoring framework.
Under 10,000 minutes per month: stay managed
At this volume, the per-minute premium of a managed platform is trivial in absolute dollars. Ten thousand minutes at even $0.30/min is $3,000 a month. The engineering cost of building, tuning, and operating a LiveKit or Pipecat stack dwarfs that several times over in the first quarter alone. Use Vapi if you need specific provider control, Retell if you want a flat HIPAA-included rate. The goal here is to launch, learn, and validate the use case before you spend on infrastructure. This is the same logic that makes managed the default for early-stage agents generally, as we argue in our pillar on AI agents.
10,000 to 50,000 minutes per month: go hybrid
This is the band where teams thrash, and where a hybrid posture usually wins. Two patterns work well: The point of hybrid is to capture most of the savings on the minutes that matter while deferring the operational cost of owning everything. Do not migrate the whole estate at once; migrate the flow whose volume justifies the work.
- Stay on the managed orchestrator but bring your own cheaper STT/TTS where quality allows, trimming the per-minute cost without giving up turn-taking and telephony.
- Self-host your single highest-volume flow (say, the appointment-reminder bot doing 80% of your minutes) on LiveKit or Pipecat, and leave the long tail of lower-volume flows on the managed platform.
Above 50,000 minutes per month: build on a framework
Past roughly 50K minutes/month, the math tips decisively. Building on LiveKit or Pipecat saves an estimated 60-80% on per-minute cost versus a managed platform at that scale, because you are paying vendor rates directly instead of a platform markup on top of them. At 100K minutes, an $0.18/min difference is $18,000 a month, which funds the engineering and on-call required to run the stack with room to spare. You take on latency tuning, telephony, and uptime, but the unit economics now reward that ownership. This crossover, where a build path saves 60-80% above the 10-50K range, is the single most important number in the decision.
| Monthly minutes | Recommendation | Primary reason |
|---|---|---|
| Under 10K | Managed (Vapi or Retell) | Per-minute premium is trivial; speed to launch wins |
| 10K-50K | Hybrid | Capture savings on top-volume flows, defer ops cost |
| Over 50K | Framework (LiveKit or Pipecat) | 60-80% per-minute savings funds owning the stack |
Telephony, HIPAA, and SOC 2: The Cuts That Decide It
Cost and latency narrow the field, but compliance and telephony often make the final call, especially for regulated or high-volume phone workloads.
Telephony. Phone calls mean SIP trunks, phone number provisioning, and carrier-grade reliability. Vapi and Retell handle this for you, which is a large part of what you are paying for. On LiveKit or Pipecat you wire telephony yourself (LiveKit has SIP support; Pipecat integrates with transport providers), and getting reliable, low-latency PSTN connectivity is non-trivial work. If your agent is web or app-only and never touches the phone network, this cut matters less and the framework path gets easier.
HIPAA. Retell includes HIPAA in its $0.07/min standard pricing with a BAA available, which is genuinely unusual and a strong reason to shortlist it for healthcare voice. Vapi offers HIPAA on higher tiers. On LiveKit or Pipecat, compliance is end-to-end your responsibility: you sign BAAs with every downstream STT, LLM, and TTS vendor, run them in compliant regions, and own audio retention, redaction, and access controls. That is the price of control. For the full picture on building voice and other AI in regulated healthcare environments, see our guide on HIPAA-compliant AI healthcare implementation.
SOC 2. Managed platforms carry their own SOC 2 attestations, which simplifies your vendor review. Self-hosting shifts the burden to your own infrastructure and your chosen vendors. For enterprise procurement, a managed platform with the right certifications can shave weeks off the security review, which is a real cost even when it does not show up on the per-minute rate.
The pattern across these cuts: managed platforms sell you compliance and telephony as bundled features. Frameworks make you assemble them. Below scale, bundled is cheaper in total cost. Above scale, the assembly cost is worth it because the per-minute savings dominate.
The Migration Plan for When You Outgrow Managed
The most expensive mistake in this space is not picking the wrong managed platform on day one. It is building your agent in a way that locks the orchestration logic inside a vendor dashboard, so that outgrowing the platform means rewriting the agent from scratch. Plan the exit before you need it.
Keep your logic portable from the start. Your prompts, tool definitions, conversation flow, and business rules should live in your own codebase and be callable from any orchestrator, not buried in a managed platform's UI. The managed platform should be running your logic, not owning it. When portability is designed in, migration becomes reimplementing the real-time plumbing, which is bounded work, rather than reverse-engineering your own agent's behavior.
Migrate the metrics, not just the feature parity. When you move a flow to LiveKit or Pipecat, the thing to protect is the latency budget and transcription quality you had on managed. Run the new pipeline in parallel on a small traffic slice, measure turn latency and word error rate against the managed baseline, and only cut over when the self-hosted path matches or beats it. Reliability, not raw accuracy, is what breaks first in production agents, a pattern we dig into in why agent reliability lags accuracy.
Cut over flow by flow. Move your highest-volume, simplest flow first, since that is where the savings are largest and the risk is lowest. Leave complex or low-volume flows on managed until the framework pipeline is hardened. A staged migration lets you capture most of the cost savings early while limiting blast radius.
Re-establish telephony deliberately. Phone numbers, SIP trunks, and carrier reliability are the least glamorous and most failure-prone part of the move. Provision and test telephony on the new stack well before cutover, and keep the managed numbers live as a fallback during the transition.
If your voice agent is one node in a larger multi-step or multi-agent system, the orchestration choices interact, and the framework you pick for voice should sit comfortably alongside the rest. Our comparison of Mastra vs LangGraph vs Vercel AI SDK for TypeScript agents covers the application-layer framework decision that pairs with the voice transport layer described here. At Particula Tech, the voice engagements we scope usually start exactly here: modeling real minute volume against a provider combination, mapping the 300ms budget across the actual tool calls, and writing the migration runbook before the managed bill makes it urgent.
Recommendation by Scenario
We close every voice agent scoping conversation with one of a few concrete starting points. They are imperfect, every workload has wrinkles, but they hold up most often:
Pick by your minutes, your latency budget, and your compliance posture, in that order. Keep your agent logic portable so the managed platform is a tenant, not a landlord. And model your real per-minute cost against a specific provider combination before you trust any headline rate, because in voice the headline and the invoice are rarely the same number.
Frequently Asked Questions
Quick answers to common questions about this topic
Retell is usually cheaper and more predictable at $0.07/min flat with HIPAA included, while Vapi's real cost depends on configuration. Vapi's $0.05-0.13/min headline only holds on its bundled defaults; once you bring your own LLM, STT, and TTS (BYOK), effective cost climbs to roughly $0.23-0.33/min because you pay Vapi's platform fee plus each vendor separately. Retell bundles the orchestration and a usable model into one flat rate, which makes budgeting easier and tends to win for high-volume support workloads. Vapi wins when you need fine-grained control over which exact STT, LLM, and TTS providers run, and you are willing to manage that pricing complexity yourself.



