At k=10 with HNSW, recall converges to 95-99% (Qdrant ~98.5%, Milvus ~97.9%, Weaviate ~97.2%), so recall is no longer the differentiator. Throughput is: Redis ~12K QPS, Qdrant ~8.5K, pgvector ~1.8K on a single instance. At a billion vectors ScyllaDB sustains 252K QPS at 2ms P99, and pgvectorscale serves 471 QPS at 28ms P95 on 50M Cohere embeddings at 99% recall, 28x lower latency than Pinecone s1 at 75% lower self-hosted cost. pgvector wins under 50M vectors; Milvus/Zilliz only earns its complexity at 500M+. Model cost-per-QPS before you pick an engine.
Recall stopped being the reason to pick one vector database over another. At a billion vectors, the question that decides your bill is not "which engine retrieves better" but "which engine serves the most queries per second per dollar." That is the cost-per-QPS math, and almost every vector database comparison published in the last year skips it entirely in favor of feature tables nobody reads twice.
Here is the convergence that broke the old way of choosing. At k=10 with a tuned HNSW index, Qdrant lands around 98.5% recall, Milvus around 97.9%, and Weaviate around 97.2%. Those are rounding errors apart. When the top engines all clear 95-99% on the same query set, recall is a floor every serious system has already crossed, not a lever you pull to differentiate. Shopping on recall in 2026 is like shopping for a laptop on whether it can run a web browser.
This post is the throughput-and-cost breakdown: single-instance QPS reality, what a billion-vector deployment actually costs, the crossover points where pgvector stops being enough, and how to model cost-per-QPS on your own data before you commit to an engine you will be stuck operating for two years.
Why Recall Convergence Changed the Decision
The early vector database market was sold on recall. Vendors published ANN-Benchmarks plots, drew their curve a hair above a competitor's, and called it a win. That worked when HNSW implementations were immature and recall genuinely varied. It stopped working once everyone implemented the same algorithm well.
The numbers tell the story. Across the major engines, recall at k=10 now clusters tightly:
A 1.3-point spread is not a buying signal. It is the signature of a saturated metric. The same thing happened to SWE-Bench Verified for coding models, where six frontier models landed within a point of each other and the leaderboard lost its discriminative power. When a benchmark saturates, the smart move is to stop reading it and find the metric that still separates the field.
For vector search, that metric is throughput-per-dollar. Two engines can tie on recall and differ by 6-7x on queries per second on identical hardware. That gap is the entire ballgame at scale, because QPS is what you pay for. Recall tells you the answers are correct; throughput tells you how many correct answers you can afford to serve. If your recall is sitting below 95%, that is a tuning problem (usually ef_search set too low or an under-built graph), not an engine limitation, and no migration will fix what an index rebuild should. We made the adjacent argument in why embedding quality matters more than your vector database: the components people obsess over are often the ones that have already converged.
| Engine | Recall @ k=10 | Index |
|---|---|---|
| Qdrant | ~98.5% | HNSW |
| Milvus | ~97.9% | HNSW |
| Weaviate | ~97.2% | HNSW |
Single-Instance QPS: Where Engines Actually Diverge
Before you think about a billion vectors, look at what one instance gives you, because the per-node throughput sets the slope of every cost curve above it. On comparable hardware at production recall, the spread is large:
Read that as orders of magnitude, not guarantees. The exact number moves with dimensionality, the ef_search setting, whether you filter on payload, and how much RAM holds the graph. But the shape is stable across benchmark suites: Redis and Qdrant are in the same throughput class, and stock pgvector is roughly 5-7x behind.
That gap is not a knock on pgvector. It is a different design point. Redis and Qdrant are purpose-built ANN engines that keep everything hot in memory and optimize the query path for vector workloads. pgvector is an extension bolted onto a general-purpose transactional database, and it inherits Postgres's overhead in exchange for transactional consistency, SQL joins, and not running a second system. For a lot of teams that trade is correct right up until it isn't, and the cost-per-QPS math is exactly how you find the line.
Why the pgvector number understates the real story
Stock pgvector at ~1,800 QPS is the wrong figure to plan around if you are serious about Postgres. The pgvectorscale extension adds a StreamingDiskANN index that changes the economics entirely, and that is the configuration worth benchmarking. More on that in the billion-vector section, because that is where it earns its keep. For a head-to-head on the managed alternatives that pgvector competes against, see Pinecone vs Qdrant and the broader Pinecone vs Weaviate vs Qdrant comparison.
| Engine | Single-instance QPS | Relative to pgvector |
|---|---|---|
| Redis | ~12,000 | ~6.7x |
| Qdrant | ~8,500 | ~4.7x |
| pgvector | ~1,800 | 1x |
Scaling to a Billion: The 252K QPS Proof Point
The headline result that reframes the whole category: ScyllaDB sustained 252,000 QPS at 2ms P99 on a billion-vector workload. That is the existence proof that single-digit millisecond latency at billion scale is not a fantasy, it is an engineering choice with a price tag.
Two things make that number worth dwelling on. First, the latency. 2ms P99 means even your slowest one-in-a-hundred query returns in two milliseconds. For most RAG pipelines the vector search is a rounding error against the LLM call that follows it, so 2ms is overkill, and overkill is information: it tells you the architecture has headroom you can trade away for cost. Second, the throughput. 252K QPS is not a single node. It is a distributed, memory-heavy deployment with real operational weight behind it. You do not get that number for free, and most teams do not need it.
The opposite design point is pgvectorscale, and it is the one more teams should actually run. On 50 million Cohere embeddings at 99% recall, pgvectorscale's StreamingDiskANN served 471 QPS at 28ms P95. The interesting part is the comparison: benchmarks measured that as 28x lower latency than Pinecone s1 on the same workload, at roughly 75% lower cost when self-hosted. StreamingDiskANN gets there by keeping the index on SSD instead of forcing the entire graph into RAM, which is what lets a single Postgres instance hold 50M vectors without a memory bill that dwarfs the rest of your infrastructure.
These two rows are the poles of the trade space. ScyllaDB buys you 2ms at the cost of a distributed memory-resident cluster. pgvectorscale buys you 75% lower cost at the price of 28ms, which is invisible inside a RAG pipeline. Where you land between them is a function of one question: does your application need 2ms, or can it tolerate 30ms? Most can tolerate 30ms, and most overpay anyway.
| Deployment | Scale | Throughput | Latency | Recall |
|---|---|---|---|---|
| ScyllaDB (distributed) | 1B vectors | 252K QPS | 2ms P99 | high |
| pgvectorscale (StreamingDiskANN) | 50M vectors | 471 QPS | 28ms P95 | 99% |
| Pinecone s1 (managed) | 50M vectors | baseline | ~28x higher | 99% |
Cost Bands by Scale: What You Actually Pay
Strip away the vendor pricing pages and the numbers settle into bands. These are the figures worth anchoring on when you size a deployment, and the gap between self-hosted and managed is the whole reason this decision is contentious.
The self-hosted bands are dominated by RAM, because a memory-resident HNSW graph for 100M vectors is a real hardware line item, and StreamingDiskANN-style SSD indexes are what keep that 100M number at $300-500 instead of multiples higher. The managed bands are dominated by the operational premium: ~$1,000-3,000/month for 10M vectors at 1,000 QPS is not paying for compute, it is paying for someone else to run, patch, scale, and page on the thing at 3am.
That premium is sometimes worth every dollar. If your team is three people and none of them want to own a distributed database, managed wins on total cost of ownership even when the raw infrastructure looks 5-10x more expensive. The mistake is paying the managed premium at a scale where a single self-hosted Postgres instance would have done the job, which is most teams under 50M vectors. The reverse mistake, self-hosting Milvus at 5M vectors to save $100/month, is just as common and far more expensive once you count the engineer-hours. We walk through this build-versus-buy framing in detail in the turbopuffer vs Pinecone migration analysis, where the cost crossover is the entire story behind why teams move.
The number that actually decides it: cost-per-QPS
Total monthly cost is the wrong unit to compare engines, because it hides how much work you are getting for the money. The right unit is cost-per-QPS: fully loaded monthly cost divided by sustained QPS at your target P99. Worked example. A self-hosted Qdrant instance costing $400/month and holding 8,500 QPS works out to roughly $0.047 per QPS-month. A managed service at $2,000/month serving the same 8,500 QPS is $0.235 per QPS-month, 5x more. But flip the scale: at a billion vectors with six-figure QPS requirements, the self-hosted number includes a team's salary in operational overhead that the per-QPS figure conveniently omits, and the managed math can win. The ranking reverses across scale, which is exactly why a single comparison table cannot answer the question for you.
| Scale | Self-hosted (commodity) | Managed (at 1K QPS) |
|---|---|---|
| 5M vectors | ~$100-200/mo | included in entry tiers |
| 10M vectors | ~$200-300/mo | ~$1,000-3,000/mo |
| 100M vectors | ~$300-500/mo | scales steeply |
The Crossover Guidance: When pgvector Wins and When Milvus Earns It
After running these numbers across scales, the guidance compresses to two thresholds.
Under 50 million vectors, pgvector wins. With pgvectorscale's StreamingDiskANN, a single Postgres instance handles production RAG at 99% recall and 28ms P95 while keeping your data, your transactions, and your operational knowledge in one system. The 28x-latency, 75%-cost advantage over Pinecone s1 at 50M is not marketing, it is the documented benchmark. You should need a specific, measured reason to leave Postgres below this line, not a feature on a competitor's landing page.
At 500 million vectors and above, a distributed engine earns its complexity. This is where Milvus and Zilliz Cloud stop being overkill and start being the correct tool. Distributed indexing, horizontal sharding, and separate compute/storage scaling are real advantages once a single node cannot hold the graph or serve the QPS, and at 500M+ a single node usually cannot. Below this line, those same features are operational tax you pay for capabilities you will not use.
The 50M-to-500M band is the genuinely contested middle. Here the decision turns on whether your binding constraint is throughput (lean Qdrant for its per-node QPS) or operational headcount (lean managed and pay the premium to not run it yourself). There is no universal answer in this band, which is the honest version of "it depends," and it is why you benchmark instead of guess. This is the kind of sizing work Particula Tech runs as a one-week diagnostic before a team commits to an engine, because a wrong choice here is a two-year migration later.
| Vector count | Recommendation | Why |
|---|---|---|
| < 50M | pgvector + pgvectorscale | Single system, 99% recall, lowest TCO |
| 50M - 500M | Qdrant or managed | Per-node throughput or offload ops |
| 500M+ | Milvus / Zilliz, or ScyllaDB for QPS | Distributed sharding earns its complexity |
How to Model Your Own Cost-Per-QPS
Public benchmarks are a starting point, not an answer, because recall and throughput shift with your embedding dimensionality, your filter patterns, and your query distribution. Here is the modeling sequence that survives a procurement review.
That sequence takes a couple of days against the real engines and saves the far larger cost of a wrong commitment. The single most common error we see across vector deployments is teams choosing on a recall benchmark that every engine already passes, then discovering the throughput-per-dollar gap only when the bill arrives at scale.
What This Means for Choosing a Vector Engine
Three takeaways for teams sizing a vector search system right now.
Stop optimizing the recall column. At k=10 the major HNSW engines are within a point or two of each other (98.5%, 97.9%, 97.2%). Recall is a floor to clear, not a lever to pull. If you are below 95%, fix your index, do not change vendors.
Decide on throughput-per-dollar, at your scale. Redis and Qdrant sit roughly 5-7x above stock pgvector on single-instance QPS, and that gap, priced out as cost-per-QPS, is what your bill is made of. But the ranking reverses across scale, so model the scale you will actually be at.
Default to pgvector under 50M, earn your way to Milvus at 500M+. pgvectorscale's 28x-latency, 75%-cost advantage over Pinecone s1 at 50M is the documented reason most teams should not leave Postgres early. The distributed engines are correct at the top, where ScyllaDB's 252K QPS at 2ms P99 shows what is possible, and operationally wasteful in the middle.
The recall era told us the engines had converged on quality. The cost-per-QPS era is telling us something more useful: the vector database decision that matters in 2026 is an economics decision, not a feature decision. If you are still choosing on recall benchmarks, you are picking on the one axis where every serious engine already ties, and paying for the difference where it actually shows up, on the invoice.
Frequently Asked Questions
Quick answers to common questions about this topic
Self-hosted, expect roughly $100-200/month for 5M vectors and $300-500/month for 100M vectors on commodity hardware, dominated by the RAM needed to hold the HNSW graph. Managed services price differently: around $1,000-3,000/month for 10M vectors at 1,000 QPS, because you pay for the operational layer, not just compute. The real driver of cost at scale is not vector count but sustained QPS at your target latency. A billion-vector deployment that needs 100K+ QPS is a fundamentally different cost class than the same dataset serving 500 QPS, even though the storage footprint is identical.



