How much does a vector database cost at scale?

Self-hosted, expect roughly $100-200/month for 5M vectors and $300-500/month for 100M vectors on commodity hardware, dominated by the RAM needed to hold the HNSW graph. Managed services price differently: around $1,000-3,000/month for 10M vectors at 1,000 QPS, because you pay for the operational layer, not just compute. The real driver of cost at scale is not vector count but sustained QPS at your target latency. A billion-vector deployment that needs 100K+ QPS is a fundamentally different cost class than the same dataset serving 500 QPS, even though the storage footprint is identical.

Is pgvector good enough for production vector search?

Yes, for most workloads under 50 million vectors, pgvector (especially with pgvectorscale's StreamingDiskANN index) is good enough and often the better choice. pgvectorscale serves 471 QPS at 28ms P95 on 50M Cohere embeddings at 99% recall, which benchmarks measured at 28x lower latency than Pinecone s1 while costing about 75% less to self-host. You keep your data in Postgres, get transactional consistency, and skip a separate system. The crossover point where a dedicated engine like Milvus earns its operational complexity is roughly 500 million vectors or sustained throughput that a single Postgres instance cannot serve.

What QPS can a single vector database instance handle?

On a single instance with HNSW, expect roughly 12,000 QPS from Redis, 8,500 from Qdrant, and 1,800 from pgvector at production recall, on comparable hardware. Those are order-of-magnitude figures, not guarantees: your real number depends on vector dimensionality, the ef_search parameter, payload filtering, and how much RAM holds the graph. The gap between Redis and pgvector is roughly 6-7x, which matters enormously for cost. If you need more than a single instance delivers, you are choosing between vertical scaling, sharding, or a distributed engine like Milvus or ScyllaDB that is built for horizontal throughput.

Does recall still matter when choosing a vector database in 2026?

Recall matters as a floor, not a differentiator. At k=10 with a properly tuned HNSW index, the major engines converge to 95-99% recall (Qdrant ~98.5%, Milvus ~97.9%, Weaviate ~97.2%), so they are effectively tied on retrieval quality. The mistake is shopping on recall benchmarks when every serious engine clears the bar. Once recall converges, the decision shifts to throughput-per-dollar, latency at your P99, operational burden, and how the engine scales past your current vector count. Treat any sub-95% recall number as a tuning bug, not an engine limitation.

When should I move from pgvector to a dedicated vector database?

Move when you cross roughly 50 million vectors or when a single Postgres instance can no longer serve your QPS at acceptable P99 latency, whichever comes first. Below 50M, pgvector with pgvectorscale handles most production RAG comfortably and keeps your stack simple. Between 50M and 500M, evaluate Qdrant or a managed service if operational load is the constraint. At 500M+ vectors or six-figure QPS, a distributed engine like Milvus/Zilliz or a ScyllaDB-backed setup earns its complexity. Do not migrate on a feature checklist; migrate when your measured cost-per-QPS on the current engine stops improving with hardware.

How do I calculate cost-per-QPS for my vector search workload?

Take the fully loaded monthly cost of the deployment (compute, RAM, storage, and managed-service fees) and divide by the sustained QPS it serves at your target P99 latency. For example, a self-hosted setup costing $400/month that holds 8,500 QPS works out to roughly $0.047 per QPS-month. Run this on a realistic slice of your own vectors, not a synthetic dataset, because recall and throughput shift with your embedding dimensionality and filter patterns. The engine with the lowest cost-per-QPS at your recall floor wins, and that ranking often reverses between 10M and 1B vectors.

Can a billion-vector search system run at single-digit millisecond latency?

Yes. ScyllaDB demonstrated 252,000 QPS at 2ms P99 on a billion-vector workload, proving that single-digit millisecond latency is achievable at that scale with the right architecture. The catch is that this requires a distributed, memory-heavy deployment and real operational investment, not a single node. pgvectorscale's StreamingDiskANN takes the opposite approach, trading some latency (28ms P95 at 50M vectors) for dramatically lower memory cost by keeping the index on SSD. Which trade-off wins depends entirely on whether your application needs 2ms or can tolerate 30ms, and what you are willing to pay for the difference.

BLOG/RAG & VECTOR SEARCH

Vector Search at a Billion Vectors: The Cost-Per-QPS Math

Recall converges at 95-99% across HNSW engines, so cost at scale is throughput-per-dollar. ScyllaDB hits 252K QPS at 2ms P99 on 1B vectors. Here's the math.

Sebastian MondragonMAY 29, 2026 · 11 MIN READ

$Vector Search at a Billion Vectors: The Cost-Per-QPS Math$

Recall stopped being the reason to pick one vector database over another. At a billion vectors, the question that decides your bill is not "which engine retrieves better" but "which engine serves the most queries per second per dollar." That is the cost-per-QPS math, and almost every vector database comparison published in the last year skips it entirely in favor of feature tables nobody reads twice.

Here is the convergence that broke the old way of choosing. At k=10 with a tuned HNSW index, Qdrant lands around 98.5% recall, Milvus around 97.9%, and Weaviate around 97.2%. Those are rounding errors apart. When the top engines all clear 95-99% on the same query set, recall is a floor every serious system has already crossed, not a lever you pull to differentiate. Shopping on recall in 2026 is like shopping for a laptop on whether it can run a web browser.

This post is the throughput-and-cost breakdown: single-instance QPS reality, what a billion-vector deployment actually costs, the crossover points where pgvector stops being enough, and how to model cost-per-QPS on your own data before you commit to an engine you will be stuck operating for two years.

01 · Why Recall Convergence Changed the Decision

The early vector database market was sold on recall. Vendors published ANN-Benchmarks plots, drew their curve a hair above a competitor's, and called it a win. That worked when HNSW implementations were immature and recall genuinely varied. It stopped working once everyone implemented the same algorithm well.

The numbers tell the story. Across the major engines, recall at k=10 now clusters tightly:

A 1.3-point spread is not a buying signal. It is the signature of a saturated metric. The same thing happened to SWE-Bench Verified for coding models, where six frontier models landed within a point of each other and the leaderboard lost its discriminative power. When a benchmark saturates, the smart move is to stop reading it and find the metric that still separates the field.

For vector search, that metric is throughput-per-dollar. Two engines can tie on recall and differ by 6-7x on queries per second on identical hardware. That gap is the entire ballgame at scale, because QPS is what you pay for. Recall tells you the answers are correct; throughput tells you how many correct answers you can afford to serve. If your recall is sitting below 95%, that is a tuning problem (usually ef_search set too low or an under-built graph), not an engine limitation, and no migration will fix what an index rebuild should. We made the adjacent argument in why embedding quality matters more than your vector database: the components people obsess over are often the ones that have already converged.

Engine	Recall @ k=10	Index
Qdrant	~98.5%	HNSW
Milvus	~97.9%	HNSW
Weaviate	~97.2%	HNSW

02 · Single-Instance QPS: Where Engines Actually Diverge

Before you think about a billion vectors, look at what one instance gives you, because the per-node throughput sets the slope of every cost curve above it. On comparable hardware at production recall, the spread is large:

Read that as orders of magnitude, not guarantees. The exact number moves with dimensionality, the ef_search setting, whether you filter on payload, and how much RAM holds the graph. But the shape is stable across benchmark suites: Redis and Qdrant are in the same throughput class, and stock pgvector is roughly 5-7x behind.

That gap is not a knock on pgvector. It is a different design point. Redis and Qdrant are purpose-built ANN engines that keep everything hot in memory and optimize the query path for vector workloads. pgvector is an extension bolted onto a general-purpose transactional database, and it inherits Postgres's overhead in exchange for transactional consistency, SQL joins, and not running a second system. For a lot of teams that trade is correct right up until it isn't, and the cost-per-QPS math is exactly how you find the line.

Why the pgvector number understates the real story

Stock pgvector at ~1,800 QPS is the wrong figure to plan around if you are serious about Postgres. The pgvectorscale extension adds a StreamingDiskANN index that changes the economics entirely, and that is the configuration worth benchmarking. More on that in the billion-vector section, because that is where it earns its keep. For a head-to-head on the managed alternatives that pgvector competes against, see Pinecone vs Qdrant and the broader Pinecone vs Weaviate vs Qdrant comparison.

Engine	Single-instance QPS	Relative to pgvector
Redis	~12,000	~6.7x
Qdrant	~8,500	~4.7x
pgvector	~1,800	1x

03 · Scaling to a Billion: The 252K QPS Proof Point

The headline result that reframes the whole category: ScyllaDB sustained 252,000 QPS at 2ms P99 on a billion-vector workload. That is the existence proof that single-digit millisecond latency at billion scale is not a fantasy, it is an engineering choice with a price tag.

Two things make that number worth dwelling on. First, the latency. 2ms P99 means even your slowest one-in-a-hundred query returns in two milliseconds. For most RAG pipelines the vector search is a rounding error against the LLM call that follows it, so 2ms is overkill, and overkill is information: it tells you the architecture has headroom you can trade away for cost. Second, the throughput. 252K QPS is not a single node. It is a distributed, memory-heavy deployment with real operational weight behind it. You do not get that number for free, and most teams do not need it.

The opposite design point is pgvectorscale, and it is the one more teams should actually run. On 50 million Cohere embeddings at 99% recall, pgvectorscale's StreamingDiskANN served 471 QPS at 28ms P95. The interesting part is the comparison: benchmarks measured that as 28x lower latency than Pinecone s1 on the same workload, at roughly 75% lower cost when self-hosted. StreamingDiskANN gets there by keeping the index on SSD instead of forcing the entire graph into RAM, which is what lets a single Postgres instance hold 50M vectors without a memory bill that dwarfs the rest of your infrastructure.

These two rows are the poles of the trade space. ScyllaDB buys you 2ms at the cost of a distributed memory-resident cluster. pgvectorscale buys you 75% lower cost at the price of 28ms, which is invisible inside a RAG pipeline. Where you land between them is a function of one question: does your application need 2ms, or can it tolerate 30ms? Most can tolerate 30ms, and most overpay anyway.

Deployment	Scale	Throughput	Latency	Recall
ScyllaDB (distributed)	1B vectors	252K QPS	2ms P99	high
pgvectorscale (StreamingDiskANN)	50M vectors	471 QPS	28ms P95	99%
Pinecone s1 (managed)	50M vectors	baseline	~28x higher	99%

04 · Cost Bands by Scale: What You Actually Pay

Strip away the vendor pricing pages and the numbers settle into bands. These are the figures worth anchoring on when you size a deployment, and the gap between self-hosted and managed is the whole reason this decision is contentious.

The self-hosted bands are dominated by RAM, because a memory-resident HNSW graph for 100M vectors is a real hardware line item, and StreamingDiskANN-style SSD indexes are what keep that 100M number at $300-500 instead of multiples higher. The managed bands are dominated by the operational premium: ~$1,000-3,000/month for 10M vectors at 1,000 QPS is not paying for compute, it is paying for someone else to run, patch, scale, and page on the thing at 3am.

That premium is sometimes worth every dollar. If your team is three people and none of them want to own a distributed database, managed wins on total cost of ownership even when the raw infrastructure looks 5-10x more expensive. The mistake is paying the managed premium at a scale where a single self-hosted Postgres instance would have done the job, which is most teams under 50M vectors. The reverse mistake, self-hosting Milvus at 5M vectors to save $100/month, is just as common and far more expensive once you count the engineer-hours. We walk through this build-versus-buy framing in detail in the turbopuffer vs Pinecone migration analysis, where the cost crossover is the entire story behind why teams move.

The number that actually decides it: cost-per-QPS

Total monthly cost is the wrong unit to compare engines, because it hides how much work you are getting for the money. The right unit is cost-per-QPS: fully loaded monthly cost divided by sustained QPS at your target P99. Worked example. A self-hosted Qdrant instance costing $400/month and holding 8,500 QPS works out to roughly $0.047 per QPS-month. A managed service at $2,000/month serving the same 8,500 QPS is $0.235 per QPS-month, 5x more. But flip the scale: at a billion vectors with six-figure QPS requirements, the self-hosted number includes a team's salary in operational overhead that the per-QPS figure conveniently omits, and the managed math can win. The ranking reverses across scale, which is exactly why a single comparison table cannot answer the question for you.

Scale	Self-hosted (commodity)	Managed (at 1K QPS)
5M vectors	~$100-200/mo	included in entry tiers
10M vectors	~$200-300/mo	~$1,000-3,000/mo
100M vectors	~$300-500/mo	scales steeply

05 · The Crossover Guidance: When pgvector Wins and When Milvus Earns It

After running these numbers across scales, the guidance compresses to two thresholds.

Under 50 million vectors, pgvector wins. With pgvectorscale's StreamingDiskANN, a single Postgres instance handles production RAG at 99% recall and 28ms P95 while keeping your data, your transactions, and your operational knowledge in one system. The 28x-latency, 75%-cost advantage over Pinecone s1 at 50M is not marketing, it is the documented benchmark. You should need a specific, measured reason to leave Postgres below this line, not a feature on a competitor's landing page. When latency does climb in that range it is almost always a tuning gap rather than a ceiling, and keeping a pgvector HNSW index fast past 10 million rows is what recovers it.

At 500 million vectors and above, a distributed engine earns its complexity. This is where Milvus and Zilliz Cloud stop being overkill and start being the correct tool. Distributed indexing, horizontal sharding, and separate compute/storage scaling are real advantages once a single node cannot hold the graph or serve the QPS, and at 500M+ a single node usually cannot. Below this line, those same features are operational tax you pay for capabilities you will not use.

The 50M-to-500M band is the genuinely contested middle. Here the decision turns on whether your binding constraint is throughput (lean Qdrant for its per-node QPS) or operational headcount (lean managed and pay the premium to not run it yourself). There is no universal answer in this band, which is the honest version of "it depends," and it is why you benchmark instead of guess. This is the kind of sizing work Particula Tech runs as a one-week diagnostic before a team commits to an engine, because a wrong choice here is a two-year migration later.

Vector count	Recommendation	Why
< 50M	pgvector + pgvectorscale	Single system, 99% recall, lowest TCO
50M - 500M	Qdrant or managed	Per-node throughput or offload ops
500M+	Milvus / Zilliz, or ScyllaDB for QPS	Distributed sharding earns its complexity

06 · How to Model Your Own Cost-Per-QPS

Public benchmarks are a starting point, not an answer, because recall and throughput shift with your embedding dimensionality, your filter patterns, and your query distribution. Here is the modeling sequence that survives a procurement review.

Pull a realistic slice of your own vectors. Not a synthetic dataset, not ANN-Benchmarks' glove vectors. Use 1-5 million of your actual embeddings at your actual dimensionality, because a 1536-dim OpenAI embedding and a 1024-dim Cohere embedding have materially different memory and latency profiles.

Fix your recall floor first. Decide whether you need 95%, 98%, or 99% recall, then tune ef_search on each candidate engine until it hits that floor. Compare throughput only at equal recall, otherwise you are comparing a fast-but-wrong engine against a slow-but-right one. The mechanics of when retrieval quality actually changes downstream are covered in when to re-embed and re-tune.

Measure sustained QPS at your target P99, not peak. A burst number is marketing. Run a steady load that holds your latency SLA for ten minutes and record the QPS it sustains.

Compute cost-per-QPS at each candidate's price. Fully loaded: compute, RAM, storage, managed fees. Divide by sustained QPS. The lowest number at your recall floor wins.

Re-run the projection at 10x and 100x your current scale. The engine that wins at 5M can lose at 500M because the cost curves cross. Model the scale you will be at in eighteen months, not just today's, so you are not migrating the moment you grow.

That sequence takes a couple of days against the real engines and saves the far larger cost of a wrong commitment. The single most common error we see across vector deployments is teams choosing on a recall benchmark that every engine already passes, then discovering the throughput-per-dollar gap only when the bill arrives at scale.

07 · What This Means for Choosing a Vector Engine

Three takeaways for teams sizing a vector search system right now.

Stop optimizing the recall column. At k=10 the major HNSW engines are within a point or two of each other (98.5%, 97.9%, 97.2%). Recall is a floor to clear, not a lever to pull. If you are below 95%, fix your index, do not change vendors.

Decide on throughput-per-dollar, at your scale. Redis and Qdrant sit roughly 5-7x above stock pgvector on single-instance QPS, and that gap, priced out as cost-per-QPS, is what your bill is made of. But the ranking reverses across scale, so model the scale you will actually be at.

Default to pgvector under 50M, earn your way to Milvus at 500M+. pgvectorscale's 28x-latency, 75%-cost advantage over Pinecone s1 at 50M is the documented reason most teams should not leave Postgres early. The distributed engines are correct at the top, where ScyllaDB's 252K QPS at 2ms P99 shows what is possible, and operationally wasteful in the middle.

The recall era told us the engines had converged on quality. The cost-per-QPS era is telling us something more useful: the vector database decision that matters in 2026 is an economics decision, not a feature decision. If you are still choosing on recall benchmarks, you are picking on the one axis where every serious engine already ties, and paying for the difference where it actually shows up, on the invoice.

08 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/RAG & VECTOR SEARCH

Vector Search at a Billion Vectors: The Cost-Per-QPS Math

Recall converges at 95-99% across HNSW engines, so cost at scale is throughput-per-dollar. ScyllaDB hits 252K QPS at 2ms P99 on 1B vectors. Here's the math.

Sebastian MondragonMAY 29, 2026 · 11 MIN READ

$Vector Search at a Billion Vectors: The Cost-Per-QPS Math$

01 · Why Recall Convergence Changed the Decision

The numbers tell the story. Across the major engines, recall at k=10 now clusters tightly:

Engine	Recall @ k=10	Index
Qdrant	~98.5%	HNSW
Milvus	~97.9%	HNSW
Weaviate	~97.2%	HNSW

02 · Single-Instance QPS: Where Engines Actually Diverge

Why the pgvector number understates the real story

Engine	Single-instance QPS	Relative to pgvector
Redis	~12,000	~6.7x
Qdrant	~8,500	~4.7x
pgvector	~1,800	1x

03 · Scaling to a Billion: The 252K QPS Proof Point

Deployment	Scale	Throughput	Latency	Recall
ScyllaDB (distributed)	1B vectors	252K QPS	2ms P99	high
pgvectorscale (StreamingDiskANN)	50M vectors	471 QPS	28ms P95	99%
Pinecone s1 (managed)	50M vectors	baseline	~28x higher	99%

04 · Cost Bands by Scale: What You Actually Pay

The number that actually decides it: cost-per-QPS

Scale	Self-hosted (commodity)	Managed (at 1K QPS)
5M vectors	~$100-200/mo	included in entry tiers
10M vectors	~$200-300/mo	~$1,000-3,000/mo
100M vectors	~$300-500/mo	scales steeply

05 · The Crossover Guidance: When pgvector Wins and When Milvus Earns It

After running these numbers across scales, the guidance compresses to two thresholds.

Vector count	Recommendation	Why
< 50M	pgvector + pgvectorscale	Single system, 99% recall, lowest TCO
50M - 500M	Qdrant or managed	Per-node throughput or offload ops
500M+	Milvus / Zilliz, or ScyllaDB for QPS	Distributed sharding earns its complexity

06 · How to Model Your Own Cost-Per-QPS

Measure sustained QPS at your target P99, not peak. A burst number is marketing. Run a steady load that holds your latency SLA for ten minutes and record the QPS it sustains.

Compute cost-per-QPS at each candidate's price. Fully loaded: compute, RAM, storage, managed fees. Divide by sustained QPS. The lowest number at your recall floor wins.

07 · What This Means for Choosing a Vector Engine

Three takeaways for teams sizing a vector search system right now.

08 · FAQ

Quick answers to the questions this post tends to raise.