Don't choose embedding models based on MTEB leaderboard scores, test with your actual data. OpenAI's text-embedding-3-large ($0.13/million tokens) is the best starting point for diverse content. Cohere embed-english-v3.0 with separate document/query models excels for pure search. For cost-sensitive or on-premise needs, self-hosted all-MiniLM-L6-v2 saves ~$800/month at scale. Key factors: retrieval accuracy on your specific content (10-20% variance between models), cost at your actual volume (can 100x as you scale), and latency (API adds 100-300ms vs 10-50ms self-hosted). Build a test set of 100+ real queries before committing.
"Which embedding model should we use?" is the first question every RAG project surfaces, and the one most teams answer wrong.
The standard approach is looking at benchmark scores on the MTEB leaderboard, picking the highest-ranked model, and moving on. Then production hits, and the chosen model either doesn't handle the team's specific document types, costs 10x what they budgeted, or introduces latency that makes search unusable.
The real answer isn't about finding the "best" embedding model. It's about matching the right model to your specific requirements, your document types, query patterns, cost constraints, and infrastructure. Below is the framework that holds up for evaluating embedding models for production RAG systems, based on what actually breaks when you scale to real users.
Why Embedding Model Choice Actually Matters
Your embedding model is the foundation of your RAG system's retrieval accuracy. It determines whether your system finds the right context to answer user questions or returns irrelevant information that makes your AI hallucinate.
Here's what we see fail most often. Teams choose a model optimized for general semantic similarity when they need domain-specific understanding. A model trained primarily on Wikipedia and web data performs differently than one trained on technical documentation, legal text, or medical records.
The difference shows up in edge cases. Your RAG system might work fine for common queries but fail on the specific terminology your users actually care about. Picture a manufacturing RAG that can't distinguish between similar part numbers because the embedding model doesn't capture the numeric precision that matters in that domain.
Cost is the other factor most teams underestimate. The difference between $0.0001 per 1K tokens and $0.13 per 1K tokens seems trivial until you're processing 10 million tokens monthly. That's $1,000 versus $1,300, and at scale, these numbers multiply fast. It's common for embedding spend to quietly overtake LLM spend when a team picks a premium model where a lighter one would have worked.
The Three Factors That Actually Determine Your Choice
Retrieval accuracy for your specific content
Generic benchmarks tell you how models perform on academic datasets. Your business has different requirements. Test with your actual documents and queries. Pull 100 representative user questions and the documents that should answer them. Embed your content with each candidate model, run the queries, and measure how often the right documents appear in your top results. This practical accuracy matters more than any leaderboard score. Across production RAG systems, the same patterns keep showing up. OpenAI's text-embedding-3-large handles diverse content types well, technical docs, conversational text, structured data. It's a sensible default when content is varied. Cohere's embed-english-v3.0 with separate document and query models often outperforms for pure search applications where you can optimize each side independently. For specialized domains, smaller fine-tuned models sometimes beat larger general models, a fine-tuned BERT variant can outperform GPT-based embeddings on legal text precisely because it captures legal terminology more precisely.
Cost at your actual usage volume
Calculate your monthly token volume before choosing a model. Most RAG systems have two embedding phases: initial document indexing and ongoing query embedding. For document indexing, you typically embed once when you build your knowledge base, then incrementally as you add new content. If you have 50,000 documents averaging 1,000 tokens each, that's 50 million tokens. At $0.13 per million tokens (OpenAI's text-embedding-3-large), that's $6.50 for the initial index. Query embedding happens every time a user searches. If you handle 10,000 queries daily averaging 50 tokens each, that's 500,000 tokens daily or 15 million monthly. At $0.13 per million, that's $1.95 monthly for queries. The costs are manageable. But these numbers change dramatically at scale. A workload with 100,000 documents and 100,000 daily queries would spend $130 monthly on query embeddings alone. At that volume, switching to a model at $0.02 per million tokens saves over $100 monthly, enough to justify the engineering effort of self-hosting.
Latency for your user experience
Query embedding latency directly impacts user experience. API-based models add network overhead, typically 100-300ms for the round trip. Self-hosted models can embed queries in 10-50ms depending on your infrastructure. For most search interfaces, 200ms isn't noticeable. Users expect a brief delay. But if you're embedding multiple queries in parallel for a complex workflow, or if you're combining RAG with real-time conversation, those milliseconds compound. Switching from an API-based embedding model to a self-hosted one routinely cuts end-to-end RAG response time from around 2 seconds to under 1 second when paired with the rest of the pipeline being optimized. Embedding latency isn't usually the only bottleneck, but it contributes enough to matter. The reranking stage is the other latency-sensitive piece, so after picking an embedding model, choose the right reranker for your latency budget. Document embedding latency matters less because it typically happens offline. You can batch process documents during indexing without affecting users. If embeddings take 500ms each but you're processing in the background, it's not a user-facing issue.
Models That Actually Work in Production
OpenAI text-embedding-3-large
This is where most teams should start. It handles diverse content types well, the API is reliable, and the cost is reasonable for most use cases at $0.13 per million tokens. The 3,072-dimension vectors capture nuanced semantic meaning. It's the right pick for mixed content, technical documentation, conversational queries, structured data, when you need one model that works across everything. It performs consistently well without domain-specific tuning. The main limitation is cost at high volume and the API dependency. If OpenAI has an outage or rate limit, your RAG system stops working. For applications where uptime is critical, you need a fallback strategy.
Cohere embed-english-v3.0
Cohere's model with separate search-document and search-query optimizations consistently ranks high for search applications. The asymmetric approach, one model for embedding documents, another for queries, captures the different characteristics of each. Documents are typically longer, more formal, and information-dense. Queries are short, conversational, and context-light. Using different models optimized for each improves retrieval accuracy compared to symmetric models that use the same embedding for both. This is the right pick when search quality is paramount and you have the engineering resources to implement the dual-model approach. It's more complex than using a single model but often delivers better results for pure search use cases.
Voyage AI voyage-large-2
Voyage specifically optimizes for RAG applications. It consistently performs well on retrieval benchmarks and is purpose-built for the retrieve-then-generate workflow. This model makes sense when your RAG system needs to find the right context from large document collections where precision matters. Medical applications, legal research, technical support, cases where returning the wrong context leads to incorrect or potentially harmful outputs. The trade-off is higher cost than general models and less flexibility if your use case extends beyond pure RAG to other semantic tasks. If you're struggling with citation accuracy, check out our guide on how to fix RAG citations.
Open-source alternatives: all-MiniLM-L6-v2
For teams with strong infrastructure capabilities and cost sensitivity, self-hosted models make sense. All-MiniLM-L6-v2 from sentence-transformers is lightweight, fast, and produces 384-dimension vectors that work well for most applications. It's the right fit when data has to stay on-premises or when volume is high enough that API costs become prohibitive. At a few hundred million tokens per month, self-hosting routinely saves several hundred dollars monthly versus an API like OpenAI's, even accounting for GPU infrastructure costs. The challenge is maintaining the infrastructure, handling model updates, and ensuring consistent uptime. You're trading lower per-token costs for higher operational complexity.
How We Actually Test Models Before Committing
Benchmarks give you a starting point. Real testing with your data tells you which model actually works.
Build a representative test set
Pull 100-200 actual user queries from your system or anticipated queries if you're building something new. For each query, identify which documents should appear in the results, your ground truth. This test set should reflect the diversity of your actual use cases. If users search for part numbers, product names, technical specifications, and troubleshooting steps, include all those query types. The goal is realistic coverage of what your system will face in production.
Measure retrieval precision at different levels
Embed your document collection with each candidate model. Then embed your test queries and retrieve the top 5-10 results for each query. Calculate precision at different cutoffs. Top-1 precision (did the best answer appear first?) matters most for direct answer systems. Top-5 precision matters for search interfaces showing multiple results. Different applications tolerate different precision levels. Across domain-specific test sets, 10-20% accuracy differences between models are common even when benchmark scores are similar. That gap is why testing with your data matters.
Factor in the real constraints
Time how long embeddings take with each model. Calculate actual costs based on your projected volume. Test the models with your infrastructure, some vector databases optimize for specific dimension sizes, which can impact query performance. A common trap: a model performs well on accuracy but produces vectors too large for the chosen vector database to handle efficiently at scale, and the theoretical accuracy advantage disappears once queries take 3x longer to run.
Consider the operational implications
API-based models are simpler to start with but create external dependencies. Self-hosted models require GPU infrastructure, monitoring, and maintenance. Proprietary models might change or get deprecated. Open-source models give you control but require more expertise. The pragmatic path is starting with an API to validate the use case, then evaluating self-hosting once you understand actual usage patterns and volumes. The exception is when data privacy requirements force self-hosting from the start.
Common Mistakes That Break Production Systems
Choosing based on benchmark scores alone
A model that tops the MTEB leaderboard might not handle your specific content well. This pattern repeats: teams choose the highest-ranked general model and get worse results than a lower-ranked specialized model. In financial services, switching from a top-ranked general model to a finance-specific model can meaningfully improve retrieval accuracy because the specialized model understands financial terminology and document structure better even though it ranks lower on general benchmarks.
Ignoring cost scaling
Embedding costs seem negligible until you scale. A $10 monthly cost becomes $1,000 at 100x volume. If your RAG system takes off, you need to know whether you can afford to scale or whether you'll need to re-architect around a cheaper model later. Post-launch embedding-model migrations are a common painful pattern: when the initial choice becomes cost-prohibitive at scale, the team has to re-embed the entire document collection and run extensive testing to ensure quality doesn't degrade.
Not testing with actual data
Every document collection has unique characteristics. Technical manuals differ from customer support tickets differ from legal contracts. Your model needs to perform well on your specific content types. Testing with sample data before committing prevents expensive rebuilds later. A common late-stage failure: discovering after implementation that the chosen model can't handle mixed-language content (e.g. English-Spanish customer support documents). A different model with better multilingual support would have been the right choice if it had been tested properly.
Mismatching model size to requirements
Using a 1,024-dimension model when 384 dimensions would work wastes compute and storage. Using a compact model when you need nuanced semantic understanding sacrifices accuracy. Right-sizing matters. A safe starting point is mid-sized models (768-1,024 dimensions), then scaling up or down based on actual performance requirements. Most applications don't need the largest available models, but some genuinely benefit from the additional semantic richness.
Making the Right Choice for Your Use Case
The best embedding model for your RAG system depends on your specific requirements, content type, query patterns, budget, latency needs, and infrastructure capabilities.
For most business applications, OpenAI's text-embedding-3-large provides a strong balance of performance, reliability, and reasonable cost. It's the right default unless specific requirements push you toward specialized models.
For search-heavy applications where retrieval quality is critical, Cohere's embed-english-v3.0 with asymmetric document and query models often delivers better results. The additional complexity is worth it when precision directly impacts business outcomes.
For cost-sensitive applications or teams with strong infrastructure capabilities, self-hosted models like all-MiniLM-L6-v2 provide good performance at lower ongoing costs. The trade-off is operational complexity and infrastructure requirements.
The key is testing before committing. Measure what matters for your application with your actual data. Don't over-engineer with the highest-dimensional model when a simpler one would work, and don't under-invest in embedding quality if retrieval accuracy is critical to your user experience.
Your embedding model is the foundation of your RAG system's performance. Getting this choice right saves you from expensive rebuilds later and ensures your system actually works when real users start asking real questions.
Frequently Asked Questions
Quick answers to common questions about this topic
No. Benchmark scores show performance on academic datasets, not your specific content. Across production RAG systems we've audited, 10-20% accuracy differences between models on domain-specific test sets are common even when MTEB scores look similar. A lower-ranked specialized model often outperforms a top-ranked general model for specific domains.



