We've built RAG systems for 20 different clients this year. Every single one asked the same question: "Which embedding model should we use?"
Most teams approach this by looking at benchmark scores on the MTEB leaderboard. They pick the highest-ranked model and move on. Then they hit production and discover the model they chose doesn't handle their specific document types well, costs 10x what they budgeted, or introduces latency that makes their search unusable.
The real answer isn't about finding the "best" embedding model. It's about matching the right model to your specific requirements—your document types, query patterns, cost constraints, and infrastructure. I'll walk you through the framework we use to evaluate embedding models for production RAG systems, based on what actually breaks when you scale to real users.
Why Embedding Model Choice Actually Matters
Your embedding model is the foundation of your RAG system's retrieval accuracy. It determines whether your system finds the right context to answer user questions or returns irrelevant information that makes your AI hallucinate.
Here's what we see fail most often. Teams choose a model optimized for general semantic similarity when they need domain-specific understanding. A model trained primarily on Wikipedia and web data performs differently than one trained on technical documentation, legal text, or medical records.
The difference shows up in edge cases. Your RAG system might work fine for common queries but fail on the specific terminology your users actually care about. We worked with a manufacturing company whose RAG system couldn't distinguish between similar part numbers because their embedding model didn't capture the numeric precision that mattered in their domain.
Cost is the other factor most teams underestimate. The difference between $0.0001 per 1K tokens and $0.13 per 1K tokens seems trivial until you're processing 10 million tokens monthly. That's $1,000 versus $1,300—and at scale, these numbers multiply fast. One client discovered their embedding costs exceeded their LLM costs because they chose a premium model when a lighter one would have worked.
The Three Factors That Actually Determine Your Choice
Retrieval accuracy for your specific content
Generic benchmarks tell you how models perform on academic datasets. Your business has different requirements. Test with your actual documents and queries. Pull 100 representative user questions and the documents that should answer them. Embed your content with each candidate model, run the queries, and measure how often the right documents appear in your top results. This practical accuracy matters more than any leaderboard score. We see consistent patterns across implementations. OpenAI's text-embedding-3-large handles diverse content types well—technical docs, conversational text, structured data. It's our default recommendation when clients have varied content. Cohere's embed-english-v3.0 with separate document and query models often outperforms for pure search applications where you can optimize each side independently. For specialized domains, smaller fine-tuned models sometimes beat larger general models. We worked with a legal tech company that got better results from a fine-tuned BERT variant than from GPT-based embeddings because it captured legal terminology more precisely.
Cost at your actual usage volume
Calculate your monthly token volume before choosing a model. Most RAG systems have two embedding phases: initial document indexing and ongoing query embedding. For document indexing, you typically embed once when you build your knowledge base, then incrementally as you add new content. If you have 50,000 documents averaging 1,000 tokens each, that's 50 million tokens. At $0.13 per million tokens (OpenAI's text-embedding-3-large), that's $6.50 for the initial index. Query embedding happens every time a user searches. If you handle 10,000 queries daily averaging 50 tokens each, that's 500,000 tokens daily or 15 million monthly. At $0.13 per million, that's $1.95 monthly for queries. The costs are manageable. But these numbers change dramatically at scale. One client processing 100,000 documents with 100,000 daily queries would spend $130 monthly on query embeddings alone. At that volume, switching to a model at $0.02 per million tokens saves over $100 monthly—enough to justify the engineering effort of self-hosting.
Latency for your user experience
Query embedding latency directly impacts user experience. API-based models add network overhead—typically 100-300ms for the round trip. Self-hosted models can embed queries in 10-50ms depending on your infrastructure. For most search interfaces, 200ms isn't noticeable. Users expect a brief delay. But if you're embedding multiple queries in parallel for a complex workflow, or if you're combining RAG with real-time conversation, those milliseconds compound. We recently helped a client reduce their RAG response time from 2 seconds to 800ms by switching from an API-based embedding model to a self-hosted one. The embedding latency wasn't the only bottleneck, but it contributed enough to matter when they optimized their entire pipeline. Document embedding latency matters less because it typically happens offline. You can batch process documents during indexing without affecting users. If embeddings take 500ms each but you're processing in the background, it's not a user-facing issue.
Models That Actually Work in Production
OpenAI text-embedding-3-large
This is where most teams should start. It handles diverse content types well, the API is reliable, and the cost is reasonable for most use cases at $0.13 per million tokens. The 3,072-dimension vectors capture nuanced semantic meaning. We use this for clients with mixed content—technical documentation, conversational queries, structured data—where we need one model that works across everything. It performs consistently well without domain-specific tuning. The main limitation is cost at high volume and the API dependency. If OpenAI has an outage or rate limit, your RAG system stops working. For applications where uptime is critical, you need a fallback strategy.
Cohere embed-english-v3.0
Cohere's model with separate search-document and search-query optimizations consistently ranks high for search applications. The asymmetric approach—one model for embedding documents, another for queries—captures the different characteristics of each. Documents are typically longer, more formal, and information-dense. Queries are short, conversational, and context-light. Using different models optimized for each improves retrieval accuracy compared to symmetric models that use the same embedding for both. We recommend this when search quality is paramount and you have the engineering resources to implement the dual-model approach. It's more complex than using a single model but often delivers better results for pure search use cases.
Voyage AI voyage-large-2
Voyage specifically optimizes for RAG applications. It consistently performs well on retrieval benchmarks and is purpose-built for the retrieve-then-generate workflow. This model makes sense when your RAG system needs to find the right context from large document collections where precision matters. Medical applications, legal research, technical support—cases where returning the wrong context leads to incorrect or potentially harmful outputs. The trade-off is higher cost than general models and less flexibility if your use case extends beyond pure RAG to other semantic tasks. If you're struggling with citation accuracy, check out our guide on how to fix RAG citations.
Open-source alternatives: all-MiniLM-L6-v2
For teams with strong infrastructure capabilities and cost sensitivity, self-hosted models make sense. All-MiniLM-L6-v2 from sentence-transformers is lightweight, fast, and produces 384-dimension vectors that work well for most applications. We use this for clients who need to keep data entirely on-premises or who process enough volume that API costs become prohibitive. One client saves $800 monthly by self-hosting versus using OpenAI's API, even accounting for GPU infrastructure costs. The challenge is maintaining the infrastructure, handling model updates, and ensuring consistent uptime. You're trading lower per-token costs for higher operational complexity.
How We Actually Test Models Before Committing
Benchmarks give you a starting point. Real testing with your data tells you which model actually works.
Build a representative test set
Pull 100-200 actual user queries from your system or anticipated queries if you're building something new. For each query, identify which documents should appear in the results—your ground truth. This test set should reflect the diversity of your actual use cases. If users search for part numbers, product names, technical specifications, and troubleshooting steps, include all those query types. The goal is realistic coverage of what your system will face in production.
Measure retrieval precision at different levels
Embed your document collection with each candidate model. Then embed your test queries and retrieve the top 5-10 results for each query. Calculate precision at different cutoffs. Top-1 precision (did the best answer appear first?) matters most for direct answer systems. Top-5 precision matters for search interfaces showing multiple results. Different applications tolerate different precision levels. We typically see 10-20% accuracy differences between models on client-specific test sets even when benchmark scores are similar. That gap is why testing with your data matters.
Factor in the real constraints
Time how long embeddings take with each model. Calculate actual costs based on your projected volume. Test the models with your infrastructure—some vector databases optimize for specific dimension sizes, which can impact query performance. One client eliminated a model that performed well on accuracy because it produced vectors too large for their vector database to handle efficiently at scale. The theoretical accuracy advantage disappeared when queries took 3x longer to run.
Consider the operational implications
API-based models are simpler to start with but create external dependencies. Self-hosted models require GPU infrastructure, monitoring, and maintenance. Proprietary models might change or get deprecated. Open-source models give you control but require more expertise. We typically recommend starting with an API to validate your use case, then evaluating self-hosting once you understand your usage patterns and volumes. The exception is when data privacy requirements force self-hosting from the start.
Common Mistakes That Break Production Systems
Choosing based on benchmark scores alone
A model that tops the MTEB leaderboard might not handle your specific content well. We've seen this repeatedly—teams choose the highest-ranked general model and get worse results than a lower-ranked specialized model. One client in financial services switched from a top-ranked general model to a finance-specific model and improved retrieval accuracy by 18%. The specialized model understood financial terminology and document structure better even though it ranked lower on general benchmarks.
Ignoring cost scaling
Embedding costs seem negligible until you scale. A $10 monthly cost becomes $1,000 at 100x volume. If your RAG system takes off, you need to know whether you can afford to scale or whether you'll need to re-architect around a cheaper model later. We've helped three clients migrate to different embedding models after launch because their initial choice became cost-prohibitive at scale. Each migration required re-embedding their entire document collection and extensive testing to ensure quality didn't degrade.
Not testing with actual data
Every document collection has unique characteristics. Technical manuals differ from customer support tickets differ from legal contracts. Your model needs to perform well on your specific content types. Testing with sample data before committing prevents expensive rebuilds later. One client discovered after implementation that their chosen model couldn't handle the mixed English-Spanish content in their customer support documents. A different model with better multilingual support would have been the right choice if they'd tested properly.
Mismatching model size to requirements
Using a 1,024-dimension model when 384 dimensions would work wastes compute and storage. Using a compact model when you need nuanced semantic understanding sacrifices accuracy. Right-sizing matters. We generally start with mid-sized models (768-1,024 dimensions) and scale up or down based on actual performance requirements. Most applications don't need the largest available models, but some genuinely benefit from the additional semantic richness.
Making the Right Choice for Your Use Case
The best embedding model for your RAG system depends on your specific requirements—content type, query patterns, budget, latency needs, and infrastructure capabilities.
For most business applications, OpenAI's text-embedding-3-large provides a strong balance of performance, reliability, and reasonable cost. It's where we start with clients unless they have specific requirements that push them toward specialized models.
For search-heavy applications where retrieval quality is critical, Cohere's embed-english-v3.0 with asymmetric document and query models often delivers better results. The additional complexity is worth it when precision directly impacts business outcomes.
For cost-sensitive applications or teams with strong infrastructure capabilities, self-hosted models like all-MiniLM-L6-v2 provide good performance at lower ongoing costs. The trade-off is operational complexity and infrastructure requirements.
The key is testing before committing. Measure what matters for your application with your actual data. Don't over-engineer with the highest-dimensional model when a simpler one would work, and don't under-invest in embedding quality if retrieval accuracy is critical to your user experience.
Your embedding model is the foundation of your RAG system's performance. Getting this choice right saves you from expensive rebuilds later and ensures your system actually works when real users start asking real questions.