Once you know you need a reranker, the decision is hosted convenience versus self-host control. Cohere Rerank 3.5 (~595-603ms) is the strongest zero-effort default. Voyage Rerank 2.5 matches it and adds code/legal domain variants worth +2-4 NDCG@10. On the open-source side, Jina Reranker v3 is the only top-tier model under 200ms (81.33% Hit@1 at 188ms), with a 131k-token context and listwise scoring of 64 docs at once. Nemotron edges it on accuracy (83.00% Hit@1) but costs you 243ms. BGE Reranker v2-m3 is the lightweight multilingual baseline. Pick hosted if you want zero ops, self-host Jina v3 if you have a strict latency budget or a GPU, and reach for a domain variant only when your corpus is code or legal.
If your RAG pipeline already retrieves the right document somewhere in its top 50 but the answer still cites the wrong chunk, the bottleneck is not retrieval, it is ranking. That is the precise problem a reranker solves, and the question this post answers is not whether you need one but which reranker model to deploy. Cohere, Voyage, Jina, and BGE all ship production-grade rerankers in 2026, and they make genuinely different tradeoffs on latency, NDCG, domain accuracy, and whether you host them yourself.
This is a WHICH-model comparison, not a should-you-rerank explainer. If you are still deciding whether reranking belongs in your stack at all, start with our guide on when RAG reranking actually improves retrieval and come back here once you have confirmed the lift is real on your corpus. Assuming you are past that gate, the decision narrows to two axes: hosted convenience versus self-host control, and general-purpose quality versus domain or latency specialization. Get those two right and the model practically picks itself.
We will walk through the two hosted heavyweights (Cohere Rerank 3.5 and Voyage Rerank 2.5), the domain variants that buy +2-4 NDCG@10 on code and legal corpora, the open-source self-host contenders (Jina Reranker v3, Nemotron, and BGE v2-m3), the sub-200ms latency budget and which models actually meet it, multilingual and long-context needs, and a decision matrix that maps your constraints to a specific model. Throughout, the numbers are public benchmark and vendor figures, not invented case-study metrics, and where your own corpus will behave differently I say so.
The Decision Is Hosted Convenience vs Self-Host Control
Before any benchmark, settle the deployment question, because it eliminates half the field instantly. A reranker is a cross-encoder: it takes your query and a candidate document together and produces a relevance score by reading both at once. That joint encoding is why rerankers beat bi-encoder embeddings on precision, and also why they are slower. Every reranking call is an inference pass over every candidate document.
Hosted rerankers (Cohere, Voyage) give you that inference as an API call. You send the query and your top-k candidates, you get back scored and reordered results, and you never think about GPUs, batching, or model serving. The cost is a network round-trip on every query (roughly 595-603ms average for Cohere Rerank 3.5) and per-call API pricing that scales with your traffic.
Self-hosted rerankers (Jina Reranker v3, Nemotron, BGE v2-m3) run the model inside your own infrastructure. You take on serving, scaling, and monitoring, and in exchange you get dramatically lower wire latency (Jina v3 at 188ms versus 595ms+ for a hosted round-trip), no per-call pricing, and full control over data residency. The crossover is almost always latency and query volume, not headline cost. At low traffic with relaxed latency, hosted wins on time-to-ship. At high QPS or a strict sub-200ms budget, self-host wins decisively.
If you have no GPU and moderate traffic, stop here and use a hosted reranker. The rest of the self-host analysis only matters if latency or volume forces your hand.
| Axis | Hosted (Cohere / Voyage) | Self-host (Jina / Nemotron / BGE) |
|---|---|---|
| Wire latency | ~595-603ms (incl. round-trip) | 188-243ms local |
| Ops burden | None | Model serving, scaling, monitoring |
| Pricing model | Per-call API | Infra cost, no per-call |
| Data residency | Vendor-controlled | Full control |
| Time to ship | Minutes | Days |
| Best at | Moderate traffic, no GPU | High QPS, strict latency, GPU available |
Hosted Rerankers: Cohere Rerank 3.5 vs Voyage Rerank 2.5
For teams that want a reranker working this afternoon, Cohere Rerank 3.5 is the strongest default. It is a mature cross-encoder with broad multilingual coverage, a stable API, and consistent NDCG across mixed general corpora. Average latency sits around 595-603ms including the network round-trip, which is fine for RAG chat and most knowledge-base assistants but too slow for autocomplete or anything with a sub-300ms end-to-end target. You change one API call, point it at your retrieved candidates, and ship. That is the entire appeal, and it is a real one.
Voyage Rerank 2.5 is the other hosted heavyweight, and on general corpora it lands in the same latency band as Cohere with comparable ranking quality. The reason to reach for Voyage over Cohere is not the generic model, it is the domain variants.
Domain Variants Buy +2-4 NDCG@10 on Code and Legal
Voyage ships reranker variants tuned for specific domains, and the most impactful are the code and legal models. On those corpora, the domain-tuned variant buys roughly +2-4 NDCG@10 over a generic reranker. That is a large delta in ranking terms. NDCG@10 differences of one to two points often separate the top three models on a leaderboard, so a +2-4 swing from picking the right domain model usually dwarfs the difference between any two general rerankers. The practical rule is simple. If your corpus is source code (a codebase search assistant, a developer Q&A tool over internal repos) or legal text (contracts, case law, regulatory filings), test the Voyage domain variant first. The accuracy gain on those document types is where domain tuning earns its keep. If your documents are general business text (support articles, product docs, mixed knowledge bases), the domain variants do not apply and Cohere versus Voyage comes down to pricing, language coverage, and existing vendor relationships. Do not pay for a domain variant on a corpus it was not trained for, you get the latency cost without the accuracy benefit. One caveat that applies to both hosted vendors: pricing and model versions in this category shift frequently. Confirm the current per-call rate and the latest model version before you budget, and treat any latency figure (including the ones here) as a starting estimate to validate against your own region and payload size.
Open-Source Self-Host: Jina v3, Nemotron, and BGE v2-m3
When latency or volume pushes you to self-host, three open-source rerankers dominate the conversation, and they are not interchangeable.
Jina Reranker v3 is the one to beat on the latency-accuracy frontier. It scores 81.33% Hit@1 at 188ms, making it the only top-tier reranker that stays under the 200ms line. Two architectural choices set it apart. First, it is listwise: instead of scoring each query-document pair independently, it compares up to 64 documents together in a single pass, which improves ranking consistency across the candidate set. Second, it carries a 131k-token context window, so it can score long documents without truncating the passages that actually matter. For long-document RAG or pipelines that rerank many candidates at once, that combination is hard to match.
Nemotron edges Jina on raw accuracy at 83.00% Hit@1, but it costs you 243ms to get there. That is a real tradeoff: roughly 1.7 points of Hit@1 for 55ms of extra latency. If your latency budget has headroom and ranking quality is the dominant concern, Nemotron is defensible. If you are defending a sub-200ms target, it disqualifies itself, because 243ms blows the budget before you have added retrieval and generation.
BGE Reranker v2-m3 is the lightweight multilingual baseline. It does not top the accuracy charts, but it is small, fast on modest hardware, and covers many languages, which makes it the pragmatic open-source starting point when you need broad language support without standing up heavy GPU infrastructure. Treat it as the sensible default to benchmark against, not the model you reach for when you need the absolute best ranking.
The Hit@1 and latency figures here are public benchmark numbers and are directional. Your corpus, chunking strategy, and query distribution will move them, sometimes by several points, which is why the benchmarking section below is not optional.
| Model | Hit@1 | Latency | Context | Deployment | Best for |
|---|---|---|---|---|---|
| Cohere Rerank 3.5 | (general) | ~595-603ms | multilingual | Hosted | Zero-ops default |
| Voyage Rerank 2.5 | (general, +2-4 NDCG@10 domain) | ~595-603ms | multilingual | Hosted | Code / legal corpora |
| Jina Reranker v3 | 81.33% | 188ms | 131k tokens | Self-host | Strict sub-200ms budget |
| Nemotron | 83.00% | 243ms | standard | Self-host | Max accuracy, relaxed latency |
| BGE Reranker v2-m3 | baseline | low (small batch) | standard | Self-host | Multilingual on a budget |
The Sub-200ms Latency Budget and Which Models Meet It
For user-facing RAG, latency is a budget you spend across stages: retrieval, reranking, prompt assembly, and generation. The reranker is one line item in that budget, and it is the one most teams underestimate. If your end-to-end target is aggressive, the reranker has to fit inside a sub-200ms slice, and that single constraint eliminates most of the field.
Run the math. A hosted reranker at 595-603ms cannot fit a sub-200ms reranking budget, full stop; the network round-trip alone overshoots it. Nemotron at 243ms misses it too. Of the strong rerankers, Jina Reranker v3 at 188ms is the only top-tier model that fits under 200ms, which is why it is the default recommendation for latency-critical self-hosted pipelines. BGE v2-m3 can also come in fast on small batches, but at lower accuracy.
Two levers control reranker latency beyond model choice. The first is candidate count: reranker latency scales with the number of documents you score, so feeding the model 500 candidates is far slower than feeding it 80. Cut your first-stage retrieval top-k to 50-100 before reranking. You are not losing recall that matters, because anything below rank 100 from a decent retriever is rarely the right answer anyway. The second is batching and hardware for self-hosted models, where a single warm GPU is the difference between meeting and missing your budget.
Measure P95, not the average. Reranker latency has a tail that grows with candidate count and payload size, and the average will hide a P95 that breaks your SLA. Across the production retrieval stacks we have tuned at Particula Tech, the most common reranker latency surprise is a long tail driven by an unbounded retrieval top-k flowing straight into the reranker, the fix is almost always capping candidates, not swapping models. For first-stage retrieval quality that feeds the reranker, our guide on combining dense and sparse embeddings covers the hybrid setup that gives the reranker the best possible candidate set to work with.
Multilingual and Long-Context Reranking Needs
Two corpus properties override the general recommendation: language coverage and document length.
For multilingual corpora, your shortlist narrows. BGE Reranker v2-m3 is the open-source multilingual baseline and the pragmatic self-host choice when you need many languages without heavy infrastructure. Cohere Rerank 3.5 offers broad multilingual coverage as a hosted option for teams that would rather not run the model. The mistake to avoid is deploying an English-tuned reranker over a multilingual corpus and concluding that reranking does not help, the model was simply never trained for your languages.
For long-context reranking, document length is the deciding factor. A reranker has to read the full document to score it, and if the model truncates at a few thousand tokens, it never sees the passage that makes the document relevant. Jina Reranker v3's 131k-token context window is the standout here: it can score long documents (full contracts, long technical articles, multi-page reports) without truncation, and its listwise scoring of up to 64 documents at once suits pipelines that rerank many long candidates together. If your documents routinely exceed a few thousand tokens, prioritize a long-context reranker over a marginally higher Hit@1 on short-document benchmarks. The benchmark advantage is irrelevant if the model cannot see half your document.
The interaction with chunking matters too. If you chunk aggressively (small chunks), context length is less of a constraint and you can use any reranker, but you may reorder fragments that lose meaning out of context. If you keep large chunks or rerank whole documents, long context becomes essential. Reranker choice and chunking strategy are not independent decisions.
Decision Matrix: Mapping Constraints to a Model
The vendor pages push feature checklists. The real decision runs through three questions in order: can you self-host, how tight is your latency budget, and is your corpus a special domain or language.
A few patterns are worth stating plainly, in the same opinionated spirit as the rest of this comparison:
Whichever model you pick, the choice is only validated by measurement against your own data, which is the one step teams skip and regret.
| Your situation | Pick | Why |
|---|---|---|
| No GPU, moderate traffic, general corpus | Cohere Rerank 3.5 | Zero ops, strongest hosted default, ship today |
| Corpus is code or legal | Voyage Rerank 2.5 (domain variant) | +2-4 NDCG@10 over generic on those domains |
| Strict sub-200ms budget, GPU available | Jina Reranker v3 | Only top-tier model under 200ms (81.33% Hit@1 @ 188ms) |
| Max accuracy, latency has headroom | Nemotron | Highest Hit@1 (83.00%), 243ms cost is acceptable |
| Long documents (>few thousand tokens) | Jina Reranker v3 | 131k-token context, no truncation |
| Multilingual, self-host, budget-conscious | BGE Reranker v2-m3 | Lightweight multilingual baseline |
| High QPS where per-call pricing hurts | Self-host (Jina or BGE) | No per-call cost, latency under control |
Benchmark on Your Own Corpus Before You Commit
Every number in this post (81.33% Hit@1 for Jina v3, 83.00% for Nemotron, +2-4 NDCG@10 for Voyage domain variants, 595-603ms for Cohere) is a public benchmark or vendor figure. They are directional, not transferable. Your documents, query style, and chunking will move them, and the model that wins a public leaderboard can lose on your corpus.
The benchmark protocol is straightforward. Build a small labeled set: 50-100 queries from your real traffic, each with one or more known-relevant documents from your corpus. Hold your first-stage retrieval constant, then swap only the reranker and measure NDCG@10 and Hit@1 for each candidate against the same retrieved set. Record end-to-end latency per query alongside ranking quality, and report P95, not just the mean. Fifty to a hundred labeled queries is enough to separate the front-runners reliably; you do not need thousands to see which model ranks your data best.
Crucially, confirm the reranker actually changes answers, not just scores. A reranker that reorders candidates but never changes which chunk reaches the LLM is pure latency cost. Pair the ranking metrics with end-to-end RAG evaluation, our guide on telling whether your RAG system actually works covers the answer-level metrics that catch this. And because reranker quality is downstream of embedding quality, if your candidate set is weak no reranker can save it; our breakdown of choosing embedding models for RAG covers the first-stage decision that determines how much work the reranker has to do.
When we tune retrieval pipelines at Particula Tech, the reranker benchmark is a half-day exercise that routinely changes the chosen model versus the leaderboard favorite, because real corpora rarely look like academic ones. The broader strategic context, when reranking earns its place in the pipeline versus when it is premature latency, lives in our RAG systems pillar.
The summary holds: pick hosted (Cohere) for zero-ops convenience on general corpora, Voyage domain variants for code and legal, Jina Reranker v3 for a strict sub-200ms budget or long documents, Nemotron when accuracy outranks latency, and BGE v2-m3 for multilingual on a budget. Then prove it on your own data before you ship, because the reranker that wins the benchmark is the one that wins on your corpus, not the one with the best leaderboard line.
Frequently Asked Questions
Quick answers to common questions about this topic
There is no single best reranker, the answer splits by deployment. For hosted convenience with zero tuning, Cohere Rerank 3.5 is the strongest default, with average latency around 595-603ms and reliable NDCG across general corpora. For self-hosting under a tight latency budget, Jina Reranker v3 is the standout: 81.33% Hit@1 at 188ms, the only top-tier model that stays under 200ms. If you can absorb the extra latency, Nemotron scores slightly higher at 83.00% Hit@1 but takes 243ms. Voyage Rerank 2.5 matches Cohere and adds domain-tuned code and legal variants. Benchmark two or three on your own corpus before committing, public NDCG numbers rarely transfer cleanly to your documents.



