What is the best reranker model in 2026?

There is no single best reranker, the answer splits by deployment. For hosted convenience with zero tuning, Cohere Rerank 3.5 is the strongest default, with average latency around 595-603ms and reliable NDCG across general corpora. For self-hosting under a tight latency budget, Jina Reranker v3 is the standout: 81.33% Hit@1 at 188ms, the only top-tier model that stays under 200ms. If you can absorb the extra latency, Nemotron scores slightly higher at 83.00% Hit@1 but takes 243ms. Voyage Rerank 2.5 matches Cohere and adds domain-tuned code and legal variants. Benchmark two or three on your own corpus before committing, public NDCG numbers rarely transfer cleanly to your documents.

Should I self-host a reranker or use a hosted API?

Use a hosted reranker (Cohere or Voyage) when you want zero ops, predictable quality, and your traffic is moderate; the 595-603ms round-trip is acceptable for most RAG chat. Self-host (Jina Reranker v3, Nemotron, or BGE v2-m3) when you have a strict sub-200ms latency budget, high query volume that makes per-call API pricing painful, data-residency or compliance constraints, or an available GPU. The crossover is usually latency and volume, not cost alone. Self-hosting Jina v3 on a single GPU can hold 188ms locally versus 595ms+ for a hosted round-trip, but you take on model serving, scaling, and monitoring. Hosted wins on time-to-ship; self-host wins on tail latency and high-QPS economics.

How much latency does a reranker add to a RAG pipeline?

A reranker adds one extra network and inference hop after retrieval, typically 150-600ms depending on the model and whether it is hosted or local. Hosted rerankers like Cohere Rerank 3.5 average 595-603ms including the round-trip. Self-hosted models are faster on the wire: Jina Reranker v3 runs 188ms, Nemotron 243ms, and lightweight BGE v2-m3 less on small batches. For user-facing RAG, budget the reranker inside a sub-200ms target if latency is critical, which in practice means self-hosting Jina v3. Latency also scales with the number of candidate documents you rerank, so cut your retrieval top-k to 50-100 before reranking rather than feeding the model 500 candidates.

What is the difference between Cohere Rerank and Voyage Rerank?

Cohere Rerank 3.5 and Voyage Rerank 2.5 are both strong hosted cross-encoders with similar average latency around 595-603ms, but Voyage differentiates with domain-specific variants. Cohere is the safer general-purpose default: broad language coverage, mature API, and consistent NDCG on mixed corpora. Voyage offers code and legal reranker variants that buy roughly +2-4 NDCG@10 over a generic reranker on those specific domains, which is significant if your corpus is source code or contracts. If your documents are general business text, Cohere and Voyage are close enough that you should pick on pricing and existing vendor relationships. If your corpus is clearly code or legal, the Voyage domain variant is worth testing first.

Does a reranker improve RAG accuracy enough to justify it?

Yes, for most multi-stage retrieval pipelines a reranker meaningfully lifts the precision of the top results, which is what the LLM actually reads. Retrieval with dense embeddings is optimized for recall over a large candidate set; a cross-encoder reranker re-scores the top 50-100 candidates by reading the query and document together, which catches relevance that embedding distance misses. The lift shows up most when your first-stage retrieval returns plausible-but-wrong chunks. The cost is one extra hop of 150-600ms. If your retrieval already returns the right chunk in position one most of the time, a reranker adds latency for little gain, so measure first. Reranking is highest-value when answer quality is bottlenecked on which chunks reach the prompt.

What context length and multilingual support do rerankers offer?

Context length and language coverage vary widely, and they matter when your documents are long or multilingual. Jina Reranker v3 leads on context with a 131k-token window and listwise scoring that compares up to 64 documents together, which suits long-document and many-candidate reranking. BGE Reranker v2-m3 is the lightweight multilingual baseline, a good open-source choice when you need many languages without heavy infrastructure. Cohere Rerank 3.5 offers broad multilingual coverage as a hosted option. If your documents routinely exceed a few thousand tokens, prioritize a long-context reranker so you are not truncating relevant passages before scoring. For many-language corpora on a budget, BGE v2-m3 is the pragmatic starting point.

How do I benchmark rerankers on my own data?

Build a small labeled set of queries with known-relevant documents from your own corpus, then measure NDCG@10 and Hit@1 for each candidate reranker against the same first-stage retrieval. Public benchmark numbers (81.33% Hit@1 for Jina v3, 83.00% for Nemotron) are directional, not transferable; your documents, query style, and chunking determine real performance. Hold retrieval constant, swap only the reranker, and record both ranking quality and end-to-end latency per query. Fifty to a hundred labeled queries is enough to separate the front-runners. Measure tail latency (P95), not just averages, because reranker latency scales with candidate count. Pair this with end-to-end RAG evaluation so you confirm the reranker actually changes answers, not just scores.

BLOG/RAG & VECTOR SEARCH

Reranker Models Compared: Cohere vs Voyage vs Jina vs BGE

Jina Reranker v3 hits 81.33% Hit@1 at 188ms, the only top-tier sub-200ms model. The latency and NDCG breakdown across Cohere, Voyage, Jina, and BGE.

Sebastian MondragonMAY 22, 2026 · 11 MIN READ

Reranker Models Compared: Cohere vs Voyage vs Jina vs BGE

If your RAG pipeline already retrieves the right document somewhere in its top 50 but the answer still cites the wrong chunk, the bottleneck is not retrieval, it is ranking. That is the precise problem a reranker solves, and the question this post answers is not whether you need one but which reranker model to deploy. Cohere, Voyage, Jina, and BGE all ship production-grade rerankers in 2026, and they make genuinely different tradeoffs on latency, NDCG, domain accuracy, and whether you host them yourself.

This is a WHICH-model comparison, not a should-you-rerank explainer. If you are still deciding whether reranking belongs in your stack at all, start with our guide on when RAG reranking actually improves retrieval and come back here once you have confirmed the lift is real on your corpus. Assuming you are past that gate, the decision narrows to two axes: hosted convenience versus self-host control, and general-purpose quality versus domain or latency specialization. Get those two right and the model practically picks itself.

We will walk through the two hosted heavyweights (Cohere Rerank 3.5 and Voyage Rerank 2.5), the domain variants that buy +2-4 NDCG@10 on code and legal corpora, the open-source self-host contenders (Jina Reranker v3, Nemotron, and BGE v2-m3), the sub-200ms latency budget and which models actually meet it, multilingual and long-context needs, and a decision matrix that maps your constraints to a specific model. Throughout, the numbers are public benchmark and vendor figures, not invented case-study metrics, and where your own corpus will behave differently I say so.

01 · The Decision Is Hosted Convenience vs Self-Host Control

Before any benchmark, settle the deployment question, because it eliminates half the field instantly. A reranker is a cross-encoder: it takes your query and a candidate document together and produces a relevance score by reading both at once. That joint encoding is why rerankers beat bi-encoder embeddings on precision, and also why they are slower. Every reranking call is an inference pass over every candidate document.

Hosted rerankers (Cohere, Voyage) give you that inference as an API call. You send the query and your top-k candidates, you get back scored and reordered results, and you never think about GPUs, batching, or model serving. The cost is a network round-trip on every query (roughly 595-603ms average for Cohere Rerank 3.5) and per-call API pricing that scales with your traffic.

Self-hosted rerankers (Jina Reranker v3, Nemotron, BGE v2-m3) run the model inside your own infrastructure. You take on serving, scaling, and monitoring, and in exchange you get dramatically lower wire latency (Jina v3 at 188ms versus 595ms+ for a hosted round-trip), no per-call pricing, and full control over data residency. The crossover is almost always latency and query volume, not headline cost. At low traffic with relaxed latency, hosted wins on time-to-ship. At high QPS or a strict sub-200ms budget, self-host wins decisively.

If you have no GPU and moderate traffic, stop here and use a hosted reranker. The rest of the self-host analysis only matters if latency or volume forces your hand.

Axis	Hosted (Cohere / Voyage)	Self-host (Jina / Nemotron / BGE)
Wire latency	~595-603ms (incl. round-trip)	188-243ms local
Ops burden	None	Model serving, scaling, monitoring
Pricing model	Per-call API	Infra cost, no per-call
Data residency	Vendor-controlled	Full control
Time to ship	Minutes	Days
Best at	Moderate traffic, no GPU	High QPS, strict latency, GPU available

02 · Hosted Rerankers: Cohere Rerank 3.5 vs Voyage Rerank 2.5

For teams that want a reranker working this afternoon, Cohere Rerank 3.5 is the strongest default. It is a mature cross-encoder with broad multilingual coverage, a stable API, and consistent NDCG across mixed general corpora. Average latency sits around 595-603ms including the network round-trip, which is fine for RAG chat and most knowledge-base assistants but too slow for autocomplete or anything with a sub-300ms end-to-end target. You change one API call, point it at your retrieved candidates, and ship. That is the entire appeal, and it is a real one.

Voyage Rerank 2.5 is the other hosted heavyweight, and on general corpora it lands in the same latency band as Cohere with comparable ranking quality. The reason to reach for Voyage over Cohere is not the generic model, it is the domain variants.

Domain Variants Buy +2-4 NDCG@10 on Code and Legal

Voyage ships reranker variants tuned for specific domains, and the most impactful are the code and legal models. On those corpora, the domain-tuned variant buys roughly +2-4 NDCG@10 over a generic reranker. That is a large delta in ranking terms. NDCG@10 differences of one to two points often separate the top three models on a leaderboard, so a +2-4 swing from picking the right domain model usually dwarfs the difference between any two general rerankers. The practical rule is simple. If your corpus is source code (a codebase search assistant, a developer Q&A tool over internal repos) or legal text (contracts, case law, regulatory filings), test the Voyage domain variant first. The accuracy gain on those document types is where domain tuning earns its keep. If your documents are general business text (support articles, product docs, mixed knowledge bases), the domain variants do not apply and Cohere versus Voyage comes down to pricing, language coverage, and existing vendor relationships. Do not pay for a domain variant on a corpus it was not trained for, you get the latency cost without the accuracy benefit. One caveat that applies to both hosted vendors: pricing and model versions in this category shift frequently. Confirm the current per-call rate and the latest model version before you budget, and treat any latency figure (including the ones here) as a starting estimate to validate against your own region and payload size.

03 · Open-Source Self-Host: Jina v3, Nemotron, and BGE v2-m3

When latency or volume pushes you to self-host, three open-source rerankers dominate the conversation, and they are not interchangeable.

Jina Reranker v3 is the one to beat on the latency-accuracy frontier. It scores 81.33% Hit@1 at 188ms, making it the only top-tier reranker that stays under the 200ms line. Two architectural choices set it apart. First, it is listwise: instead of scoring each query-document pair independently, it compares up to 64 documents together in a single pass, which improves ranking consistency across the candidate set. Second, it carries a 131k-token context window, so it can score long documents without truncating the passages that actually matter. For long-document RAG or pipelines that rerank many candidates at once, that combination is hard to match.

Nemotron edges Jina on raw accuracy at 83.00% Hit@1, but it costs you 243ms to get there. That is a real tradeoff: roughly 1.7 points of Hit@1 for 55ms of extra latency. If your latency budget has headroom and ranking quality is the dominant concern, Nemotron is defensible. If you are defending a sub-200ms target, it disqualifies itself, because 243ms blows the budget before you have added retrieval and generation.

BGE Reranker v2-m3 is the lightweight multilingual baseline. It does not top the accuracy charts, but it is small, fast on modest hardware, and covers many languages, which makes it the pragmatic open-source starting point when you need broad language support without standing up heavy GPU infrastructure. Treat it as the sensible default to benchmark against, not the model you reach for when you need the absolute best ranking.

The Hit@1 and latency figures here are public benchmark numbers and are directional. Your corpus, chunking strategy, and query distribution will move them, sometimes by several points, which is why the benchmarking section below is not optional.

Model	Hit@1	Latency	Context	Deployment	Best for
Cohere Rerank 3.5	(general)	~595-603ms	multilingual	Hosted	Zero-ops default
Voyage Rerank 2.5	(general, +2-4 NDCG@10 domain)	~595-603ms	multilingual	Hosted	Code / legal corpora
Jina Reranker v3	81.33%	188ms	131k tokens	Self-host	Strict sub-200ms budget
Nemotron	83.00%	243ms	standard	Self-host	Max accuracy, relaxed latency
BGE Reranker v2-m3	baseline	low (small batch)	standard	Self-host	Multilingual on a budget

04 · The Sub-200ms Latency Budget and Which Models Meet It

For user-facing RAG, latency is a budget you spend across stages: retrieval, reranking, prompt assembly, and generation. The reranker is one line item in that budget, and it is the one most teams underestimate. If your end-to-end target is aggressive, the reranker has to fit inside a sub-200ms slice, and that single constraint eliminates most of the field.

Run the math. A hosted reranker at 595-603ms cannot fit a sub-200ms reranking budget, full stop; the network round-trip alone overshoots it. Nemotron at 243ms misses it too. Of the strong rerankers, Jina Reranker v3 at 188ms is the only top-tier model that fits under 200ms, which is why it is the default recommendation for latency-critical self-hosted pipelines. BGE v2-m3 can also come in fast on small batches, but at lower accuracy.

Two levers control reranker latency beyond model choice. The first is candidate count: reranker latency scales with the number of documents you score, so feeding the model 500 candidates is far slower than feeding it 80. Cut your first-stage retrieval top-k to 50-100 before reranking. You are not losing recall that matters, because anything below rank 100 from a decent retriever is rarely the right answer anyway. The second is batching and hardware for self-hosted models, where a single warm GPU is the difference between meeting and missing your budget.

Measure P95, not the average. Reranker latency has a tail that grows with candidate count and payload size, and the average will hide a P95 that breaks your SLA. Across the production retrieval stacks we have tuned at Particula Tech, the most common reranker latency surprise is a long tail driven by an unbounded retrieval top-k flowing straight into the reranker, the fix is almost always capping candidates, not swapping models. For first-stage retrieval quality that feeds the reranker, our guide on combining dense and sparse embeddings covers the hybrid setup that gives the reranker the best possible candidate set to work with.

05 · Multilingual and Long-Context Reranking Needs

Two corpus properties override the general recommendation: language coverage and document length.

For multilingual corpora, your shortlist narrows. BGE Reranker v2-m3 is the open-source multilingual baseline and the pragmatic self-host choice when you need many languages without heavy infrastructure. Cohere Rerank 3.5 offers broad multilingual coverage as a hosted option for teams that would rather not run the model. The mistake to avoid is deploying an English-tuned reranker over a multilingual corpus and concluding that reranking does not help, the model was simply never trained for your languages.

For long-context reranking, document length is the deciding factor. A reranker has to read the full document to score it, and if the model truncates at a few thousand tokens, it never sees the passage that makes the document relevant. Jina Reranker v3's 131k-token context window is the standout here: it can score long documents (full contracts, long technical articles, multi-page reports) without truncation, and its listwise scoring of up to 64 documents at once suits pipelines that rerank many long candidates together. If your documents routinely exceed a few thousand tokens, prioritize a long-context reranker over a marginally higher Hit@1 on short-document benchmarks. The benchmark advantage is irrelevant if the model cannot see half your document.

The interaction with chunking matters too. If you chunk aggressively (small chunks), context length is less of a constraint and you can use any reranker, but you may reorder fragments that lose meaning out of context. If you keep large chunks or rerank whole documents, long context becomes essential. Reranker choice and chunking strategy are not independent decisions.

06 · Decision Matrix: Mapping Constraints to a Model

The vendor pages push feature checklists. The real decision runs through three questions in order: can you self-host, how tight is your latency budget, and is your corpus a special domain or language.

A few patterns are worth stating plainly, in the same opinionated spirit as the rest of this comparison:

"Higher Hit@1 is better" is not a sufficient reason to pick Nemotron over Jina v3. The 1.7-point accuracy gain costs you 55ms, and if you have a latency budget, that math goes the wrong way. Pick the highest accuracy that fits your latency budget, not the highest accuracy.

"Hosted is easier" is not a sufficient reason to stay hosted at high QPS. The 595-603ms round-trip and per-call pricing both compound with traffic; above a certain volume, self-hosting Jina v3 is faster and cheaper, and the ops cost is one GPU plus monitoring.

"Domain variants are more accurate" is only true on their domain. Running the Voyage code reranker over general business text gives you the latency without the gain. Match the variant to the corpus or use the generic model.

Whichever model you pick, the choice is only validated by measurement against your own data, which is the one step teams skip and regret.

Your situation	Pick	Why
No GPU, moderate traffic, general corpus	Cohere Rerank 3.5	Zero ops, strongest hosted default, ship today
Corpus is code or legal	Voyage Rerank 2.5 (domain variant)	+2-4 NDCG@10 over generic on those domains
Strict sub-200ms budget, GPU available	Jina Reranker v3	Only top-tier model under 200ms (81.33% Hit@1 @ 188ms)
Max accuracy, latency has headroom	Nemotron	Highest Hit@1 (83.00%), 243ms cost is acceptable
Long documents (>few thousand tokens)	Jina Reranker v3	131k-token context, no truncation
Multilingual, self-host, budget-conscious	BGE Reranker v2-m3	Lightweight multilingual baseline
High QPS where per-call pricing hurts	Self-host (Jina or BGE)	No per-call cost, latency under control

07 · Benchmark on Your Own Corpus Before You Commit

Every number in this post (81.33% Hit@1 for Jina v3, 83.00% for Nemotron, +2-4 NDCG@10 for Voyage domain variants, 595-603ms for Cohere) is a public benchmark or vendor figure. They are directional, not transferable. Your documents, query style, and chunking will move them, and the model that wins a public leaderboard can lose on your corpus.

The benchmark protocol is straightforward. Build a small labeled set: 50-100 queries from your real traffic, each with one or more known-relevant documents from your corpus. Hold your first-stage retrieval constant, then swap only the reranker and measure NDCG@10 and Hit@1 for each candidate against the same retrieved set. Record end-to-end latency per query alongside ranking quality, and report P95, not just the mean. Fifty to a hundred labeled queries is enough to separate the front-runners reliably; you do not need thousands to see which model ranks your data best.

Crucially, confirm the reranker actually changes answers, not just scores. A reranker that reorders candidates but never changes which chunk reaches the LLM is pure latency cost. Pair the ranking metrics with end-to-end RAG evaluation, our guide on telling whether your RAG system actually works covers the answer-level metrics that catch this. And because reranker quality is downstream of embedding quality, if your candidate set is weak no reranker can save it; our breakdown of choosing embedding models for RAG covers the first-stage decision that determines how much work the reranker has to do.

When we tune retrieval pipelines at Particula Tech, the reranker benchmark is a half-day exercise that routinely changes the chosen model versus the leaderboard favorite, because real corpora rarely look like academic ones. The broader strategic context, when reranking earns its place in the pipeline versus when it is premature latency, lives in our RAG systems pillar.

The summary holds: pick hosted (Cohere) for zero-ops convenience on general corpora, Voyage domain variants for code and legal, Jina Reranker v3 for a strict sub-200ms budget or long documents, Nemotron when accuracy outranks latency, and BGE v2-m3 for multilingual on a budget. Then prove it on your own data before you ship, because the reranker that wins the benchmark is the one that wins on your corpus, not the one with the best leaderboard line.

08 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/RAG & VECTOR SEARCH

Reranker Models Compared: Cohere vs Voyage vs Jina vs BGE

Jina Reranker v3 hits 81.33% Hit@1 at 188ms, the only top-tier sub-200ms model. The latency and NDCG breakdown across Cohere, Voyage, Jina, and BGE.

Sebastian MondragonMAY 22, 2026 · 11 MIN READ

01 · The Decision Is Hosted Convenience vs Self-Host Control

If you have no GPU and moderate traffic, stop here and use a hosted reranker. The rest of the self-host analysis only matters if latency or volume forces your hand.

Axis	Hosted (Cohere / Voyage)	Self-host (Jina / Nemotron / BGE)
Wire latency	~595-603ms (incl. round-trip)	188-243ms local
Ops burden	None	Model serving, scaling, monitoring
Pricing model	Per-call API	Infra cost, no per-call
Data residency	Vendor-controlled	Full control
Time to ship	Minutes	Days
Best at	Moderate traffic, no GPU	High QPS, strict latency, GPU available

02 · Hosted Rerankers: Cohere Rerank 3.5 vs Voyage Rerank 2.5

Domain Variants Buy +2-4 NDCG@10 on Code and Legal

03 · Open-Source Self-Host: Jina v3, Nemotron, and BGE v2-m3

When latency or volume pushes you to self-host, three open-source rerankers dominate the conversation, and they are not interchangeable.

Model	Hit@1	Latency	Context	Deployment	Best for
Cohere Rerank 3.5	(general)	~595-603ms	multilingual	Hosted	Zero-ops default
Voyage Rerank 2.5	(general, +2-4 NDCG@10 domain)	~595-603ms	multilingual	Hosted	Code / legal corpora
Jina Reranker v3	81.33%	188ms	131k tokens	Self-host	Strict sub-200ms budget
Nemotron	83.00%	243ms	standard	Self-host	Max accuracy, relaxed latency
BGE Reranker v2-m3	baseline	low (small batch)	standard	Self-host	Multilingual on a budget

04 · The Sub-200ms Latency Budget and Which Models Meet It

05 · Multilingual and Long-Context Reranking Needs

Two corpus properties override the general recommendation: language coverage and document length.

06 · Decision Matrix: Mapping Constraints to a Model

A few patterns are worth stating plainly, in the same opinionated spirit as the rest of this comparison:

Whichever model you pick, the choice is only validated by measurement against your own data, which is the one step teams skip and regret.

Your situation	Pick	Why
No GPU, moderate traffic, general corpus	Cohere Rerank 3.5	Zero ops, strongest hosted default, ship today
Corpus is code or legal	Voyage Rerank 2.5 (domain variant)	+2-4 NDCG@10 over generic on those domains
Strict sub-200ms budget, GPU available	Jina Reranker v3	Only top-tier model under 200ms (81.33% Hit@1 @ 188ms)
Max accuracy, latency has headroom	Nemotron	Highest Hit@1 (83.00%), 243ms cost is acceptable
Long documents (>few thousand tokens)	Jina Reranker v3	131k-token context, no truncation
Multilingual, self-host, budget-conscious	BGE Reranker v2-m3	Lightweight multilingual baseline
High QPS where per-call pricing hurts	Self-host (Jina or BGE)	No per-call cost, latency under control

07 · Benchmark on Your Own Corpus Before You Commit

08 · FAQ

Quick answers to the questions this post tends to raise.