Question 1

What is a reranker and why does it improve RAG quality?

Accepted Answer

A reranker is a second-stage model that takes the top-K candidates from your initial vector retrieval and rescores them with a more expensive but more accurate model. Vector retrieval (bi-encoder) embeds query and documents independently — fast (millions of docs/sec) but loses cross-attention between query and document. Reranking (cross-encoder) feeds the query and each candidate document jointly through a transformer, computing rich attention between every query token and every document token — slow (~50-500 docs/sec) but much more accurate. The standard pipeline: retrieve top 100-500 candidates with vector search, rerank to top 5-20 with cross-encoder, send to LLM. Quality lift on common benchmarks (MTEB rerank, BEIR): +5 to +15 NDCG@10 points — often the difference between a usable RAG system and a great one.

Question 2

Bi-encoder vs cross-encoder vs ColBERT — what's the difference?

Accepted Answer

**Bi-encoder** (e.g., BGE-base, E5, Nomic, OpenAI embeddings): independently encodes query and document into single vectors; relevance = cosine similarity. Fast retrieval over millions of docs but no fine-grained interaction. **Cross-encoder** (e.g., BGE-Reranker-v2-m3, Cohere Rerank 3): query and document concatenated, passed through a transformer that outputs a single relevance score. Rich interaction, slow — only viable for small candidate sets (top 100-500). **ColBERT / late-interaction**: encodes query and document into multi-vector representations (one vector per token), relevance = sum of max-similarity per query token. Middle ground: faster than cross-encoder, more accurate than bi-encoder, but uses 10-100x more storage. Modern RAG: bi-encoder for retrieval (top 100), cross-encoder for reranking (top 10), LLM for generation.

Question 3

Which open reranker should I use in 2026?

Accepted Answer

**BGE-Reranker-v2-m3** (BAAI, Aug 2024): the open default. Multi-lingual (100+ languages), fits 568M params, excellent quality. License: MIT. **Jina Reranker v2 (jina-reranker-v2-base-multilingual)**: 278M params, faster than BGE, slight quality dip. License: Apache 2.0. **mxbai-rerank-large-v1** (Mixedbread): 435M params, top of MTEB rerank leaderboard at release. License: Apache 2.0. **bge-reranker-v2-gemma**: 7B parameters, the highest-quality open option but slower. License: Apache 2.0. For most production: BGE-Reranker-v2-m3 is the sweet spot. For latency-critical: Jina v2 or mxbai-base. For maximum quality on small candidate sets: bge-reranker-v2-gemma.

Question 4

How do I run a reranker locally?

Accepted Answer

Most rerankers use the standard cross-encoder API via sentence-transformers. Install: `pip install sentence-transformers`. Usage: `from sentence_transformers import CrossEncoder; model = CrossEncoder("BAAI/bge-reranker-v2-m3"); scores = model.predict([(query, doc) for doc in candidates]); ranked = sorted(zip(scores, candidates), reverse=True)`. For high-throughput serving: deploy via FastAPI / Triton / TEI (Text Embeddings Inference, the official Hugging Face server). TEI: `text-embeddings-router --model-id BAAI/bge-reranker-v2-m3` exposes /rerank endpoint with batching and ONNX/CUDA acceleration. Latency: ~10-30ms per query+doc pair on RTX 4090; ~5-15ms on H100; batch 32 at once for max throughput.

Question 5

When should I use Cohere Rerank API vs an open model?

Accepted Answer

Cohere Rerank 3 (closed, hosted) is excellent: 4096-token document support, 100+ languages, ~$2 per 1K queries. Use Cohere when: (a) you need >4K-token document support and don't want to chunk, (b) you want zero infrastructure, (c) volume is moderate (<10K queries/day). Use open rerankers (BGE, Jina, mxbai) when: (a) data residency / privacy is required, (b) volume is high (>10K queries/day; self-hosting becomes cheaper), (c) you need to fine-tune on domain data, (d) you need to compose with an existing GPU pipeline. For most production B2B RAG in 2026: BGE-Reranker-v2-m3 self-hosted via TEI is the right default. Cohere is a great fallback or quick-start.

Question 6

Can I fine-tune a reranker on my domain data?

Accepted Answer

Yes — and it often gives 10-20 NDCG points on domain-specific RAG. Recipe: (1) Collect (query, positive_doc, negative_doc) triples from your corpus — manual labels or synthetic via LLM judge. (2) Use sentence-transformers CrossEncoderTrainer or Hugging Face's standard fine-tuning loop with binary cross-entropy or pairwise margin loss. (3) Train 1-3 epochs on a single GPU; BGE-base reranker fine-tunes in 1-4 hours on 10K-50K triples. Alternative: distill from a larger reranker — use BGE-Reranker-v2-Gemma (7B) to score 100K (query, doc) pairs from your domain, then SFT BGE-base on those scores. Captures most of the gemma quality at base latency. See [Knowledge Distillation Guide](/blog/knowledge-distillation-guide).

Question 7

How many documents should I rerank — 100? 500?

Accepted Answer

The standard pipeline retrieves K=100-200 with bi-encoder, reranks to top 10-20 for the LLM. Trade-offs: too few candidates (K<50) and the reranker can't recover from bi-encoder misses; too many (K>500) and reranking latency dominates without quality gains (long-tail of clearly-irrelevant docs). Empirical sweet spot for most domains: K=100. For high-precision tasks (legal, medical, code): K=200-500 to ensure relevant docs aren't dropped at retrieval. For high-recall low-stakes (chat memory, casual Q&A): K=20-50. Always benchmark on your domain — vary K from 20 to 500 and plot NDCG@10 to find your knee.

Question 8

What's the latency cost of reranking in production?

Accepted Answer

For BGE-Reranker-v2-m3 on RTX 4090, batched inference: ~15ms per query-doc pair, scaling sub-linearly with batch. Reranking 100 candidates: batch as 100 pairs in one call → ~120ms total. On H100: ~60ms. On CPU: 800ms-2s (use only for low-volume). For real-time chat (target <500ms TTFT): rerank top 50-100 candidates fits comfortably in budget. For ultra-low-latency (autocomplete, suggestion): use a smaller reranker (Jina v2 base or mxbai-mini) and reduce K to 30-50. ColBERT alternative (PLAID) is faster than cross-encoder for K>100 but slower for K<50. Benchmarks vary widely with sequence length — chunk documents to <512 tokens for best reranker latency.

Pipeline	NDCG@10 (BEIR avg)
BM25 only	41.7
Bi-encoder (BGE-base)	51.0
Bi-encoder + Cross-encoder rerank	56.5
Bi-encoder + GPT-4 rerank	58.2

Approach	Latency (per query)	Storage	Quality	Use
Bi-encoder	<1 ms (retrieve from index)	1 vector / doc (~3 KB)	Baseline	First-stage retrieval over millions
Cross-encoder	50-500 ms (top 100)	None (compute on-the-fly)	+5-15 NDCG	Second-stage rerank top 100
ColBERT (late interaction)	5-50 ms	1 vector / token (~30 KB)	+3-8 NDCG	Mid-stage rerank or replace bi-encoder for high-precision
LLM-as-reranker	100-2000 ms	None	+6-10 NDCG	Top 10-20 only, when budget allows

Model	Params	License	MTEB Rerank	Best For
BGE-Reranker-v2-m3	568M	MIT	60.4	Default; multi-lingual
BGE-Reranker-v2-Gemma	9B	Apache 2.0	64.5	Maximum quality
BGE-Reranker-v2-MiniCPM	2.7B	Apache 2.0	62.3	Quality + speed
Jina Reranker v2 base multilingual	278M	Apache 2.0	56.8	Latency-critical
mxbai-rerank-large-v1	435M	Apache 2.0	59.4	English; fast
mxbai-rerank-base-v1	184M	Apache 2.0	55.0	Edge / minimal latency
bge-reranker-large	560M	MIT	57.0	Predecessor; v2-m3 better
Voyage-rerank-2	closed	API	61.5	Hosted alternative
Cohere Rerank 3	closed	API	62.0	4K context, 100+ languages
jina-colbert-v2	137M	Apache 2.0	58.5	ColBERT alternative

K	Use Case	Trade-off
20-50	Low-stakes chat, fast autocomplete	May miss relevant docs
100	Standard RAG (default)	Good balance
200-500	High-precision (legal, medical, code)	Higher latency
1000+	Research / max-recall pipelines	Latency dominates

Reranker	Throughput (pairs/s)	Latency for 100 pairs
BGE-Reranker-v2-m3	1100	90 ms
BGE-Reranker-v2-MiniCPM (2.7B)	380	260 ms
BGE-Reranker-v2-Gemma (9B)	95	1050 ms
mxbai-rerank-large	1450	70 ms
mxbai-rerank-base	4200	24 ms
Jina Reranker v2 base	3500	28 ms

Reranker	Language Coverage
BGE-Reranker-v2-m3	100+ languages
Jina Reranker v2 base multilingual	30+ languages
Cohere Rerank 3	100+ languages
Voyage-rerank-2-multilingual	100+ languages

Symptom	Cause	Fix
Reranker doesn't improve quality	Bi-encoder already returns relevant docs	Increase K to give reranker harder cases
Latency too high	Too many candidates or too-large model	Lower K; switch to mxbai-base or Jina
Long documents truncated	max_length=512 default	Chunk documents to fit; or use Cohere (4K)
Multi-language quality low	Wrong reranker	Use BGE-Reranker-v2-m3 or Cohere Rerank 3
Reranker scores all 0	Sigmoid not applied	Set normalize=True or use raw_scores=False in TEI
OOM during reranking	Batch size too large	Lower batch_size in predict()
Reranker worse than bi-encoder	Wrong model	Verify it's a cross-encoder, not a bi-encoder mislabeled
No GPU acceleration	Wrong device	Set device="cuda" explicitly in CrossEncoder constructor

Reranking & Cross-Encoders Complete Guide (2026): BGE, Cohere, Jina, ColBERT for RAG

Want to go deeper than this article?

Table of Contents

Reading articles is good. Building is better.

Why Reranking Matters {#why}

Bi-Encoder vs Cross-Encoder vs ColBERT {#architectures}

How Cross-Encoders Score Relevance {#how-it-works}

Reading articles is good. Building is better.

Open Reranker Models (2026) {#open-models}

Hosted: Cohere, Voyage, Jina API {#hosted}

Setup with sentence-transformers {#setup-st}

Setup with Text Embeddings Inference (TEI) {#setup-tei}

Integration: Chroma / Qdrant / Weaviate {#integration}

Choosing K for Reranking {#choosing-k}

Latency Benchmarks {#latency}

Fine-Tuning a Reranker {#fine-tuning}

Distilling Larger Rerankers {#distillation}

ColBERT and Late Interaction {#colbert}

Hybrid Retrieval (BM25 + Dense + Rerank) {#hybrid}

Multi-lingual Reranking {#multilingual}

Production Best Practices {#production}

Troubleshooting {#troubleshooting}

FAQ {#faq}

Go from reading about AI to building with AI

Liked this? 17 full AI courses are waiting.

LocalAimaster Research Team

Build Real AI on Your Machine

Want structured AI education?

Continue Your Local AI Journey

How to Install Your First Local AI Model

How to Choose the Right AI Model for Your Computer

Comments (0)

Ollama Docker Templates

Build Real AI on Your Machine

Related Guides

Local AI Embeddings Guide

RAG Local Setup Guide

Local vs OpenAI Embeddings

Ollama ChromaDB RAG Pipeline

Written by Pattanaik Ramswarup

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

Go from reading about AI to building with AI