Reranking & Cross-Encoders Complete Guide (2026): BGE, Cohere, Jina, ColBERT for RAG
Want to go deeper than this article?
Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.
Reranking is the highest-ROI improvement most RAG systems can make. Vector retrieval (bi-encoders) is fast but loses fine-grained query-document interaction. A cross-encoder reranker takes the top 100 candidates from vector search, jointly attends over each query-document pair, and returns a precision-tuned top 10 for the LLM. Quality lift: typically +5 to +15 NDCG@10 points across MTEB and BEIR benchmarks — often the difference between a usable RAG system and one that actually answers questions correctly.
This guide covers the full reranker landscape in 2026: bi-encoder vs cross-encoder vs ColBERT trade-offs, the leading open models (BGE-Reranker-v2-m3, Jina Reranker v2, mxbai-rerank, bge-reranker-v2-gemma), Cohere / Voyage hosted alternatives, setup with sentence-transformers and Text Embeddings Inference (TEI), latency benchmarks, fine-tuning recipes, and integration into RAG pipelines with Chroma / Qdrant / Weaviate.
Table of Contents
- Why Reranking Matters
- Bi-Encoder vs Cross-Encoder vs ColBERT
- How Cross-Encoders Score Relevance
- Open Reranker Models (2026)
- Hosted: Cohere, Voyage, Jina API
- Setup with sentence-transformers
- Setup with Text Embeddings Inference (TEI)
- Integration: Chroma / Qdrant / Weaviate
- Choosing K for Reranking
- Latency Benchmarks
- Fine-Tuning a Reranker
- Distilling Larger Rerankers
- ColBERT and Late Interaction
- Hybrid Retrieval (BM25 + Dense + Rerank)
- Multi-lingual Reranking
- Production Best Practices
- Troubleshooting
- FAQ
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Why Reranking Matters {#why}
A bi-encoder embeds query Q and document D independently. Score = cos(emb(Q), emb(D)). The two never see each other directly. Lost: which Q tokens match which D tokens, term overlap, negation, multi-hop reasoning, modifier matching.
A cross-encoder feeds [Q; D] jointly through a transformer. Every query token attends to every document token. Score is computed from rich cross-attention.
Result on standard benchmarks:
| Pipeline | NDCG@10 (BEIR avg) |
|---|---|
| BM25 only | 41.7 |
| Bi-encoder (BGE-base) | 51.0 |
| Bi-encoder + Cross-encoder rerank | 56.5 |
| Bi-encoder + GPT-4 rerank | 58.2 |
Reranking adds ~5-7 NDCG points free — biggest single-step quality lift in the typical RAG stack.
Bi-Encoder vs Cross-Encoder vs ColBERT {#architectures}
| Approach | Latency (per query) | Storage | Quality | Use |
|---|---|---|---|---|
| Bi-encoder | <1 ms (retrieve from index) | 1 vector / doc (~3 KB) | Baseline | First-stage retrieval over millions |
| Cross-encoder | 50-500 ms (top 100) | None (compute on-the-fly) | +5-15 NDCG | Second-stage rerank top 100 |
| ColBERT (late interaction) | 5-50 ms | 1 vector / token (~30 KB) | +3-8 NDCG | Mid-stage rerank or replace bi-encoder for high-precision |
| LLM-as-reranker | 100-2000 ms | None | +6-10 NDCG | Top 10-20 only, when budget allows |
For most RAG: bi-encoder retrieve (top 100) → cross-encoder rerank (top 10) → LLM generate.
How Cross-Encoders Score Relevance {#how-it-works}
Input: [CLS] query [SEP] document [SEP]
Forward through transformer (BERT-style or modern BERT replacement)
Pool [CLS] embedding (or mean-pool)
Linear head → single scalar score
Sigmoid optional for probability output
Training: contrastive loss with positive (Q, D+) and negative (Q, D-) pairs. Or pairwise loss: score(Q, D+) > score(Q, D-).
Modern rerankers add: instruction-tuning prompts, domain pre-training, multi-lingual data, listwise loss for order-aware training.
Reading articles is good. Building is better.
Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.
Open Reranker Models (2026) {#open-models}
| Model | Params | License | MTEB Rerank | Best For |
|---|---|---|---|---|
| BGE-Reranker-v2-m3 | 568M | MIT | 60.4 | Default; multi-lingual |
| BGE-Reranker-v2-Gemma | 9B | Apache 2.0 | 64.5 | Maximum quality |
| BGE-Reranker-v2-MiniCPM | 2.7B | Apache 2.0 | 62.3 | Quality + speed |
| Jina Reranker v2 base multilingual | 278M | Apache 2.0 | 56.8 | Latency-critical |
| mxbai-rerank-large-v1 | 435M | Apache 2.0 | 59.4 | English; fast |
| mxbai-rerank-base-v1 | 184M | Apache 2.0 | 55.0 | Edge / minimal latency |
| bge-reranker-large | 560M | MIT | 57.0 | Predecessor; v2-m3 better |
| Voyage-rerank-2 | closed | API | 61.5 | Hosted alternative |
| Cohere Rerank 3 | closed | API | 62.0 | 4K context, 100+ languages |
| jina-colbert-v2 | 137M | Apache 2.0 | 58.5 | ColBERT alternative |
For most production in 2026: BGE-Reranker-v2-m3 is the right default — best quality / latency / license combination. Use bge-reranker-v2-gemma when you can afford 9B-model latency and need top-of-leaderboard quality.
Hosted: Cohere, Voyage, Jina API {#hosted}
When self-hosting isn't practical:
# Cohere
curl https://api.cohere.com/v2/rerank \
-H "Authorization: Bearer $COHERE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "rerank-v3.5", "query": "what is RAG?", "documents": ["...", "..."]}'
# Voyage
curl https://api.voyageai.com/v1/rerank \
-H "Authorization: Bearer $VOYAGE_API_KEY" \
-d '{"model": "rerank-2", "query": "...", "documents": [...]}'
# Jina
curl https://api.jina.ai/v1/rerank \
-H "Authorization: Bearer $JINA_API_KEY" \
-d '{"model": "jina-reranker-v2-base-multilingual", "query": "...", "documents": [...]}'
Cost: ~$1-2 per 1K queries (each scoring 100 documents). For low-volume / quick-start, hosted is great. For >10K queries/day, self-hosted BGE typically pays back in days.
Setup with sentence-transformers {#setup-st}
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512, device="cuda")
query = "What is retrieval-augmented generation?"
candidates = [
"RAG combines retrieval with generation...",
"Cooking recipes for cake...",
"Vector databases store embeddings...",
# ... 100 more
]
# Score all (query, doc) pairs
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs, batch_size=32, show_progress_bar=False)
# Rank
ranked = sorted(zip(scores, candidates), key=lambda x: -x[0])
top_10 = [doc for _, doc in ranked[:10]]
Setup with Text Embeddings Inference (TEI) {#setup-tei}
For production, use Hugging Face's TEI server:
docker run -p 8080:80 --gpus all \
-v $PWD/data:/data \
ghcr.io/huggingface/text-embeddings-inference:1.5 \
--model-id BAAI/bge-reranker-v2-m3
curl http://localhost:8080/rerank \
-X POST -H "Content-Type: application/json" \
-d '{"query": "what is RAG?", "texts": ["doc1", "doc2", "doc3"], "raw_scores": false}'
Returns scored, sorted indices:
[
{"index": 0, "score": 0.94},
{"index": 2, "score": 0.31},
{"index": 1, "score": 0.05}
]
TEI features: dynamic batching, ONNX runtime for CPU, CUDA + Metal + Vulkan for GPU, OpenAPI spec.
Integration: Chroma / Qdrant / Weaviate {#integration}
Standard pattern: retrieve top K with vector DB, rerank with cross-encoder.
import chromadb
from sentence_transformers import CrossEncoder
# Vector retrieval
client = chromadb.Client()
collection = client.get_collection("docs")
results = collection.query(query_texts=[query], n_results=100)
candidates = results["documents"][0]
# Reranking
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", device="cuda")
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs, batch_size=32)
top_10 = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)][:10]
# Pass to LLM
context = "\n\n".join(top_10)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
Qdrant has built-in reranker support via QueryRequest.using since 1.13. Weaviate supports rerankers as modules. See Ollama ChromaDB RAG Pipeline for full RAG examples.
Choosing K for Reranking {#choosing-k}
K = number of candidates fed from retrieval into reranking.
| K | Use Case | Trade-off |
|---|---|---|
| 20-50 | Low-stakes chat, fast autocomplete | May miss relevant docs |
| 100 | Standard RAG (default) | Good balance |
| 200-500 | High-precision (legal, medical, code) | Higher latency |
| 1000+ | Research / max-recall pipelines | Latency dominates |
Empirical recipe: vary K from 20 to 500, plot NDCG@10 (or your task metric), find the knee. For 80% of RAG: K=100 is right.
Latency Benchmarks {#latency}
Single H100, batched, 512-token documents:
| Reranker | Throughput (pairs/s) | Latency for 100 pairs |
|---|---|---|
| BGE-Reranker-v2-m3 | 1100 | 90 ms |
| BGE-Reranker-v2-MiniCPM (2.7B) | 380 | 260 ms |
| BGE-Reranker-v2-Gemma (9B) | 95 | 1050 ms |
| mxbai-rerank-large | 1450 | 70 ms |
| mxbai-rerank-base | 4200 | 24 ms |
| Jina Reranker v2 base | 3500 | 28 ms |
RTX 4090: ~50% of H100 throughput. CPU (Ryzen 7 7800X3D): 1/30th of H100 (use only for <100 queries/day).
For interactive chat (TTFT budget <300 ms): mxbai-base or Jina v2 base. For B2B RAG (TTFT <2 s): any reranker comfortably fits.
Fine-Tuning a Reranker {#fine-tuning}
Domain-specific fine-tuning often gives +10-20 NDCG points.
from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader
train_examples = [
InputExample(texts=[query, pos_doc], label=1.0),
InputExample(texts=[query, neg_doc], label=0.0),
# ... 10K-50K triples
]
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", num_labels=1)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
reranker.fit(
train_dataloader=train_dataloader,
epochs=3,
warmup_steps=100,
output_path="./my-reranker",
)
Time: ~1-4 hours on a single RTX 4090 for 10K-50K triples. Data curation tip: mine hard negatives with a stronger retriever (BGE-large) then label with an LLM judge.
Distilling Larger Rerankers {#distillation}
Use bge-reranker-v2-gemma (9B) to score 100K-1M (query, doc) pairs from your domain. SFT BGE-Reranker-v2-m3 on those scores. Captures most of gemma's quality at base latency.
# Pseudocode
big_reranker = CrossEncoder("BAAI/bge-reranker-v2-gemma")
small_reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", num_labels=1)
scored_pairs = []
for q, d in pairs:
score = big_reranker.predict([(q, d)])[0]
scored_pairs.append(InputExample(texts=[q, d], label=score))
# Distill
small_reranker.fit(DataLoader(scored_pairs, batch_size=32), epochs=2, ...)
See Knowledge Distillation Guide.
ColBERT and Late Interaction {#colbert}
ColBERT (Khattab & Zaharia, 2020) and ColBERT-v2 / PLAID encode each query token and each document token into separate vectors. Relevance = sum over query tokens of max similarity to any document token.
Trade-offs:
- Faster than cross-encoder (~5x at K=100)
- Slower than bi-encoder (~10x at retrieval)
- Storage: 10-100x bi-encoder (one vector per token)
When ColBERT helps: high-precision retrieval over medium-size corpora (1K-10K docs) where you can afford the storage. Modern variants (jina-colbert-v2) reduce storage and improve quality.
For 2026: ColBERT is a niche tool. The standard bi-encoder + cross-encoder pipeline is simpler and usually equivalent quality.
Hybrid Retrieval (BM25 + Dense + Rerank) {#hybrid}
Best-quality RAG retrieval pipeline:
1. BM25 (lexical): retrieve top 50 (catches exact-keyword matches)
2. Dense bi-encoder: retrieve top 50 (catches semantic matches)
3. Reciprocal Rank Fusion (RRF): merge into top 100 candidates
4. Cross-encoder rerank: top 10
5. LLM generate
RRF formula: score(d) = Σ 1/(k + rank_in_list_i(d)) for each retrieval method i; k=60 is standard.
This pattern (used by Cohere, OpenAI's RAG cookbooks, and most production systems) gives the most robust retrieval — covers cases where dense embeddings miss exact terms (e.g., product codes, brand names) and BM25 misses paraphrases.
Multi-lingual Reranking {#multilingual}
For multi-lingual RAG (cross-language Q&A or multi-language corpora):
| Reranker | Language Coverage |
|---|---|
| BGE-Reranker-v2-m3 | 100+ languages |
| Jina Reranker v2 base multilingual | 30+ languages |
| Cohere Rerank 3 | 100+ languages |
| Voyage-rerank-2-multilingual | 100+ languages |
For Asian-language workloads (Chinese, Japanese, Korean): BGE family is strongest (trained on substantial Chinese data). For European languages: all major rerankers work well.
For cross-language retrieval (English query, Chinese documents): use multilingual reranker; quality is typically 5-10 points below same-language but still usable.
Production Best Practices {#production}
- Truncate documents to <512 tokens — most rerankers degrade at longer contexts and latency scales linearly
- Batch reranking calls — 100 pairs in one batch, not 100 sequential calls
- Cache scores — for stable corpora, (query, doc) score pairs can be cached when queries repeat
- Use TEI in production — better batching and ONNX acceleration than raw sentence-transformers
- Fine-tune on domain — even 1K labeled triples gives measurable gains
- Monitor recall@K from retrieval — if BM25/dense misses relevant docs, rerank can't recover
- Combine with hybrid retrieval — BM25 + dense + RRF + rerank is the gold standard
- Profile latency — TTFT budget should explicitly include retrieve + rerank time
Troubleshooting {#troubleshooting}
| Symptom | Cause | Fix |
|---|---|---|
| Reranker doesn't improve quality | Bi-encoder already returns relevant docs | Increase K to give reranker harder cases |
| Latency too high | Too many candidates or too-large model | Lower K; switch to mxbai-base or Jina |
| Long documents truncated | max_length=512 default | Chunk documents to fit; or use Cohere (4K) |
| Multi-language quality low | Wrong reranker | Use BGE-Reranker-v2-m3 or Cohere Rerank 3 |
| Reranker scores all 0 | Sigmoid not applied | Set normalize=True or use raw_scores=False in TEI |
| OOM during reranking | Batch size too large | Lower batch_size in predict() |
| Reranker worse than bi-encoder | Wrong model | Verify it's a cross-encoder, not a bi-encoder mislabeled |
| No GPU acceleration | Wrong device | Set device="cuda" explicitly in CrossEncoder constructor |
FAQ {#faq}
See answers to common reranker questions below.
Sources: BGE-Reranker-v2-m3 model card | BGE paper (arXiv 2402.03216) | ColBERT paper (Khattab & Zaharia, 2020) | Jina Reranker v2 announcement | Cohere Rerank docs | Text Embeddings Inference (TEI) | BEIR benchmark | Internal benchmarks H100 + RTX 4090.
Related guides:
Go from reading about AI to building with AI
10 structured courses. Hands-on projects. Runs on your machine. Start free.
Liked this? 17 full AI courses are waiting.
From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.
Build Real AI on Your Machine
RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.
Want structured AI education?
17 courses, 160+ chapters, from $9. Understand AI, don't just use it.
Continue Your Local AI Journey
Comments (0)
No comments yet. Be the first to share your thoughts!