★ Reading this for free? Get 17 structured AI courses + per-chapter AI tutor — the first chapter of every course free, no card.Start free in 30 seconds
RAG

Reranking & Cross-Encoders Complete Guide (2026): BGE, Cohere, Jina, ColBERT for RAG

May 2, 2026
22 min read
LocalAimaster Research Team

Want to go deeper than this article?

Free account unlocks the first chapter of all 17 courses — RAG, agents, MCP, voice AI, MLOps, real GitHub repos.

Reranking is the highest-ROI improvement most RAG systems can make. Vector retrieval (bi-encoders) is fast but loses fine-grained query-document interaction. A cross-encoder reranker takes the top 100 candidates from vector search, jointly attends over each query-document pair, and returns a precision-tuned top 10 for the LLM. Quality lift: typically +5 to +15 NDCG@10 points across MTEB and BEIR benchmarks — often the difference between a usable RAG system and one that actually answers questions correctly.

This guide covers the full reranker landscape in 2026: bi-encoder vs cross-encoder vs ColBERT trade-offs, the leading open models (BGE-Reranker-v2-m3, Jina Reranker v2, mxbai-rerank, bge-reranker-v2-gemma), Cohere / Voyage hosted alternatives, setup with sentence-transformers and Text Embeddings Inference (TEI), latency benchmarks, fine-tuning recipes, and integration into RAG pipelines with Chroma / Qdrant / Weaviate.

Table of Contents

  1. Why Reranking Matters
  2. Bi-Encoder vs Cross-Encoder vs ColBERT
  3. How Cross-Encoders Score Relevance
  4. Open Reranker Models (2026)
  5. Hosted: Cohere, Voyage, Jina API
  6. Setup with sentence-transformers
  7. Setup with Text Embeddings Inference (TEI)
  8. Integration: Chroma / Qdrant / Weaviate
  9. Choosing K for Reranking
  10. Latency Benchmarks
  11. Fine-Tuning a Reranker
  12. Distilling Larger Rerankers
  13. ColBERT and Late Interaction
  14. Hybrid Retrieval (BM25 + Dense + Rerank)
  15. Multi-lingual Reranking
  16. Production Best Practices
  17. Troubleshooting
  18. FAQ

Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Why Reranking Matters {#why}

A bi-encoder embeds query Q and document D independently. Score = cos(emb(Q), emb(D)). The two never see each other directly. Lost: which Q tokens match which D tokens, term overlap, negation, multi-hop reasoning, modifier matching.

A cross-encoder feeds [Q; D] jointly through a transformer. Every query token attends to every document token. Score is computed from rich cross-attention.

Result on standard benchmarks:

PipelineNDCG@10 (BEIR avg)
BM25 only41.7
Bi-encoder (BGE-base)51.0
Bi-encoder + Cross-encoder rerank56.5
Bi-encoder + GPT-4 rerank58.2

Reranking adds ~5-7 NDCG points free — biggest single-step quality lift in the typical RAG stack.


Bi-Encoder vs Cross-Encoder vs ColBERT {#architectures}

ApproachLatency (per query)StorageQualityUse
Bi-encoder<1 ms (retrieve from index)1 vector / doc (~3 KB)BaselineFirst-stage retrieval over millions
Cross-encoder50-500 ms (top 100)None (compute on-the-fly)+5-15 NDCGSecond-stage rerank top 100
ColBERT (late interaction)5-50 ms1 vector / token (~30 KB)+3-8 NDCGMid-stage rerank or replace bi-encoder for high-precision
LLM-as-reranker100-2000 msNone+6-10 NDCGTop 10-20 only, when budget allows

For most RAG: bi-encoder retrieve (top 100) → cross-encoder rerank (top 10) → LLM generate.


How Cross-Encoders Score Relevance {#how-it-works}

Input: [CLS] query [SEP] document [SEP]
Forward through transformer (BERT-style or modern BERT replacement)
Pool [CLS] embedding (or mean-pool)
Linear head → single scalar score
Sigmoid optional for probability output

Training: contrastive loss with positive (Q, D+) and negative (Q, D-) pairs. Or pairwise loss: score(Q, D+) > score(Q, D-).

Modern rerankers add: instruction-tuning prompts, domain pre-training, multi-lingual data, listwise loss for order-aware training.


Reading articles is good. Building is better.

Free account = 17+ structured chapters across 17 courses, with a per-chapter AI tutor. No card. Cancel anytime if you ever upgrade.

Open Reranker Models (2026) {#open-models}

ModelParamsLicenseMTEB RerankBest For
BGE-Reranker-v2-m3568MMIT60.4Default; multi-lingual
BGE-Reranker-v2-Gemma9BApache 2.064.5Maximum quality
BGE-Reranker-v2-MiniCPM2.7BApache 2.062.3Quality + speed
Jina Reranker v2 base multilingual278MApache 2.056.8Latency-critical
mxbai-rerank-large-v1435MApache 2.059.4English; fast
mxbai-rerank-base-v1184MApache 2.055.0Edge / minimal latency
bge-reranker-large560MMIT57.0Predecessor; v2-m3 better
Voyage-rerank-2closedAPI61.5Hosted alternative
Cohere Rerank 3closedAPI62.04K context, 100+ languages
jina-colbert-v2137MApache 2.058.5ColBERT alternative

For most production in 2026: BGE-Reranker-v2-m3 is the right default — best quality / latency / license combination. Use bge-reranker-v2-gemma when you can afford 9B-model latency and need top-of-leaderboard quality.


Hosted: Cohere, Voyage, Jina API {#hosted}

When self-hosting isn't practical:

# Cohere
curl https://api.cohere.com/v2/rerank \
    -H "Authorization: Bearer $COHERE_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "rerank-v3.5", "query": "what is RAG?", "documents": ["...", "..."]}'

# Voyage
curl https://api.voyageai.com/v1/rerank \
    -H "Authorization: Bearer $VOYAGE_API_KEY" \
    -d '{"model": "rerank-2", "query": "...", "documents": [...]}'

# Jina
curl https://api.jina.ai/v1/rerank \
    -H "Authorization: Bearer $JINA_API_KEY" \
    -d '{"model": "jina-reranker-v2-base-multilingual", "query": "...", "documents": [...]}'

Cost: ~$1-2 per 1K queries (each scoring 100 documents). For low-volume / quick-start, hosted is great. For >10K queries/day, self-hosted BGE typically pays back in days.


Setup with sentence-transformers {#setup-st}

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", max_length=512, device="cuda")

query = "What is retrieval-augmented generation?"
candidates = [
    "RAG combines retrieval with generation...",
    "Cooking recipes for cake...",
    "Vector databases store embeddings...",
    # ... 100 more
]

# Score all (query, doc) pairs
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs, batch_size=32, show_progress_bar=False)

# Rank
ranked = sorted(zip(scores, candidates), key=lambda x: -x[0])
top_10 = [doc for _, doc in ranked[:10]]

Setup with Text Embeddings Inference (TEI) {#setup-tei}

For production, use Hugging Face's TEI server:

docker run -p 8080:80 --gpus all \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-embeddings-inference:1.5 \
    --model-id BAAI/bge-reranker-v2-m3
curl http://localhost:8080/rerank \
    -X POST -H "Content-Type: application/json" \
    -d '{"query": "what is RAG?", "texts": ["doc1", "doc2", "doc3"], "raw_scores": false}'

Returns scored, sorted indices:

[
    {"index": 0, "score": 0.94},
    {"index": 2, "score": 0.31},
    {"index": 1, "score": 0.05}
]

TEI features: dynamic batching, ONNX runtime for CPU, CUDA + Metal + Vulkan for GPU, OpenAPI spec.


Integration: Chroma / Qdrant / Weaviate {#integration}

Standard pattern: retrieve top K with vector DB, rerank with cross-encoder.

import chromadb
from sentence_transformers import CrossEncoder

# Vector retrieval
client = chromadb.Client()
collection = client.get_collection("docs")
results = collection.query(query_texts=[query], n_results=100)
candidates = results["documents"][0]

# Reranking
reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", device="cuda")
pairs = [(query, doc) for doc in candidates]
scores = reranker.predict(pairs, batch_size=32)
top_10 = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)][:10]

# Pass to LLM
context = "\n\n".join(top_10)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

Qdrant has built-in reranker support via QueryRequest.using since 1.13. Weaviate supports rerankers as modules. See Ollama ChromaDB RAG Pipeline for full RAG examples.


Choosing K for Reranking {#choosing-k}

K = number of candidates fed from retrieval into reranking.

KUse CaseTrade-off
20-50Low-stakes chat, fast autocompleteMay miss relevant docs
100Standard RAG (default)Good balance
200-500High-precision (legal, medical, code)Higher latency
1000+Research / max-recall pipelinesLatency dominates

Empirical recipe: vary K from 20 to 500, plot NDCG@10 (or your task metric), find the knee. For 80% of RAG: K=100 is right.


Latency Benchmarks {#latency}

Single H100, batched, 512-token documents:

RerankerThroughput (pairs/s)Latency for 100 pairs
BGE-Reranker-v2-m3110090 ms
BGE-Reranker-v2-MiniCPM (2.7B)380260 ms
BGE-Reranker-v2-Gemma (9B)951050 ms
mxbai-rerank-large145070 ms
mxbai-rerank-base420024 ms
Jina Reranker v2 base350028 ms

RTX 4090: ~50% of H100 throughput. CPU (Ryzen 7 7800X3D): 1/30th of H100 (use only for <100 queries/day).

For interactive chat (TTFT budget <300 ms): mxbai-base or Jina v2 base. For B2B RAG (TTFT <2 s): any reranker comfortably fits.


Fine-Tuning a Reranker {#fine-tuning}

Domain-specific fine-tuning often gives +10-20 NDCG points.

from sentence_transformers import CrossEncoder, InputExample
from torch.utils.data import DataLoader

train_examples = [
    InputExample(texts=[query, pos_doc], label=1.0),
    InputExample(texts=[query, neg_doc], label=0.0),
    # ... 10K-50K triples
]

reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", num_labels=1)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

reranker.fit(
    train_dataloader=train_dataloader,
    epochs=3,
    warmup_steps=100,
    output_path="./my-reranker",
)

Time: ~1-4 hours on a single RTX 4090 for 10K-50K triples. Data curation tip: mine hard negatives with a stronger retriever (BGE-large) then label with an LLM judge.


Distilling Larger Rerankers {#distillation}

Use bge-reranker-v2-gemma (9B) to score 100K-1M (query, doc) pairs from your domain. SFT BGE-Reranker-v2-m3 on those scores. Captures most of gemma's quality at base latency.

# Pseudocode
big_reranker = CrossEncoder("BAAI/bge-reranker-v2-gemma")
small_reranker = CrossEncoder("BAAI/bge-reranker-v2-m3", num_labels=1)

scored_pairs = []
for q, d in pairs:
    score = big_reranker.predict([(q, d)])[0]
    scored_pairs.append(InputExample(texts=[q, d], label=score))

# Distill
small_reranker.fit(DataLoader(scored_pairs, batch_size=32), epochs=2, ...)

See Knowledge Distillation Guide.


ColBERT and Late Interaction {#colbert}

ColBERT (Khattab & Zaharia, 2020) and ColBERT-v2 / PLAID encode each query token and each document token into separate vectors. Relevance = sum over query tokens of max similarity to any document token.

Trade-offs:

  • Faster than cross-encoder (~5x at K=100)
  • Slower than bi-encoder (~10x at retrieval)
  • Storage: 10-100x bi-encoder (one vector per token)

When ColBERT helps: high-precision retrieval over medium-size corpora (1K-10K docs) where you can afford the storage. Modern variants (jina-colbert-v2) reduce storage and improve quality.

For 2026: ColBERT is a niche tool. The standard bi-encoder + cross-encoder pipeline is simpler and usually equivalent quality.


Hybrid Retrieval (BM25 + Dense + Rerank) {#hybrid}

Best-quality RAG retrieval pipeline:

1. BM25 (lexical): retrieve top 50 (catches exact-keyword matches)
2. Dense bi-encoder: retrieve top 50 (catches semantic matches)
3. Reciprocal Rank Fusion (RRF): merge into top 100 candidates
4. Cross-encoder rerank: top 10
5. LLM generate

RRF formula: score(d) = Σ 1/(k + rank_in_list_i(d)) for each retrieval method i; k=60 is standard.

This pattern (used by Cohere, OpenAI's RAG cookbooks, and most production systems) gives the most robust retrieval — covers cases where dense embeddings miss exact terms (e.g., product codes, brand names) and BM25 misses paraphrases.


Multi-lingual Reranking {#multilingual}

For multi-lingual RAG (cross-language Q&A or multi-language corpora):

RerankerLanguage Coverage
BGE-Reranker-v2-m3100+ languages
Jina Reranker v2 base multilingual30+ languages
Cohere Rerank 3100+ languages
Voyage-rerank-2-multilingual100+ languages

For Asian-language workloads (Chinese, Japanese, Korean): BGE family is strongest (trained on substantial Chinese data). For European languages: all major rerankers work well.

For cross-language retrieval (English query, Chinese documents): use multilingual reranker; quality is typically 5-10 points below same-language but still usable.


Production Best Practices {#production}

  1. Truncate documents to <512 tokens — most rerankers degrade at longer contexts and latency scales linearly
  2. Batch reranking calls — 100 pairs in one batch, not 100 sequential calls
  3. Cache scores — for stable corpora, (query, doc) score pairs can be cached when queries repeat
  4. Use TEI in production — better batching and ONNX acceleration than raw sentence-transformers
  5. Fine-tune on domain — even 1K labeled triples gives measurable gains
  6. Monitor recall@K from retrieval — if BM25/dense misses relevant docs, rerank can't recover
  7. Combine with hybrid retrieval — BM25 + dense + RRF + rerank is the gold standard
  8. Profile latency — TTFT budget should explicitly include retrieve + rerank time

Troubleshooting {#troubleshooting}

SymptomCauseFix
Reranker doesn't improve qualityBi-encoder already returns relevant docsIncrease K to give reranker harder cases
Latency too highToo many candidates or too-large modelLower K; switch to mxbai-base or Jina
Long documents truncatedmax_length=512 defaultChunk documents to fit; or use Cohere (4K)
Multi-language quality lowWrong rerankerUse BGE-Reranker-v2-m3 or Cohere Rerank 3
Reranker scores all 0Sigmoid not appliedSet normalize=True or use raw_scores=False in TEI
OOM during rerankingBatch size too largeLower batch_size in predict()
Reranker worse than bi-encoderWrong modelVerify it's a cross-encoder, not a bi-encoder mislabeled
No GPU accelerationWrong deviceSet device="cuda" explicitly in CrossEncoder constructor

FAQ {#faq}

See answers to common reranker questions below.


Sources: BGE-Reranker-v2-m3 model card | BGE paper (arXiv 2402.03216) | ColBERT paper (Khattab & Zaharia, 2020) | Jina Reranker v2 announcement | Cohere Rerank docs | Text Embeddings Inference (TEI) | BEIR benchmark | Internal benchmarks H100 + RTX 4090.

Related guides:

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Liked this? 17 full AI courses are waiting.

From fundamentals to RAG, agents, MCP servers, voice AI, and production deployment with real GitHub repos. First chapter free, every course.

Reading now
Join the discussion

LocalAimaster Research Team

Creator of Local AI Master. I've built datasets with over 77,000 examples and trained AI models from scratch. Now I help people achieve AI independence through local AI mastery.

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Want structured AI education?

17 courses, 160+ chapters, from $9. Understand AI, don't just use it.

AI Learning Path

Comments (0)

No comments yet. Be the first to share your thoughts!

📅 Published: May 2, 2026🔄 Last Updated: May 2, 2026✓ Manually Reviewed

Bonus kit

Ollama Docker Templates

10 one-command Docker stacks. Includes BGE-Reranker + ChromaDB + Ollama production RAG deploy. Included with paid plans, or free after subscribing to both Local AI Master and Little AI Master on YouTube.

See Plans →

Build Real AI on Your Machine

RAG, agents, NLP, vision, and MLOps - chapters across 17 courses that take you from reading about AI to building AI.

Was this helpful?

PR

Written by Pattanaik Ramswarup

Creator of Local AI Master

I build Local AI Master around practical, testable local AI workflows: model selection, hardware planning, RAG systems, agents, and MLOps. The goal is to turn scattered tutorials into a structured learning path you can follow on your own hardware.

✓ Local AI Curriculum✓ Hands-On Projects✓ Open Source Contributor
📚
Free · no account required

Grab the AI Starter Kit — career roadmap, cheat sheet, setup guide

No spam. Unsubscribe with one click.

🎯
AI Learning Path

Go from reading about AI to building with AI

10 structured courses. Hands-on projects. Runs on your machine. Start free.

Free Tools & Calculators