
Better RAG: Add a Reranker to Cut Your Top-20 Candidates Down to the Right Top 3
Chris Harper
3 min read
Jun 24, 2026 · 21:07 UTC
TL;DR: A cross-encoder reads each query-chunk pair together and re-scores them — fixing the precision gap that makes vector-only RAG hallucinate, at 100-250ms of added latency per request.
What you'll be able to do after this:
- Understand why bi-encoder retrieval gets recall right but precision wrong — and why a cross-encoder fixes it
- Implement a two-stage pipeline: retrieve 20 candidates with a fast bi-encoder, rerank to the top 3 with a cross-encoder
- Choose between self-hosted (BGE-reranker, FlashRank) and hosted (Cohere Rerank, Pinecone Rerank) options
The precision problem
From the last post: hybrid search gives you better recall. But your retriever can still return a top-3 where #1 is semantically close but semantically wrong — same topic, wrong answer. The culprit is how bi-encoders work: they encode the query and each chunk separately, then compare vectors. Fast, but they can't read the query and chunk together. Subtle signals — negation ("not the timeout setting"), numeric precision, domain jargon — slip through.
A cross-encoder fixes this. It takes the query and a candidate chunk as a single combined input and outputs one relevance score. Much more accurate. Also much slower — you can't run it against all million chunks. Two-stage retrieval captures the best of both:
- Stage 1 (recall): bi-encoder retrieves top-20 candidates in milliseconds
- Stage 2 (precision): cross-encoder re-scores those 20, returns top-3 to the LLM
The LLM sees 3 highly relevant chunks instead of 20 mediocre ones — smaller context, less hallucination, better answers.
Code: drop-in reranker with sentence-transformers
pip install sentence-transformers
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What is the connection timeout setting?"
candidates = [
"The default connection wait time is 30 seconds.",
"Set ANTHROPIC_API_KEY in your environment.",
"ERR_TLS_CERT_EXPIRED means the server certificate has expired.",
"Set connect_timeout in your database config to override the default.",
"Connection pooling is managed by the database driver.",
]
pairs = [[query, chunk] for chunk in candidates]
scores = reranker.predict(pairs) # one score per pair, rated jointly
# Sort descending, take top 3
ranked = sorted(zip(scores, candidates), reverse=True)
top_3 = [chunk for _, chunk in ranked[:3]]
On CPU, ms-marco-MiniLM-L-6-v2 scores 50 candidates in 100–250ms — a worthwhile trade for precision. Swap in BAAI/bge-reranker-v2-m3 (568M params, multilingual) for best free accuracy (~80ms batched). For 15–30ms CPU reranking, try FlashRank (quantized, purpose-built for speed).
Options at a glance
| Option | Latency | Cost | Notes |
|---|---|---|---|
ms-marco-MiniLM-L-6-v2 | 100-250ms | Free | Fast, good baseline |
BAAI/bge-reranker-v2-m3 | ~80ms batched | Free | Best free accuracy, multilingual |
| FlashRank | 15-30ms | Free | pip install flashrank, quantized |
| Cohere Rerank | 150-400ms + net | Pay per call | Best-in-class accuracy |
| Pinecone Rerank | Managed | Pay per call | One call if on Pinecone |
The Pinecone "Rerankers and Two-Stage Retrieval" tutorial walks the full pipeline — retrieval, reranking, comparing results — with runnable Python. The YouTube full course covers LLM, cross-encoder, and rule-based reranking in one session.
Sources: Pinecone: Rerankers and Two-Stage Retrieval, Reranking for RAG — Full Course (YouTube), TDS: Advanced RAG Cross-Encoders, DEV.to: Production Reranker Layer