Better RAG: Add a Reranker to Cut Your Top-20 Candidates Down to the Right Top 3

Chris Harper

3 min read

Jun 24, 2026 · 21:07 UTC

Tutorial

RAG

Embeddings

Best Practices

TL;DR: A cross-encoder reads each query-chunk pair together and re-scores them — fixing the precision gap that makes vector-only RAG hallucinate, at 100-250ms of added latency per request.

What you'll be able to do after this:

Understand why bi-encoder retrieval gets recall right but precision wrong — and why a cross-encoder fixes it
Implement a two-stage pipeline: retrieve 20 candidates with a fast bi-encoder, rerank to the top 3 with a cross-encoder
Choose between self-hosted (BGE-reranker, FlashRank) and hosted (Cohere Rerank, Pinecone Rerank) options

The precision problem

From the last post: hybrid search gives you better recall. But your retriever can still return a top-3 where #1 is semantically close but semantically wrong — same topic, wrong answer. The culprit is how bi-encoders work: they encode the query and each chunk separately, then compare vectors. Fast, but they can't read the query and chunk together. Subtle signals — negation ("not the timeout setting"), numeric precision, domain jargon — slip through.

A cross-encoder fixes this. It takes the query and a candidate chunk as a single combined input and outputs one relevance score. Much more accurate. Also much slower — you can't run it against all million chunks. Two-stage retrieval captures the best of both:

Stage 1 (recall): bi-encoder retrieves top-20 candidates in milliseconds
Stage 2 (precision): cross-encoder re-scores those 20, returns top-3 to the LLM

The LLM sees 3 highly relevant chunks instead of 20 mediocre ones — smaller context, less hallucination, better answers.

Code: drop-in reranker with sentence-transformers

pip install sentence-transformers

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "What is the connection timeout setting?"
candidates = [
    "The default connection wait time is 30 seconds.",
    "Set ANTHROPIC_API_KEY in your environment.",
    "ERR_TLS_CERT_EXPIRED means the server certificate has expired.",
    "Set connect_timeout in your database config to override the default.",
    "Connection pooling is managed by the database driver.",
]

pairs = [[query, chunk] for chunk in candidates]
scores = reranker.predict(pairs)   # one score per pair, rated jointly

# Sort descending, take top 3
ranked = sorted(zip(scores, candidates), reverse=True)
top_3 = [chunk for _, chunk in ranked[:3]]

On CPU, ms-marco-MiniLM-L-6-v2 scores 50 candidates in 100–250ms — a worthwhile trade for precision. Swap in BAAI/bge-reranker-v2-m3 (568M params, multilingual) for best free accuracy (~80ms batched). For 15–30ms CPU reranking, try FlashRank (quantized, purpose-built for speed).

Options at a glance

Option	Latency	Cost	Notes
`ms-marco-MiniLM-L-6-v2`	100-250ms	Free	Fast, good baseline
`BAAI/bge-reranker-v2-m3`	~80ms batched	Free	Best free accuracy, multilingual
FlashRank	15-30ms	Free	`pip install flashrank`, quantized
Cohere Rerank	150-400ms + net	Pay per call	Best-in-class accuracy
Pinecone Rerank	Managed	Pay per call	One call if on Pinecone

The Pinecone "Rerankers and Two-Stage Retrieval" tutorial walks the full pipeline — retrieval, reranking, comparing results — with runnable Python. The YouTube full course covers LLM, cross-encoder, and rule-based reranking in one session.

Sources: Pinecone: Rerankers and Two-Stage Retrieval, Reranking for RAG — Full Course (YouTube), TDS: Advanced RAG Cross-Encoders, DEV.to: Production Reranker Layer

CloudCodeTree

Better RAG: Add a Reranker to Cut Your Top-20 Candidates Down to the Right Top 3

The precision problem

Code: drop-in reranker with sentence-transformers

Options at a glance