CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Better RAG: Add a Reranker to Cut Your Top-20 Candidates Down to the Right Top 3

Better RAG: Add a Reranker to Cut Your Top-20 Candidates Down to the Right Top 3

Chris Harper

3 min read

Jun 24, 2026 · 21:07 UTC

AI
Tutorial
RAG
Embeddings
Best Practices

TL;DR: A cross-encoder reads each query-chunk pair together and re-scores them — fixing the precision gap that makes vector-only RAG hallucinate, at 100-250ms of added latency per request.

What you'll be able to do after this:

  • Understand why bi-encoder retrieval gets recall right but precision wrong — and why a cross-encoder fixes it
  • Implement a two-stage pipeline: retrieve 20 candidates with a fast bi-encoder, rerank to the top 3 with a cross-encoder
  • Choose between self-hosted (BGE-reranker, FlashRank) and hosted (Cohere Rerank, Pinecone Rerank) options

The precision problem

From the last post: hybrid search gives you better recall. But your retriever can still return a top-3 where #1 is semantically close but semantically wrong — same topic, wrong answer. The culprit is how bi-encoders work: they encode the query and each chunk separately, then compare vectors. Fast, but they can't read the query and chunk together. Subtle signals — negation ("not the timeout setting"), numeric precision, domain jargon — slip through.

A cross-encoder fixes this. It takes the query and a candidate chunk as a single combined input and outputs one relevance score. Much more accurate. Also much slower — you can't run it against all million chunks. Two-stage retrieval captures the best of both:

  • Stage 1 (recall): bi-encoder retrieves top-20 candidates in milliseconds
  • Stage 2 (precision): cross-encoder re-scores those 20, returns top-3 to the LLM

The LLM sees 3 highly relevant chunks instead of 20 mediocre ones — smaller context, less hallucination, better answers.

Code: drop-in reranker with sentence-transformers

pip install sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

query = "What is the connection timeout setting?"
candidates = [
    "The default connection wait time is 30 seconds.",
    "Set ANTHROPIC_API_KEY in your environment.",
    "ERR_TLS_CERT_EXPIRED means the server certificate has expired.",
    "Set connect_timeout in your database config to override the default.",
    "Connection pooling is managed by the database driver.",
]

pairs = [[query, chunk] for chunk in candidates]
scores = reranker.predict(pairs)   # one score per pair, rated jointly

# Sort descending, take top 3
ranked = sorted(zip(scores, candidates), reverse=True)
top_3 = [chunk for _, chunk in ranked[:3]]

On CPU, ms-marco-MiniLM-L-6-v2 scores 50 candidates in 100–250ms — a worthwhile trade for precision. Swap in BAAI/bge-reranker-v2-m3 (568M params, multilingual) for best free accuracy (~80ms batched). For 15–30ms CPU reranking, try FlashRank (quantized, purpose-built for speed).

Options at a glance

OptionLatencyCostNotes
ms-marco-MiniLM-L-6-v2100-250msFreeFast, good baseline
BAAI/bge-reranker-v2-m3~80ms batchedFreeBest free accuracy, multilingual
FlashRank15-30msFreepip install flashrank, quantized
Cohere Rerank150-400ms + netPay per callBest-in-class accuracy
Pinecone RerankManagedPay per callOne call if on Pinecone

The Pinecone "Rerankers and Two-Stage Retrieval" tutorial walks the full pipeline — retrieval, reranking, comparing results — with runnable Python. The YouTube full course covers LLM, cross-encoder, and rule-based reranking in one session.

Sources: Pinecone: Rerankers and Two-Stage Retrieval, Reranking for RAG — Full Course (YouTube), TDS: Advanced RAG Cross-Encoders, DEV.to: Production Reranker Layer