
Better RAG: Measure Whether Your Pipeline Is Actually Working with RAGAS
Chris Harper
4 min read
Jun 25, 2026 · 21:05 UTC
TL;DR: RAGAS gives your RAG pipeline four objective scores — faithfulness, context recall, context precision, answer relevancy — so you can stop guessing whether hybrid search and reranking actually improved anything.
What you'll be able to do after this:
- Detect hallucination: measure whether your model's answers are actually grounded in the retrieved context (faithfulness score)
- Diagnose retrieval gaps: catch when your retriever is missing relevant chunks before they reach the LLM (context recall)
- Wire RAGAS into CI as a regression guard — fail the build if any metric drops after a retrieval config change
Why you need evaluation
The last three posts built a real retrieval stack: hybrid search for recall, reranking for precision, metadata filters for scope. But "it seems better" is not a measurement. RAGAS is the 2026 standard for scoring RAG pipelines — it evaluates the four things that actually matter:
| Metric | What it measures | What a low score tells you |
|---|---|---|
| Faithfulness | Are answers grounded in the retrieved context? | Your model is hallucinating — asserting facts not supported by the docs |
| Context recall | Does the retriever return all the chunks needed to answer? | Retriever is missing relevant docs — revisit chunking, hybrid search, or reranking |
| Context precision | Are retrieved chunks actually relevant, or is noise crowding out signal? | Tighten metadata filters or reduce top-k |
| Answer relevancy | Does the answer actually address the question? | Answer synthesis problem — check your prompt |
Run it in 10 lines
pip install ragas datasets langchain-openai
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
data = {
"question": ["What is the refund window?", "How do I reset my password?"],
"contexts": [
["Refunds are accepted within 30 days of purchase with original receipt."],
["Visit account settings and click 'Forgot password' to receive a reset email."],
],
"answer": [
"You can get a refund within 30 days.",
"Go to settings and use the forgot password option.",
],
"ground_truth": [
"Refunds are accepted within 30 days with original receipt.",
"Visit account settings and click 'Forgot password'.",
],
}
dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)
# → {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.91, 'context_recall': 0.84}
RAGAS uses an LLM as a structured judge internally — it calls your configured API key to score each answer against the retrieved context. Scores range 0–1; higher is better. A golden eval dataset of 20–50 representative questions runs in a few minutes.
Reading your scores
- Faithfulness < 0.8 → your model is asserting facts not in the retrieved context. Increase top-k to retrieve more supporting evidence, or strengthen your answer synthesis prompt to cite only what's retrieved.
- Context recall < 0.7 → your retriever is missing relevant chunks. Revisit chunking strategy (smaller chunks often help recall), add hybrid search, or check that metadata filters aren't over-restricting the search space.
- Context precision < 0.7 → too many irrelevant chunks are crowding the context window. Tighten your metadata filters or reduce top-k to cut noise.
- Answer relevancy < 0.8 → the answer isn't actually addressing the question, even if the retrieval was fine. Usually a prompt issue.
Wire it into CI
Build a fixed golden dataset of 20–50 representative questions, store it in your repo as evals/golden.json, and run evaluate() in CI after any retrieval config change. Fail the build if any metric drops more than 0.05 from baseline:
baseline = {"faithfulness": 0.92, "context_recall": 0.81, ...}
for metric, score in results.items():
if score < baseline[metric] - 0.05:
raise ValueError(f"Regression: {metric} dropped to {score:.2f}")
Treat RAGAS like unit tests for your retrieval stack. Every change to chunking, hybrid weights, reranker configuration, or metadata filters should run the eval suite.
Sources: INVRA: How to Evaluate RAG Pipelines with RAGAS (May 2026) | Analytics Vidhya: A Beginner's Guide to Evaluating RAG Pipelines Using RAGAS | How to Evaluate RAG Systems with RAGAS (Jan 2026)