Better RAG: Measure Whether Your Pipeline Is Actually Working with RAGAS

Chris Harper

4 min read

Jun 25, 2026 · 21:05 UTC

Tutorial

RAG

Embeddings

Best Practices

TL;DR: RAGAS gives your RAG pipeline four objective scores — faithfulness, context recall, context precision, answer relevancy — so you can stop guessing whether hybrid search and reranking actually improved anything.

What you'll be able to do after this:

Detect hallucination: measure whether your model's answers are actually grounded in the retrieved context (faithfulness score)
Diagnose retrieval gaps: catch when your retriever is missing relevant chunks before they reach the LLM (context recall)
Wire RAGAS into CI as a regression guard — fail the build if any metric drops after a retrieval config change

Why you need evaluation

The last three posts built a real retrieval stack: hybrid search for recall, reranking for precision, metadata filters for scope. But "it seems better" is not a measurement. RAGAS is the 2026 standard for scoring RAG pipelines — it evaluates the four things that actually matter:

Metric	What it measures	What a low score tells you
Faithfulness	Are answers grounded in the retrieved context?	Your model is hallucinating — asserting facts not supported by the docs
Context recall	Does the retriever return all the chunks needed to answer?	Retriever is missing relevant docs — revisit chunking, hybrid search, or reranking
Context precision	Are retrieved chunks actually relevant, or is noise crowding out signal?	Tighten metadata filters or reduce top-k
Answer relevancy	Does the answer actually address the question?	Answer synthesis problem — check your prompt

Run it in 10 lines

pip install ragas datasets langchain-openai

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

data = {
    "question": ["What is the refund window?", "How do I reset my password?"],
    "contexts": [
        ["Refunds are accepted within 30 days of purchase with original receipt."],
        ["Visit account settings and click 'Forgot password' to receive a reset email."],
    ],
    "answer": [
        "You can get a refund within 30 days.",
        "Go to settings and use the forgot password option.",
    ],
    "ground_truth": [
        "Refunds are accepted within 30 days with original receipt.",
        "Visit account settings and click 'Forgot password'.",
    ],
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)
# → {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.91, 'context_recall': 0.84}

RAGAS uses an LLM as a structured judge internally — it calls your configured API key to score each answer against the retrieved context. Scores range 0–1; higher is better. A golden eval dataset of 20–50 representative questions runs in a few minutes.

Reading your scores

Faithfulness < 0.8 → your model is asserting facts not in the retrieved context. Increase top-k to retrieve more supporting evidence, or strengthen your answer synthesis prompt to cite only what's retrieved.
Context recall < 0.7 → your retriever is missing relevant chunks. Revisit chunking strategy (smaller chunks often help recall), add hybrid search, or check that metadata filters aren't over-restricting the search space.
Context precision < 0.7 → too many irrelevant chunks are crowding the context window. Tighten your metadata filters or reduce top-k to cut noise.
Answer relevancy < 0.8 → the answer isn't actually addressing the question, even if the retrieval was fine. Usually a prompt issue.

Wire it into CI

Build a fixed golden dataset of 20–50 representative questions, store it in your repo as evals/golden.json, and run evaluate() in CI after any retrieval config change. Fail the build if any metric drops more than 0.05 from baseline:

baseline = {"faithfulness": 0.92, "context_recall": 0.81, ...}
for metric, score in results.items():
    if score < baseline[metric] - 0.05:
        raise ValueError(f"Regression: {metric} dropped to {score:.2f}")

Treat RAGAS like unit tests for your retrieval stack. Every change to chunking, hybrid weights, reranker configuration, or metadata filters should run the eval suite.

Sources: INVRA: How to Evaluate RAG Pipelines with RAGAS (May 2026) | Analytics Vidhya: A Beginner's Guide to Evaluating RAG Pipelines Using RAGAS | How to Evaluate RAG Systems with RAGAS (Jan 2026)

CloudCodeTree

Better RAG: Measure Whether Your Pipeline Is Actually Working with RAGAS

Why you need evaluation

Run it in 10 lines

Reading your scores

Wire it into CI