
Unit-Test Your LLM App in 5 Minutes: DeepEval's Answer Relevancy, Faithfulness, and G-Eval
Chris Harper
2 min read
Jul 3, 2026 · 12:03 UTC
TL;DR: DeepEval brings pytest-style unit testing to LLM apps — score answer relevancy, faithfulness, and custom criteria with an LLM judge in under 15 lines of Python.
What you'll be able to do after this:
- Write a test case that automatically scores your LLM's response on answer relevancy and faithfulness
- Define a custom G-Eval metric for any criterion your standard benchmarks don't cover (tone, format, safety)
- Run your eval suite in CI to catch output regressions before they reach users
Your LLM "worked in testing" often means "you read some outputs and they looked fine." That doesn't scale. DeepEval is an open-source evaluation framework that works like pytest for LLM outputs — define test cases, pick metrics, run deepeval test run. The metrics use an LLM judge internally, so they handle nuanced quality criteria that string matching can't.
Install:
pip install -U deepeval
export OPENAI_API_KEY=sk-... # used as the default judge model (GPT-4o); swap via config
Write your first test case:
# test_my_llm.py
from deepeval import evaluate
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, GEval
test_case = LLMTestCase(
input="What are the benefits of prompt caching?",
actual_output=your_llm_response, # the string you're testing
retrieval_context=["Prompt caching stores the KV cache..."] # for RAG; omit if not RAG
)
Run built-in metrics:
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.5)
evaluate([test_case], [relevancy, faithfulness])
AnswerRelevancyMetric checks whether the response actually answers the question. FaithfulnessMetric checks whether claims are grounded in the provided context — it flags hallucinated facts. Both score 0–1; the threshold sets pass/fail.
Add a custom G-Eval metric for anything else:
When no built-in metric fits, G-Eval lets you write the criterion in plain English. It uses chain-of-thought prompting to reason before scoring:
tone_check = GEval(
name="Professional Tone",
criteria="Determine if the response is professional and avoids colloquial language.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.6
)
evaluate([test_case], [tone_check])
Run in pytest (CI-ready):
deepeval test run test_my_llm.py
Each failed metric prints the judge's reasoning, not just a score — so you see why it failed. Gate your CI on eval pass rate the same way you gate it on unit tests.
Sources: DeepEval 5-min Quickstart · G-Eval metric · AnswerRelevancy metric · Faithfulness metric