
Did Your Fine-Tune Actually Improve? Benchmark It Against Standard Tasks with Lighteval
Chris Harper
2 min read
Jul 2, 2026 · 20:07 UTC
TL;DR: Use Lighteval to run MMLU, GSM8K, and HumanEval on your fine-tuned model in one command — catch regressions before you ship it.
What you'll be able to do after this:
- Compare your fine-tuned model against the base on benchmarks that match your task
- Catch regressions on general capabilities you didn't train for (the most common fine-tune failure mode)
- Add a custom eval task for behavior your standard benchmarks don't cover
A fine-tune that "feels better" on your test prompts can still regress on general capabilities. Systematic benchmarking is your regression suite. HuggingFace's LLM Course Chapter 11.5: Evaluation walks through picking the right benchmarks, running them with Lighteval, and interpreting the results — it's the clearest practical intro in the official HF curriculum.
Install and run:
pip install lighteval
# Evaluate a HF Hub model or a local merged checkpoint
lighteval accelerate "pretrained=your-org/your-fine-tuned-model" "humaneval|0|0" "gsm8k|5|0" "mmlu|abstract_algebra|0|0" --max_samples 100 --output_path "./eval-results"
For a local checkpoint (e.g., after a LoRA merge), replace the Hub name with the local path: pretrained=/path/to/merged-model.
Task format: {suite}|{task}|{num_few_shot}|{auto_reduce}
Match benchmarks to your use case:
| Task | Benchmark |
|---|---|
| General knowledge / chat | MMLU (57 subjects), TruthfulQA |
| Code generation | HumanEval (164 Python problems) |
| Math / multi-step reasoning | GSM8K, BBH |
| Instruction following | Alpaca Eval (GPT-4 as judge) |
Always add a task-specific eval. Standard benchmarks test generic capabilities. If you fine-tuned for function calling, tool format adherence, or domain tone — also write a custom Lighteval task on your holdout set. The HF course chapter covers how to define custom tasks. Compare your fine-tune vs the base model on the same run to spot exactly where you gained and where you regressed.
Sources: HuggingFace LLM Course — Evaluation · Lighteval docs · philschmid: Evaluate LLMs with lm-eval + vLLM