Did Your Fine-Tune Actually Improve? Benchmark It Against Standard Tasks with Lighteval

Chris Harper

2 min read

Jul 2, 2026 · 20:07 UTC

Tutorial

Fine-Tuning

HuggingFace

TL;DR: Use Lighteval to run MMLU, GSM8K, and HumanEval on your fine-tuned model in one command — catch regressions before you ship it.

What you'll be able to do after this:

Compare your fine-tuned model against the base on benchmarks that match your task
Catch regressions on general capabilities you didn't train for (the most common fine-tune failure mode)
Add a custom eval task for behavior your standard benchmarks don't cover

A fine-tune that "feels better" on your test prompts can still regress on general capabilities. Systematic benchmarking is your regression suite. HuggingFace's LLM Course Chapter 11.5: Evaluation walks through picking the right benchmarks, running them with Lighteval, and interpreting the results — it's the clearest practical intro in the official HF curriculum.

Install and run:

pip install lighteval

# Evaluate a HF Hub model or a local merged checkpoint
lighteval accelerate   "pretrained=your-org/your-fine-tuned-model"   "humaneval|0|0"   "gsm8k|5|0"   "mmlu|abstract_algebra|0|0"   --max_samples 100   --output_path "./eval-results"

For a local checkpoint (e.g., after a LoRA merge), replace the Hub name with the local path: pretrained=/path/to/merged-model.

Task format: {suite}|{task}|{num_few_shot}|{auto_reduce}

Match benchmarks to your use case:

Task	Benchmark
General knowledge / chat	MMLU (57 subjects), TruthfulQA
Code generation	HumanEval (164 Python problems)
Math / multi-step reasoning	GSM8K, BBH
Instruction following	Alpaca Eval (GPT-4 as judge)

Always add a task-specific eval. Standard benchmarks test generic capabilities. If you fine-tuned for function calling, tool format adherence, or domain tone — also write a custom Lighteval task on your holdout set. The HF course chapter covers how to define custom tasks. Compare your fine-tune vs the base model on the same run to spot exactly where you gained and where you regressed.

Sources: HuggingFace LLM Course — Evaluation · Lighteval docs · philschmid: Evaluate LLMs with lm-eval + vLLM

CloudCodeTree

Did Your Fine-Tune Actually Improve? Benchmark It Against Standard Tasks with Lighteval