Did Your Fine-Tune Actually Work? Score Any Model in One Command with the LM Evaluation Harness

Chris Harper

3 min read

Jun 28, 2026 · 20:30 UTC

Tutorial

Fine-Tuning

HuggingFace

TL;DR: After fine-tuning with Unsloth + QLoRA, run one lm_eval command to get a numeric accuracy score and compare it directly against the base model — no custom eval code needed.

What you'll be able to do after this:

Install the EleutherAI LM Evaluation Harness in two commands and benchmark any HuggingFace model
Run the same command against your base and fine-tuned model to get a before/after accuracy number
Interpret the output: what the score means and how to pick the right benchmark task for your use case

You finished training. Your loss curve looks good. But does the model actually perform better on your target task? The EleutherAI LM Evaluation Harness gives you a reproducible, numeric answer with one CLI command — no custom eval scripts, no hand-labeling outputs.

Install

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[hf]"

The [hf] extra installs the HuggingFace backend — required to load models from the Hub or from a local checkpoint.

Evaluate base model, then your fine-tune

# Step 1: score the base model
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-3.2-3B \
  --tasks gsm8k \
  --device cuda:0 \
  --batch_size 8

# Step 2: swap in your fine-tuned model (identical command)
lm_eval --model hf \
  --model_args pretrained=your-username/llama-3-2-3b-finetuned \
  --tasks gsm8k \
  --device cuda:0 \
  --batch_size 8

Output looks like: gsm8k | flexible_extract | 0 | 0.4756 — that's 47.56% on grade school math. If your fine-tune scores meaningfully higher, training worked. Same or lower? Revisit your dataset quality and learning rate. Philschmid's fine-tuning guide shows a concrete example: a 10K-sample dataset moved the score from 47% to 54%.

Good starter tasks

Task	What it tests	Speed
`gsm8k`	Multi-step math reasoning	Medium
`hellaswag`	Commonsense reasoning	Fast
`arc_easy` / `arc_challenge`	Science QA	Fast
`mmlu`	Broad knowledge (STEM, humanities)	Slow

Pick a task that matches your training domain. If you fine-tuned on customer support data, use a task that tests instruction-following; if you fine-tuned on code, run humaneval.

Can't fit the model on Colab?

If your fine-tuned model is hosted on HuggingFace Hub and you're serving it with Ollama or vLLM, use the OpenAI-compatible backend to decouple the eval client from the inference server:

lm_eval --model local-chat-completions \
  --model_args model=your-ft-model,base_url=http://localhost:8000/v1 \
  --tasks gsm8k_cot

This way the evaluation runs anywhere — even a CPU-only machine — and the GPU work stays on the server.

Sources: EleutherAI LM Evaluation Harness — GitHub | How to Fine-Tune Open LLMs in 2025 — Philschmid

CloudCodeTree

Did Your Fine-Tune Actually Work? Score Any Model in One Command with the LM Evaluation Harness

Install

Evaluate base model, then your fine-tune

Good starter tasks

Can't fit the model on Colab?