
Did Your Fine-Tune Actually Work? Score Any Model in One Command with the LM Evaluation Harness
Chris Harper
3 min read
Jun 28, 2026 · 20:30 UTC
TL;DR: After fine-tuning with Unsloth + QLoRA, run one lm_eval command to get a numeric accuracy score and compare it directly against the base model — no custom eval code needed.
What you'll be able to do after this:
- Install the EleutherAI LM Evaluation Harness in two commands and benchmark any HuggingFace model
- Run the same command against your base and fine-tuned model to get a before/after accuracy number
- Interpret the output: what the score means and how to pick the right benchmark task for your use case
You finished training. Your loss curve looks good. But does the model actually perform better on your target task? The EleutherAI LM Evaluation Harness gives you a reproducible, numeric answer with one CLI command — no custom eval scripts, no hand-labeling outputs.
Install
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[hf]"
The [hf] extra installs the HuggingFace backend — required to load models from the Hub or from a local checkpoint.
Evaluate base model, then your fine-tune
# Step 1: score the base model
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3.2-3B \
--tasks gsm8k \
--device cuda:0 \
--batch_size 8
# Step 2: swap in your fine-tuned model (identical command)
lm_eval --model hf \
--model_args pretrained=your-username/llama-3-2-3b-finetuned \
--tasks gsm8k \
--device cuda:0 \
--batch_size 8
Output looks like: gsm8k | flexible_extract | 0 | 0.4756 — that's 47.56% on grade school math. If your fine-tune scores meaningfully higher, training worked. Same or lower? Revisit your dataset quality and learning rate. Philschmid's fine-tuning guide shows a concrete example: a 10K-sample dataset moved the score from 47% to 54%.
Good starter tasks
| Task | What it tests | Speed |
|---|---|---|
gsm8k | Multi-step math reasoning | Medium |
hellaswag | Commonsense reasoning | Fast |
arc_easy / arc_challenge | Science QA | Fast |
mmlu | Broad knowledge (STEM, humanities) | Slow |
Pick a task that matches your training domain. If you fine-tuned on customer support data, use a task that tests instruction-following; if you fine-tuned on code, run humaneval.
Can't fit the model on Colab?
If your fine-tuned model is hosted on HuggingFace Hub and you're serving it with Ollama or vLLM, use the OpenAI-compatible backend to decouple the eval client from the inference server:
lm_eval --model local-chat-completions \
--model_args model=your-ft-model,base_url=http://localhost:8000/v1 \
--tasks gsm8k_cot
This way the evaluation runs anywhere — even a CPU-only machine — and the GPU work stays on the server.
Sources: EleutherAI LM Evaluation Harness — GitHub | How to Fine-Tune Open LLMs in 2025 — Philschmid