
Build Your Fine-Tuning Dataset First: Alpaca Format, Chat Messages, and the Train/Eval Split
Chris Harper
3 min read
Jun 27, 2026 · 20:01 UTC
TL;DR: The training run takes 20 minutes; building the right dataset takes days — here is how to structure instruction/response pairs in the two formats every fine-tuning trainer expects, plus the HuggingFace split-and-push pattern.
What you'll be able to do after this:
- Structure 100–500 instruction/response pairs in Alpaca format (instruction/input/output) and the newer chat messages format that modern trainers prefer
- Split your dataset into train and eval sets with one HuggingFace
datasetscall - Push the result to the HuggingFace Hub so SFTTrainer can load it directly
Why the data step is the real fine-tuning work
Every benchmark difference between "fine-tuned model that regresses on base tasks" and one that actually improves is traceable to dataset quality. Fine-tuning adjusts which output distribution the model reaches for given an input — so the dataset must demonstrate the exact input/output pattern you want repeated, consistently.
The two formats you will encounter
Alpaca format (original Stanford Alpaca): three columns — instruction, input (optional context), output.
{
"instruction": "Summarize this customer complaint in one sentence.",
"input": "I ordered a red jacket but received a blue one. Third time this month.",
"output": "Customer received the wrong color jacket for the third consecutive time."
}
Chat / messages format (what modern trainers expect): a messages array with role and content keys.
{
"messages": [
{"role": "system", "content": "You are a customer support summarizer."},
{"role": "user", "content": "I ordered a red jacket but received a blue one..."},
{"role": "assistant", "content": "Customer received the wrong color jacket for the third consecutive time."}
]
}
HuggingFace trl SFTTrainer detects the messages key and automatically applies the base model's chat template — no custom preprocessing required. Prefer this format for any model trained with a chat template (Llama-3-Instruct, Mistral-Instruct, Qwen-2.5-Instruct, etc.).
Building and splitting your dataset in Python
from datasets import Dataset
rows = [
{
"messages": [
{"role": "user", "content": "Summarize: customer said ..."},
{"role": "assistant", "content": "One-sentence summary here."}
]
},
# ... 99–499 more rows
]
ds = Dataset.from_list(rows)
# 90/10 train/eval split
splits = ds.train_test_split(test_size=0.1, seed=42)
train_ds = splits["train"]
eval_ds = splits["test"]
# Pass directly to SFTTrainer
# trainer = SFTTrainer(train_dataset=train_ds, eval_dataset=eval_ds, ...)
# Optional: push to Hub for Colab workflows
splits.push_to_hub("your-username/my-finetune-dataset", private=True)
How much data do you need?
| Goal | Volume |
|---|---|
| Domain style / format adaptation | 100–500 high-quality examples |
| Task specialization (classifier, extractor) | 500–2,000 |
| Covering broad capabilities | 5,000+ |
Quality beats volume at every scale. 200 carefully written examples will outperform 2,000 noisy ones. A good heuristic: if you would not be proud to label an example yourself, cut it.
Common pitfalls
- Label inconsistency — the model cannot learn a mapping when similar inputs get different outputs. Run a quick
groupby instructioncheck to catch contradictions. - Leaking the eval set — never fine-tune on your eval rows; the whole point of the split is to detect overfit.
- Output length mismatch — if target outputs vary wildly (2 words vs 500 words), pad or pack carefully or you'll distort loss weighting.
The W&B multi-part series linked below covers the full pipeline end-to-end: picking datasets, tokenizing, packing short sequences, setting train/eval splits, and debugging common issues — worth reading before you run your first training job.
Sources: How to Fine-Tune an LLM Part 1: Preparing a Dataset for Instruction Tuning — Weights & Biases | Dataset formats — HuggingFace TRL docs | HuggingFace Datasets library