Build Your Fine-Tuning Dataset First: Alpaca Format, Chat Messages, and the Train/Eval Split

Chris Harper

3 min read

Jun 27, 2026 · 20:01 UTC

Tutorial

Fine-Tuning

HuggingFace

TL;DR: The training run takes 20 minutes; building the right dataset takes days — here is how to structure instruction/response pairs in the two formats every fine-tuning trainer expects, plus the HuggingFace split-and-push pattern.

What you'll be able to do after this:

Structure 100–500 instruction/response pairs in Alpaca format (instruction/input/output) and the newer chat messages format that modern trainers prefer
Split your dataset into train and eval sets with one HuggingFace datasets call
Push the result to the HuggingFace Hub so SFTTrainer can load it directly

Why the data step is the real fine-tuning work

Every benchmark difference between "fine-tuned model that regresses on base tasks" and one that actually improves is traceable to dataset quality. Fine-tuning adjusts which output distribution the model reaches for given an input — so the dataset must demonstrate the exact input/output pattern you want repeated, consistently.

The two formats you will encounter

Alpaca format (original Stanford Alpaca): three columns — instruction, input (optional context), output.

{
  "instruction": "Summarize this customer complaint in one sentence.",
  "input": "I ordered a red jacket but received a blue one. Third time this month.",
  "output": "Customer received the wrong color jacket for the third consecutive time."
}

Chat / messages format (what modern trainers expect): a messages array with role and content keys.

{
  "messages": [
    {"role": "system", "content": "You are a customer support summarizer."},
    {"role": "user",   "content": "I ordered a red jacket but received a blue one..."},
    {"role": "assistant", "content": "Customer received the wrong color jacket for the third consecutive time."}
  ]
}

HuggingFace trl SFTTrainer detects the messages key and automatically applies the base model's chat template — no custom preprocessing required. Prefer this format for any model trained with a chat template (Llama-3-Instruct, Mistral-Instruct, Qwen-2.5-Instruct, etc.).

Building and splitting your dataset in Python

from datasets import Dataset

rows = [
    {
        "messages": [
            {"role": "user",      "content": "Summarize: customer said ..."},
            {"role": "assistant", "content": "One-sentence summary here."}
        ]
    },
    # ... 99–499 more rows
]

ds = Dataset.from_list(rows)

# 90/10 train/eval split
splits = ds.train_test_split(test_size=0.1, seed=42)
train_ds = splits["train"]
eval_ds  = splits["test"]

# Pass directly to SFTTrainer
# trainer = SFTTrainer(train_dataset=train_ds, eval_dataset=eval_ds, ...)

# Optional: push to Hub for Colab workflows
splits.push_to_hub("your-username/my-finetune-dataset", private=True)

How much data do you need?

Goal	Volume
Domain style / format adaptation	100–500 high-quality examples
Task specialization (classifier, extractor)	500–2,000
Covering broad capabilities	5,000+

Quality beats volume at every scale. 200 carefully written examples will outperform 2,000 noisy ones. A good heuristic: if you would not be proud to label an example yourself, cut it.

Common pitfalls

Label inconsistency — the model cannot learn a mapping when similar inputs get different outputs. Run a quick groupby instruction check to catch contradictions.
Leaking the eval set — never fine-tune on your eval rows; the whole point of the split is to detect overfit.
Output length mismatch — if target outputs vary wildly (2 words vs 500 words), pad or pack carefully or you'll distort loss weighting.

The W&B multi-part series linked below covers the full pipeline end-to-end: picking datasets, tokenizing, packing short sequences, setting train/eval splits, and debugging common issues — worth reading before you run your first training job.