CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Build Your Fine-Tuning Dataset First: Alpaca Format, Chat Messages, and the Train/Eval Split

Build Your Fine-Tuning Dataset First: Alpaca Format, Chat Messages, and the Train/Eval Split

Chris Harper

3 min read

Jun 27, 2026 · 20:01 UTC

AI
Tutorial
Fine-Tuning
HuggingFace

TL;DR: The training run takes 20 minutes; building the right dataset takes days — here is how to structure instruction/response pairs in the two formats every fine-tuning trainer expects, plus the HuggingFace split-and-push pattern.

What you'll be able to do after this:

  • Structure 100–500 instruction/response pairs in Alpaca format (instruction/input/output) and the newer chat messages format that modern trainers prefer
  • Split your dataset into train and eval sets with one HuggingFace datasets call
  • Push the result to the HuggingFace Hub so SFTTrainer can load it directly

Why the data step is the real fine-tuning work

Every benchmark difference between "fine-tuned model that regresses on base tasks" and one that actually improves is traceable to dataset quality. Fine-tuning adjusts which output distribution the model reaches for given an input — so the dataset must demonstrate the exact input/output pattern you want repeated, consistently.

The two formats you will encounter

Alpaca format (original Stanford Alpaca): three columns — instruction, input (optional context), output.

{
  "instruction": "Summarize this customer complaint in one sentence.",
  "input": "I ordered a red jacket but received a blue one. Third time this month.",
  "output": "Customer received the wrong color jacket for the third consecutive time."
}

Chat / messages format (what modern trainers expect): a messages array with role and content keys.

{
  "messages": [
    {"role": "system", "content": "You are a customer support summarizer."},
    {"role": "user",   "content": "I ordered a red jacket but received a blue one..."},
    {"role": "assistant", "content": "Customer received the wrong color jacket for the third consecutive time."}
  ]
}

HuggingFace trl SFTTrainer detects the messages key and automatically applies the base model's chat template — no custom preprocessing required. Prefer this format for any model trained with a chat template (Llama-3-Instruct, Mistral-Instruct, Qwen-2.5-Instruct, etc.).

Building and splitting your dataset in Python

from datasets import Dataset

rows = [
    {
        "messages": [
            {"role": "user",      "content": "Summarize: customer said ..."},
            {"role": "assistant", "content": "One-sentence summary here."}
        ]
    },
    # ... 99–499 more rows
]

ds = Dataset.from_list(rows)

# 90/10 train/eval split
splits = ds.train_test_split(test_size=0.1, seed=42)
train_ds = splits["train"]
eval_ds  = splits["test"]

# Pass directly to SFTTrainer
# trainer = SFTTrainer(train_dataset=train_ds, eval_dataset=eval_ds, ...)

# Optional: push to Hub for Colab workflows
splits.push_to_hub("your-username/my-finetune-dataset", private=True)

How much data do you need?

GoalVolume
Domain style / format adaptation100–500 high-quality examples
Task specialization (classifier, extractor)500–2,000
Covering broad capabilities5,000+

Quality beats volume at every scale. 200 carefully written examples will outperform 2,000 noisy ones. A good heuristic: if you would not be proud to label an example yourself, cut it.

Common pitfalls

  • Label inconsistency — the model cannot learn a mapping when similar inputs get different outputs. Run a quick groupby instruction check to catch contradictions.
  • Leaking the eval set — never fine-tune on your eval rows; the whole point of the split is to detect overfit.
  • Output length mismatch — if target outputs vary wildly (2 words vs 500 words), pad or pack carefully or you'll distort loss weighting.

The W&B multi-part series linked below covers the full pipeline end-to-end: picking datasets, tokenizing, packing short sequences, setting train/eval splits, and debugging common issues — worth reading before you run your first training job.

Sources: How to Fine-Tune an LLM Part 1: Preparing a Dataset for Instruction Tuning — Weights & Biases | Dataset formats — HuggingFace TRL docs | HuggingFace Datasets library