Beyond SFT: Align Your Model With Human Preferences Using DPO and TRL

Chris Harper

2 min read

Jul 5, 2026 · 04:13 UTC

Tutorial

Fine-Tuning

HuggingFace

DPO (Direct Preference Optimization) replaces the reward-model step in RLHF with a single stable objective on chosen/rejected pairs — and HuggingFace's TRL makes it a one-class swap from SFT, runnable on a free Colab T4 GPU.

What you'll be able to do after this:

Build a preference dataset in chosen/rejected format and understand why pair quality matters more than quantity
Run DPOTrainer with 4-bit QLoRA in under 50 lines on a free T4 GPU
Read eval/reward_margin to verify your fine-tune is pulling the model toward preferred outputs

Why DPO after SFT

Your SFT model generates correct outputs but may still default to verbose, hedged, or off-topic completions when multiple answers are plausible. DPO trains on pairs — a preferred completion and a rejected one for the same prompt — and directly optimizes the log-ratio between them. No separate reward model, no RL training loop, no PPO instability.

The dataset format

{
  "prompt": "Summarize this article in two sentences.",
  "chosen": [{"role": "assistant", "content": "The article covers the impact..."}],
  "rejected": [{"role": "assistant", "content": "Sure! I'd be happy to help summarize..."}]
}

chosen and rejected are full chat turns. HuggingFace's Digish/huggingface-smol-course-preference-tuning-dataset is a working starter dataset.

Running DPOTrainer

from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig

model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM3-3B-Instruct",
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Instruct")

training_args = DPOConfig(
    output_dir="./dpo_smollm3",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    beta=0.1,       # KL penalty vs reference model — 0.1 is a safe start; higher = stay closer to SFT
    logging_steps=10,
)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = DPOTrainer(
    model=model,
    ref_model=None,      # None = TRL keeps a frozen reference copy in memory (PEFT-efficient)
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
)
trainer.train()

beta is the KL divergence penalty: lower values let the model deviate more from the reference SFT checkpoint; higher values keep it closer but limit how much it can shift. Start at 0.1 and watch eval/reward_margin — positive and growing means the model is separating chosen from rejected as expected.

The smol-course hands-on exercise at unit2/3 walks through loading the preference dataset, training, and checking metrics on a real SmolLM3 run.

Sources: HuggingFace smol-course: DPO unit | Hands-on exercise with SmolLM3 | TRL DPO Trainer docs

CloudCodeTree

Beyond SFT: Align Your Model With Human Preferences Using DPO and TRL

Why DPO after SFT

The dataset format

Running DPOTrainer