
Beyond SFT: Align Your Model With Human Preferences Using DPO and TRL
Chris Harper
2 min read
Jul 5, 2026 · 04:13 UTC
DPO (Direct Preference Optimization) replaces the reward-model step in RLHF with a single stable objective on chosen/rejected pairs — and HuggingFace's TRL makes it a one-class swap from SFT, runnable on a free Colab T4 GPU.
What you'll be able to do after this:
- Build a preference dataset in
chosen/rejectedformat and understand why pair quality matters more than quantity - Run
DPOTrainerwith 4-bit QLoRA in under 50 lines on a free T4 GPU - Read
eval/reward_marginto verify your fine-tune is pulling the model toward preferred outputs
Why DPO after SFT
Your SFT model generates correct outputs but may still default to verbose, hedged, or off-topic completions when multiple answers are plausible. DPO trains on pairs — a preferred completion and a rejected one for the same prompt — and directly optimizes the log-ratio between them. No separate reward model, no RL training loop, no PPO instability.
The dataset format
{
"prompt": "Summarize this article in two sentences.",
"chosen": [{"role": "assistant", "content": "The article covers the impact..."}],
"rejected": [{"role": "assistant", "content": "Sure! I'd be happy to help summarize..."}]
}
chosen and rejected are full chat turns. HuggingFace's Digish/huggingface-smol-course-preference-tuning-dataset is a working starter dataset.
Running DPOTrainer
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
model = AutoModelForCausalLM.from_pretrained(
"HuggingFaceTB/SmolLM3-3B-Instruct",
load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B-Instruct")
training_args = DPOConfig(
output_dir="./dpo_smollm3",
num_train_epochs=1,
per_device_train_batch_size=2,
beta=0.1, # KL penalty vs reference model — 0.1 is a safe start; higher = stay closer to SFT
logging_steps=10,
)
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")
trainer = DPOTrainer(
model=model,
ref_model=None, # None = TRL keeps a frozen reference copy in memory (PEFT-efficient)
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
peft_config=lora_config,
processing_class=tokenizer,
)
trainer.train()
beta is the KL divergence penalty: lower values let the model deviate more from the reference SFT checkpoint; higher values keep it closer but limit how much it can shift. Start at 0.1 and watch eval/reward_margin — positive and growing means the model is separating chosen from rejected as expected.
The smol-course hands-on exercise at unit2/3 walks through loading the preference dataset, training, and checking metrics on a real SmolLM3 run.
Sources: HuggingFace smol-course: DPO unit | Hands-on exercise with SmolLM3 | TRL DPO Trainer docs