
LoRA Hyperparameters Demystified: Which Rank, Alpha, and Layers Actually Matter for Your Fine-Tune
Chris Harper
4 min read
Jul 4, 2026 · 04:12 UTC
TL;DR: The default r=4 LoRA config is often too small for real tasks — start at r=16, set alpha to 2x rank, apply to all transformer layers, and use QLoRA only when memory-constrained.
What you'll be able to do after this:
- Choose rank
rwith confidence — know when r=16 is enough and when to push to r=64 or r=256 - Set
lora_alphacorrectly using the 2x-rank heuristic, and understand when to adjust it - Apply LoRA to all attention and MLP layers (not just query/value) and see why it consistently improves results
You've run the Unsloth Colab tutorial and fine-tuned your first model. Now you're trying to get serious quality — and the defaults aren't cutting it. The three levers that matter most are rank, alpha, and which layers you target. Here's what Sebastian Raschka's synthesis of hundreds of LoRA experiments actually shows.
What rank r controls
LoRA decomposes each weight update as ΔW ≈ AB, where A is (d × r) and B is (r × k). The rank r is the bottleneck dimension — how many independent directions the fine-tune can shift the model. Higher r = more expressiveness, more trainable parameters, more memory, more overfitting risk on small datasets.
| Rank | When to use |
|---|---|
| r = 4–8 | Style/tone transfer, very small datasets (<5k examples) |
| r = 16–32 | Good starting point for most instruction-following tasks |
| r = 64–128 | Code generation, factual recall tasks, datasets >50k examples |
| r = 256+ | Deep task specialization; Raschka found r=256 optimal in several experiments |
The default r=4 in many tutorials is fine for demonstration. For production fine-tunes, start at r=16 and run a quick eval before scaling up.
Setting alpha
lora_alpha is a scaling factor: the actual update applied is (alpha/r) × ΔW. The standard heuristic is alpha = 2 × r (scaling factor of 2). This is a good default — but if training is noisy or you're getting stability issues, try alpha = r (scaling factor 1), which keeps the base model's knowledge more intact. Raschka found that lower ratios sometimes outperformed the 2x rule depending on dataset size; tune it like a learning rate multiplier.
Which layers to apply LoRA to
HuggingFace PEFT's default targets only q_proj and v_proj (query and value attention matrices). Raschka's experiments consistently showed that applying LoRA to all attention and MLP layers improved results noticeably — often by a meaningful margin:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
config = LoraConfig(
r=64,
lora_alpha=128, # 2x rank
target_modules=[ # all layers, not just q/v
"q_proj", "k_proj", "v_proj", "o_proj",
"up_proj", "down_proj", "gate_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: ~83M (vs ~21M with q/v only for 8B)
The extra memory is moderate. The quality gain is consistent across tasks.
QLoRA: memory savings vs. training time
QLoRA freezes the base model in 4-bit NF4 instead of bfloat16. Raschka's measurements for a 7B model:
| Memory | Training time | |
|---|---|---|
| Full LoRA (bfloat16) | 21.3 GB | 1.85 h |
| QLoRA (4-bit NF4) | 14.2 GB | 2.79 h |
33% memory savings, 39% slower. Use QLoRA on a free T4 (16 GB) or a 24 GB card at max load. If you have a 3090/4090 (24 GB) with room to spare, full LoRA gives you faster iteration.
Single epoch is almost always right
Multi-epoch training consistently hurts. After epoch 1, the model starts memorizing the training distribution and loses generalization. One epoch over a clean, diverse dataset beats three epochs over a small or repetitive one. If your validation loss is still improving at the end of epoch 1, add more data — not more epochs.
Sources: Practical Tips for Finetuning LLMs Using LoRA — Sebastian Raschka · HuggingFace PEFT LoRA conceptual guide · LoRA paper: Hu et al. (arXiv:2106.09685)