LoRA Hyperparameters Demystified: Which Rank, Alpha, and Layers Actually Matter for Your Fine-Tune

Chris Harper

4 min read

Jul 4, 2026 · 04:12 UTC

Tutorial

Fine-Tuning

HuggingFace

TL;DR: The default r=4 LoRA config is often too small for real tasks — start at r=16, set alpha to 2x rank, apply to all transformer layers, and use QLoRA only when memory-constrained.

What you'll be able to do after this:

Choose rank r with confidence — know when r=16 is enough and when to push to r=64 or r=256
Set lora_alpha correctly using the 2x-rank heuristic, and understand when to adjust it
Apply LoRA to all attention and MLP layers (not just query/value) and see why it consistently improves results

You've run the Unsloth Colab tutorial and fine-tuned your first model. Now you're trying to get serious quality — and the defaults aren't cutting it. The three levers that matter most are rank, alpha, and which layers you target. Here's what Sebastian Raschka's synthesis of hundreds of LoRA experiments actually shows.

What rank r controls

LoRA decomposes each weight update as ΔW ≈ AB, where A is (d × r) and B is (r × k). The rank r is the bottleneck dimension — how many independent directions the fine-tune can shift the model. Higher r = more expressiveness, more trainable parameters, more memory, more overfitting risk on small datasets.

Rank	When to use
r = 4–8	Style/tone transfer, very small datasets (<5k examples)
r = 16–32	Good starting point for most instruction-following tasks
r = 64–128	Code generation, factual recall tasks, datasets >50k examples
r = 256+	Deep task specialization; Raschka found r=256 optimal in several experiments

The default r=4 in many tutorials is fine for demonstration. For production fine-tunes, start at r=16 and run a quick eval before scaling up.

Setting alpha

lora_alpha is a scaling factor: the actual update applied is (alpha/r) × ΔW. The standard heuristic is alpha = 2 × r (scaling factor of 2). This is a good default — but if training is noisy or you're getting stability issues, try alpha = r (scaling factor 1), which keeps the base model's knowledge more intact. Raschka found that lower ratios sometimes outperformed the 2x rule depending on dataset size; tune it like a learning rate multiplier.

Which layers to apply LoRA to

HuggingFace PEFT's default targets only q_proj and v_proj (query and value attention matrices). Raschka's experiments consistently showed that applying LoRA to all attention and MLP layers improved results noticeably — often by a meaningful margin:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")

config = LoraConfig(
    r=64,
    lora_alpha=128,             # 2x rank
    target_modules=[            # all layers, not just q/v
        "q_proj", "k_proj", "v_proj", "o_proj",
        "up_proj", "down_proj", "gate_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: ~83M (vs ~21M with q/v only for 8B)

The extra memory is moderate. The quality gain is consistent across tasks.

QLoRA: memory savings vs. training time

QLoRA freezes the base model in 4-bit NF4 instead of bfloat16. Raschka's measurements for a 7B model:

	Memory	Training time
Full LoRA (bfloat16)	21.3 GB	1.85 h
QLoRA (4-bit NF4)	14.2 GB	2.79 h

33% memory savings, 39% slower. Use QLoRA on a free T4 (16 GB) or a 24 GB card at max load. If you have a 3090/4090 (24 GB) with room to spare, full LoRA gives you faster iteration.

Single epoch is almost always right

Multi-epoch training consistently hurts. After epoch 1, the model starts memorizing the training distribution and loses generalization. One epoch over a clean, diverse dataset beats three epochs over a small or repetitive one. If your validation loss is still improving at the end of epoch 1, add more data — not more epochs.