
Deploy Your Fine-Tune: Merge a LoRA Adapter into the Base Model and Push to HuggingFace Hub
Chris Harper
3 min read
Jun 30, 2026 · 12:09 UTC
TL;DR: model.merge_and_unload() collapses a LoRA adapter into the base model weights once — no PEFT needed at inference time. Two more lines push it to HuggingFace Hub so vLLM, Ollama, or anyone can pull it.
What you'll be able to do after this:
- Merge a trained LoRA adapter into the base model with a single function call, eliminating PEFT as a runtime dependency
- Save the merged model locally and push it to HuggingFace Hub in two lines
- Load the merged model directly with vLLM or Ollama, exactly like any other HuggingFace checkpoint
After fine-tuning with LoRA (via Unsloth, TRL's SFTTrainer, or similar), you have a PeftModel: the frozen base model plus a set of adapter delta matrices. At inference time, PEFT adds these deltas on every forward pass. For serving you typically want one merged checkpoint: no adapter overhead, no PEFT dependency, and full compatibility with every inference tool.
Step 1: Reload base model and merge
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Reload the base model in serving precision
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# Load your trained adapter on top of the base model
peft_model = PeftModel.from_pretrained(base_model, "./output/checkpoint-final")
# Merge adapter weights into base model — returns a plain transformers model
merged = peft_model.merge_and_unload()
# Save locally
merged.save_pretrained("./merged-mistral-7b")
tokenizer.save_pretrained("./merged-mistral-7b")
print("Saved — no PEFT library needed to load this checkpoint")
merge_and_unload() computes W_merged = W_base + (B x A) x (alpha / r) once for every LoRA layer and stores the result. The adapter checkpoint is no longer needed after this step.
Memory note: merging requires loading both base weights and adapter simultaneously — expect ~16 GB RAM for a 7B model in bfloat16. Run this in the same Colab session as training (base already loaded) rather than reloading cold.
Step 2: Push to HuggingFace Hub
# Authenticate first (one-time)
huggingface-cli login # or: export HF_TOKEN=hf_...
# Push merged model to your Hub repo (add private=True for a private repo)
merged.push_to_hub("yourusername/mistral-7b-custom-v1")
tokenizer.push_to_hub("yourusername/mistral-7b-custom-v1")
print("Live at huggingface.co/yourusername/mistral-7b-custom-v1")
Large models are automatically sharded into ~10 GB files. The Hub stores all shards and serves them via CDN — anyone with the repo ID can pull the model.
Step 3: Serve with vLLM or Ollama
Once on the Hub, your model is a drop-in for any inference tool:
# vLLM — one command
python -m vllm.entrypoints.openai.api_server \
--model yourusername/mistral-7b-custom-v1 \
--port 8000
# Ollama — Modelfile
printf 'FROM yourusername/mistral-7b-custom-v1\nPARAMETER temperature 0.7\n' > Modelfile
ollama create my-fine-tune -f Modelfile
ollama run my-fine-tune
Adapter-only vs. merged: when to choose each
| Adapter-only push | Merged model push | |
|---|---|---|
| Hub storage | ~10-100 MB | ~14 GB for a 7B model |
| Inference tooling | Requires PEFT at runtime | Any standard transformers loader |
| vLLM / Ollama | Extra setup step | Drop-in compatible |
| Best for | Popular base models; many adapters | Maximum compatibility; proprietary base models |
If the base model is already on the Hub and your users can load it, push only the adapter — PeftModel.from_pretrained(base, "adapter-hub-id") handles the rest. For everything else, push the merged checkpoint.
Sources: