Deploy Your Fine-Tune: Merge a LoRA Adapter into the Base Model and Push to HuggingFace Hub

Chris Harper

3 min read

Jun 30, 2026 · 12:09 UTC

Tutorial

Fine-Tuning

HuggingFace

TL;DR: model.merge_and_unload() collapses a LoRA adapter into the base model weights once — no PEFT needed at inference time. Two more lines push it to HuggingFace Hub so vLLM, Ollama, or anyone can pull it.

What you'll be able to do after this:

Merge a trained LoRA adapter into the base model with a single function call, eliminating PEFT as a runtime dependency
Save the merged model locally and push it to HuggingFace Hub in two lines
Load the merged model directly with vLLM or Ollama, exactly like any other HuggingFace checkpoint

After fine-tuning with LoRA (via Unsloth, TRL's SFTTrainer, or similar), you have a PeftModel: the frozen base model plus a set of adapter delta matrices. At inference time, PEFT adds these deltas on every forward pass. For serving you typically want one merged checkpoint: no adapter overhead, no PEFT dependency, and full compatibility with every inference tool.

Step 1: Reload base model and merge

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Reload the base model in serving precision
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Load your trained adapter on top of the base model
peft_model = PeftModel.from_pretrained(base_model, "./output/checkpoint-final")

# Merge adapter weights into base model — returns a plain transformers model
merged = peft_model.merge_and_unload()

# Save locally
merged.save_pretrained("./merged-mistral-7b")
tokenizer.save_pretrained("./merged-mistral-7b")
print("Saved — no PEFT library needed to load this checkpoint")

merge_and_unload() computes W_merged = W_base + (B x A) x (alpha / r) once for every LoRA layer and stores the result. The adapter checkpoint is no longer needed after this step.

Memory note: merging requires loading both base weights and adapter simultaneously — expect ~16 GB RAM for a 7B model in bfloat16. Run this in the same Colab session as training (base already loaded) rather than reloading cold.

Step 2: Push to HuggingFace Hub

# Authenticate first (one-time)
huggingface-cli login   # or: export HF_TOKEN=hf_...

# Push merged model to your Hub repo (add private=True for a private repo)
merged.push_to_hub("yourusername/mistral-7b-custom-v1")
tokenizer.push_to_hub("yourusername/mistral-7b-custom-v1")
print("Live at huggingface.co/yourusername/mistral-7b-custom-v1")

Large models are automatically sharded into ~10 GB files. The Hub stores all shards and serves them via CDN — anyone with the repo ID can pull the model.

Step 3: Serve with vLLM or Ollama

Once on the Hub, your model is a drop-in for any inference tool:

# vLLM — one command
python -m vllm.entrypoints.openai.api_server \
    --model yourusername/mistral-7b-custom-v1 \
    --port 8000

# Ollama — Modelfile
printf 'FROM yourusername/mistral-7b-custom-v1\nPARAMETER temperature 0.7\n' > Modelfile
ollama create my-fine-tune -f Modelfile
ollama run my-fine-tune

Adapter-only vs. merged: when to choose each

	Adapter-only push	Merged model push
Hub storage	~10-100 MB	~14 GB for a 7B model
Inference tooling	Requires PEFT at runtime	Any standard transformers loader
vLLM / Ollama	Extra setup step	Drop-in compatible
Best for	Popular base models; many adapters	Maximum compatibility; proprietary base models

If the base model is already on the Hub and your users can load it, push only the adapter — PeftModel.from_pretrained(base, "adapter-hub-id") handles the rest. For everything else, push the merged checkpoint.

Sources:

CloudCodeTree

Deploy Your Fine-Tune: Merge a LoRA Adapter into the Base Model and Push to HuggingFace Hub

Step 1: Reload base model and merge

Step 2: Push to HuggingFace Hub

Step 3: Serve with vLLM or Ollama

Adapter-only vs. merged: when to choose each