CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Deploy Your Fine-Tune: Merge a LoRA Adapter into the Base Model and Push to HuggingFace Hub

Deploy Your Fine-Tune: Merge a LoRA Adapter into the Base Model and Push to HuggingFace Hub

Chris Harper

3 min read

Jun 30, 2026 · 12:09 UTC

AI
Tutorial
Fine-Tuning
HuggingFace

TL;DR: model.merge_and_unload() collapses a LoRA adapter into the base model weights once — no PEFT needed at inference time. Two more lines push it to HuggingFace Hub so vLLM, Ollama, or anyone can pull it.

What you'll be able to do after this:

  • Merge a trained LoRA adapter into the base model with a single function call, eliminating PEFT as a runtime dependency
  • Save the merged model locally and push it to HuggingFace Hub in two lines
  • Load the merged model directly with vLLM or Ollama, exactly like any other HuggingFace checkpoint

After fine-tuning with LoRA (via Unsloth, TRL's SFTTrainer, or similar), you have a PeftModel: the frozen base model plus a set of adapter delta matrices. At inference time, PEFT adds these deltas on every forward pass. For serving you typically want one merged checkpoint: no adapter overhead, no PEFT dependency, and full compatibility with every inference tool.

Step 1: Reload base model and merge

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Reload the base model in serving precision
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Load your trained adapter on top of the base model
peft_model = PeftModel.from_pretrained(base_model, "./output/checkpoint-final")

# Merge adapter weights into base model — returns a plain transformers model
merged = peft_model.merge_and_unload()

# Save locally
merged.save_pretrained("./merged-mistral-7b")
tokenizer.save_pretrained("./merged-mistral-7b")
print("Saved — no PEFT library needed to load this checkpoint")

merge_and_unload() computes W_merged = W_base + (B x A) x (alpha / r) once for every LoRA layer and stores the result. The adapter checkpoint is no longer needed after this step.

Memory note: merging requires loading both base weights and adapter simultaneously — expect ~16 GB RAM for a 7B model in bfloat16. Run this in the same Colab session as training (base already loaded) rather than reloading cold.

Step 2: Push to HuggingFace Hub

# Authenticate first (one-time)
huggingface-cli login   # or: export HF_TOKEN=hf_...
# Push merged model to your Hub repo (add private=True for a private repo)
merged.push_to_hub("yourusername/mistral-7b-custom-v1")
tokenizer.push_to_hub("yourusername/mistral-7b-custom-v1")
print("Live at huggingface.co/yourusername/mistral-7b-custom-v1")

Large models are automatically sharded into ~10 GB files. The Hub stores all shards and serves them via CDN — anyone with the repo ID can pull the model.

Step 3: Serve with vLLM or Ollama

Once on the Hub, your model is a drop-in for any inference tool:

# vLLM — one command
python -m vllm.entrypoints.openai.api_server \
    --model yourusername/mistral-7b-custom-v1 \
    --port 8000
# Ollama — Modelfile
printf 'FROM yourusername/mistral-7b-custom-v1\nPARAMETER temperature 0.7\n' > Modelfile
ollama create my-fine-tune -f Modelfile
ollama run my-fine-tune

Adapter-only vs. merged: when to choose each

Adapter-only pushMerged model push
Hub storage~10-100 MB~14 GB for a 7B model
Inference toolingRequires PEFT at runtimeAny standard transformers loader
vLLM / OllamaExtra setup stepDrop-in compatible
Best forPopular base models; many adaptersMaximum compatibility; proprietary base models

If the base model is already on the Hub and your users can load it, push only the adapter — PeftModel.from_pretrained(base, "adapter-hub-id") handles the rest. For everything else, push the merged checkpoint.

Sources: