Fine-Tuning & Serving: Serve a Model with vLLM

Fine-Tuning & Serving · Part 3 of 3

The Fine-Tuning & Serving course · 3 parts

Change the model itself: when to fine-tune vs. retrieve, how LoRA/QLoRA work, and how to serve the result behind an OpenAI-compatible API.

01Fine-Tuning vs RAG
02LoRA & QLoRA on One GPU

03Serve a Model with vLLM← you are here

Serve a Model with vLLM

TL;DR: Merge your LoRA adapter back into the base model, then vllm serve ./your-model gives you an OpenAI-compatible endpoint. Your existing OpenAI client code works against it by changing one line — the base_url.

You fine-tuned a model. Right now it's a folder of weights. This tutorial makes it a service anything can call — and because the API mirrors OpenAI's, the rest of your stack barely notices the swap.

HEADS UP

Serving runs on a GPU host, not your laptop — vLLM is built for CUDA. The commands below are the real ones from vLLM's quickstart; run them where there's a GPU (a cloud VM, or the same Colab/box you trained on). This page is concept + verified commands, not locally-tested output.

Why vLLM (and why "OpenAI-compatible" matters)

vLLM is a high-throughput inference server. The headline feature for you isn't the speed tricks (paged attention, continuous batching) — it's that it speaks the OpenAI API. That means:

Your code that calls openai.chat.completions.create(...) works unchanged — you only repoint base_url.
Every tool that already talks to OpenAI (SDKs, frameworks, UIs) talks to your model for free.

No vendor lock-in, no bespoke client. You own the model and the interface.

Step 1 — merge the adapter into the base

LoRA left you with a small adapter on top of a frozen base. The simplest thing to serve is a single merged model. With Hugging Face PEFT:

# illustrative — run on the GPU host where you trained
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained("./my-lora-adapter")
merged = model.merge_and_unload()          # fold adapter weights into the base
merged.save_pretrained("./my-merged-model")
AutoTokenizer.from_pretrained("./my-lora-adapter").save_pretrained("./my-merged-model")

merge_and_unload() collapses the W + B·A from the last tutorial into a single set of weights, so the served model is just a normal model — no adapter plumbing at inference time.

NOTE

Merge vs. dynamic adapters. Merging is the simplest path and adds zero inference overhead. If instead you want to serve many adapters over one base (e.g. a per-customer fine-tune), vLLM can load adapters dynamically with --enable-lora — see vLLM's LoRA serving docs. Start merged; reach for dynamic LoRA when you actually have multiple adapters.

Step 2 — start the server (one command)

vllm serve ./my-merged-model

That's it — the server comes up at http://localhost:8000 with OpenAI-compatible routes (/v1/chat/completions, /v1/completions). Add --host / --port to change the address, or --api-key to require a key.

Step 3 — call it like OpenAI

With curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "./my-merged-model",
    "messages": [
      {"role": "system", "content": "You are our support assistant."},
      {"role": "user", "content": "How do I reset my password?"}
    ]
  }'

Or with the OpenAI Python client — the only changes from real OpenAI are base_url and api_key:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="./my-merged-model",
    messages=[{"role": "user", "content": "How do I reset my password?"}],
)
print(resp.choices[0].message.content)

Any framework that takes an OpenAI base_url — including a RAG pipeline — now runs on your fine-tuned model.

The full picture

You've now done both halves of applied AI engineering, end to end:

Knowledge → retrieval: embeddings → vector DB → chunking → hybrid → reranking → evaluation.
Behavior → fine-tuning: when to → LoRA/QLoRA → serving (here).

And they meet at the base_url: a vLLM-served, fine-tuned model behind an OpenAI API, fed fresh facts by your RAG stack. Retrieve for knowledge, fine-tune for behavior — now wired together and shippable.