Run Any Open-Weight LLM in One Command: vLLM's OpenAI-Compatible Server

Chris Harper

3 min read

Jun 29, 2026 · 04:03 UTC

Tutorial

Self-Hosting

LLM

TL;DR: Install vLLM in one command, serve any HuggingFace model as an OpenAI-compatible endpoint, and query it with your existing OpenAI SDK code — no changes required.

What you'll be able to do after this:

Spin up a private LLM inference server on any NVIDIA or AMD GPU — local machine, cloud VM, or bare metal
Drop your existing OpenAI SDK code into a self-hosted setup by changing one line (base_url)
Understand why vLLM is 10–24x faster than naive HuggingFace Transformers for concurrent requests

Key takeaways:

PagedAttention: vLLM manages the GPU KV-cache the way an OS pages RAM — non-contiguous physical blocks, zero fragmentation. More requests fit at once.
Continuous batching: new requests join the inference batch at each decode step rather than waiting for the whole batch to finish. GPU stays saturated.
OpenAI-compatible API: swap base_url from https://api.openai.com/v1 to http://localhost:8000/v1 and your chat.completions.create() calls work unchanged.

Walk-through

1. Install (Python 3.10–3.12; NVIDIA GPU required for CUDA; AMD also supported)

uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install vllm --torch-backend=auto   # pulls the right CUDA wheels automatically

For AMD/ROCm: uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

2. Serve a model

# 1.5B param model — ~3 GB VRAM, good for testing
vllm serve Qwen/Qwen2.5-1.5B-Instruct

# 8B model at 4-bit quantization — ~6 GB VRAM
vllm serve meta-llama/Llama-3.1-8B-Instruct --quantization awq

vLLM downloads the model from HuggingFace Hub on first run. The server starts on http://localhost:8000 and prints the available routes.

3. Query it — zero code changes from the OpenAI SDK

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")

resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[{"role": "user", "content": "What is PagedAttention?"}],
    max_tokens=256,
)
print(resp.choices[0].message.content)

List available models: curl http://localhost:8000/v1/models

4. Key production flags

vllm serve meta-llama/Llama-3.1-8B-Instruct   --gpu-memory-utilization 0.90   --max-model-len 4096   --dtype float16   --max-num-seqs 128

--gpu-memory-utilization 0.90 — leave 10% headroom for the OS
--max-model-len 4096 — limit context window to reduce peak VRAM
--dtype float16 — roughly half the VRAM of bfloat16
--max-num-seqs 128 — cap concurrent sequences for predictable latency

Hardware floor: ~3 GB VRAM for a 1.5B model; ~6–8 GB for an 8B model at INT4 quantization; ~18 GB for a full-precision 8B model in BF16.

Where to go next: The Online Serving docs cover multi-GPU tensor parallelism (--tensor-parallel-size), LoRA adapter hot-loading, and structured output (grammar-constrained generation).

Sources: vLLM Quickstart — Official Docs · vLLM Online Serving

CloudCodeTree

Run Any Open-Weight LLM in One Command: vLLM's OpenAI-Compatible Server

Walk-through