
Photo: Christina Morillo / Pexels
Run Any Open-Weight LLM in One Command: vLLM's OpenAI-Compatible Server
Chris Harper
3 min read
Jun 29, 2026 · 04:03 UTC
TL;DR: Install vLLM in one command, serve any HuggingFace model as an OpenAI-compatible endpoint, and query it with your existing OpenAI SDK code — no changes required.
What you'll be able to do after this:
- Spin up a private LLM inference server on any NVIDIA or AMD GPU — local machine, cloud VM, or bare metal
- Drop your existing OpenAI SDK code into a self-hosted setup by changing one line (
base_url) - Understand why vLLM is 10–24x faster than naive HuggingFace Transformers for concurrent requests
Key takeaways:
- PagedAttention: vLLM manages the GPU KV-cache the way an OS pages RAM — non-contiguous physical blocks, zero fragmentation. More requests fit at once.
- Continuous batching: new requests join the inference batch at each decode step rather than waiting for the whole batch to finish. GPU stays saturated.
- OpenAI-compatible API: swap
base_urlfromhttps://api.openai.com/v1tohttp://localhost:8000/v1and yourchat.completions.create()calls work unchanged.
Walk-through
1. Install (Python 3.10–3.12; NVIDIA GPU required for CUDA; AMD also supported)
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install vllm --torch-backend=auto # pulls the right CUDA wheels automatically
For AMD/ROCm: uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/
2. Serve a model
# 1.5B param model — ~3 GB VRAM, good for testing
vllm serve Qwen/Qwen2.5-1.5B-Instruct
# 8B model at 4-bit quantization — ~6 GB VRAM
vllm serve meta-llama/Llama-3.1-8B-Instruct --quantization awq
vLLM downloads the model from HuggingFace Hub on first run. The server starts on http://localhost:8000 and prints the available routes.
3. Query it — zero code changes from the OpenAI SDK
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[{"role": "user", "content": "What is PagedAttention?"}],
max_tokens=256,
)
print(resp.choices[0].message.content)
List available models: curl http://localhost:8000/v1/models
4. Key production flags
vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.90 --max-model-len 4096 --dtype float16 --max-num-seqs 128
--gpu-memory-utilization 0.90— leave 10% headroom for the OS--max-model-len 4096— limit context window to reduce peak VRAM--dtype float16— roughly half the VRAM of bfloat16--max-num-seqs 128— cap concurrent sequences for predictable latency
Hardware floor: ~3 GB VRAM for a 1.5B model; ~6–8 GB for an 8B model at INT4 quantization; ~18 GB for a full-precision 8B model in BF16.
Where to go next: The Online Serving docs cover multi-GPU tensor parallelism (--tensor-parallel-size), LoRA adapter hot-loading, and structured output (grammar-constrained generation).
Sources: vLLM Quickstart — Official Docs · vLLM Online Serving