
Run Your First Local LLM in 5 Minutes: Ollama From Install to REST API
Chris Harper
3 min read
Jun 26, 2026 · 19:13 UTC
TL;DR: One command installs Ollama, one more pulls a 2GB model, and within 5 minutes you have a local LLM responding to curl — no cloud bill, no API key, and an OpenAI-compatible endpoint your existing code can point at without changes.
What you'll be able to do after this:
- Install Ollama and run a 7B model locally on CPU, Apple Silicon, or NVIDIA GPU with a single pull command
- Query it over REST using the same request format as the OpenAI API — drop-in compatible at
/v1/chat/completions - Know which models to start with given your hardware, and what the next step in local model serving looks like
Why local models matter
Running a model locally means your data never leaves the machine, you pay nothing per token, and you can work offline. Ollama handles quantization, memory management, and GPU routing automatically — think of it like Docker for models: pull to download, run to use.
Install (60 seconds)
Linux / macOS:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer at ollama.com/download.
Ollama starts a local server on port 11434 automatically. NVIDIA and AMD GPUs are auto-detected if drivers are present; CPU-only mode works too (slower, fine for experiments).
Pull and run your first model
# 2GB — runs on almost any machine, fast enough to start
ollama pull llama3.2:3b
ollama run llama3.2:3b
For more capable models (16GB+ RAM recommended):
ollama pull llama3.1:8b # Meta 8B instruction model (~5GB)
ollama pull qwen2.5:7b # Alibaba 7B — strong code + reasoning (~4.7GB)
ollama pull gemma3:4b # Google 4B — punches above its weight (~3.3GB)
Browse all available models at ollama.com/library.
Query via REST API
Once a model is running, Ollama's server handles requests:
# Native Ollama chat API
curl http://localhost:11434/api/chat \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "What is RAG?"}]}'
# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "What is RAG?"}]}'
Drop-in for OpenAI SDK
The /v1 endpoint is a compatible replacement for the OpenAI API. Point your existing Python client at Ollama with two changed lines:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Explain embeddings in one paragraph."}]
)
print(response.choices[0].message.content)
This means you can prototype locally for free and switch to a hosted model by changing base_url and api_key — no other code changes required.
What's next in the Run & Serve track
Once you have Ollama running, the next topics are quantization (what q4_K_M vs q8_0 means for speed vs quality tradeoffs — most ollama pull models are pre-quantized; run ollama show llama3.2:3b to see the quant level) and OpenRouter (for routing between local Ollama models and hosted APIs from the same interface).
Sources: Ollama quickstart | Ollama GitHub | Ollama model library