Run Your First Local LLM in 5 Minutes: Ollama From Install to REST API

Chris Harper

3 min read

Jun 26, 2026 · 19:13 UTC

Tutorial

Self-Hosting

LLM

TL;DR: One command installs Ollama, one more pulls a 2GB model, and within 5 minutes you have a local LLM responding to curl — no cloud bill, no API key, and an OpenAI-compatible endpoint your existing code can point at without changes.

What you'll be able to do after this:

Install Ollama and run a 7B model locally on CPU, Apple Silicon, or NVIDIA GPU with a single pull command
Query it over REST using the same request format as the OpenAI API — drop-in compatible at /v1/chat/completions
Know which models to start with given your hardware, and what the next step in local model serving looks like

Why local models matter

Running a model locally means your data never leaves the machine, you pay nothing per token, and you can work offline. Ollama handles quantization, memory management, and GPU routing automatically — think of it like Docker for models: pull to download, run to use.

Install (60 seconds)

Linux / macOS:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer at ollama.com/download.

Ollama starts a local server on port 11434 automatically. NVIDIA and AMD GPUs are auto-detected if drivers are present; CPU-only mode works too (slower, fine for experiments).

Pull and run your first model

# 2GB — runs on almost any machine, fast enough to start
ollama pull llama3.2:3b
ollama run llama3.2:3b

For more capable models (16GB+ RAM recommended):

ollama pull llama3.1:8b      # Meta 8B instruction model (~5GB)
ollama pull qwen2.5:7b       # Alibaba 7B — strong code + reasoning (~4.7GB)
ollama pull gemma3:4b        # Google 4B — punches above its weight (~3.3GB)

Browse all available models at ollama.com/library.

Query via REST API

Once a model is running, Ollama's server handles requests:

# Native Ollama chat API
curl http://localhost:11434/api/chat \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "What is RAG?"}]}'

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "What is RAG?"}]}'

Drop-in for OpenAI SDK

The /v1 endpoint is a compatible replacement for the OpenAI API. Point your existing Python client at Ollama with two changed lines:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Explain embeddings in one paragraph."}]
)
print(response.choices[0].message.content)

This means you can prototype locally for free and switch to a hosted model by changing base_url and api_key — no other code changes required.

What's next in the Run & Serve track

Once you have Ollama running, the next topics are quantization (what q4_K_M vs q8_0 means for speed vs quality tradeoffs — most ollama pull models are pre-quantized; run ollama show llama3.2:3b to see the quant level) and OpenRouter (for routing between local Ollama models and hosted APIs from the same interface).

Sources: Ollama quickstart | Ollama GitHub | Ollama model library