Read the Filename: Which GGUF Quantization Level to Pull for Your Hardware

Chris Harper

3 min read

Jun 27, 2026 · 04:12 UTC

Tutorial

Self-Hosting

LLM

TL;DR: GGUF quantization compresses model weights from 32-bit floats to 4–8 bits — Q4_K_M is the default pick for 6–12 GB VRAM and retains ~94% of full-precision quality.

What you'll be able to do after this:

Decode any GGUF filename (Q4_K_M, Q5_K_M, Q8_0, F16) and know exactly what you're downloading
Match quantization level to your available VRAM without trial-and-error
Pull a specific GGUF quant in Ollama and confirm what you got

Why quantization exists

A 7B-parameter model stores each weight as a 32-bit float by default: 7 billion × 4 bytes = 28 GB. Most GPUs don't have that. Quantization compresses each weight to fewer bits:

Format	Bits/weight	7B size	13B size	Quality retained
F16	16	14 GB	26 GB	100%
Q8_0	8	7.2 GB	13.5 GB	~99%
Q6_K	6	5.5 GB	10.3 GB	~97%
Q5_K_M	5	4.8 GB	8.6 GB	~96%
Q4_K_M	4	4.1 GB	7.9 GB	~94%
Q3_K_M	3	3.2 GB	6.0 GB	~87%
Q2_K	2	2.7 GB	5.2 GB	~75%

The big quality cliff is between Q3 and Q4. Above Q4, returns diminish fast — Q8_0 costs nearly double the VRAM of Q4_K_M for under 5% measurable difference on most tasks.

Decoding the name

Q4_K_M breaks down as:

Q4 — 4 bits per weight
_K — K-quant method: groups weights by importance and quantizes each group differently, smarter than naive uniform rounding
_M — Medium variant (_S = Small, _L = Large); Medium is the standard recommendation

Q8_0:

Q8 — 8 bits per weight
_0 — original simple quantization scheme (uniform, no grouping); works fine, just less nuanced than K-quant

Pick your level by VRAM

Available VRAM	Recommended
< 6 GB	Q3_K_M
6–12 GB	Q4_K_M ← default choice
12–16 GB	Q5_K_M or Q6_K
16–24 GB	Q8_0
24 GB+	F16 (full precision)

No GPU? CPU-only inference: Q4_K_M still applies — lower quant = faster generation, but Q2/Q3 quality degradation becomes noticeable in practice.

Pull a specific quant in Ollama

# Default pull (Ollama auto-selects based on your system — usually Q4_K_M for 7B)
ollama pull llama3.2

# Pull a specific quant
ollama pull llama3.2:7b-instruct-q4_K_M
ollama pull llama3.2:7b-instruct-q8_0

# See what you have
ollama list

# Confirm which quant is running
ollama show llama3.2:7b-instruct-q4_K_M
# ...
# quantization  Q4_K_M

Browse tags at ollama.com/library/llama3.2. The format is model:size-variant-QUANT.

Pull GGUF models from Hugging Face directly

For models not yet in the Ollama library:

ollama pull hf.co/bartowski/Llama-3.2-7B-Instruct-GGUF:Q4_K_M

Ollama supports the hf.co/ scheme natively. Browse GGUF models on Hugging Face — filter by library:gguf and the model name you want.