
Read the Filename: Which GGUF Quantization Level to Pull for Your Hardware
Chris Harper
3 min read
Jun 27, 2026 · 04:12 UTC
TL;DR: GGUF quantization compresses model weights from 32-bit floats to 4–8 bits — Q4_K_M is the default pick for 6–12 GB VRAM and retains ~94% of full-precision quality.
What you'll be able to do after this:
- Decode any GGUF filename (
Q4_K_M,Q5_K_M,Q8_0,F16) and know exactly what you're downloading - Match quantization level to your available VRAM without trial-and-error
- Pull a specific GGUF quant in Ollama and confirm what you got
Why quantization exists
A 7B-parameter model stores each weight as a 32-bit float by default: 7 billion × 4 bytes = 28 GB. Most GPUs don't have that. Quantization compresses each weight to fewer bits:
| Format | Bits/weight | 7B size | 13B size | Quality retained |
|---|---|---|---|---|
| F16 | 16 | 14 GB | 26 GB | 100% |
| Q8_0 | 8 | 7.2 GB | 13.5 GB | ~99% |
| Q6_K | 6 | 5.5 GB | 10.3 GB | ~97% |
| Q5_K_M | 5 | 4.8 GB | 8.6 GB | ~96% |
| Q4_K_M | 4 | 4.1 GB | 7.9 GB | ~94% |
| Q3_K_M | 3 | 3.2 GB | 6.0 GB | ~87% |
| Q2_K | 2 | 2.7 GB | 5.2 GB | ~75% |
The big quality cliff is between Q3 and Q4. Above Q4, returns diminish fast — Q8_0 costs nearly double the VRAM of Q4_K_M for under 5% measurable difference on most tasks.
Decoding the name
Q4_K_M breaks down as:
- Q4 — 4 bits per weight
- _K — K-quant method: groups weights by importance and quantizes each group differently, smarter than naive uniform rounding
- _M — Medium variant (
_S= Small,_L= Large); Medium is the standard recommendation
Q8_0:
- Q8 — 8 bits per weight
- _0 — original simple quantization scheme (uniform, no grouping); works fine, just less nuanced than K-quant
Pick your level by VRAM
| Available VRAM | Recommended |
|---|---|
| < 6 GB | Q3_K_M |
| 6–12 GB | Q4_K_M ← default choice |
| 12–16 GB | Q5_K_M or Q6_K |
| 16–24 GB | Q8_0 |
| 24 GB+ | F16 (full precision) |
No GPU? CPU-only inference: Q4_K_M still applies — lower quant = faster generation, but Q2/Q3 quality degradation becomes noticeable in practice.
Pull a specific quant in Ollama
# Default pull (Ollama auto-selects based on your system — usually Q4_K_M for 7B)
ollama pull llama3.2
# Pull a specific quant
ollama pull llama3.2:7b-instruct-q4_K_M
ollama pull llama3.2:7b-instruct-q8_0
# See what you have
ollama list
# Confirm which quant is running
ollama show llama3.2:7b-instruct-q4_K_M
# ...
# quantization Q4_K_M
Browse tags at ollama.com/library/llama3.2. The format is model:size-variant-QUANT.
Pull GGUF models from Hugging Face directly
For models not yet in the Ollama library:
ollama pull hf.co/bartowski/Llama-3.2-7B-Instruct-GGUF:Q4_K_M
Ollama supports the hf.co/ scheme natively. Browse GGUF models on Hugging Face — filter by library:gguf and the model name you want.
Sources: GGUF Quantization Explained — vucense.com | GGUF — Hugging Face Transformers docs | Ollama model library | Demystifying LLM Quantization Suffixes — Medium