CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Read the Filename: Which GGUF Quantization Level to Pull for Your Hardware

Read the Filename: Which GGUF Quantization Level to Pull for Your Hardware

Chris Harper

3 min read

Jun 27, 2026 · 04:12 UTC

AI
Tutorial
Self-Hosting
LLM

TL;DR: GGUF quantization compresses model weights from 32-bit floats to 4–8 bits — Q4_K_M is the default pick for 6–12 GB VRAM and retains ~94% of full-precision quality.

What you'll be able to do after this:

  • Decode any GGUF filename (Q4_K_M, Q5_K_M, Q8_0, F16) and know exactly what you're downloading
  • Match quantization level to your available VRAM without trial-and-error
  • Pull a specific GGUF quant in Ollama and confirm what you got

Why quantization exists

A 7B-parameter model stores each weight as a 32-bit float by default: 7 billion × 4 bytes = 28 GB. Most GPUs don't have that. Quantization compresses each weight to fewer bits:

FormatBits/weight7B size13B sizeQuality retained
F161614 GB26 GB100%
Q8_087.2 GB13.5 GB~99%
Q6_K65.5 GB10.3 GB~97%
Q5_K_M54.8 GB8.6 GB~96%
Q4_K_M44.1 GB7.9 GB~94%
Q3_K_M33.2 GB6.0 GB~87%
Q2_K22.7 GB5.2 GB~75%

The big quality cliff is between Q3 and Q4. Above Q4, returns diminish fast — Q8_0 costs nearly double the VRAM of Q4_K_M for under 5% measurable difference on most tasks.

Decoding the name

Q4_K_M breaks down as:

  • Q4 — 4 bits per weight
  • _K — K-quant method: groups weights by importance and quantizes each group differently, smarter than naive uniform rounding
  • _M — Medium variant (_S = Small, _L = Large); Medium is the standard recommendation

Q8_0:

  • Q8 — 8 bits per weight
  • _0 — original simple quantization scheme (uniform, no grouping); works fine, just less nuanced than K-quant

Pick your level by VRAM

Available VRAMRecommended
< 6 GBQ3_K_M
6–12 GBQ4_K_M ← default choice
12–16 GBQ5_K_M or Q6_K
16–24 GBQ8_0
24 GB+F16 (full precision)

No GPU? CPU-only inference: Q4_K_M still applies — lower quant = faster generation, but Q2/Q3 quality degradation becomes noticeable in practice.

Pull a specific quant in Ollama

# Default pull (Ollama auto-selects based on your system — usually Q4_K_M for 7B)
ollama pull llama3.2

# Pull a specific quant
ollama pull llama3.2:7b-instruct-q4_K_M
ollama pull llama3.2:7b-instruct-q8_0

# See what you have
ollama list

# Confirm which quant is running
ollama show llama3.2:7b-instruct-q4_K_M
# ...
# quantization  Q4_K_M

Browse tags at ollama.com/library/llama3.2. The format is model:size-variant-QUANT.

Pull GGUF models from Hugging Face directly

For models not yet in the Ollama library:

ollama pull hf.co/bartowski/Llama-3.2-7B-Instruct-GGUF:Q4_K_M

Ollama supports the hf.co/ scheme natively. Browse GGUF models on Hugging Face — filter by library:gguf and the model name you want.

Sources: GGUF Quantization Explained — vucense.com | GGUF — Hugging Face Transformers docs | Ollama model library | Demystifying LLM Quantization Suffixes — Medium