
FAISS From Scratch: Three Index Types Every AI Engineer Should Know
Chris Harper
3 min read
Jul 4, 2026 · 20:05 UTC
FAISS is Meta's in-process vector search library — IndexFlatL2 for exact results, IVFFlat for 4× faster approximate search, IVFPQ for memory-compressed million-scale retrieval.
What you'll be able to do after this:
- Build a FAISS index from scratch, add embeddings, and run similarity queries in under 20 lines of Python
- Choose between exact (Flat), partitioned (IVF), and compressed (PQ) indexes based on dataset size and latency budget
- Tune
nlistandnprobeto trade recall for speed without a single extra dependency
If you're building semantic search, RAG retrieval, deduplication, or recommendations — you'll eventually need to pick a vector index. FAISS (Facebook AI Similarity Search) is the workhorse behind many production ML systems and runs entirely in-process: no server, no database, no network call.
Install
pip install faiss-cpu # CPU (works anywhere)
# pip install faiss-gpu # if you have CUDA
FAISS expects float32 NumPy arrays. If you're using sentence-transformers or OpenAI embeddings, cast with .astype('float32') before adding.
Index 1: IndexFlatL2 — exact search
Computes L2 (Euclidean) distance from your query to every vector. Always 100% accurate; fast up to ~100K vectors.
import faiss
import numpy as np
d = 768 # embedding dimension (match your model)
index = faiss.IndexFlatL2(d)
index.add(sentence_embeddings) # shape: (N, d), float32
D, I = index.search(query_embedding, k=5) # D = distances, I = indices
Use IndexFlatL2 when you need exact results (benchmarks, evaluation) or your corpus is small.
Index 2: IndexIVFFlat — partitioned approximate search
Clusters vectors into nlist Voronoi cells. At query time only nprobe cells are searched — a 4–10× speedup at the cost of a small recall drop.
nlist = 50 # number of clusters ≈ sqrt(N)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(sentence_embeddings) # IVF requires a training pass
index.add(sentence_embeddings)
index.nprobe = 10 # search 10 of 50 cells; raise for better recall
D, I = index.search(query_embedding, k=5)
Rule of thumb: nlist ≈ sqrt(N). Start nprobe at 10–20% of nlist and tune from there — doubling nprobe roughly doubles latency but meaningfully improves recall.
Index 3: IndexIVFPQ — memory-compressed at scale
Adds Product Quantization (PQ) on top of IVF. Each 768-float vector is compressed into m × bits bytes — a 96× memory reduction at the cost of another recall step.
m = 8 # sub-quantizers (d must be divisible by m)
bits = 8 # bits per sub-quantizer → 8 bytes per vector (vs. 3072 for float32)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)
index.train(sentence_embeddings)
index.add(sentence_embeddings)
D, I = index.search(query_embedding, k=5)
Use IVFPQ when you have > 1M vectors or can't fit embeddings in RAM.
Picking your index
| Dataset size | Index | Why |
|---|---|---|
| < 100K | IndexFlatL2 | Exact, no tuning, no training |
| 100K – 1M | IndexIVFFlat | 4–10× faster, negligible accuracy loss |
| > 1M or RAM-limited | IndexIVFPQ | 96× memory savings, still high recall |
The underlying speed vs. accuracy trade-off is the same one HNSW and other ANN indexes make — FAISS just makes the parameters explicit and tunable.
Anchor resource: Pinecone's FAISS tutorial series walks through all three index types with real sentence-transformer data, explains the Voronoi cell intuition behind IVF, and includes chapters on HNSW and Product Quantization.
Sources: Pinecone FAISS tutorial · FAISS GitHub (facebookresearch) · Meta Engineering: FAISS intro