FAISS From Scratch: Three Index Types Every AI Engineer Should Know

Chris Harper

3 min read

Jul 4, 2026 · 20:05 UTC

Tutorial

Embeddings

Vectors

RAG

FAISS is Meta's in-process vector search library — IndexFlatL2 for exact results, IVFFlat for 4× faster approximate search, IVFPQ for memory-compressed million-scale retrieval.

What you'll be able to do after this:

Build a FAISS index from scratch, add embeddings, and run similarity queries in under 20 lines of Python
Choose between exact (Flat), partitioned (IVF), and compressed (PQ) indexes based on dataset size and latency budget
Tune nlist and nprobe to trade recall for speed without a single extra dependency

If you're building semantic search, RAG retrieval, deduplication, or recommendations — you'll eventually need to pick a vector index. FAISS (Facebook AI Similarity Search) is the workhorse behind many production ML systems and runs entirely in-process: no server, no database, no network call.

Install

pip install faiss-cpu        # CPU (works anywhere)
# pip install faiss-gpu      # if you have CUDA

FAISS expects float32 NumPy arrays. If you're using sentence-transformers or OpenAI embeddings, cast with .astype('float32') before adding.

Index 1: IndexFlatL2 — exact search

Computes L2 (Euclidean) distance from your query to every vector. Always 100% accurate; fast up to ~100K vectors.

import faiss
import numpy as np

d = 768  # embedding dimension (match your model)

index = faiss.IndexFlatL2(d)
index.add(sentence_embeddings)            # shape: (N, d), float32
D, I = index.search(query_embedding, k=5) # D = distances, I = indices

Use IndexFlatL2 when you need exact results (benchmarks, evaluation) or your corpus is small.

Index 2: IndexIVFFlat — partitioned approximate search

Clusters vectors into nlist Voronoi cells. At query time only nprobe cells are searched — a 4–10× speedup at the cost of a small recall drop.

nlist = 50                              # number of clusters ≈ sqrt(N)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(sentence_embeddings)        # IVF requires a training pass
index.add(sentence_embeddings)
index.nprobe = 10                       # search 10 of 50 cells; raise for better recall
D, I = index.search(query_embedding, k=5)

Rule of thumb: nlist ≈ sqrt(N). Start nprobe at 10–20% of nlist and tune from there — doubling nprobe roughly doubles latency but meaningfully improves recall.

Index 3: IndexIVFPQ — memory-compressed at scale

Adds Product Quantization (PQ) on top of IVF. Each 768-float vector is compressed into m × bits bytes — a 96× memory reduction at the cost of another recall step.

m    = 8   # sub-quantizers (d must be divisible by m)
bits = 8   # bits per sub-quantizer → 8 bytes per vector (vs. 3072 for float32)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)
index.train(sentence_embeddings)
index.add(sentence_embeddings)
D, I = index.search(query_embedding, k=5)

Use IVFPQ when you have > 1M vectors or can't fit embeddings in RAM.

Picking your index

Dataset size	Index	Why
< 100K	`IndexFlatL2`	Exact, no tuning, no training
100K – 1M	`IndexIVFFlat`	4–10× faster, negligible accuracy loss
> 1M or RAM-limited	`IndexIVFPQ`	96× memory savings, still high recall

The underlying speed vs. accuracy trade-off is the same one HNSW and other ANN indexes make — FAISS just makes the parameters explicit and tunable.

Anchor resource: Pinecone's FAISS tutorial series walks through all three index types with real sentence-transformer data, explains the Voronoi cell intuition behind IVF, and includes chapters on HNSW and Product Quantization.