CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
FAISS From Scratch: Three Index Types Every AI Engineer Should Know

FAISS From Scratch: Three Index Types Every AI Engineer Should Know

Chris Harper

3 min read

Jul 4, 2026 · 20:05 UTC

AI
Tutorial
Embeddings
Vectors
RAG

FAISS is Meta's in-process vector search library — IndexFlatL2 for exact results, IVFFlat for 4× faster approximate search, IVFPQ for memory-compressed million-scale retrieval.

What you'll be able to do after this:

  • Build a FAISS index from scratch, add embeddings, and run similarity queries in under 20 lines of Python
  • Choose between exact (Flat), partitioned (IVF), and compressed (PQ) indexes based on dataset size and latency budget
  • Tune nlist and nprobe to trade recall for speed without a single extra dependency

If you're building semantic search, RAG retrieval, deduplication, or recommendations — you'll eventually need to pick a vector index. FAISS (Facebook AI Similarity Search) is the workhorse behind many production ML systems and runs entirely in-process: no server, no database, no network call.


Install

pip install faiss-cpu        # CPU (works anywhere)
# pip install faiss-gpu      # if you have CUDA

FAISS expects float32 NumPy arrays. If you're using sentence-transformers or OpenAI embeddings, cast with .astype('float32') before adding.

Index 1: IndexFlatL2 — exact search

Computes L2 (Euclidean) distance from your query to every vector. Always 100% accurate; fast up to ~100K vectors.

import faiss
import numpy as np

d = 768  # embedding dimension (match your model)

index = faiss.IndexFlatL2(d)
index.add(sentence_embeddings)            # shape: (N, d), float32
D, I = index.search(query_embedding, k=5) # D = distances, I = indices

Use IndexFlatL2 when you need exact results (benchmarks, evaluation) or your corpus is small.

Index 2: IndexIVFFlat — partitioned approximate search

Clusters vectors into nlist Voronoi cells. At query time only nprobe cells are searched — a 4–10× speedup at the cost of a small recall drop.

nlist = 50                              # number of clusters ≈ sqrt(N)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist)
index.train(sentence_embeddings)        # IVF requires a training pass
index.add(sentence_embeddings)
index.nprobe = 10                       # search 10 of 50 cells; raise for better recall
D, I = index.search(query_embedding, k=5)

Rule of thumb: nlist ≈ sqrt(N). Start nprobe at 10–20% of nlist and tune from there — doubling nprobe roughly doubles latency but meaningfully improves recall.

Index 3: IndexIVFPQ — memory-compressed at scale

Adds Product Quantization (PQ) on top of IVF. Each 768-float vector is compressed into m × bits bytes — a 96× memory reduction at the cost of another recall step.

m    = 8   # sub-quantizers (d must be divisible by m)
bits = 8   # bits per sub-quantizer → 8 bytes per vector (vs. 3072 for float32)
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)
index.train(sentence_embeddings)
index.add(sentence_embeddings)
D, I = index.search(query_embedding, k=5)

Use IVFPQ when you have > 1M vectors or can't fit embeddings in RAM.

Picking your index

Dataset sizeIndexWhy
< 100KIndexFlatL2Exact, no tuning, no training
100K – 1MIndexIVFFlat4–10× faster, negligible accuracy loss
> 1M or RAM-limitedIndexIVFPQ96× memory savings, still high recall

The underlying speed vs. accuracy trade-off is the same one HNSW and other ANN indexes make — FAISS just makes the parameters explicit and tunable.

Anchor resource: Pinecone's FAISS tutorial series walks through all three index types with real sentence-transformer data, explains the Voronoi cell intuition behind IVF, and includes chapters on HNSW and Product Quantization.

Sources: Pinecone FAISS tutorial · FAISS GitHub (facebookresearch) · Meta Engineering: FAISS intro