CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Cut Your Chunks Right: The 5 Levels of Text Splitting Every RAG Builder Needs to Know

Cut Your Chunks Right: The 5 Levels of Text Splitting Every RAG Builder Needs to Know

Chris Harper

3 min read

Jul 1, 2026 · 04:08 UTC

AI
Tutorial
RAG
Embeddings

TL;DR: How you chunk documents determines 80% of RAG retrieval quality — start with RecursiveCharacterTextSplitter, then climb the 5-level framework as your use case demands.

What you'll be able to do after this:

  • Pick the right chunking strategy for any document type — code, prose, structured data
  • Tune chunk_size and chunk_overlap to match your embedding model and query patterns
  • Know when to graduate from fixed splits to semantic or agentic chunking

Why chunking is the first thing that matters

A RAG pipeline retrieves by vector similarity: your question embedding is compared to every chunk embedding, and the closest chunks are sent to the LLM. If a good answer is split across two chunks, neither chunk ranks highly — and the LLM never sees the full context. Get chunking wrong and no amount of reranking or prompt engineering fixes it.

Greg Kamradt laid out the definitive framework in his 5 Levels of Text Splitting walkthrough. Here's the ladder:

Level 1 — Fixed character splits (naive baseline)

Split every N characters regardless of content:

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = splitter.split_text(raw_text)

Simplest to implement. Breaks mid-sentence constantly. Use only as a quick sanity-check — never ship this.

Level 2 — Recursive character splitting (your default)

Try paragraph breaks first (`

`), then line breaks, then sentence endings, then words — recursively re-splitting with the next separator until chunks are small enough:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,      # ~12% overlap — a reliable starting point
    separators=["

", "
", ". ", " ", ""],
)
docs = splitter.create_documents([raw_text])

This is the "swiss army knife" — respects natural text boundaries without needing schema knowledge. Start here for any new project.

Choosing chunk_size: factoid Q&A → 256–512 tokens; analytical queries → 1024+. If unsure, 512 with 64-token overlap is your default.

Level 3 — Document-specific splitting

Code, Markdown, HTML, and PDFs each have structure worth preserving:

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers = [("#", "h1"), ("##", "h2"), ("###", "h3")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
md_docs = splitter.split_text(markdown_content)
# Each chunk carries metadata: {"h1": "Installation", "h2": "Quick Start"}

The metadata matters: you can later filter on h1 == "API Reference" to scope retrieval to the right section before any similarity search runs.

Level 4 — Semantic splitting

Group sentences by meaning — embed adjacent sentences, split when similarity drops sharply:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings  # or any embedder

splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # split at bottom 10% similarity drops
)
docs = splitter.create_documents([raw_text])

Benchmarks show up to ~70% lift in retrieval recall over naive fixed splits. The cost: each sentence gets embedded during the split phase, roughly doubling ingestion time. Worth it for unstructured corpora with varied topic density.

Level 5 — Agentic splitting

Use an LLM to extract propositions — atomic, self-contained statements — from a passage, then embed each proposition. The chunk becomes a semantic unit the model decided was standalone.

Highest precision, highest cost. Use it when queries are very specific and documents are dense (research papers, legal contracts).

What to ship

Start at Level 2. Measure retrieval quality with a small eval set (see the RAGAS post if you missed it). If recall is below ~0.7, move up the ladder — Level 3 if your docs have structure, Level 4 if topics shift inside documents, Level 5 for high-stakes dense corpora.

Sources: 5 Levels of Text Splitting — YouTube | LangChain RecursiveCharacterTextSplitter docs | Unstructured.io chunking best practices