Why Your RAG System Fails: The Chunking Problem
Chunking is how documents are broken into smaller, more manageable pieces for retrieval. Think of it as storing books by paragraphs, so when someone asks a question, you only respond with the relevant paragraphs—not the entire book.
- Get it wrong and your RAG system doesn't find the needed information at query time
- Get it right and you get accurate responses with lower costs and fewer hallucinations
The Fundamentals
The way you chunk your documents directly impacts retrieval quality. Too large and you waste tokens on irrelevant content. Too small and you lose context. The goal is finding the sweet spot for your specific use case.
6 Chunking Strategies
Fixed-Size Chunking
Splits text into fixed-sized chunks or tokens.
- Use when: Most cases—this is your starting point
- Watch out: Loss of context by breaking up ideas
- Example: Every 500 tokens with 100-token overlap
Recursive Chunking
Splits text using the document's structure. Breaks down paragraphs, sentences, and words depending on the chunk size.
- Use when: Documents are structured
- Watch out: Difficult to implement across multiple document types
- Example: Split by
\n\n, then\n, then.if needed
Document-Structure Chunking
Splits text based on the document's structure using headings, paragraphs, and code blocks.
- Use when: Markdown, HTML, structured PDFs
- Watch out: Sections can be different sizes
- Example: Each markdown
##heading = one chunk - Pro tip: Add metadata (chapter, section) to each chunk
Semantic Chunking
Splits text based on topic changes.
- Use when: Documents have complex structures, or you need to preserve meaning
- Watch out: Potential for context loss by splitting sentences; could have high computational cost
- Example: Split where sentence embeddings show topic shift
- Best for: News articles, transcripts, blog posts
Sliding Window Chunking
Overlapping chunks that move across the document.
- Use when: Maintaining context and continuity across sections is crucial (legal, medical, or financial texts)
- Watch out: Creates more chunks; higher storage cost
- Example: 1000-token window, 200-token overlap
- Prevents: Missing info that spans chunk boundaries
LLM-Based Chunking
Use an LLM to handle how to split the document.
- Use when: Higher accuracy is needed
- Watch out: Most expensive option
- Example: "Divide this contract into logical sections with complete clauses"
- Best for: Legal docs, medical records, research papers
How to Choose Your Strategy
Phase 1: Establish a Baseline
Start with fixed-size chunking at 512 tokens with 10% overlap. Build a small evaluation dataset of 20-30 real user queries and measure your baseline retrieval accuracy.
Phase 2: Test Alternatives
Test strategies that match your content:
- Structured documents? Try document-structure or recursive chunking
- Complex narratives? Test semantic chunking
- Mixed content? Compare multiple approaches
A/B test against your baseline. If you see 10-20% improvement in retrieval accuracy, you've found your winner.
No improvement? Your content might work fine with fixed-size chunking, or you may need to test different chunk sizes.
Phase 3: Production Optimization
In production, different document types need different strategies:
- Code repositories: Recursive chunking (function/class boundaries)
- Legal contracts: Structure-based (clause preservation)
- Blog posts: Semantic chunking (topic coherence)
The key to building a sound chunking strategy is to monitor real-world performance and adjust as needed. There really isn't a one-size-fits-all when it comes to chunking.
Quick Start: Implementation Checklist
Optimal Chunk Sizes by Content Type
- General content: 512-1024 tokens
- Code: 200-500 tokens (function/class level)
- Legal/contracts: Clause-level (variable, 200-2000 tokens)
- FAQs: One Q&A pair per chunk
- Chat logs: Conversation turn or exchange level
Overlap Guidelines
- Standard: 10-20% overlap (50-200 tokens)
- Critical applications (legal, medical): 20-30%
- Cost-sensitive: 5-10% or none
Testing Approach
- Create 20-50 evaluation queries from real use cases
- Implement baseline (fixed-size 512 tokens, 10% overlap)
- Measure precision (relevant chunks retrieved / total retrieved)
- Test 2 alternative strategies
- Compare metrics and choose winner
- Monitor in production and iterate
Tools to Get Started
- LangChain: RecursiveCharacterTextSplitter, MarkdownTextSplitter
- LlamaIndex: SentenceSplitter, SemanticSplitterNodeParser
- Evaluation: RAGAS, TruLens
Strategy Comparison at a Glance
| Strategy | Speed | Accuracy | Cost | Best For |
|---|---|---|---|---|
| Fixed-size | Fast | Medium | Low | Starting point, mixed content |
| Recursive | Medium | High | Low | Structured docs |
| Document-structure | Medium | Very High | Low | Markdown, HTML, PDFs |
| Semantic | Slow | Very High | High | Unstructured text |
| Sliding window | Medium | Very High | Medium | Context-critical apps |
| LLM-based | Slow | Highest | Very High | High-value docs |
The Bottom Line
Chunking isn't the most talked about aspect of RAG systems, but it's the foundation that a performant RAG system is built upon. Start with fixed-size chunking, and based on real-world testing, iterate on your chunking strategy.
Your RAG system is only as good as the chunks it retrieves.