Chunking Strategies
Chunking strategies determine how documents are split into smaller pieces for retrieval. Choosing the right chunking strategy can significantly improve retrieval quality.
π Strategy Overviewβ
| Strategy | Best For | Description |
|---|---|---|
| Smart Chunking | General documents | Auto-detect document structure |
| Sentence-based | Precise retrieval | Split by sentence boundaries |
| Semantic | Complex documents | Split by semantic similarity |
π§ Smart Chunkingβ
Smart chunking is the default strategy that automatically identifies document structure and splits accordingly.
How It Worksβ
- Identifies paragraphs, headers, lists, and other structures
- Maintains semantic integrity
- Automatically adjusts chunk size
Best Forβ
- Structured documents (technical docs, reports)
- Mixed content documents
- Most general use cases
Configuration Parametersβ
| Parameter | Description | Default |
|---|---|---|
chunk_size | Target chunk size (characters) | 500 |
chunk_overlap | Overlap between chunks | 50 |
π Sentence-based Chunkingβ
Sentence-based chunking splits documents by sentence boundaries, suitable for scenarios requiring precise retrieval.
How It Worksβ
- Identifies sentence boundaries (periods, question marks, exclamation marks)
- Combines adjacent sentences into chunks
- Maintains sentence integrity
Best Forβ
- FAQ documents
- Q&A content
- Scenarios requiring precise matching
Configuration Parametersβ
| Parameter | Description | Default |
|---|---|---|
separator | Sentence separators | .!? |
buffer_size | Sentence buffer count | 1 |
π Semantic Chunkingβ
Semantic chunking splits based on content semantic similarity, suitable for complex documents.
How It Worksβ
- Calculates semantic similarity between adjacent text
- Splits at semantic change points
- Maintains topic coherence
Best Forβ
- Long articles
- Documents with diverse topics
- Scenarios requiring context coherence
Configuration Parametersβ
| Parameter | Description | Default |
|---|---|---|
breakpoint_threshold | Semantic breakpoint threshold | 0.5 |
buffer_size | Context buffer size | 1 |
βοΈ General Configurationβ
Chunk Sizeβ
Chunk size affects retrieval precision and recall:
| Size | Pros | Cons |
|---|---|---|
| Small (200-300) | Precise matching | May lose context |
| Medium (400-600) | Balance precision and context | General choice |
| Large (800-1000) | Preserve more context | May include irrelevant content |
Chunk Overlapβ
Overlap prevents important information from being split:
- No overlap (0): Independent chunks, saves storage
- Small overlap (20-50): Basic continuity
- Large overlap (100+): Strong context preservation
π‘ Selection Recommendationsβ
By Document Typeβ
| Document Type | Recommended Strategy | Reason |
|---|---|---|
| Technical docs | Smart Chunking | Preserve structure |
| FAQ | Sentence-based | Precise Q&A matching |
| Long articles | Semantic | Maintain topic coherence |
| Code docs | Smart Chunking | Identify code blocks |
By Use Caseβ
| Scenario | Recommended Configuration |
|---|---|
| Precise Q&A | Sentence-based + small chunks |
| Knowledge retrieval | Smart Chunking + medium chunks |
| Context understanding | Semantic + large chunks |
π Re-chunkingβ
If retrieval results are unsatisfactory, you can re-chunk:
- Go to the knowledge base document list
- Select documents to re-chunk
- Click Re-index
- Choose new chunking strategy and parameters
- Confirm reprocessing
Re-chunking will delete old chunks and create new ones.
π Related Documentationβ
- User Guide - Complete knowledge base guide
- Document Management - Adding and managing documents
- Configuring Retrievers - Retriever configuration guide