Chunking Strategies

Chunking strategies determine how documents are split into smaller pieces for retrieval. Choosing the right chunking strategy can significantly improve retrieval quality.

📊 Strategy Overview

Strategy	Best For	Description
Smart Chunking	General documents	Auto-detect document structure
Sentence-based	Precise retrieval	Split by sentence boundaries
Semantic	Complex documents	Split by semantic similarity

🧠 Smart Chunking

Smart chunking is the default strategy that automatically identifies document structure and splits accordingly.

How It Works

Identifies paragraphs, headers, lists, and other structures
Maintains semantic integrity
Automatically adjusts chunk size

Best For

Structured documents (technical docs, reports)
Mixed content documents
Most general use cases

Configuration Parameters

Parameter	Description	Default
`chunk_size`	Target chunk size (characters)	500
`chunk_overlap`	Overlap between chunks	50

📝 Sentence-based Chunking

Sentence-based chunking splits documents by sentence boundaries, suitable for scenarios requiring precise retrieval.

How It Works

Identifies sentence boundaries (periods, question marks, exclamation marks)
Combines adjacent sentences into chunks
Maintains sentence integrity

Best For

FAQ documents
Q&A content
Scenarios requiring precise matching

Configuration Parameters

Parameter	Description	Default
`separator`	Sentence separators	`.!?`
`buffer_size`	Sentence buffer count	1

🔗 Semantic Chunking

Semantic chunking splits based on content semantic similarity, suitable for complex documents.

How It Works

Calculates semantic similarity between adjacent text
Splits at semantic change points
Maintains topic coherence

Best For

Long articles
Documents with diverse topics
Scenarios requiring context coherence

Configuration Parameters

Parameter	Description	Default
`breakpoint_threshold`	Semantic breakpoint threshold	0.5
`buffer_size`	Context buffer size	1

⚙️ General Configuration

Chunk Size

Chunk size affects retrieval precision and recall:

Size	Pros	Cons
Small (200-300)	Precise matching	May lose context
Medium (400-600)	Balance precision and context	General choice
Large (800-1000)	Preserve more context	May include irrelevant content

Chunk Overlap

Overlap prevents important information from being split:

No overlap (0): Independent chunks, saves storage
Small overlap (20-50): Basic continuity
Large overlap (100+): Strong context preservation

💡 Selection Recommendations

By Document Type

Document Type	Recommended Strategy	Reason
Technical docs	Smart Chunking	Preserve structure
FAQ	Sentence-based	Precise Q&A matching
Long articles	Semantic	Maintain topic coherence
Code docs	Smart Chunking	Identify code blocks

By Use Case

Scenario	Recommended Configuration
Precise Q&A	Sentence-based + small chunks
Knowledge retrieval	Smart Chunking + medium chunks
Context understanding	Semantic + large chunks

🔄 Re-chunking

If retrieval results are unsatisfactory, you can re-chunk:

Go to the knowledge base document list
Select documents to re-chunk
Click Re-index
Choose new chunking strategy and parameters
Confirm reprocessing

Re-chunking will delete old chunks and create new ones.

User Guide - Complete knowledge base guide
Document Management - Adding and managing documents
Configuring Retrievers - Retriever configuration guide

📊 Strategy Overview​

🧠 Smart Chunking​

How It Works​

Best For​

Configuration Parameters​

📝 Sentence-based Chunking​

How It Works​

Best For​

Configuration Parameters​

🔗 Semantic Chunking​

How It Works​

Best For​

Configuration Parameters​

⚙️ General Configuration​

Chunk Size​

Chunk Overlap​

💡 Selection Recommendations​

By Document Type​

By Use Case​

🔄 Re-chunking​

🔗 Related Documentation​

📊 Strategy Overview

🧠 Smart Chunking

How It Works

Best For

Configuration Parameters

📝 Sentence-based Chunking

How It Works

Best For

Configuration Parameters

🔗 Semantic Chunking

How It Works

Best For

Configuration Parameters

⚙️ General Configuration

Chunk Size

Chunk Overlap

💡 Selection Recommendations

By Document Type

By Use Case

🔄 Re-chunking

🔗 Related Documentation