Skip to main content

Chunking Strategies

Chunking strategies determine how documents are split into smaller pieces for retrieval. Choosing the right chunking strategy can significantly improve retrieval quality.


πŸ“Š Strategy Overview​

StrategyBest ForDescription
Smart ChunkingGeneral documentsAuto-detect document structure
Sentence-basedPrecise retrievalSplit by sentence boundaries
SemanticComplex documentsSplit by semantic similarity

🧠 Smart Chunking​

Smart chunking is the default strategy that automatically identifies document structure and splits accordingly.

How It Works​

  • Identifies paragraphs, headers, lists, and other structures
  • Maintains semantic integrity
  • Automatically adjusts chunk size

Best For​

  • Structured documents (technical docs, reports)
  • Mixed content documents
  • Most general use cases

Configuration Parameters​

ParameterDescriptionDefault
chunk_sizeTarget chunk size (characters)500
chunk_overlapOverlap between chunks50

πŸ“ Sentence-based Chunking​

Sentence-based chunking splits documents by sentence boundaries, suitable for scenarios requiring precise retrieval.

How It Works​

  • Identifies sentence boundaries (periods, question marks, exclamation marks)
  • Combines adjacent sentences into chunks
  • Maintains sentence integrity

Best For​

  • FAQ documents
  • Q&A content
  • Scenarios requiring precise matching

Configuration Parameters​

ParameterDescriptionDefault
separatorSentence separators.!?
buffer_sizeSentence buffer count1

πŸ”— Semantic Chunking​

Semantic chunking splits based on content semantic similarity, suitable for complex documents.

How It Works​

  • Calculates semantic similarity between adjacent text
  • Splits at semantic change points
  • Maintains topic coherence

Best For​

  • Long articles
  • Documents with diverse topics
  • Scenarios requiring context coherence

Configuration Parameters​

ParameterDescriptionDefault
breakpoint_thresholdSemantic breakpoint threshold0.5
buffer_sizeContext buffer size1

βš™οΈ General Configuration​

Chunk Size​

Chunk size affects retrieval precision and recall:

SizeProsCons
Small (200-300)Precise matchingMay lose context
Medium (400-600)Balance precision and contextGeneral choice
Large (800-1000)Preserve more contextMay include irrelevant content

Chunk Overlap​

Overlap prevents important information from being split:

  • No overlap (0): Independent chunks, saves storage
  • Small overlap (20-50): Basic continuity
  • Large overlap (100+): Strong context preservation

πŸ’‘ Selection Recommendations​

By Document Type​

Document TypeRecommended StrategyReason
Technical docsSmart ChunkingPreserve structure
FAQSentence-basedPrecise Q&A matching
Long articlesSemanticMaintain topic coherence
Code docsSmart ChunkingIdentify code blocks

By Use Case​

ScenarioRecommended Configuration
Precise Q&ASentence-based + small chunks
Knowledge retrievalSmart Chunking + medium chunks
Context understandingSemantic + large chunks

πŸ”„ Re-chunking​

If retrieval results are unsatisfactory, you can re-chunk:

  1. Go to the knowledge base document list
  2. Select documents to re-chunk
  3. Click Re-index
  4. Choose new chunking strategy and parameters
  5. Confirm reprocessing

Re-chunking will delete old chunks and create new ones.