Documentation
¶
Overview ¶
Package textsplitter provides text splitting utilities for chunking documents. It includes code-aware splitters that respect language syntax boundaries.
Index ¶
- Constants
- Variables
- func TruncateParentText(text string, maxLen int) string
- type ChunkType
- type CodeAwareTextSplitter
- func (c *CodeAwareTextSplitter) ChunkFileWithFileInfo(ctx context.Context, content, filePath, modelName string, fileInfo fs.FileInfo, ...) ([]schema.CodeChunk, error)
- func (c *CodeAwareTextSplitter) EnrichChunkWithContext(ctx context.Context, chunk model.CodeChunk, fileContent string, ...) model.CodeChunk
- func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
- func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
- func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
- type Option
- func WithChunkOverlap(overlap int) Option
- func WithChunkSize(size int) Option
- func WithEstimationRatio(ratio float64) Option
- func WithMaxChunkSize(size int) Option
- func WithMinChunkSize(size int) Option
- func WithModelName(name string) Option
- func WithParentContextConfig(config ParentContextConfig) Option
- type ParentContextConfig
- type RecursiveCharacter
- type TextSplitter
- type Tokenizer
Constants ¶
const ( // MaxParentTextLength defines the default limit for parent context storage MaxParentTextLength = 2000 // DefaultChunkSize is the fallback size if not provided DefaultChunkSize = 2048 )
Variables ¶
var ( // ErrInvalidChunkSize is returned when the chunk size is invalid. ErrInvalidChunkSize = errors.New("invalid chunk size") // ErrEmptyContent is returned when the content is empty or whitespace only. ErrEmptyContent = errors.New("content is empty or contains only whitespace") // ErrTokenizerNotConfigured is returned when a tokenizer is required but not configured. ErrTokenizerNotConfigured = errors.New("tokenizer service is not configured") // ErrModelRequired is returned when a model name is required but not provided. ErrModelRequired = errors.New("model name is required") )
Error variables for text splitting.
Functions ¶
func TruncateParentText ¶ added in v0.15.0
TruncateParentText reduces text length while preserving start and end context.
Types ¶
type ChunkType ¶
type ChunkType string
ChunkType represents the type of content in a chunk.
const ( // ChunkTypeFunction represents a function or method chunk. ChunkTypeFunction ChunkType = "function" // ChunkTypeClass represents a class or struct chunk. ChunkTypeClass ChunkType = "class" // ChunkTypeImports represents an import block chunk. ChunkTypeImports ChunkType = "imports" // ChunkTypeComment represents a comment block chunk. ChunkTypeComment ChunkType = "comment" // ChunkTypeCode represents a generic code block chunk. ChunkTypeCode ChunkType = "code" // ChunkTypeText represents a text block chunk. ChunkTypeText ChunkType = "text" )
Chunk type constants.
type CodeAwareTextSplitter ¶
type CodeAwareTextSplitter struct {
// contains filtered or unexported fields
}
func NewCodeAware ¶
func NewCodeAware( registry parsers.ParserRegistry, tokenizer Tokenizer, logger *slog.Logger, opts ...Option, ) (*CodeAwareTextSplitter, error)
func (*CodeAwareTextSplitter) ChunkFileWithFileInfo ¶
func (*CodeAwareTextSplitter) EnrichChunkWithContext ¶
func (c *CodeAwareTextSplitter) EnrichChunkWithContext( ctx context.Context, chunk model.CodeChunk, fileContent string, metadata model.FileMetadata, parentChunks []model.CodeChunk, modelName string, ) model.CodeChunk
EnrichChunkWithContext adds file and hierarchical context to a chunk.
func (*CodeAwareTextSplitter) GetRecommendedChunkSize ¶
func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
GetRecommendedChunkSize returns a recommended chunk size based on file type and content length.
func (*CodeAwareTextSplitter) SplitDocuments ¶
func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
SplitDocuments takes a slice of documents and returns a new slice with split content.
func (*CodeAwareTextSplitter) ValidateChunkingOptions ¶
func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
ValidateChunkingOptions validates the provided chunking options for correctness.
type Option ¶
type Option func(*options)
Option is a function type for configuring the splitter.
func WithChunkOverlap ¶
WithChunkOverlap sets the number of overlapping tokens between chunks.
func WithChunkSize ¶
WithChunkSize sets the target chunk size in tokens.
func WithEstimationRatio ¶
WithEstimationRatio sets the character-to-token estimation ratio. Used when a tokenizer is not available. Default is 4.0 (4 chars per token).
func WithMaxChunkSize ¶
WithMaxChunkSize sets the maximum chunk size in tokens. Chunks larger than this will be split further.
func WithMinChunkSize ¶
WithMinChunkSize sets the minimum number of characters for a chunk to be valid. Chunks smaller than this may be merged with adjacent content.
func WithModelName ¶
WithModelName sets the model name for token-aware splitting. When set, the splitter uses the model's tokenizer for accurate chunk sizing.
func WithParentContextConfig ¶ added in v0.15.0
func WithParentContextConfig(config ParentContextConfig) Option
WithParentContextConfig sets the parent context configuration. When enabled, chunks include context from their parent code structure.
type ParentContextConfig ¶ added in v0.15.0
type RecursiveCharacter ¶ added in v0.15.0
type RecursiveCharacter struct {
// contains filtered or unexported fields
}
RecursiveCharacter is a text splitter that recursively tries to split text using a list of separators. It aims to keep semantically related parts of the text together as long as possible.
func NewRecursiveCharacter ¶ added in v0.15.0
func NewRecursiveCharacter(opts ...Option) *RecursiveCharacter
NewRecursiveCharacter creates a new RecursiveCharacter text splitter.
type TextSplitter ¶
type TextSplitter interface {
// SplitDocuments splits the input documents into smaller chunks.
SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
}
TextSplitter is the interface for splitting documents into smaller chunks. Implementations can use various strategies like character-based, token-based, or code-aware splitting.
type Tokenizer ¶
type Tokenizer interface {
// CountTokens returns the exact number of tokens in the text.
CountTokens(ctx context.Context, modelName, text string) int
// EstimateTokens returns an estimated token count (faster but less accurate).
EstimateTokens(ctx context.Context, modelName, text string) int
// SplitTextByTokens splits text into chunks that fit within maxTokens.
SplitTextByTokens(ctx context.Context, modelName, text string, maxTokens int) ([]string, error)
// GetRecommendedChunkSize returns the recommended chunk size for the model.
GetRecommendedChunkSize(ctx context.Context, modelName string) int
// GetOptimalOverlapTokens returns the optimal overlap for chunking.
GetOptimalOverlapTokens(ctx context.Context, modelName string) int
// GetMaxContextWindow returns the maximum context window for the model.
GetMaxContextWindow(ctx context.Context, modelName string) int
}
Tokenizer is the interface for token counting and text splitting. Implementations provide accurate token counts for specific LLM models.