textsplitter

package
v0.31.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 4, 2026 License: MIT Imports: 15 Imported by: 0

Documentation

Overview

Package textsplitter provides text splitting utilities for chunking documents. It includes code-aware splitters that respect language syntax boundaries.

Index

Constants

View Source
const (
	// MaxParentTextLength defines the default limit for parent context storage
	MaxParentTextLength = 2000
	// DefaultChunkSize is the fallback size if not provided
	DefaultChunkSize = 2048
)

Variables

View Source
var (
	// ErrInvalidChunkSize is returned when the chunk size is invalid.
	ErrInvalidChunkSize = errors.New("invalid chunk size")
	// ErrEmptyContent is returned when the content is empty or whitespace only.
	ErrEmptyContent = errors.New("content is empty or contains only whitespace")
	// ErrTokenizerNotConfigured is returned when a tokenizer is required but not configured.
	ErrTokenizerNotConfigured = errors.New("tokenizer service is not configured")
	// ErrModelRequired is returned when a model name is required but not provided.
	ErrModelRequired = errors.New("model name is required")
)

Error variables for text splitting.

Functions

func TruncateParentText added in v0.15.0

func TruncateParentText(text string, maxLen int) string

TruncateParentText reduces text length while preserving start and end context.

Types

type ChunkType

type ChunkType string

ChunkType represents the type of content in a chunk.

const (
	// ChunkTypeFunction represents a function or method chunk.
	ChunkTypeFunction ChunkType = "function"
	// ChunkTypeClass represents a class or struct chunk.
	ChunkTypeClass ChunkType = "class"
	// ChunkTypeImports represents an import block chunk.
	ChunkTypeImports ChunkType = "imports"
	// ChunkTypeComment represents a comment block chunk.
	ChunkTypeComment ChunkType = "comment"
	// ChunkTypeCode represents a generic code block chunk.
	ChunkTypeCode ChunkType = "code"
	// ChunkTypeText represents a text block chunk.
	ChunkTypeText ChunkType = "text"
)

Chunk type constants.

type CodeAwareTextSplitter

type CodeAwareTextSplitter struct {
	// contains filtered or unexported fields
}

func NewCodeAware

func NewCodeAware(
	registry parsers.ParserRegistry,
	tokenizer Tokenizer,
	logger *slog.Logger,
	opts ...Option,
) (*CodeAwareTextSplitter, error)

func (*CodeAwareTextSplitter) ChunkFileWithFileInfo

func (c *CodeAwareTextSplitter) ChunkFileWithFileInfo(
	ctx context.Context,
	content, filePath, modelName string,
	fileInfo fs.FileInfo,
	opts *schema.CodeChunkingOptions,
) ([]schema.CodeChunk, error)

func (*CodeAwareTextSplitter) EnrichChunkWithContext

func (c *CodeAwareTextSplitter) EnrichChunkWithContext(
	ctx context.Context,
	chunk model.CodeChunk,
	fileContent string,
	metadata model.FileMetadata,
	parentChunks []model.CodeChunk,
	modelName string,
) model.CodeChunk

EnrichChunkWithContext adds file and hierarchical context to a chunk.

func (*CodeAwareTextSplitter) GetRecommendedChunkSize

func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int

GetRecommendedChunkSize returns a recommended chunk size based on file type and content length.

func (*CodeAwareTextSplitter) SplitDocuments

func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)

SplitDocuments takes a slice of documents and returns a new slice with split content.

func (*CodeAwareTextSplitter) ValidateChunkingOptions

func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error

ValidateChunkingOptions validates the provided chunking options for correctness.

type Option

type Option func(*options)

Option is a function type for configuring the splitter.

func WithChunkOverlap

func WithChunkOverlap(overlap int) Option

WithChunkOverlap sets the number of overlapping tokens between chunks.

func WithChunkSize

func WithChunkSize(size int) Option

WithChunkSize sets the target chunk size in tokens.

func WithEstimationRatio

func WithEstimationRatio(ratio float64) Option

WithEstimationRatio sets the character-to-token estimation ratio. Used when a tokenizer is not available. Default is 4.0 (4 chars per token).

func WithMaxChunkSize

func WithMaxChunkSize(size int) Option

WithMaxChunkSize sets the maximum chunk size in tokens. Chunks larger than this will be split further.

func WithMinChunkSize

func WithMinChunkSize(size int) Option

WithMinChunkSize sets the minimum number of characters for a chunk to be valid. Chunks smaller than this may be merged with adjacent content.

func WithModelName

func WithModelName(name string) Option

WithModelName sets the model name for token-aware splitting. When set, the splitter uses the model's tokenizer for accurate chunk sizing.

func WithParentContextConfig added in v0.15.0

func WithParentContextConfig(config ParentContextConfig) Option

WithParentContextConfig sets the parent context configuration. When enabled, chunks include context from their parent code structure.

type ParentContextConfig added in v0.15.0

type ParentContextConfig struct {
	Enabled       bool
	MaxTextLength int
}

type RecursiveCharacter added in v0.15.0

type RecursiveCharacter struct {
	// contains filtered or unexported fields
}

RecursiveCharacter is a text splitter that recursively tries to split text using a list of separators. It aims to keep semantically related parts of the text together as long as possible.

func NewRecursiveCharacter added in v0.15.0

func NewRecursiveCharacter(opts ...Option) *RecursiveCharacter

NewRecursiveCharacter creates a new RecursiveCharacter text splitter.

func (*RecursiveCharacter) SplitText added in v0.15.0

func (s *RecursiveCharacter) SplitText(_ context.Context, text string) ([]string, error)

SplitText splits a single text document into multiple chunks.

type TextSplitter

type TextSplitter interface {
	// SplitDocuments splits the input documents into smaller chunks.
	SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
}

TextSplitter is the interface for splitting documents into smaller chunks. Implementations can use various strategies like character-based, token-based, or code-aware splitting.

type Tokenizer

type Tokenizer interface {
	// CountTokens returns the exact number of tokens in the text.
	CountTokens(ctx context.Context, modelName, text string) int
	// EstimateTokens returns an estimated token count (faster but less accurate).
	EstimateTokens(ctx context.Context, modelName, text string) int
	// SplitTextByTokens splits text into chunks that fit within maxTokens.
	SplitTextByTokens(ctx context.Context, modelName, text string, maxTokens int) ([]string, error)
	// GetRecommendedChunkSize returns the recommended chunk size for the model.
	GetRecommendedChunkSize(ctx context.Context, modelName string) int
	// GetOptimalOverlapTokens returns the optimal overlap for chunking.
	GetOptimalOverlapTokens(ctx context.Context, modelName string) int
	// GetMaxContextWindow returns the maximum context window for the model.
	GetMaxContextWindow(ctx context.Context, modelName string) int
}

Tokenizer is the interface for token counting and text splitting. Implementations provide accurate token counts for specific LLM models.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL