Documentation
¶
Index ¶
- Variables
- type ChunkType
- type CodeAwareTextSplitter
- func (c *CodeAwareTextSplitter) ChunkFileWithFileInfo(ctx context.Context, content, filePath, modelName string, fileInfo fs.FileInfo, ...) ([]schema.CodeChunk, error)
- func (c *CodeAwareTextSplitter) EnrichChunkWithContext(ctx context.Context, chunk model.CodeChunk, fileContent string, ...) model.CodeChunk
- func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
- func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
- func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
- type Option
- type TextSplitter
- type Tokenizer
Constants ¶
This section is empty.
Variables ¶
Functions ¶
This section is empty.
Types ¶
type CodeAwareTextSplitter ¶
type CodeAwareTextSplitter struct {
// contains filtered or unexported fields
}
CodeAwareTextSplitter implements intelligent chunking using language-specific parsers.
func NewCodeAware ¶
func NewCodeAware(registry parsers.ParserRegistry, tokenizer Tokenizer, logger *slog.Logger, opts ...Option) (*CodeAwareTextSplitter, error)
NewCodeAware creates the splitter.
func (*CodeAwareTextSplitter) ChunkFileWithFileInfo ¶
func (c *CodeAwareTextSplitter) ChunkFileWithFileInfo( ctx context.Context, content, filePath, modelName string, fileInfo fs.FileInfo, opts *schema.CodeChunkingOptions, ) ([]schema.CodeChunk, error)
ChunkFileWithFileInfo chunks content with file info for enhanced language detection.
func (*CodeAwareTextSplitter) EnrichChunkWithContext ¶
func (c *CodeAwareTextSplitter) EnrichChunkWithContext( ctx context.Context, chunk model.CodeChunk, fileContent string, metadata model.FileMetadata, parentChunks []model.CodeChunk, modelName string, ) model.CodeChunk
EnrichChunkWithContext adds file and hierarchical context to a chunk.
func (*CodeAwareTextSplitter) GetRecommendedChunkSize ¶
func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
GetRecommendedChunkSize returns a recommended chunk size based on file type and content length.
func (*CodeAwareTextSplitter) SplitDocuments ¶
func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
SplitDocuments is the primary public method.
func (*CodeAwareTextSplitter) ValidateChunkingOptions ¶
func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
ValidateChunkingOptions validates the provided chunking options for correctness.
type Option ¶
type Option func(*options)
Option is a function type for configuring the splitter.
func WithChunkOverlap ¶
WithChunkOverlap sets the chunk overlap.
func WithEstimationRatio ¶
WithEstimationRatio sets the character-to-token estimation ratio.
func WithMaxChunkSize ¶
func WithMinChunkSize ¶
WithMinChunkSize sets the minimum number of characters for a chunk to be valid.
func WithModelName ¶
WithModelName sets the model name for token-aware splitting.
type TextSplitter ¶
type Tokenizer ¶
type Tokenizer interface {
CountTokens(ctx context.Context, modelName, text string) int
EstimateTokens(ctx context.Context, modelName, text string) int
SplitTextByTokens(ctx context.Context, modelName, text string, maxTokens int) ([]string, error)
GetRecommendedChunkSize(ctx context.Context, modelName string) int
GetOptimalOverlapTokens(ctx context.Context, modelName string) int
GetMaxContextWindow(ctx context.Context, modelName string) int
}
Tokenizer is an interface for components that can count tokens.