Documentation
¶
Index ¶
- Variables
- type ChunkType
- type CodeAwareTextSplitter
- func (c *CodeAwareTextSplitter) ChunkFileWithFileInfo(ctx context.Context, content, filePath, modelName string, fileInfo fs.FileInfo, ...) ([]schema.CodeChunk, error)
- func (c *CodeAwareTextSplitter) EnrichChunkWithContext(ctx context.Context, chunk model.CodeChunk, fileContent string, ...) model.CodeChunk
- func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
- func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
- func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
- type Option
- type RecursiveCharacter
- type TextSplitter
- type Tokenizer
Constants ¶
This section is empty.
Variables ¶
Functions ¶
This section is empty.
Types ¶
type CodeAwareTextSplitter ¶
type CodeAwareTextSplitter struct {
// contains filtered or unexported fields
}
func NewCodeAware ¶
func NewCodeAware( registry parsers.ParserRegistry, tokenizer Tokenizer, logger *slog.Logger, opts ...Option, ) (*CodeAwareTextSplitter, error)
func (*CodeAwareTextSplitter) ChunkFileWithFileInfo ¶
func (*CodeAwareTextSplitter) EnrichChunkWithContext ¶
func (c *CodeAwareTextSplitter) EnrichChunkWithContext( ctx context.Context, chunk model.CodeChunk, fileContent string, metadata model.FileMetadata, parentChunks []model.CodeChunk, modelName string, ) model.CodeChunk
EnrichChunkWithContext adds file and hierarchical context to a chunk.
func (*CodeAwareTextSplitter) GetRecommendedChunkSize ¶
func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
GetRecommendedChunkSize returns a recommended chunk size based on file type and content length.
func (*CodeAwareTextSplitter) SplitDocuments ¶
func (*CodeAwareTextSplitter) ValidateChunkingOptions ¶
func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
ValidateChunkingOptions validates the provided chunking options for correctness.
type Option ¶
type Option func(*options)
Option is a function type for configuring the splitter.
func WithChunkOverlap ¶
WithChunkOverlap sets the chunk overlap.
func WithEstimationRatio ¶
WithEstimationRatio sets the character-to-token estimation ratio.
func WithMaxChunkSize ¶
func WithMinChunkSize ¶
WithMinChunkSize sets the minimum number of characters for a chunk to be valid.
func WithModelName ¶
WithModelName sets the model name for token-aware splitting.
type RecursiveCharacter ¶ added in v0.15.0
type RecursiveCharacter struct {
// contains filtered or unexported fields
}
RecursiveCharacter is a text splitter that recursively tries to split text using a list of separators. It aims to keep semantically related parts of the text together as long as possible.
func NewRecursiveCharacter ¶ added in v0.15.0
func NewRecursiveCharacter(opts ...Option) *RecursiveCharacter
NewRecursiveCharacter creates a new RecursiveCharacter text splitter.
type TextSplitter ¶
type Tokenizer ¶
type Tokenizer interface {
CountTokens(ctx context.Context, modelName, text string) int
EstimateTokens(ctx context.Context, modelName, text string) int
SplitTextByTokens(ctx context.Context, modelName, text string, maxTokens int) ([]string, error)
GetRecommendedChunkSize(ctx context.Context, modelName string) int
GetOptimalOverlapTokens(ctx context.Context, modelName string) int
GetMaxContextWindow(ctx context.Context, modelName string) int
}
Tokenizer is an interface for components that can count tokens.