Documentation
¶
Index ¶
- Constants
- Variables
- func TruncateParentText(text string, maxLen int) string
- type ChunkType
- type CodeAwareTextSplitter
- func (c *CodeAwareTextSplitter) ChunkFileWithFileInfo(ctx context.Context, content, filePath, modelName string, fileInfo fs.FileInfo, ...) ([]schema.CodeChunk, error)
- func (c *CodeAwareTextSplitter) EnrichChunkWithContext(ctx context.Context, chunk model.CodeChunk, fileContent string, ...) model.CodeChunk
- func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
- func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
- func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
- type Option
- func WithChunkOverlap(overlap int) Option
- func WithChunkSize(size int) Option
- func WithEstimationRatio(ratio float64) Option
- func WithMaxChunkSize(size int) Option
- func WithMinChunkSize(size int) Option
- func WithModelName(name string) Option
- func WithParentContextConfig(config ParentContextConfig) Option
- type ParentContextConfig
- type RecursiveCharacter
- type TextSplitter
- type Tokenizer
Constants ¶
const ( // MaxParentTextLength defines the default limit for parent context storage MaxParentTextLength = 2000 // DefaultChunkSize is the fallback size if not provided DefaultChunkSize = 2048 )
Variables ¶
Functions ¶
func TruncateParentText ¶ added in v0.15.0
TruncateParentText reduces text length while preserving start and end context.
Types ¶
type CodeAwareTextSplitter ¶
type CodeAwareTextSplitter struct {
// contains filtered or unexported fields
}
func NewCodeAware ¶
func NewCodeAware( registry parsers.ParserRegistry, tokenizer Tokenizer, logger *slog.Logger, opts ...Option, ) (*CodeAwareTextSplitter, error)
func (*CodeAwareTextSplitter) ChunkFileWithFileInfo ¶
func (*CodeAwareTextSplitter) EnrichChunkWithContext ¶
func (c *CodeAwareTextSplitter) EnrichChunkWithContext( ctx context.Context, chunk model.CodeChunk, fileContent string, metadata model.FileMetadata, parentChunks []model.CodeChunk, modelName string, ) model.CodeChunk
EnrichChunkWithContext adds file and hierarchical context to a chunk.
func (*CodeAwareTextSplitter) GetRecommendedChunkSize ¶
func (c *CodeAwareTextSplitter) GetRecommendedChunkSize(ctx context.Context, filePath, modelName string, contentLength int) int
GetRecommendedChunkSize returns a recommended chunk size based on file type and content length.
func (*CodeAwareTextSplitter) SplitDocuments ¶
func (c *CodeAwareTextSplitter) SplitDocuments(ctx context.Context, docs []schema.Document) ([]schema.Document, error)
SplitDocuments takes a slice of documents and returns a new slice with split content.
func (*CodeAwareTextSplitter) ValidateChunkingOptions ¶
func (c *CodeAwareTextSplitter) ValidateChunkingOptions(opts *model.CodeChunkingOptions) error
ValidateChunkingOptions validates the provided chunking options for correctness.
type Option ¶
type Option func(*options)
Option is a function type for configuring the splitter.
func WithChunkOverlap ¶
WithChunkOverlap sets the chunk overlap.
func WithEstimationRatio ¶
WithEstimationRatio sets the character-to-token estimation ratio.
func WithMaxChunkSize ¶
func WithMinChunkSize ¶
WithMinChunkSize sets the minimum number of characters for a chunk to be valid.
func WithModelName ¶
WithModelName sets the model name for token-aware splitting.
func WithParentContextConfig ¶ added in v0.15.0
func WithParentContextConfig(config ParentContextConfig) Option
WithParentContextConfig sets the parent context configuration.
type ParentContextConfig ¶ added in v0.15.0
type RecursiveCharacter ¶ added in v0.15.0
type RecursiveCharacter struct {
// contains filtered or unexported fields
}
RecursiveCharacter is a text splitter that recursively tries to split text using a list of separators. It aims to keep semantically related parts of the text together as long as possible.
func NewRecursiveCharacter ¶ added in v0.15.0
func NewRecursiveCharacter(opts ...Option) *RecursiveCharacter
NewRecursiveCharacter creates a new RecursiveCharacter text splitter.
type TextSplitter ¶
type Tokenizer ¶
type Tokenizer interface {
CountTokens(ctx context.Context, modelName, text string) int
EstimateTokens(ctx context.Context, modelName, text string) int
SplitTextByTokens(ctx context.Context, modelName, text string, maxTokens int) ([]string, error)
GetRecommendedChunkSize(ctx context.Context, modelName string) int
GetOptimalOverlapTokens(ctx context.Context, modelName string) int
GetMaxContextWindow(ctx context.Context, modelName string) int
}
Tokenizer is an interface for components that can count tokens.