Documentation
¶
Overview ¶
Package splitter defines different chunker spliters
Index ¶
- type Chunk
- type Graphemes
- type GraphemesTokenCounter
- type Markdown
- type Option
- type Options
- func (o *Options) Buffer() *bytes.Buffer
- func (o *Options) Chunks() []string
- func (o *Options) Read(p []byte) (int, error)
- func (o *Options) Scan() error
- func (o *Options) Scanner() Scanner
- func (o *Options) Size() int
- func (o *Options) SplitText(txt string) []string
- func (o *Options) TokenCount(txt string) int
- func (o *Options) Write(p []byte) (int, error)
- type Phrases
- type PhrasesTokenCounter
- type Scanner
- type Sentences
- type SentencesTokenCounter
- type TikTokenCounter
- type TokenCounter
- type Words
- type WordsTokenCounter
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Chunk ¶
type Chunk struct {
// Buffer contains the actual content of the chunk
Buffer *bytes.Buffer
// TokenSize represents the number of tokens in this chunk
TokenSize int
// Start is the index of the first part in this chunk
Start int
// End is the index of the last part in this chunk (exclusive)
End int
}
Chunk represents a piece of text with associated metadata for tracking its position and size within the original document.
type GraphemesTokenCounter ¶
type GraphemesTokenCounter struct{}
func (*GraphemesTokenCounter) Count ¶
func (c *GraphemesTokenCounter) Count(p []byte) int
type Markdown ¶
type Markdown struct {
Options
// contains filtered or unexported fields
}
Markdown markdown header text splitter.
If your origin document is HTML, you purify and convert to markdown, then split it.
func NewMarkdown ¶
NewMarkdown creates a new Markdown text splitter.
func (*Markdown) EnableCodeBlocks ¶
func (*Markdown) EnableHeadingHierarchy ¶
func (*Markdown) EnableReferenceLinks ¶
func (*Markdown) SetSecondSplitter ¶
type Option ¶
type Option func(*Options)
Option is a function type for configuring chunkcer Options. This follows the functional options pattern for clean and flexible configuration.
func WithBuffer ¶
func WithChunkSize ¶
func WithOverlap ¶
func WithTokenCounter ¶
func WithTokenCounter(counter TokenCounter) Option
type Options ¶
type Options struct {
// contains filtered or unexported fields
}
func (*Options) TokenCount ¶
type PhrasesTokenCounter ¶ added in v1.2.2
type PhrasesTokenCounter struct{}
func (PhrasesTokenCounter) Count ¶ added in v1.2.2
func (c PhrasesTokenCounter) Count(p []byte) int
type SentencesTokenCounter ¶
type SentencesTokenCounter struct{}
func (SentencesTokenCounter) Count ¶
func (c SentencesTokenCounter) Count(p []byte) int
type TikTokenCounter ¶
type TikTokenCounter struct {
// contains filtered or unexported fields
}
TikTokenCounter provides accurate token counting using the tiktoken library, which implements the tokenization schemes used by OpenAI models.
func NewTikTokenCounter ¶
func NewTikTokenCounter(encoding string) (*TikTokenCounter, error)
NewTikTokenCounter creates a new TikTokenCounter using the specified encoding. Common encodings include: - "cl100k_base" (GPT-4, ChatGPT) - "p50k_base" (GPT-3) - "r50k_base" (Codex)
func (*TikTokenCounter) Count ¶
func (ttc *TikTokenCounter) Count(p []byte) int
Count returns the exact number of tokens in the text according to the specified tiktoken encoding.
type TokenCounter ¶
type TokenCounter interface {
// Count returns the number of tokens in the given text according to the
// implementation's tokenization strategy.
Count(p []byte) int
}
TokenCounter defines the interface for counting tokens in a string. This abstraction allows for different tokenization strategies (e.g., words, subwords).
type WordsTokenCounter ¶
type WordsTokenCounter struct{}
func (WordsTokenCounter) Count ¶
func (c WordsTokenCounter) Count(p []byte) int