Documentation
¶
Overview ¶
Package embedder contains Embedder interface and different providers including openai, voyageai, coheren, gemini and huggingface, etc.
Index ¶
- func DefaultSentenceSplitter(text string) []string
- func EmbedChunk(ctx context.Context, embedder Embedder, chunk *Chunk, embedding *Embedding, ...) error
- func SmartSentenceSplitter(text string) []string
- type Base64
- type Chunk
- type Chunker
- type DefaultTokenCounter
- type EmbeddedChunk
- type Embedder
- type Embedding
- type Option
- type Options
- type Provider
- type TextChunker
- type TextChunkerOption
- type TikTokenCounter
- type TokenCounter
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DefaultSentenceSplitter ¶
DefaultSentenceSplitter provides a basic implementation for splitting text into sentences. It uses common punctuation marks (., !, ?) as sentence boundaries.
func EmbedChunk ¶
func EmbedChunk(ctx context.Context, embedder Embedder, chunk *Chunk, embedding *Embedding, usage *components.LLMUsage) error
EmbedChunk processes text chunk and generates embeddings. It handles the embedding process in sequence, with debug output for monitoring. The function: 1. Allocates space for the results 2. Processes each chunk through the embedder 3. Creates EmbeddedChunk instances with the results 4. Provides progress information via debug output
func SmartSentenceSplitter ¶
SmartSentenceSplitter provides an advanced sentence splitting implementation that handles: - Multiple punctuation marks (., !, ?) - Common abbreviations - Quoted sentences - Parenthetical sentences - Lists and enumerations
Types ¶
type Chunk ¶
type Chunk struct {
// Text contains the actual content of the chunk
Text string
// TokenSize represents the number of tokens in this chunk
TokenSize int
// StartSentence is the index of the first sentence in this chunk
StartSentence int
// EndSentence is the index of the last sentence in this chunk (exclusive)
EndSentence int
}
Chunk represents a piece of text with associated metadata for tracking its position and size within the original document.
type Chunker ¶
type Chunker interface {
// Chunk splits the input text into a slice of Chunks according to the
// implementation's strategy.
Chunk(text string) []Chunk
}
Chunker defines the interface for text chunking implementations. Different implementations can provide various strategies for splitting text while maintaining context and semantic meaning.
type DefaultTokenCounter ¶
type DefaultTokenCounter struct{}
DefaultTokenCounter provides a simple word-based token counting implementation. It splits text on whitespace to approximate token counts. This is suitable for basic use cases but may not accurately reflect subword tokenization used by language models.
func (*DefaultTokenCounter) Count ¶
func (dtc *DefaultTokenCounter) Count(text string) int
Count returns the number of words in the text, using whitespace as a delimiter.
type EmbeddedChunk ¶
type EmbeddedChunk struct {
// Embedding of their vector representations
// Multiple embeddings can exist for different models or purposes
Embedding
// Chunk is the original chunk content that was embedded
Chunk *Chunk `json:"text"`
}
EmbeddedChunk represents a chunk of text along with its vector embeddings and associated metadata. This is the core data structure for storing and retrieving embedded content.
func EmbedChunks ¶
func EmbedChunks(ctx context.Context, embedder Embedder, chunks []Chunk, usage *components.LLMUsage) ([]EmbeddedChunk, error)
EmbedChunks processes a slice of text chunks and generates embeddings for each one. It handles the embedding process in sequence, with debug output for monitoring. The function: 1. Allocates space for the results 2. Processes each chunk through the embedder 3. Creates EmbeddedChunk instances with the results 4. Provides progress information via debug output
Returns an error if any chunk fails to embed properly.
type Embedding ¶
type Embedding struct {
Object string `json:"object"`
Embedding []float64 `json:"embedding"`
Index int `json:"index"`
Meta map[string]string `json:"meta,omitempty"`
}
Embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.
func (*Embedding) DotProduct ¶
DotProduct calculates the dot product of the embedding vector with another embedding vector. Both vectors must have the same length; otherwise, an ErrVectorLengthMismatch is returned. The method returns the calculated dot product as a float32 value.
type Option ¶
type Option func(*Options)
Option is a function type for configuring the EmbedderConfig. It follows the functional options pattern for clean and flexible configuration.
func WithProvider ¶
type Options ¶
type Options struct {
// contains filtered or unexported fields
}
Options holds the configuration for creating an Embedder instance. It supports multiple embedding providers and their specific options.
type TextChunker ¶
type TextChunker struct {
// ChunkSize is the target size of each chunk in tokens
ChunkSize int
// ChunkOverlap is the number of tokens that should overlap between adjacent chunks
ChunkOverlap int
// TokenCounter is used to count tokens in text segments
TokenCounter TokenCounter
// SentenceSplitter is a function that splits text into sentences
SentenceSplitter func(string) []string
}
TextChunker provides an advanced implementation of the Chunker interface with support for overlapping chunks and custom tokenization.
func NewTextChunker ¶
func NewTextChunker(options ...TextChunkerOption) (*TextChunker, error)
NewTextChunker creates a new TextChunker with the given options. It uses sensible defaults if no options are provided: - ChunkSize: 200 tokens - ChunkOverlap: 50 tokens - TokenCounter: DefaultTokenCounter - SentenceSplitter: DefaultSentenceSplitter
func (*TextChunker) Chunk ¶
func (tc *TextChunker) Chunk(text string) []Chunk
Chunk splits the input text into chunks while preserving sentence boundaries and maintaining the specified overlap between chunks. The algorithm: 1. Splits the text into sentences 2. Builds chunks by adding sentences until the chunk size limit is reached 3. Creates overlap with previous chunk when starting a new chunk 4. Tracks token counts and sentence indices for each chunk
type TextChunkerOption ¶
type TextChunkerOption func(*TextChunker)
TextChunkerOption is a function type for configuring TextChunker instances. This follows the functional options pattern for clean and flexible configuration.
type TikTokenCounter ¶
type TikTokenCounter struct {
// contains filtered or unexported fields
}
TikTokenCounter provides accurate token counting using the tiktoken library, which implements the tokenization schemes used by OpenAI models.
func NewTikTokenCounter ¶
func NewTikTokenCounter(encoding string) (*TikTokenCounter, error)
NewTikTokenCounter creates a new TikTokenCounter using the specified encoding. Common encodings include: - "cl100k_base" (GPT-4, ChatGPT) - "p50k_base" (GPT-3) - "r50k_base" (Codex)
func (*TikTokenCounter) Count ¶
func (ttc *TikTokenCounter) Count(text string) int
Count returns the exact number of tokens in the text according to the specified tiktoken encoding.
type TokenCounter ¶
type TokenCounter interface {
// Count returns the number of tokens in the given text according to the
// implementation's tokenization strategy.
Count(text string) int
}
TokenCounter defines the interface for counting tokens in a string. This abstraction allows for different tokenization strategies (e.g., words, subwords).