embedder

package
v1.1.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 26, 2025 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package embedder contains Embedder interface and different providers including openai, voyageai, coheren, gemini and huggingface, etc.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultSentenceSplitter

func DefaultSentenceSplitter(text string) []string

DefaultSentenceSplitter provides a basic implementation for splitting text into sentences. It uses common punctuation marks (., !, ?) as sentence boundaries.

func EmbedChunk

func EmbedChunk(ctx context.Context, embedder Embedder, chunk *Chunk, embedding *Embedding, usage *components.LLMUsage) error

EmbedChunk processes text chunk and generates embeddings. It handles the embedding process in sequence, with debug output for monitoring. The function: 1. Allocates space for the results 2. Processes each chunk through the embedder 3. Creates EmbeddedChunk instances with the results 4. Provides progress information via debug output

func SmartSentenceSplitter

func SmartSentenceSplitter(text string) []string

SmartSentenceSplitter provides an advanced sentence splitting implementation that handles: - Multiple punctuation marks (., !, ?) - Common abbreviations - Quoted sentences - Parenthetical sentences - Lists and enumerations

Types

type Base64

type Base64 string

Base64 is base64 encoded embedding string.

func (Base64) Decode

func (s Base64) Decode() (*Embedding, error)

Decode decodes base64 encoded string into a slice of floats.

type Chunk

type Chunk struct {
	// Text contains the actual content of the chunk
	Text string
	// TokenSize represents the number of tokens in this chunk
	TokenSize int
	// StartSentence is the index of the first sentence in this chunk
	StartSentence int
	// EndSentence is the index of the last sentence in this chunk (exclusive)
	EndSentence int
}

Chunk represents a piece of text with associated metadata for tracking its position and size within the original document.

type Chunker

type Chunker interface {
	// Chunk splits the input text into a slice of Chunks according to the
	// implementation's strategy.
	Chunk(text string) []Chunk
}

Chunker defines the interface for text chunking implementations. Different implementations can provide various strategies for splitting text while maintaining context and semantic meaning.

type DefaultTokenCounter

type DefaultTokenCounter struct{}

DefaultTokenCounter provides a simple word-based token counting implementation. It splits text on whitespace to approximate token counts. This is suitable for basic use cases but may not accurately reflect subword tokenization used by language models.

func (*DefaultTokenCounter) Count

func (dtc *DefaultTokenCounter) Count(text string) int

Count returns the number of words in the text, using whitespace as a delimiter.

type EmbeddedChunk

type EmbeddedChunk struct {
	// Embedding of their vector representations
	// Multiple embeddings can exist for different models or purposes
	Embedding
	// Chunk is the original chunk content that was embedded
	Chunk *Chunk `json:"text"`
}

EmbeddedChunk represents a chunk of text along with its vector embeddings and associated metadata. This is the core data structure for storing and retrieving embedded content.

func EmbedChunks

func EmbedChunks(ctx context.Context, embedder Embedder, chunks []Chunk, usage *components.LLMUsage) ([]EmbeddedChunk, error)

EmbedChunks processes a slice of text chunks and generates embeddings for each one. It handles the embedding process in sequence, with debug output for monitoring. The function: 1. Allocates space for the results 2. Processes each chunk through the embedder 3. Creates EmbeddedChunk instances with the results 4. Provides progress information via debug output

Returns an error if any chunk fails to embed properly.

type Embedder

type Embedder interface {
	Provider() Provider
	Model() string
	Embed(context.Context, string, *Embedding, *components.LLMUsage) error
	BatchEmbed(ctx context.Context, parts []string, usage *components.LLMUsage) ([]Embedding, error)
	DotProduct(context.Context, *Embedding, *Embedding) (float64, error)
}

type Embedding

type Embedding struct {
	Object    string            `json:"object"`
	Embedding []float64         `json:"embedding"`
	Index     int               `json:"index"`
	Meta      map[string]string `json:"meta,omitempty"`
}

Embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

func (*Embedding) DotProduct

func (e *Embedding) DotProduct(other *Embedding) (float64, error)

DotProduct calculates the dot product of the embedding vector with another embedding vector. Both vectors must have the same length; otherwise, an ErrVectorLengthMismatch is returned. The method returns the calculated dot product as a float32 value.

func (Embedding) UUID

func (e Embedding) UUID() string

type Option

type Option func(*Options)

Option is a function type for configuring the EmbedderConfig. It follows the functional options pattern for clean and flexible configuration.

func WithModel

func WithModel(model string) Option

func WithProvider

func WithProvider(provider Provider) Option

type Options

type Options struct {
	// contains filtered or unexported fields
}

Options holds the configuration for creating an Embedder instance. It supports multiple embedding providers and their specific options.

func (Options) Model

func (i Options) Model() string

func (Options) Provider

func (i Options) Provider() Provider

type Provider

type Provider = string
const (
	ProviderOpenAI      Provider = "OpenAI"
	ProviderVoyageAI    Provider = "VoyageAI"
	ProviderCohere      Provider = "Cohere"
	ProviderGemini      Provider = "Gemini"
	ProviderHuggingFace Provider = "HuggingFace"
)

type TextChunker

type TextChunker struct {
	// ChunkSize is the target size of each chunk in tokens
	ChunkSize int
	// ChunkOverlap is the number of tokens that should overlap between adjacent chunks
	ChunkOverlap int
	// TokenCounter is used to count tokens in text segments
	TokenCounter TokenCounter
	// SentenceSplitter is a function that splits text into sentences
	SentenceSplitter func(string) []string
}

TextChunker provides an advanced implementation of the Chunker interface with support for overlapping chunks and custom tokenization.

func NewTextChunker

func NewTextChunker(options ...TextChunkerOption) (*TextChunker, error)

NewTextChunker creates a new TextChunker with the given options. It uses sensible defaults if no options are provided: - ChunkSize: 200 tokens - ChunkOverlap: 50 tokens - TokenCounter: DefaultTokenCounter - SentenceSplitter: DefaultSentenceSplitter

func (*TextChunker) Chunk

func (tc *TextChunker) Chunk(text string) []Chunk

Chunk splits the input text into chunks while preserving sentence boundaries and maintaining the specified overlap between chunks. The algorithm: 1. Splits the text into sentences 2. Builds chunks by adding sentences until the chunk size limit is reached 3. Creates overlap with previous chunk when starting a new chunk 4. Tracks token counts and sentence indices for each chunk

type TextChunkerOption

type TextChunkerOption func(*TextChunker)

TextChunkerOption is a function type for configuring TextChunker instances. This follows the functional options pattern for clean and flexible configuration.

type TikTokenCounter

type TikTokenCounter struct {
	// contains filtered or unexported fields
}

TikTokenCounter provides accurate token counting using the tiktoken library, which implements the tokenization schemes used by OpenAI models.

func NewTikTokenCounter

func NewTikTokenCounter(encoding string) (*TikTokenCounter, error)

NewTikTokenCounter creates a new TikTokenCounter using the specified encoding. Common encodings include: - "cl100k_base" (GPT-4, ChatGPT) - "p50k_base" (GPT-3) - "r50k_base" (Codex)

func (*TikTokenCounter) Count

func (ttc *TikTokenCounter) Count(text string) int

Count returns the exact number of tokens in the text according to the specified tiktoken encoding.

type TokenCounter

type TokenCounter interface {
	// Count returns the number of tokens in the given text according to the
	// implementation's tokenization strategy.
	Count(text string) int
}

TokenCounter defines the interface for counting tokens in a string. This abstraction allows for different tokenization strategies (e.g., words, subwords).

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL