embedder

package

v1.1.9 Latest Latest Go to latest Published: Feb 26, 2025 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/bububa/atomic-agents

Links

Open Source Insights

Documentation ¶

Overview ¶

Package embedder contains Embedder interface and different providers including openai, voyageai, coheren, gemini and huggingface, etc.

Index ¶

func DefaultSentenceSplitter(text string) []string
func EmbedChunk(ctx context.Context, embedder Embedder, chunk *Chunk, embedding *Embedding, ...) error
func SmartSentenceSplitter(text string) []string
type Base64
- func (s Base64) Decode() (*Embedding, error)
type Chunk
type Chunker
type DefaultTokenCounter
- func (dtc *DefaultTokenCounter) Count(text string) int
type EmbeddedChunk
- func EmbedChunks(ctx context.Context, embedder Embedder, chunks []Chunk, ...) ([]EmbeddedChunk, error)
type Embedder
type Embedding
- func (e *Embedding) DotProduct(other *Embedding) (float64, error)
- func (e Embedding) UUID() string
type Option
- func WithModel(model string) Option
- func WithProvider(provider Provider) Option
type Options
- func (i Options) Model() string
- func (i Options) Provider() Provider
type Provider
type TextChunker
- func NewTextChunker(options ...TextChunkerOption) (*TextChunker, error)
- func (tc *TextChunker) Chunk(text string) []Chunk
type TextChunkerOption
type TikTokenCounter
- func NewTikTokenCounter(encoding string) (*TikTokenCounter, error)
- func (ttc *TikTokenCounter) Count(text string) int
type TokenCounter

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DefaultSentenceSplitter ¶

func DefaultSentenceSplitter(text string) []string

DefaultSentenceSplitter provides a basic implementation for splitting text into sentences. It uses common punctuation marks (., !, ?) as sentence boundaries.

func EmbedChunk ¶

func EmbedChunk(ctx context.Context, embedder Embedder, chunk *Chunk, embedding *Embedding, usage *components.LLMUsage) error

EmbedChunk processes text chunk and generates embeddings. It handles the embedding process in sequence, with debug output for monitoring. The function: 1. Allocates space for the results 2. Processes each chunk through the embedder 3. Creates EmbeddedChunk instances with the results 4. Provides progress information via debug output

func SmartSentenceSplitter ¶

func SmartSentenceSplitter(text string) []string

SmartSentenceSplitter provides an advanced sentence splitting implementation that handles: - Multiple punctuation marks (., !, ?) - Common abbreviations - Quoted sentences - Parenthetical sentences - Lists and enumerations

Types ¶

type Base64 ¶

type Base64 string

Base64 is base64 encoded embedding string.

func (Base64) Decode ¶

func (s Base64) Decode() (*Embedding, error)

Decode decodes base64 encoded string into a slice of floats.

type Chunk ¶

type Chunk struct {
	// Text contains the actual content of the chunk
	Text string
	// TokenSize represents the number of tokens in this chunk
	TokenSize int
	// StartSentence is the index of the first sentence in this chunk
	StartSentence int
	// EndSentence is the index of the last sentence in this chunk (exclusive)
	EndSentence int
}

Chunk represents a piece of text with associated metadata for tracking its position and size within the original document.

type Chunker ¶

type Chunker interface {
	// Chunk splits the input text into a slice of Chunks according to the
	// implementation's strategy.
	Chunk(text string) []Chunk
}

Chunker defines the interface for text chunking implementations. Different implementations can provide various strategies for splitting text while maintaining context and semantic meaning.

type DefaultTokenCounter ¶

type DefaultTokenCounter struct{}

DefaultTokenCounter provides a simple word-based token counting implementation. It splits text on whitespace to approximate token counts. This is suitable for basic use cases but may not accurately reflect subword tokenization used by language models.

func (*DefaultTokenCounter) Count ¶

func (dtc *DefaultTokenCounter) Count(text string) int

Count returns the number of words in the text, using whitespace as a delimiter.

type EmbeddedChunk ¶

type EmbeddedChunk struct {
	// Embedding of their vector representations
	// Multiple embeddings can exist for different models or purposes
	Embedding
	// Chunk is the original chunk content that was embedded
	Chunk *Chunk `json:"text"`
}

EmbeddedChunk represents a chunk of text along with its vector embeddings and associated metadata. This is the core data structure for storing and retrieving embedded content.

func EmbedChunks ¶

func EmbedChunks(ctx context.Context, embedder Embedder, chunks []Chunk, usage *components.LLMUsage) ([]EmbeddedChunk, error)

EmbedChunks processes a slice of text chunks and generates embeddings for each one. It handles the embedding process in sequence, with debug output for monitoring. The function: 1. Allocates space for the results 2. Processes each chunk through the embedder 3. Creates EmbeddedChunk instances with the results 4. Provides progress information via debug output

Returns an error if any chunk fails to embed properly.

type Embedder ¶

type Embedder interface {
	Provider() Provider
	Model() string
	Embed(context.Context, string, *Embedding, *components.LLMUsage) error
	BatchEmbed(ctx context.Context, parts []string, usage *components.LLMUsage) ([]Embedding, error)
	DotProduct(context.Context, *Embedding, *Embedding) (float64, error)
}

type Embedding ¶

type Embedding struct {
	Object    string            `json:"object"`
	Embedding []float64         `json:"embedding"`
	Index     int               `json:"index"`
	Meta      map[string]string `json:"meta,omitempty"`
}

Embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

func (*Embedding) DotProduct ¶

func (e *Embedding) DotProduct(other *Embedding) (float64, error)

DotProduct calculates the dot product of the embedding vector with another embedding vector. Both vectors must have the same length; otherwise, an ErrVectorLengthMismatch is returned. The method returns the calculated dot product as a float32 value.

func (Embedding) UUID ¶

func (e Embedding) UUID() string

type Option ¶

type Option func(*Options)

Option is a function type for configuring the EmbedderConfig. It follows the functional options pattern for clean and flexible configuration.

func WithModel ¶

func WithModel(model string) Option

func WithProvider ¶

func WithProvider(provider Provider) Option

type Options ¶

type Options struct {
	// contains filtered or unexported fields
}

Options holds the configuration for creating an Embedder instance. It supports multiple embedding providers and their specific options.

func (Options) Model ¶

func (i Options) Model() string

func (Options) Provider ¶

func (i Options) Provider() Provider

type Provider ¶

type Provider = string

const (
	ProviderOpenAI      Provider = "OpenAI"
	ProviderVoyageAI    Provider = "VoyageAI"
	ProviderCohere      Provider = "Cohere"
	ProviderGemini      Provider = "Gemini"
	ProviderHuggingFace Provider = "HuggingFace"
)

type TextChunker ¶

type TextChunker struct {
	// ChunkSize is the target size of each chunk in tokens
	ChunkSize int
	// ChunkOverlap is the number of tokens that should overlap between adjacent chunks
	ChunkOverlap int
	// TokenCounter is used to count tokens in text segments
	TokenCounter TokenCounter
	// SentenceSplitter is a function that splits text into sentences
	SentenceSplitter func(string) []string
}

TextChunker provides an advanced implementation of the Chunker interface with support for overlapping chunks and custom tokenization.

func NewTextChunker ¶

func NewTextChunker(options ...TextChunkerOption) (*TextChunker, error)

NewTextChunker creates a new TextChunker with the given options. It uses sensible defaults if no options are provided: - ChunkSize: 200 tokens - ChunkOverlap: 50 tokens - TokenCounter: DefaultTokenCounter - SentenceSplitter: DefaultSentenceSplitter

func (*TextChunker) Chunk ¶

func (tc *TextChunker) Chunk(text string) []Chunk

Chunk splits the input text into chunks while preserving sentence boundaries and maintaining the specified overlap between chunks. The algorithm: 1. Splits the text into sentences 2. Builds chunks by adding sentences until the chunk size limit is reached 3. Creates overlap with previous chunk when starting a new chunk 4. Tracks token counts and sentence indices for each chunk

type TextChunkerOption ¶

type TextChunkerOption func(*TextChunker)

TextChunkerOption is a function type for configuring TextChunker instances. This follows the functional options pattern for clean and flexible configuration.

type TikTokenCounter ¶

type TikTokenCounter struct {
	// contains filtered or unexported fields
}

TikTokenCounter provides accurate token counting using the tiktoken library, which implements the tokenization schemes used by OpenAI models.

func NewTikTokenCounter ¶

func NewTikTokenCounter(encoding string) (*TikTokenCounter, error)

NewTikTokenCounter creates a new TikTokenCounter using the specified encoding. Common encodings include: - "cl100k_base" (GPT-4, ChatGPT) - "p50k_base" (GPT-3) - "r50k_base" (Codex)

func (*TikTokenCounter) Count ¶

func (ttc *TikTokenCounter) Count(text string) int

Count returns the exact number of tokens in the text according to the specified tiktoken encoding.

type TokenCounter ¶

type TokenCounter interface {
	// Count returns the number of tokens in the given text according to the
	// implementation's tokenization strategy.
	Count(text string) int
}

TokenCounter defines the interface for counting tokens in a string. This abstraction allows for different tokenization strategies (e.g., words, subwords).

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
providers
cohere
gemini
huggingface
openai
voyageai

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL