rag

package
v1.15.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 29, 2025 License: AGPL-3.0 Imports: 33 Imported by: 0

Documentation

Overview

Package rag provides Retrieval-Augmented Generation (RAG) capabilities.

Architecture

The RAG package follows a layered architecture:

┌─────────────────────────────────────────────────────────────────────────┐
│  SearchEngine (v2/rag/search.go)                                        │
│  • Query processing, retrieval, reranking                               │
├─────────────────────────────────────────────────────────────────────────┤
│  Chunker (v2/rag/chunker.go)                                            │
│  • Content splitting strategies                                         │
├─────────────────────────────────────────────────────────────────────────┤
│  Shared Foundation                                                       │
│  ┌───────────────────────────┐ ┌───────────────────────────┐           │
│  │ v2/vector/provider.go    │ │ v2/embedder/embedder.go   │           │
│  └───────────────────────────┘ └───────────────────────────┘           │
└─────────────────────────────────────────────────────────────────────────┘

Usage

Basic usage for document ingestion and search:

// Create search engine
engine, _ := rag.NewSearchEngine(rag.SearchEngineConfig{
    Provider: vectorProvider,
    Embedder: embedder,
})

// Ingest document
engine.IngestDocument(ctx, "doc1", "Document content...", metadata)

// Search
results, _ := engine.Search(ctx, "query", 10)

Integration with Memory

The RAG package shares the same vector.Provider abstraction as the memory package, allowing both to use the same vector database backend.

Index

Constants

View Source
const (
	// MinQueryLength is the minimum allowed query length.
	MinQueryLength = 2
	// MaxQueryLength is the maximum allowed query length.
	MaxQueryLength = 10000
)

Query validation constants (from legacy).

Variables

This section is empty.

Functions

func DoWithResult

func DoWithResult[T any](ctx context.Context, r *Retryer, operation string, fn func() (T, error)) (T, error)

DoWithResult executes an operation that returns a value.

func IsRetryExhausted

func IsRetryExhausted(err error) bool

IsRetryExhausted checks if an error is a retry exhaustion error.

func NewVectorProviderFromConfig

func NewVectorProviderFromConfig(cfg *config.VectorStoreConfig) (vector.Provider, error)

NewVectorProviderFromConfig creates a vector provider from configuration.

Types

type APIAuthConfig

type APIAuthConfig struct {
	Type   string            `yaml:"type"`   // "bearer", "basic", "apikey", "oauth2"
	Token  string            `yaml:"token"`  // Token/API key
	User   string            `yaml:"user"`   // Username (for basic auth)
	Pass   string            `yaml:"pass"`   // Password (for basic auth)
	Header string            `yaml:"header"` // Header name (for apikey type)
	Extra  map[string]string `yaml:"extra"`  // Additional auth parameters
}

APIAuthConfig defines authentication for API requests.

Direct port from legacy pkg/context/indexing/api_source.go

type APIEndpointConfig

type APIEndpointConfig struct {
	Path           string            `yaml:"path"`            // API path (relative to baseURL)
	Method         string            `yaml:"method"`          // HTTP method (default: GET)
	Params         map[string]string `yaml:"params"`          // Query parameters
	Headers        map[string]string `yaml:"headers"`         // Additional headers
	Body           string            `yaml:"body"`            // Request body (for POST/PUT)
	IDField        string            `yaml:"id_field"`        // JSON field to use as document ID
	ContentField   string            `yaml:"content_field"`   // JSON field(s) to use as content (comma-separated or JSONPath)
	MetadataFields []string          `yaml:"metadata_fields"` // JSON fields to include as metadata
	UpdatedField   string            `yaml:"updated_field"`   // JSON field for last modified time
	Pagination     *PaginationConfig `yaml:"pagination"`      // Pagination configuration
	Transform      string            `yaml:"transform"`       // Optional JavaScript-like transform function (future)
}

APIEndpointConfig defines an API endpoint to index.

Direct port from legacy pkg/context/indexing/api_source.go

type APISource

type APISource struct {
	// contains filtered or unexported fields
}

APISource implements DataSource for REST API endpoints.

Direct port from legacy pkg/context/indexing/api_source.go

func NewAPISource

func NewAPISource(baseURL string, endpoints []APIEndpointConfig, auth *APIAuthConfig) *APISource

NewAPISource creates a new REST API data source.

Direct port from legacy pkg/context/indexing/api_source.go

func (*APISource) Close

func (a *APISource) Close() error

Close closes the API source.

func (*APISource) DiscoverDocuments

func (a *APISource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors.

Direct port from legacy pkg/context/indexing/api_source.go

func (*APISource) GetLastModified

func (a *APISource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*APISource) ReadDocument

func (a *APISource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID.

func (*APISource) SupportsIncrementalIndexing

func (a *APISource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true if UpdatedField is configured.

func (*APISource) Type

func (a *APISource) Type() string

Type returns the data source type.

type BinaryExtractor

type BinaryExtractor struct {
	// contains filtered or unexported fields
}

BinaryExtractor handles binary files like PDF, DOCX, XLSX using native parsers.

Direct port from legacy pkg/context/extraction/binary_extractor.go

func NewBinaryExtractor

func NewBinaryExtractor(nativeParsers NativeParser) *BinaryExtractor

NewBinaryExtractor creates a new binary extractor.

func (*BinaryExtractor) CanExtract

func (be *BinaryExtractor) CanExtract(path string, mimeType string) bool

CanExtract checks if this extractor can handle the file.

func (*BinaryExtractor) Extract

func (be *BinaryExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

Extract uses native parsers to extract content from binary files.

func (*BinaryExtractor) Name

func (be *BinaryExtractor) Name() string

Name returns the extractor name.

func (*BinaryExtractor) Priority

func (be *BinaryExtractor) Priority() int

Priority returns medium priority (5).

type Chunk

type Chunk struct {
	// Content is the actual text content of this chunk.
	Content string `json:"content"`

	// Index is the chunk's position within the document (0-based).
	Index int `json:"index"`

	// Total is the total number of chunks for this document.
	Total int `json:"total"`

	// StartLine is the starting line number in the source document (1-based).
	StartLine int `json:"start_line"`

	// EndLine is the ending line number in the source document (1-based).
	EndLine int `json:"end_line"`

	// StartByte is the byte offset where this chunk begins (optional).
	StartByte int `json:"start_byte,omitempty"`

	// EndByte is the byte offset where this chunk ends (optional).
	EndByte int `json:"end_byte,omitempty"`

	// Context provides semantic context for the chunk (function name, type, etc.).
	Context *ChunkContext `json:"context,omitempty"`

	// Metadata contains additional chunk-specific information.
	Metadata map[string]any `json:"metadata,omitempty"`
}

Chunk represents a piece of content with position and context information.

Chunks are the fundamental unit of retrieval in RAG systems. Each chunk:

  • Contains a portion of the original document
  • Tracks its position within the source
  • Preserves semantic context for better retrieval

Derived from legacy pkg/context/chunking/chunker.go:Chunk

type ChunkContext

type ChunkContext struct {
	// FunctionName is the containing function/method name (for code).
	FunctionName string `json:"function_name,omitempty"`

	// TypeName is the containing type/class name (for code).
	TypeName string `json:"type_name,omitempty"`

	// FilePath is the source file path.
	FilePath string `json:"file_path,omitempty"`

	// Language is the detected programming language (for code).
	Language string `json:"language,omitempty"`

	// Section is the document section name (for prose documents).
	Section string `json:"section,omitempty"`

	// ParentID links to a parent chunk (for hierarchical retrieval).
	ParentID string `json:"parent_id,omitempty"`
}

ChunkContext provides semantic context for a chunk.

This is especially useful for code files where understanding the function or type a chunk belongs to improves retrieval quality.

type Chunker

type Chunker interface {
	// Chunk splits content into pieces.
	//
	// The content is split according to the chunker's strategy.
	// Each chunk includes position information (line numbers, byte offsets)
	// for source mapping.
	//
	// Parameters:
	//   - content: the text to split
	//   - ctx: optional context (e.g., from metadata extraction)
	//
	// Returns chunks ordered by position in the original content.
	Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

	// Strategy returns the chunker strategy name.
	Strategy() ChunkerStrategy

	// Config returns the chunker configuration.
	Config() ChunkerConfig
}

Chunker splits content into smaller pieces for indexing.

Chunking is critical for RAG quality:

  • Too small: loses context, retrieves fragments
  • Too large: wastes tokens, dilutes relevance
  • Good chunking: preserves semantic units, enables precise retrieval

Derived from legacy pkg/context/chunking/chunker.go:Chunker

func NewChunker

func NewChunker(cfg ChunkerConfig) (Chunker, error)

NewChunker creates a chunker from configuration.

func NewChunkerFromConfig

func NewChunkerFromConfig(cfg *config.ChunkingConfig) (Chunker, error)

NewChunkerFromConfig creates a chunker from configuration.

type ChunkerConfig

type ChunkerConfig struct {
	// Strategy is the chunking strategy.
	// Values: "simple", "overlapping", "semantic"
	// Default: "simple"
	Strategy ChunkerStrategy `yaml:"strategy,omitempty"`

	// Size is the target chunk size in characters.
	// Default: 1000
	Size int `yaml:"size,omitempty"`

	// Overlap is the overlap size in characters (for overlapping strategy).
	// Default: 200
	Overlap int `yaml:"overlap,omitempty"`

	// MinSize is the minimum chunk size (chunks smaller than this are merged).
	// Default: 100
	MinSize int `yaml:"min_size,omitempty"`

	// MaxSize is the maximum chunk size (hard limit).
	// Default: 2000
	MaxSize int `yaml:"max_size,omitempty"`

	// Separators are the preferred split points for semantic chunking.
	// Default: ["\n\n", "\n", ". ", " "]
	Separators []string `yaml:"separators,omitempty"`

	// PreserveWords avoids splitting in the middle of words.
	// Default: true
	PreserveWords bool `yaml:"preserve_words,omitempty"`
}

ChunkerConfig configures chunking behavior.

func DefaultChunkerConfig

func DefaultChunkerConfig() ChunkerConfig

DefaultChunkerConfig returns sensible defaults.

func (*ChunkerConfig) SetDefaults

func (c *ChunkerConfig) SetDefaults()

SetDefaults applies default values.

func (*ChunkerConfig) Validate

func (c *ChunkerConfig) Validate() error

Validate checks the configuration for errors.

type ChunkerStrategy

type ChunkerStrategy string

ChunkerStrategy identifies a chunking strategy.

const (
	// ChunkerSimple splits content by fixed character count.
	// Fast but may split mid-sentence/word.
	ChunkerSimple ChunkerStrategy = "simple"

	// ChunkerOverlapping splits with overlap between chunks.
	// Better for retrieval as context is preserved at boundaries.
	ChunkerOverlapping ChunkerStrategy = "overlapping"

	// ChunkerSemantic splits at natural boundaries (paragraphs, sections).
	// Best quality but more complex and slower.
	ChunkerSemantic ChunkerStrategy = "semantic"
)

type ChunkingError

type ChunkingError struct {
	Strategy   string // Chunking strategy
	DocumentID string // Document ID
	Message    string // Error message
	Err        error  // Underlying error
}

ChunkingError represents an error during document chunking.

func NewChunkingError

func NewChunkingError(strategy, documentID, message string, err error) *ChunkingError

NewChunkingError creates a new ChunkingError.

func (*ChunkingError) Error

func (e *ChunkingError) Error() string

Error implements the error interface.

func (*ChunkingError) Unwrap

func (e *ChunkingError) Unwrap() error

Unwrap returns the underlying error.

type CodeMetadata

type CodeMetadata struct {
	Functions []FunctionInfo         `json:"functions,omitempty"`
	Types     []TypeInfo             `json:"types,omitempty"`
	Imports   []string               `json:"imports,omitempty"`
	Symbols   map[string]interface{} `json:"symbols,omitempty"`
	Custom    map[string]interface{} `json:"custom,omitempty"`
}

CodeMetadata contains extracted code structure information.

Direct port from legacy pkg/context/metadata/extractor.go

type CollectionSource

type CollectionSource struct {
	// contains filtered or unexported fields
}

CollectionSource implements DataSource for collection-only stores. It's a no-op source that doesn't index anything - used when document store points to an existing collection that's already populated.

Direct port from legacy pkg/context/indexing/collection_source.go

func NewCollectionSource

func NewCollectionSource(collectionName string) *CollectionSource

NewCollectionSource creates a new collection-only data source.

func (*CollectionSource) Close

func (cs *CollectionSource) Close() error

Close closes the collection source.

func (*CollectionSource) CollectionName

func (cs *CollectionSource) CollectionName() string

CollectionName returns the collection name.

func (*CollectionSource) DiscoverDocuments

func (cs *CollectionSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns empty channels - no documents to index.

func (*CollectionSource) GetLastModified

func (cs *CollectionSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns zero time - not supported for collection sources.

func (*CollectionSource) ReadDocument

func (cs *CollectionSource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument returns an error - not supported for collection sources.

func (*CollectionSource) SupportsIncrementalIndexing

func (cs *CollectionSource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns false.

func (*CollectionSource) Type

func (cs *CollectionSource) Type() string

Type returns the data source type.

type ContentExtractor

type ContentExtractor interface {
	// Name returns the extractor name for logging/debugging.
	Name() string

	// CanExtract determines if this extractor can handle the given file.
	CanExtract(path string, mimeType string) bool

	// Extract extracts content from the file.
	Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

	// Priority returns the priority (higher = preferred when multiple extractors match).
	Priority() int
}

ContentExtractor defines the interface for extracting content from files.

Direct port from legacy pkg/context/extraction/extractor.go

type DBPoolAdapter

type DBPoolAdapter struct {
	// contains filtered or unexported fields
}

DBPoolAdapter wraps config.DBPool to provide sql.DB connections.

func NewDBPoolAdapter

func NewDBPoolAdapter(pool *config.DBPool, cfg *config.Config) *DBPoolAdapter

NewDBPoolAdapter creates an adapter for the DBPool.

func (*DBPoolAdapter) Get

func (a *DBPoolAdapter) Get(name string) (*sql.DB, string, error)

Get returns a database connection for the given database name.

type DataSource

type DataSource interface {
	// Type returns the type of data source (e.g., "directory", "sql", "api", "s3")
	Type() string

	// DiscoverDocuments returns a channel of discovered documents and a channel of errors.
	// Documents are discovered asynchronously and sent through the channel.
	// For file sources, content should be read from files.
	// For SQL/API sources, content should already be populated.
	DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

	// ReadDocument retrieves a specific document by its ID.
	// The ID format depends on the source type (file path, SQL row ID, API endpoint, etc.)
	ReadDocument(ctx context.Context, id string) (*Document, error)

	// SupportsIncrementalIndexing indicates if this source supports incremental updates
	// based on modification timestamps or change tracking.
	SupportsIncrementalIndexing() bool

	// GetLastModified returns the last modification time for a document, if available.
	// Returns zero time if not supported or document doesn't exist.
	GetLastModified(ctx context.Context, id string) (time.Time, error)

	// Close releases any resources held by the data source.
	Close() error
}

DataSource represents a generic source of documents to be indexed. It abstracts over filesystem, SQL databases, REST APIs, and cloud storage.

Direct port from legacy pkg/context/indexing/data_source.go

func NewDataSourceFromConfig

func NewDataSourceFromConfig(cfg *config.DocumentSourceConfig, deps *FactoryDeps) (DataSource, error)

NewDataSourceFromConfig creates a data source from configuration.

func NewDirectorySourceFromConfig

func NewDirectorySourceFromConfig(cfg DirectorySourceConfig) (DataSource, error)

NewDirectorySourceFromConfig creates a directory source from config.

type DirectorySource

type DirectorySource struct {
	// contains filtered or unexported fields
}

DirectorySource implements DataSource for local filesystem directories.

Direct port from legacy pkg/context/indexing/directory_source.go

func NewDirectorySource

func NewDirectorySource(basePath string, filter FileFilter, maxFileSize int64) *DirectorySource

NewDirectorySource creates a new directory-based data source.

Direct port from legacy pkg/context/indexing/directory_source.go

func (*DirectorySource) Close

func (ds *DirectorySource) Close() error

Close releases any resources held by the data source.

func (*DirectorySource) DiscoverDocuments

func (ds *DirectorySource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors. Documents are discovered asynchronously and sent through the channel.

Direct port from legacy pkg/context/indexing/directory_source.go

func (*DirectorySource) GetBasePath

func (ds *DirectorySource) GetBasePath() string

GetBasePath returns the base directory path (helper method).

func (*DirectorySource) GetFilter

func (ds *DirectorySource) GetFilter() FileFilter

GetFilter returns the file filter (helper method).

func (*DirectorySource) GetLastModified

func (ds *DirectorySource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*DirectorySource) ReadDocument

func (ds *DirectorySource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID (file path).

Direct port from legacy pkg/context/indexing/directory_source.go

func (*DirectorySource) SupportsIncrementalIndexing

func (ds *DirectorySource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true as directory sources support incremental indexing.

func (*DirectorySource) Type

func (ds *DirectorySource) Type() string

Type returns the data source type.

type DirectorySourceConfig

type DirectorySourceConfig struct {
	Path        string
	Include     []string
	Exclude     []string
	MaxFileSize int64 // Max file size in bytes to process (0 for no limit)
}

DirectorySourceConfig configures a directory data source.

func DefaultDirectorySourceConfig

func DefaultDirectorySourceConfig(path string) DirectorySourceConfig

DefaultDirectorySourceConfig returns sensible defaults for directory source. Includes both text-based source code files and binary document formats that can be parsed by native parsers (PDF, DOCX, XLSX).

type Document

type Document struct {
	// ID is the unique identifier for this document.
	ID string `json:"id"`

	// Content is the text content to be indexed.
	Content string `json:"content"`

	// Title is the document title (optional).
	Title string `json:"title,omitempty"`

	// SourcePath is the path to the source file (for file-based documents).
	SourcePath string `json:"source_path,omitempty"`

	// MimeType is the content type (e.g., "text/plain", "text/markdown").
	MimeType string `json:"mime_type,omitempty"`

	// Size is the content size in bytes.
	Size int64 `json:"size"`

	// Metadata contains additional document information.
	Metadata map[string]any `json:"metadata,omitempty"`
}

Document represents a document to be indexed.

Documents go through the following pipeline:

  1. Content extraction (if binary)
  2. Chunking (split into searchable pieces)
  3. Embedding (convert to vectors)
  4. Indexing (store in vector database)

type DocumentEvent

type DocumentEvent struct {
	Type     DocumentEventType
	Document Document
	Error    error
}

DocumentEvent represents a change in a document.

type DocumentEventType

type DocumentEventType string

DocumentEventType indicates the type of change.

const (
	DocumentEventCreate DocumentEventType = "create"
	DocumentEventUpdate DocumentEventType = "update"
	DocumentEventDelete DocumentEventType = "delete"
	DocumentEventError  DocumentEventType = "error"
)

type DocumentStore

type DocumentStore struct {
	// contains filtered or unexported fields
}

DocumentStore manages document indexing and search.

It combines:

  • DataSource: Where documents come from
  • ContentExtractor: How to extract text from documents
  • SearchEngine: How to index and search
  • File watching: Automatic re-indexing on changes
  • Concurrent indexing with configurable worker pool
  • Retry logic for transient failures
  • Checkpoint/resume for interrupted indexing
  • Progress tracking with ETA

Direct port from legacy pkg/context/document_store.go

func NewDocumentStore

func NewDocumentStore(cfg DocumentStoreConfig) (*DocumentStore, error)

NewDocumentStore creates a new document store.

func NewDocumentStoreFromConfig

func NewDocumentStoreFromConfig(
	name string,
	storeCfg *config.DocumentStoreConfig,
	deps *FactoryDeps,
) (*DocumentStore, error)

NewDocumentStoreFromConfig creates a document store from configuration.

func (*DocumentStore) Clear

func (s *DocumentStore) Clear(ctx context.Context) error

Clear removes all indexed documents.

func (*DocumentStore) Close

func (s *DocumentStore) Close() error

Close stops watching and releases resources.

func (*DocumentStore) Collection

func (s *DocumentStore) Collection() string

Collection returns the collection name.

func (*DocumentStore) Config

func (s *DocumentStore) Config() DocumentStoreConfig

Config returns the store configuration.

func (*DocumentStore) GetDocument

func (s *DocumentStore) GetDocument(ctx context.Context, id string) (*SearchResult, error)

GetDocument retrieves a specific document by ID.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) GetSearchEngine

func (s *DocumentStore) GetSearchEngine() *SearchEngine

GetSearchEngine returns the underlying search engine.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) HealthCheck

func (s *DocumentStore) HealthCheck(ctx context.Context) HealthCheck

DocumentStoreHealth checks the health of a DocumentStore.

func (*DocumentStore) Index

func (s *DocumentStore) Index(ctx context.Context) error

Index indexes all documents from the source with concurrent processing.

Uses channel-based DiscoverDocuments from legacy architecture with worker pool for concurrent indexing (like legacy indexingSemaphore). Supports checkpoint/resume for interrupted indexing.

Direct port from legacy pkg/context/document_store_indexing.go

func (*DocumentStore) Metrics

func (s *DocumentStore) Metrics() IndexMetricsSnapshot

Metrics returns detailed indexing metrics.

func (*DocumentStore) Name

func (s *DocumentStore) Name() string

Name returns the store name.

func (*DocumentStore) RefreshDocument

func (s *DocumentStore) RefreshDocument(ctx context.Context, docID string) error

RefreshDocument re-indexes a single document by path.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) RegisterExtractor

func (s *DocumentStore) RegisterExtractor(e ContentExtractor)

RegisterExtractor adds a custom content extractor.

func (*DocumentStore) Search

Search searches for documents.

func (*DocumentStore) SearchWithFilter

func (s *DocumentStore) SearchWithFilter(ctx context.Context, query string, topK int, filter map[string]any) (*SearchResponse, error)

SearchWithFilter searches with metadata filtering.

func (*DocumentStore) StartWatching

func (s *DocumentStore) StartWatching(ctx context.Context) error

StartWatching starts watching for document changes.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) Stats

func (s *DocumentStore) Stats() DocumentStoreStats

Stats returns indexing statistics.

func (*DocumentStore) StopWatching

func (s *DocumentStore) StopWatching()

StopWatching stops watching for changes.

type DocumentStoreConfig

type DocumentStoreConfig struct {
	// Name identifies this store.
	Name string

	// Description describes the store (used by SearchTool).
	Description string

	// Source provides documents.
	Source DataSource

	// SearchEngine for indexing and search.
	SearchEngine *SearchEngine

	// Chunker for splitting documents (optional, defaults to engine's chunker).
	Chunker Chunker

	// Collection name (optional, defaults to store name).
	Collection string

	// SourcePath is the base path for checkpoints (auto-detected from directory source).
	SourcePath string

	// Watch enables file watching for automatic re-indexing.
	Watch bool

	// IncrementalIndexing only re-indexes changed documents.
	IncrementalIndexing bool

	// EnableCheckpoints enables resume capability for interrupted indexing.
	// Checkpoints are saved to .hector/checkpoints/ in the source path.
	// Default: true for directory sources
	EnableCheckpoints bool

	// EnableProgress enables progress display during indexing.
	// Default: true
	EnableProgress bool

	// Search configuration for advanced features.
	Search *SearchOptions

	// MaxConcurrentIndexing limits parallel document processing (default: NumCPU).
	// Set to 1 for sequential indexing (legacy behavior).
	MaxConcurrentIndexing int

	// RetryConfig for transient failure handling (optional).
	RetryConfig *RetryConfig
}

DocumentStoreConfig configures a document store.

type DocumentStoreError

type DocumentStoreError struct {
	StoreName string    // Name of the document store
	Operation string    // Operation that failed
	Message   string    // Error message
	FilePath  string    // File path if applicable
	Err       error     // Underlying error
	Timestamp time.Time // When the error occurred
}

DocumentStoreError represents an error in document store operations.

Inspired by legacy pkg/context/document_store.go error handling

func NewDocumentStoreError

func NewDocumentStoreError(storeName, operation, message, filePath string, err error) *DocumentStoreError

NewDocumentStoreError creates a new DocumentStoreError.

func (*DocumentStoreError) Error

func (e *DocumentStoreError) Error() string

Error implements the error interface.

func (*DocumentStoreError) Unwrap

func (e *DocumentStoreError) Unwrap() error

Unwrap returns the underlying error.

type DocumentStoreStats

type DocumentStoreStats struct {
	Name          string  `json:"name"`
	Collection    string  `json:"collection"`
	IndexedCount  int     `json:"indexed_count"`
	WatchEnabled  bool    `json:"watch_enabled"`
	SourceType    string  `json:"source_type"`
	TotalDocs     int64   `json:"total_docs"`
	SkippedDocs   int64   `json:"skipped_docs"`
	ErrorDocs     int64   `json:"error_docs"`
	DocsPerSecond float64 `json:"docs_per_second"`
	SearchCount   int64   `json:"search_count"`
}

DocumentStoreStats contains store statistics.

type ExtractedContent

type ExtractedContent struct {
	Content          string            // The extracted text content
	Title            string            // Document title (if available)
	Author           string            // Document author (if available)
	Metadata         map[string]string // Additional metadata
	ProcessingTimeMs int64             // Time taken to extract
	ExtractorName    string            // Name of extractor used
}

ExtractedContent represents extracted file content with metadata.

Direct port from legacy pkg/context/extraction/extractor.go

type ExtractionError

type ExtractionError struct {
	Extractor string // Extractor name
	FilePath  string // File path
	Message   string // Error message
	Err       error  // Underlying error
}

ExtractionError represents an error during content extraction.

func NewExtractionError

func NewExtractionError(extractor, filePath, message string, err error) *ExtractionError

NewExtractionError creates a new ExtractionError.

func (*ExtractionError) Error

func (e *ExtractionError) Error() string

Error implements the error interface.

func (*ExtractionError) Unwrap

func (e *ExtractionError) Unwrap() error

Unwrap returns the underlying error.

type ExtractorRegistry

type ExtractorRegistry struct {
	// contains filtered or unexported fields
}

ExtractorRegistry manages multiple content extractors.

Direct port from legacy pkg/context/extraction/extractor.go

func NewExtractorRegistry

func NewExtractorRegistry() *ExtractorRegistry

NewExtractorRegistry creates a new extractor registry with default extractors. Registers:

  • BinaryExtractor (priority 5): PDF, DOCX, XLSX via native parsers
  • TextExtractor (priority 1): Plain text files

func (*ExtractorRegistry) Extract

Extract tries to extract content using the best available extractor. Adapts the document-based interface for store.go compatibility.

func (*ExtractorRegistry) ExtractContent

func (r *ExtractorRegistry) ExtractContent(ctx context.Context, path string, mimeType string, fileSize int64) (*ExtractedContent, error)

ExtractContent tries to extract content using the best available extractor.

func (*ExtractorRegistry) GetExtractors

func (r *ExtractorRegistry) GetExtractors() []ContentExtractor

GetExtractors returns all registered extractors (for debugging).

func (*ExtractorRegistry) HasExtractorForFile

func (r *ExtractorRegistry) HasExtractorForFile(path string, mimeType string) bool

HasExtractorForFile checks if any extractor can handle the given file. This is useful for determining if a file can be indexed before attempting extraction.

func (*ExtractorRegistry) Register

func (r *ExtractorRegistry) Register(extractor ContentExtractor)

Register adds an extractor to the registry.

type FactoryDeps

type FactoryDeps struct {
	// DBPool provides database connections.
	DBPool *config.DBPool

	// VectorProviders maps provider names to instances.
	VectorProviders map[string]vector.Provider

	// Embedders maps embedder names to instances.
	Embedders map[string]embedder.Embedder

	// LLMs maps LLM names to instances.
	LLMs map[string]model.LLM

	// ToolCaller provides access to MCP tools for document parsing.
	// Optional - only needed if MCPParsers is configured.
	ToolCaller ToolCaller

	// Config is the root configuration.
	Config *config.Config
}

FactoryDeps provides dependencies for creating RAG components.

type FileCheckpoint

type FileCheckpoint struct {
	Path        string    `json:"path"`
	Hash        string    `json:"hash"`
	Size        int64     `json:"size"`
	ModTime     time.Time `json:"mod_time"`
	Status      string    `json:"status"` // "indexed", "skipped", "failed"
	ProcessedAt time.Time `json:"processed_at"`
}

FileCheckpoint contains information about a processed file.

Direct port from legacy pkg/context/checkpoint.go

type FileFilter

type FileFilter interface {
	ShouldInclude(path string) bool
	ShouldExclude(path string) bool
}

FileFilter determines if a file should be indexed.

Direct port from legacy pkg/context/indexing/data_source.go:FileFilter

type FileWatcher

type FileWatcher struct {
	// contains filtered or unexported fields
}

FileWatcher watches a directory for file changes using fsnotify.

Direct port from legacy pkg/context/document_store.go fsnotify watching

func NewFileWatcher

func NewFileWatcher(cfg FileWatcherConfig) (*FileWatcher, error)

NewFileWatcher creates a new file watcher.

func (*FileWatcher) IsWatching

func (fw *FileWatcher) IsWatching() bool

IsWatching returns whether the watcher is active.

func (*FileWatcher) Start

func (fw *FileWatcher) Start(ctx context.Context) (<-chan DocumentEvent, error)

Start begins watching the directory for changes.

func (*FileWatcher) Stop

func (fw *FileWatcher) Stop() error

Stop stops watching for changes.

type FileWatcherConfig

type FileWatcherConfig struct {
	BasePath      string
	Filter        FileFilter
	DebounceDelay time.Duration // Delay before processing events (default: 100ms)
}

FileWatcherConfig configures the file watcher.

type FunctionInfo

type FunctionInfo struct {
	Name       string `json:"name"`
	Signature  string `json:"signature,omitempty"`
	StartLine  int    `json:"start_line"`
	EndLine    int    `json:"end_line"`
	Receiver   string `json:"receiver,omitempty"` // For methods
	IsExported bool   `json:"is_exported,omitempty"`
	DocComment string `json:"doc_comment,omitempty"`
}

FunctionInfo contains information about a function.

Direct port from legacy pkg/context/metadata/extractor.go

type GoMetadataExtractor

type GoMetadataExtractor struct{}

GoMetadataExtractor extracts metadata from Go source files using AST parsing.

Direct port from legacy pkg/context/metadata/go_extractor.go

func NewGoMetadataExtractor

func NewGoMetadataExtractor() *GoMetadataExtractor

NewGoMetadataExtractor creates a new Go metadata extractor.

func (*GoMetadataExtractor) CanExtract

func (ge *GoMetadataExtractor) CanExtract(language string) bool

CanExtract checks if this extractor can handle the language.

func (*GoMetadataExtractor) Extract

func (ge *GoMetadataExtractor) Extract(content string, filePath string) (*CodeMetadata, error)

Extract parses Go source code and extracts metadata.

func (*GoMetadataExtractor) Name

func (ge *GoMetadataExtractor) Name() string

Name returns the extractor name.

type HealthCheck

type HealthCheck struct {
	// Component name.
	Component string `json:"component"`

	// Status of the component.
	Status HealthStatus `json:"status"`

	// Message provides details about the status.
	Message string `json:"message,omitempty"`

	// Latency of the health check.
	Latency time.Duration `json:"latency_ms"`

	// Timestamp of the check.
	Timestamp time.Time `json:"timestamp"`

	// Details contains component-specific health information.
	Details map[string]any `json:"details,omitempty"`
}

HealthCheck represents the result of a health check.

func (HealthCheck) IsHealthy

func (h HealthCheck) IsHealthy() bool

IsHealthy returns true if the status is healthy.

type HealthChecker

type HealthChecker interface {
	HealthCheck(ctx context.Context) HealthCheck
}

HealthChecker is an interface for components that support health checking.

type HealthStatus

type HealthStatus string

HealthStatus represents the health state of a component.

const (
	// HealthStatusHealthy indicates the component is functioning normally.
	HealthStatusHealthy HealthStatus = "healthy"

	// HealthStatusDegraded indicates the component is functioning but with issues.
	HealthStatusDegraded HealthStatus = "degraded"

	// HealthStatusUnhealthy indicates the component is not functioning.
	HealthStatusUnhealthy HealthStatus = "unhealthy"
)

type HyDE

type HyDE struct {
	// contains filtered or unexported fields
}

HyDE implements Hypothetical Document Embeddings.

Instead of searching with the query embedding directly, HyDE:

  1. Uses an LLM to generate a hypothetical document that would answer the query
  2. Embeds the hypothetical document
  3. Uses that embedding for search

This can significantly improve retrieval for questions, as the hypothetical document's embedding is closer to actual relevant documents than the query embedding.

Paper: "Precise Zero-Shot Dense Retrieval without Relevance Labels" https://arxiv.org/abs/2212.10496

Derived from legacy pkg/context/hyde.go

func NewHyDE

func NewHyDE(llm model.LLM) *HyDE

NewHyDE creates a new HyDE processor.

func (*HyDE) EnhancedSearch

func (h *HyDE) EnhancedSearch(ctx context.Context, query string) (hypotheticalDoc string, err error)

EnhancedSearch performs HyDE-enhanced search.

This is a convenience method that:

  1. Generates a hypothetical document
  2. Returns both the hypothetical doc and the original query

The caller should embed the hypothetical doc instead of the query.

func (*HyDE) GenerateHypotheticalDocument

func (h *HyDE) GenerateHypotheticalDocument(ctx context.Context, query string) (string, error)

GenerateHypotheticalDocument generates a hypothetical document for the query.

type IndexCheckpoint

type IndexCheckpoint struct {
	Version        string                    `json:"version"`
	StoreName      string                    `json:"store_name"`
	SourcePath     string                    `json:"source_path"`
	StartTime      time.Time                 `json:"start_time"`
	LastUpdate     time.Time                 `json:"last_update"`
	ProcessedFiles map[string]FileCheckpoint `json:"processed_files"`
	TotalFiles     int                       `json:"total_files"`
	IndexedCount   int                       `json:"indexed_count"`
	SkippedCount   int                       `json:"skipped_count"`
	FailedCount    int                       `json:"failed_count"`
}

IndexCheckpoint represents a saved indexing checkpoint.

Direct port from legacy pkg/context/checkpoint.go

type IndexCheckpointManager

type IndexCheckpointManager struct {
	// contains filtered or unexported fields
}

IndexCheckpointManager manages indexing checkpoints.

Direct port from legacy pkg/context/checkpoint.go

func NewIndexCheckpointManager

func NewIndexCheckpointManager(storeName, sourcePath string, enabled bool) *IndexCheckpointManager

NewIndexCheckpointManager creates a new checkpoint manager.

Direct port from legacy pkg/context/checkpoint.go

func (*IndexCheckpointManager) ClearCheckpoint

func (cm *IndexCheckpointManager) ClearCheckpoint() error

ClearCheckpoint removes the checkpoint file.

func (*IndexCheckpointManager) ForceSave

func (cm *IndexCheckpointManager) ForceSave() error

ForceSave forces a checkpoint save regardless of the save interval.

func (*IndexCheckpointManager) FormatCheckpointInfo

func (cm *IndexCheckpointManager) FormatCheckpointInfo(checkpoint *IndexCheckpoint) string

FormatCheckpointInfo returns a human-readable checkpoint summary.

func (*IndexCheckpointManager) GetProcessedCount

func (cm *IndexCheckpointManager) GetProcessedCount() int

GetProcessedCount returns the number of processed files.

func (*IndexCheckpointManager) IsEnabled

func (cm *IndexCheckpointManager) IsEnabled() bool

IsEnabled returns whether checkpointing is enabled.

func (*IndexCheckpointManager) LoadCheckpoint

func (cm *IndexCheckpointManager) LoadCheckpoint() (*IndexCheckpoint, error)

LoadCheckpoint attempts to load an existing checkpoint.

func (*IndexCheckpointManager) RecordFile

func (cm *IndexCheckpointManager) RecordFile(path string, size int64, modTime time.Time, status string)

RecordFile records a processed file in the checkpoint.

func (*IndexCheckpointManager) SaveCheckpoint

func (cm *IndexCheckpointManager) SaveCheckpoint() error

SaveCheckpoint saves the current checkpoint.

func (*IndexCheckpointManager) SetTotalFiles

func (cm *IndexCheckpointManager) SetTotalFiles(total int)

SetTotalFiles sets the total file count.

func (*IndexCheckpointManager) ShouldProcessFile

func (cm *IndexCheckpointManager) ShouldProcessFile(path string, size int64, modTime time.Time) bool

ShouldProcessFile checks if a file should be processed (not in checkpoint or changed).

type IndexError

type IndexError struct {
	StoreName  string // Document store name
	DocumentID string // Document ID
	Operation  string // Operation (e.g., "embed", "upsert", "delete")
	Message    string // Error message
	Err        error  // Underlying error
}

IndexError represents an error during indexing operations.

func NewIndexError

func NewIndexError(storeName, documentID, operation, message string, err error) *IndexError

NewIndexError creates a new IndexError.

func (*IndexError) Error

func (e *IndexError) Error() string

Error implements the error interface.

func (*IndexError) Unwrap

func (e *IndexError) Unwrap() error

Unwrap returns the underlying error.

type IndexMetrics

type IndexMetrics struct {
	// contains filtered or unexported fields
}

IndexMetrics tracks document store indexing metrics.

Thread-safe for concurrent access during indexing.

func NewIndexMetrics

func NewIndexMetrics(storeName string) *IndexMetrics

NewIndexMetrics creates a new metrics tracker.

func (*IndexMetrics) IncrementErrors

func (m *IndexMetrics) IncrementErrors()

IncrementErrors increments error count.

func (*IndexMetrics) IncrementIndexed

func (m *IndexMetrics) IncrementIndexed()

IncrementIndexed increments indexed document count.

func (*IndexMetrics) IncrementSkipped

func (m *IndexMetrics) IncrementSkipped()

IncrementSkipped increments skipped document count.

func (*IndexMetrics) IncrementTotal

func (m *IndexMetrics) IncrementTotal()

IncrementTotal increments total document count.

func (*IndexMetrics) RecordSearch

func (m *IndexMetrics) RecordSearch(latency time.Duration)

RecordSearch records a search operation with latency.

func (*IndexMetrics) Reset

func (m *IndexMetrics) Reset()

Reset clears all metrics.

func (*IndexMetrics) SetEndTime

func (m *IndexMetrics) SetEndTime(t time.Time)

SetEndTime sets the indexing end time.

func (*IndexMetrics) SetStartTime

func (m *IndexMetrics) SetStartTime(t time.Time)

SetStartTime sets the indexing start time.

func (*IndexMetrics) Snapshot

func (m *IndexMetrics) Snapshot() IndexMetricsSnapshot

Snapshot returns a point-in-time copy of all metrics.

type IndexMetricsSnapshot

type IndexMetricsSnapshot struct {
	StoreName         string        `json:"store_name"`
	TotalDocs         int64         `json:"total_docs"`
	IndexedDocs       int64         `json:"indexed_docs"`
	SkippedDocs       int64         `json:"skipped_docs"`
	ErrorDocs         int64         `json:"error_docs"`
	DocsPerSecond     float64       `json:"docs_per_second"`
	StartTime         time.Time     `json:"start_time,omitempty"`
	EndTime           time.Time     `json:"end_time,omitempty"`
	SearchCount       int64         `json:"search_count"`
	AvgSearchLatency  time.Duration `json:"avg_search_latency_ns"`
	MaxSearchLatency  time.Duration `json:"max_search_latency_ns"`
	LastSearchLatency time.Duration `json:"last_search_latency_ns"`
}

IndexMetricsSnapshot is a point-in-time copy of metrics.

type LLMQueryExpander

type LLMQueryExpander struct {
	// contains filtered or unexported fields
}

LLMQueryExpander uses an LLM to generate query variations.

Direct port from legacy pkg/context/query_expansion.go

func NewLLMQueryExpander

func NewLLMQueryExpander(llm model.LLM) *LLMQueryExpander

NewLLMQueryExpander creates a new LLM-based query expander.

func (*LLMQueryExpander) Expand

func (e *LLMQueryExpander) Expand(ctx context.Context, query string, numVariations int) ([]string, error)

Expand implements the QueryExpander interface.

Direct port from legacy pkg/context/query_expansion.go

type MCPExtractor

type MCPExtractor struct {
	// contains filtered or unexported fields
}

MCPExtractor handles document parsing via MCP tools. This allows using any MCP service (Docling, etc.) for document parsing.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

func NewMCPExtractor

func NewMCPExtractor(config MCPExtractorConfig) (*MCPExtractor, error)

NewMCPExtractor creates a new MCP-based extractor.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

func (*MCPExtractor) CanExtract

func (e *MCPExtractor) CanExtract(path string, mimeType string) bool

CanExtract checks if this extractor can handle the file.

func (*MCPExtractor) Extract

func (e *MCPExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

Extract uses MCP tools to extract content from files.

func (*MCPExtractor) Name

func (e *MCPExtractor) Name() string

Name returns the extractor name.

func (*MCPExtractor) Priority

func (e *MCPExtractor) Priority() int

Priority returns the extractor priority.

type MCPExtractorConfig

type MCPExtractorConfig struct {
	ToolCaller      ToolCaller
	ParserToolNames []string // Tool names to try (e.g., ["parse_document", "docling_parse"])
	SupportedExts   []string // File extensions this extractor handles (empty = all)
	Priority        int      // Priority (higher = preferred)
	LocalBasePath   string   // Local base path of the document store (e.g., "/Users/user/workspace/hector/test-docs")
	PathPrefix      string   // Remote path prefix for containerized MCP services (e.g., "/docs")
}

MCPExtractorConfig configures an MCP extractor.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type MetadataExtractor

type MetadataExtractor interface {
	// Name returns the extractor name
	Name() string

	// CanExtract determines if this extractor can handle the given language
	CanExtract(language string) bool

	// Extract extracts metadata from source code
	Extract(content string, filePath string) (*CodeMetadata, error)
}

MetadataExtractor defines the interface for extracting metadata from source code.

Direct port from legacy pkg/context/metadata/extractor.go

type MetadataExtractorRegistry

type MetadataExtractorRegistry struct {
	// contains filtered or unexported fields
}

MetadataExtractorRegistry manages metadata extractors.

Direct port from legacy pkg/context/metadata/extractor.go

func NewMetadataExtractorRegistry

func NewMetadataExtractorRegistry() *MetadataExtractorRegistry

NewMetadataExtractorRegistry creates a new metadata extractor registry.

func (*MetadataExtractorRegistry) ExtractMetadata

func (r *MetadataExtractorRegistry) ExtractMetadata(language string, content string, filePath string) (*CodeMetadata, error)

ExtractMetadata tries to extract metadata using the appropriate extractor.

func (*MetadataExtractorRegistry) GetExtractors

func (r *MetadataExtractorRegistry) GetExtractors() []MetadataExtractor

GetExtractors returns all registered extractors.

func (*MetadataExtractorRegistry) Register

func (r *MetadataExtractorRegistry) Register(extractor MetadataExtractor)

Register adds a metadata extractor for specific languages.

type MultiQueryExpander

type MultiQueryExpander struct {
	// contains filtered or unexported fields
}

MultiQueryExpander generates multiple query variants for better recall.

Multi-query retrieval improves recall by:

  • Generating alternative phrasings of the query
  • Searching with each variant
  • Combining and deduplicating results

This helps when:

  • Queries are ambiguous
  • Relevant documents use different terminology
  • Users don't know exact terms used in documents

Derived from legacy pkg/context/multi_query.go

func NewMultiQueryExpander

func NewMultiQueryExpander(llm model.LLM, numQueries int) *MultiQueryExpander

NewMultiQueryExpander creates a new multi-query expander.

func (*MultiQueryExpander) ExpandQuery

func (m *MultiQueryExpander) ExpandQuery(ctx context.Context, query string) ([]string, error)

ExpandQuery generates multiple query variants.

type NativeParseResult

type NativeParseResult struct {
	Success          bool
	Content          string
	Title            string
	Author           string
	Metadata         map[string]string
	Error            string
	ProcessingTimeMs int64
}

NativeParseResult represents the result from a native parser.

Direct port from legacy pkg/context/extraction/binary_extractor.go

type NativeParser

type NativeParser interface {
	ParseDocument(ctx context.Context, filePath string, fileSize int64) (*NativeParseResult, error)
}

NativeParser interface for parsing binary documents.

Direct port from legacy pkg/context/extraction/binary_extractor.go

type NativeParserRegistry

type NativeParserRegistry struct {
	// contains filtered or unexported fields
}

NativeParserRegistry manages native document parsers for PDF, DOCX, XLSX.

Ported from legacy pkg/context/native_parsers.go

func NewNativeParserRegistry

func NewNativeParserRegistry() *NativeParserRegistry

NewNativeParserRegistry creates a new native parser registry with built-in parsers.

func (*NativeParserRegistry) GetSupportedExtensions

func (r *NativeParserRegistry) GetSupportedExtensions() []string

GetSupportedExtensions returns all supported file extensions.

func (*NativeParserRegistry) ParseDocument

func (r *NativeParserRegistry) ParseDocument(ctx context.Context, filePath string, fileSize int64) (*NativeParseResult, error)

ParseDocument finds the appropriate parser and extracts content. Implements NativeParser interface.

type NilChunker

type NilChunker struct{}

NilChunker returns the entire content as a single chunk.

func (NilChunker) Chunk

func (NilChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

func (NilChunker) Config

func (NilChunker) Config() ChunkerConfig

func (NilChunker) Strategy

func (NilChunker) Strategy() ChunkerStrategy

type NilDataSource

type NilDataSource struct{}

NilDataSource is a no-op data source that returns no documents.

func (NilDataSource) Close

func (NilDataSource) Close() error

func (NilDataSource) DiscoverDocuments

func (NilDataSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

func (NilDataSource) GetLastModified

func (NilDataSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

func (NilDataSource) ReadDocument

func (NilDataSource) ReadDocument(ctx context.Context, id string) (*Document, error)

func (NilDataSource) SupportsIncrementalIndexing

func (NilDataSource) SupportsIncrementalIndexing() bool

func (NilDataSource) Type

func (NilDataSource) Type() string

type NilMultiQueryExpander

type NilMultiQueryExpander struct{}

NilMultiQueryExpander returns the original query unchanged.

func (NilMultiQueryExpander) ExpandQuery

func (NilMultiQueryExpander) ExpandQuery(ctx context.Context, query string) ([]string, error)

type NilQueryExpander

type NilQueryExpander struct{}

NilQueryExpander returns the original query unchanged.

func (NilQueryExpander) Expand

func (NilQueryExpander) Expand(ctx context.Context, query string, numVariations int) ([]string, error)

type NilReranker

type NilReranker struct{}

NilReranker returns results unchanged.

func (NilReranker) Rerank

func (NilReranker) Rerank(ctx context.Context, query string, results []SearchResult) (*RerankResult, error)

type OverlappingChunker

type OverlappingChunker struct {
	// contains filtered or unexported fields
}

OverlappingChunker implements chunking with configurable overlap.

This is a direct port of legacy pkg/context/chunking/overlapping_chunker.go. Overlap helps preserve context at chunk boundaries, improving retrieval quality when relevant information spans two chunks.

Use when:

  • Retrieval quality is important
  • Content has flowing prose
  • You can afford slightly more storage

func NewOverlappingChunker

func NewOverlappingChunker(cfg ChunkerConfig) *OverlappingChunker

NewOverlappingChunker creates a new overlapping chunker.

func (*OverlappingChunker) Chunk

func (c *OverlappingChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

Chunk splits content into overlapping chunks. Direct port from legacy pkg/context/chunking/overlapping_chunker.go

func (*OverlappingChunker) Config

func (c *OverlappingChunker) Config() ChunkerConfig

func (*OverlappingChunker) Strategy

func (c *OverlappingChunker) Strategy() ChunkerStrategy

type PaginationConfig

type PaginationConfig struct {
	Type      string `yaml:"type"`       // "offset", "cursor", "page", "link"
	PageParam string `yaml:"page_param"` // Query parameter name for page/offset
	SizeParam string `yaml:"size_param"` // Query parameter name for page size
	MaxPages  int    `yaml:"max_pages"`  // Maximum pages to fetch (0 = unlimited)
	PageSize  int    `yaml:"page_size"`  // Items per page
	NextField string `yaml:"next_field"` // JSON field containing next page URL/cursor
	DataField string `yaml:"data_field"` // JSON field containing array of items (if nested)
}

PaginationConfig defines how to handle paginated API responses.

Direct port from legacy pkg/context/indexing/api_source.go

type PatternCache

type PatternCache struct {
	// contains filtered or unexported fields
}

PatternCache provides fast pattern matching.

Direct port from legacy pkg/context/indexing/pattern_filter.go

type PatternFilter

type PatternFilter struct {
	// contains filtered or unexported fields
}

PatternFilter implements FileFilter using include/exclude patterns.

Direct port from legacy pkg/context/indexing/pattern_filter.go

func NewPatternFilter

func NewPatternFilter(sourcePath string, includePatterns, excludePatterns []string) (*PatternFilter, error)

NewPatternFilter creates a new pattern-based filter with validation.

Direct port from legacy pkg/context/indexing/pattern_filter.go

func (*PatternFilter) ShouldExclude

func (pf *PatternFilter) ShouldExclude(path string) bool

ShouldExclude checks if a file matches exclude patterns.

func (*PatternFilter) ShouldInclude

func (pf *PatternFilter) ShouldInclude(path string) bool

ShouldInclude checks if a file matches include patterns.

type ProgressStats

type ProgressStats struct {
	TotalFiles     int64
	ProcessedFiles int64
	IndexedFiles   int64
	SkippedFiles   int64
	FailedFiles    int64
	DeletedFiles   int64
	CurrentFile    string
	ElapsedTime    time.Duration
}

ProgressStats contains progress statistics.

type ProgressTracker

type ProgressTracker struct {
	// contains filtered or unexported fields
}

ProgressTracker tracks indexing progress with real-time statistics.

Direct port from legacy pkg/context/progress_tracker.go

func NewProgressTracker

func NewProgressTracker(enabled bool, verbose bool) *ProgressTracker

NewProgressTracker creates a new progress tracker.

func (*ProgressTracker) GetExtractorStats

func (pt *ProgressTracker) GetExtractorStats() map[string]int64

GetExtractorStats returns extractor usage statistics.

func (*ProgressTracker) GetStats

func (pt *ProgressTracker) GetStats() ProgressStats

GetStats returns current statistics.

func (*ProgressTracker) IncrementDeleted

func (pt *ProgressTracker) IncrementDeleted()

IncrementDeleted increments the deleted files counter.

func (*ProgressTracker) IncrementFailed

func (pt *ProgressTracker) IncrementFailed()

IncrementFailed increments the failed files counter.

func (*ProgressTracker) IncrementIndexed

func (pt *ProgressTracker) IncrementIndexed()

IncrementIndexed increments the indexed files counter.

func (*ProgressTracker) IncrementProcessed

func (pt *ProgressTracker) IncrementProcessed()

IncrementProcessed increments the processed files counter.

func (*ProgressTracker) IncrementSkipped

func (pt *ProgressTracker) IncrementSkipped()

IncrementSkipped increments the skipped files counter.

func (*ProgressTracker) RecordExtractorUsage

func (pt *ProgressTracker) RecordExtractorUsage(extractorName string)

RecordExtractorUsage records which extractor was used for a document.

func (*ProgressTracker) SetCurrentFile

func (pt *ProgressTracker) SetCurrentFile(filename string)

SetCurrentFile sets the currently processing file.

func (*ProgressTracker) SetTotalFiles

func (pt *ProgressTracker) SetTotalFiles(total int64)

SetTotalFiles sets the total number of files to process.

func (*ProgressTracker) Start

func (pt *ProgressTracker) Start()

Start begins the progress display loop.

func (*ProgressTracker) Stop

func (pt *ProgressTracker) Stop()

Stop stops the progress display.

type QueryExpander

type QueryExpander interface {
	// Expand generates multiple query variations from the original query.
	Expand(ctx context.Context, query string, numVariations int) ([]string, error)
}

QueryExpander expands a single query into multiple query variations.

Direct port from legacy pkg/context/query_expansion.go

type RankingDecision

type RankingDecision struct {
	// Index is the original result index.
	Index int `json:"index"`

	// Relevance is the LLM-assigned relevance score (1-10).
	Relevance int `json:"relevance"`

	// Reason explains why this ranking was assigned.
	Reason string `json:"reason,omitempty"`
}

RankingDecision represents the LLM's ranking for a single result.

type RerankResult

type RerankResult struct {
	// Results are the reranked search results.
	Results []SearchResult

	// Rankings contains the LLM's ranking decisions.
	Rankings []RankingDecision
}

RerankConfig configures the reranker.

type Reranker

type Reranker struct {
	// contains filtered or unexported fields
}

Reranker re-ranks search results using an LLM.

Reranking improves search quality by:

  • Using deeper semantic understanding than vector similarity
  • Evaluating actual relevance to the query
  • Considering context that embeddings might miss

Trade-offs:

  • Adds latency (100-500ms per search)
  • Incurs LLM API costs
  • Only practical for small result sets (10-20 items)

Derived from legacy pkg/context/reranking/reranker.go

func NewReranker

func NewReranker(llm model.LLM, maxResults int) *Reranker

NewReranker creates a new reranker.

func (*Reranker) Rerank

func (r *Reranker) Rerank(ctx context.Context, query string, results []SearchResult) (*RerankResult, error)

Rerank re-orders results based on LLM assessment.

The process:

  1. Format results and query for the LLM
  2. Ask LLM to rank results by relevance
  3. Parse LLM response and reorder results
  4. Assign new scores based on ranking position

After reranking:

  • Scores are position-based (1st=1.0, 2nd=0.95, etc.)
  • Original vector similarity scores are replaced

type RetryConfig

type RetryConfig struct {
	// MaxRetries is the maximum number of retry attempts (default: 3).
	MaxRetries int

	// BaseDelay is the initial delay between retries (default: 1s).
	BaseDelay time.Duration

	// MaxDelay is the maximum delay between retries (default: 30s).
	MaxDelay time.Duration

	// JitterFactor adds randomness to delays (0.0-1.0, default: 0.1).
	JitterFactor float64

	// RetryableErrors are error substrings that indicate retryable failures.
	RetryableErrors []string
}

RetryConfig configures retry behavior.

Reuses patterns from v2/httpclient for consistency.

func DefaultRetryConfig

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns sensible defaults for RAG operations.

type RetryError

type RetryError struct {
	Operation   string
	Attempts    int
	LastError   error
	IsExhausted bool
}

RetryError represents an error after retry attempts.

func (*RetryError) Error

func (e *RetryError) Error() string

func (*RetryError) Unwrap

func (e *RetryError) Unwrap() error

type Retryer

type Retryer struct {
	// contains filtered or unexported fields
}

Retryer handles retry logic with exponential backoff.

Based on v2/httpclient patterns but generalized for any operation.

func NewRetryer

func NewRetryer(cfg RetryConfig) *Retryer

NewRetryer creates a new retryer with the given config.

func (*Retryer) Do

func (r *Retryer) Do(ctx context.Context, operation string, fn func() error) error

Do executes the operation with retry logic.

Returns the first successful result or the last error after all retries.

type SQLSource

type SQLSource struct {
	// contains filtered or unexported fields
}

SQLSource implements DataSource for SQL databases using database/sql.

Direct port from legacy pkg/context/indexing/sql_source.go

func NewSQLSource

func NewSQLSource(opts SQLSourceOptions) (*SQLSource, error)

NewSQLSource creates a new SQL data source.

Direct port from legacy pkg/context/indexing/sql_source.go

func (*SQLSource) Close

func (s *SQLSource) Close() error

Close closes the underlying database connection. Note: In most cases, the connection is managed externally (e.g., by DBPool), so this is a no-op. The caller should manage the lifecycle.

func (*SQLSource) DiscoverDocuments

func (s *SQLSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors.

Direct port from legacy pkg/context/indexing/sql_source.go

func (*SQLSource) GetLastModified

func (s *SQLSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*SQLSource) ReadDocument

func (s *SQLSource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID.

Direct port from legacy pkg/context/indexing/sql_source.go

func (*SQLSource) SupportsIncrementalIndexing

func (s *SQLSource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true if UpdatedColumn is configured.

func (*SQLSource) Type

func (s *SQLSource) Type() string

Type returns the data source type.

type SQLSourceOptions

type SQLSourceOptions struct {
	DB      *sql.DB
	Driver  string
	Tables  []SQLTableConfig
	MaxRows int
}

SQLSourceOptions configures the SQL source.

type SQLTableConfig

type SQLTableConfig struct {
	Table           string   `yaml:"table"`
	Columns         []string `yaml:"columns"`          // Columns to concatenate for content
	IDColumn        string   `yaml:"id_column"`        // Primary key or unique identifier
	UpdatedColumn   string   `yaml:"updated_column"`   // Column for tracking updates (e.g., updated_at)
	WhereClause     string   `yaml:"where_clause"`     // Optional WHERE clause for filtering
	MetadataColumns []string `yaml:"metadata_columns"` // Columns to include as metadata
}

SQLTableConfig defines which tables and columns to index.

Direct port from legacy pkg/context/indexing/sql_source.go

type SearchEngine

type SearchEngine struct {
	// contains filtered or unexported fields
}

SearchEngine provides document indexing and semantic search.

It combines:

  • Document ingestion with chunking
  • Vector similarity search
  • Optional hybrid search (vector + keyword)
  • Optional query enhancement (HyDE, multi-query)
  • Optional reranking

Derived from legacy pkg/context/search.go:SearchEngine

func NewSearchEngine

func NewSearchEngine(cfg SearchEngineConfig) (*SearchEngine, error)

NewSearchEngine creates a new search engine.

func NewSearchEngineFromConfig

func NewSearchEngineFromConfig(
	storeCfg *config.DocumentStoreConfig,
	deps *FactoryDeps,
	collectionName string,
) (*SearchEngine, error)

NewSearchEngineFromConfig creates a search engine from configuration. collectionName is used as the default if storeCfg.Collection is empty.

func (*SearchEngine) Clear

func (e *SearchEngine) Clear(ctx context.Context) error

Clear removes all documents from the index.

func (*SearchEngine) Close

func (e *SearchEngine) Close() error

Close releases resources.

func (*SearchEngine) Collection

func (e *SearchEngine) Collection() string

Collection returns the collection name.

func (*SearchEngine) DeleteByFilter

func (e *SearchEngine) DeleteByFilter(ctx context.Context, filter map[string]any) error

DeleteByFilter removes documents matching the filter.

Direct port from legacy pkg/context/search.go

func (*SearchEngine) DeleteDocument

func (e *SearchEngine) DeleteDocument(ctx context.Context, documentID string) error

DeleteDocument removes a document and all its chunks from the index.

func (*SearchEngine) HealthCheck

func (e *SearchEngine) HealthCheck(ctx context.Context) HealthCheck

SearchEngineHealth checks the health of a SearchEngine.

func (*SearchEngine) IngestDocument

func (e *SearchEngine) IngestDocument(ctx context.Context, doc Document) error

IngestDocument indexes a document for search.

The document is:

  1. Split into chunks using the configured chunker
  2. Each chunk is embedded
  3. Chunks are stored in the vector database

Document ID should be stable across re-indexing to enable updates.

func (*SearchEngine) IngestDocuments

func (e *SearchEngine) IngestDocuments(ctx context.Context, docs []Document) error

IngestDocuments indexes multiple documents concurrently.

func (*SearchEngine) Search

Search finds documents matching the query.

func (*SearchEngine) Status

func (e *SearchEngine) Status() map[string]any

Status returns the current status of the search engine.

Direct port from legacy pkg/context/search.go (GetStatus)

type SearchEngineConfig

type SearchEngineConfig struct {
	// Provider for vector storage and search (required).
	Provider vector.Provider

	// Embedder for generating embeddings (required).
	Embedder embedder.Embedder

	// Chunker for splitting documents (optional, defaults to simple).
	Chunker Chunker

	// Collection name for storing documents (optional, defaults to "rag_documents").
	Collection string

	// DefaultTopK is the default number of results (default: 10).
	DefaultTopK int

	// DefaultThreshold filters results below this score (default: 0.0).
	DefaultThreshold float32

	// HyDE for hypothetical document embedding (optional).
	HyDE *HyDE

	// Reranker for LLM-based result reranking (optional).
	Reranker *Reranker

	// MultiQuery for query expansion (optional).
	MultiQuery *MultiQueryExpander
}

SearchEngineConfig configures the search engine.

type SearchError

type SearchError struct {
	Component string // Component that failed (e.g., "embedder", "vector_db", "reranker")
	Operation string // Operation that failed
	Message   string // Error message
	Query     string // Query that caused the error
	Err       error  // Underlying error
}

SearchError represents an error during search operations.

Inspired by legacy pkg/context error handling

func NewSearchError

func NewSearchError(component, operation, message, query string, err error) *SearchError

NewSearchError creates a new SearchError.

func (*SearchError) Error

func (e *SearchError) Error() string

Error implements the error interface.

func (*SearchError) Unwrap

func (e *SearchError) Unwrap() error

Unwrap returns the underlying error.

type SearchMetrics

type SearchMetrics struct {
	// contains filtered or unexported fields
}

SearchMetrics tracks search engine metrics.

func NewSearchMetrics

func NewSearchMetrics(engineName string) *SearchMetrics

NewSearchMetrics creates a new search metrics tracker.

func (*SearchMetrics) RecordSearch

func (m *SearchMetrics) RecordSearch(latency time.Duration, resultCount int, opts *SearchOptions)

RecordSearch records a search operation.

func (*SearchMetrics) Snapshot

func (m *SearchMetrics) Snapshot() SearchMetricsSnapshot

Snapshot returns a point-in-time copy of search metrics.

type SearchMetricsSnapshot

type SearchMetricsSnapshot struct {
	EngineName      string        `json:"engine_name"`
	TotalSearches   int64         `json:"total_searches"`
	SuccessfulHits  int64         `json:"successful_hits"`
	EmptyResults    int64         `json:"empty_results"`
	AvgLatency      time.Duration `json:"avg_latency_ns"`
	MaxLatency      time.Duration `json:"max_latency_ns"`
	MinLatency      time.Duration `json:"min_latency_ns"`
	HyDEUsage       int64         `json:"hyde_usage"`
	RerankUsage     int64         `json:"rerank_usage"`
	MultiQueryUsage int64         `json:"multi_query_usage"`
}

SearchMetricsSnapshot is a point-in-time copy of search metrics.

type SearchOptions

type SearchOptions struct {
	// Mode specifies the search mode: "vector", "keyword", "hybrid".
	Mode string `json:"mode,omitempty"`

	// EnableHyDE enables Hypothetical Document Embeddings.
	EnableHyDE bool `json:"enable_hyde,omitempty"`

	// EnableRerank enables LLM-based reranking.
	EnableRerank bool `json:"enable_rerank,omitempty"`

	// EnableMultiQuery enables query expansion.
	EnableMultiQuery bool `json:"enable_multi_query,omitempty"`

	// NumQueries is the number of query variants for multi-query.
	NumQueries int `json:"num_queries,omitempty"`
}

SearchOptions configures search behavior.

type SearchRequest

type SearchRequest struct {
	// Query is the search query text.
	Query string `json:"query"`

	// Collection scopes the search to a specific collection.
	Collection string `json:"collection,omitempty"`

	// TopK is the maximum number of results to return.
	TopK int `json:"top_k,omitempty"`

	// Threshold filters results below this score.
	Threshold float32 `json:"threshold,omitempty"`

	// Filter applies metadata filtering.
	Filter map[string]any `json:"filter,omitempty"`

	// Options contains search-specific options.
	Options *SearchOptions `json:"options,omitempty"`
}

SearchRequest represents a search query.

func (*SearchRequest) SetDefaults

func (r *SearchRequest) SetDefaults()

SetDefaults applies default values to SearchRequest.

type SearchResponse

type SearchResponse struct {
	// Results contains the matched documents/chunks.
	Results []SearchResult `json:"results"`

	// TotalMatches is the total number of matches (before limit).
	TotalMatches int `json:"total_matches,omitempty"`

	// SearchTimeMs is the search duration in milliseconds.
	SearchTimeMs int64 `json:"search_time_ms,omitempty"`

	// QueryExpansions contains expanded queries (if multi-query enabled).
	QueryExpansions []string `json:"query_expansions,omitempty"`
}

SearchResponse contains search results.

type SearchResult

type SearchResult struct {
	// ID is the chunk/document identifier.
	ID string `json:"id"`

	// Content is the matched content.
	Content string `json:"content"`

	// Score represents relevance (higher is better).
	Score float32 `json:"score"`

	// DocumentID is the parent document identifier.
	DocumentID string `json:"document_id,omitempty"`

	// ChunkIndex is the chunk position within the document.
	ChunkIndex int `json:"chunk_index,omitempty"`

	// Metadata contains additional result information.
	Metadata map[string]any `json:"metadata,omitempty"`

	// Highlights contains matched text spans (optional).
	Highlights []string `json:"highlights,omitempty"`
}

SearchResult represents a single search result.

Results are ordered by Score (highest first). The Score semantics depend on whether reranking was applied:

  • Without reranking: vector similarity (0.0 to 1.0)
  • With reranking: LLM-determined position score

func CombineResults

func CombineResults(resultSets [][]SearchResult) []SearchResult

CombineResults merges results from multiple queries.

Deduplicates by document ID and keeps the highest score for each.

type SemanticChunker

type SemanticChunker struct {
	// contains filtered or unexported fields
}

SemanticChunker implements AST-aware chunking that respects code structure.

This is a direct port of legacy pkg/context/chunking/semantic_chunker.go. It attempts to keep functions and types together when possible, using metadata to identify semantic boundaries.

Use when:

  • Chunking code files
  • Retrieval quality is paramount
  • Variable chunk sizes are acceptable

func NewSemanticChunker

func NewSemanticChunker(cfg ChunkerConfig) *SemanticChunker

NewSemanticChunker creates a new semantic chunker.

func (*SemanticChunker) Chunk

func (c *SemanticChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

Chunk splits content into semantically meaningful chunks. It uses metadata to identify function and type boundaries. Direct port from legacy pkg/context/chunking/semantic_chunker.go

func (*SemanticChunker) Config

func (c *SemanticChunker) Config() ChunkerConfig

func (*SemanticChunker) Strategy

func (c *SemanticChunker) Strategy() ChunkerStrategy

type SimpleChunker

type SimpleChunker struct {
	// contains filtered or unexported fields
}

SimpleChunker implements basic line-based chunking.

This is a direct port of legacy pkg/context/chunking/simple_chunker.go. It splits content by lines first, then groups lines into chunks of the configured size. This ensures chunks never split mid-line.

Use when:

  • Speed is critical
  • Content has uniform structure
  • Line boundaries should be preserved

func NewSimpleChunker

func NewSimpleChunker(cfg ChunkerConfig) *SimpleChunker

NewSimpleChunker creates a new simple chunker.

func (*SimpleChunker) Chunk

func (c *SimpleChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

Chunk splits content into chunks based on line count. Direct port from legacy pkg/context/chunking/simple_chunker.go

func (*SimpleChunker) Config

func (c *SimpleChunker) Config() ChunkerConfig

func (*SimpleChunker) Strategy

func (c *SimpleChunker) Strategy() ChunkerStrategy

type SourceDocument

type SourceDocument struct {
	// ID is a unique identifier for the document (format depends on source type)
	ID string

	// Content is the text content to be indexed.
	// For file sources, this should be populated by reading the file.
	// For SQL/API sources, this is populated during discovery.
	Content string

	// Metadata contains source-specific metadata (file path, table name, API endpoint, etc.)
	Metadata map[string]interface{}

	// LastModified is the last modification time, if available
	LastModified time.Time

	// Size is the size of the document in bytes (approximate for non-file sources)
	Size int64

	// ShouldIndex indicates whether this document should be indexed (after filtering)
	ShouldIndex bool

	// SourcePath is the original source path (file path, table name, API endpoint, etc.)
	// This is used for relative path calculations and display purposes
	SourcePath string
}

SourceDocument represents a document from any source (file, SQL row, API response, etc.)

Direct port from legacy pkg/context/indexing/data_source.go:Document Renamed to SourceDocument to avoid conflict with v2/rag Document type

type TextExtractor

type TextExtractor struct{}

TextExtractor handles plain text files.

Direct port from legacy pkg/context/extraction/text_extractor.go

func NewTextExtractor

func NewTextExtractor() *TextExtractor

NewTextExtractor creates a new text extractor.

func (*TextExtractor) CanExtract

func (te *TextExtractor) CanExtract(path string, mimeType string) bool

CanExtract checks if this is a text file.

func (*TextExtractor) Extract

func (te *TextExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

Extract reads and cleans text content.

func (*TextExtractor) Name

func (te *TextExtractor) Name() string

Name returns the extractor name.

func (*TextExtractor) Priority

func (te *TextExtractor) Priority() int

Priority returns lower priority (1) so specific extractors can override.

type Tool

type Tool interface {
	GetInfo() ToolInfo
	Execute(ctx context.Context, args map[string]interface{}) (ToolResult, error)
}

Tool is a minimal interface for executing tools.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolCaller

type ToolCaller interface {
	GetTool(name string) (Tool, error)
}

ToolCaller is a minimal interface for calling tools without creating import cycles. This allows MCP extractors to work with any tool registry implementation.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolInfo

type ToolInfo struct {
	Name        string
	Description string
	Parameters  []ToolParameter
}

ToolInfo contains information about a tool.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolParameter

type ToolParameter struct {
	Name        string
	Type        string
	Description string
	Required    bool
}

ToolParameter describes a tool parameter.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolResult

type ToolResult struct {
	Success  bool
	Content  string
	Error    string
	Metadata interface{}
}

ToolResult contains the result of tool execution.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type TypeInfo

type TypeInfo struct {
	Name       string   `json:"name"`
	Kind       string   `json:"kind"` // "struct", "interface", "alias", etc.
	StartLine  int      `json:"start_line"`
	EndLine    int      `json:"end_line"`
	Fields     []string `json:"fields,omitempty"`
	Methods    []string `json:"methods,omitempty"`
	IsExported bool     `json:"is_exported,omitempty"`
	DocComment string   `json:"doc_comment,omitempty"`
}

TypeInfo contains information about a type (struct, interface, etc.).

Direct port from legacy pkg/context/metadata/extractor.go

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL