rag

package

v1.21.0 Latest Latest Go to latest Published: Apr 15, 2026 License: MIT Imports: 38 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/verikod/hector

Links

Open Source Insights

Documentation ¶

Overview ¶

Package rag provides Retrieval-Augmented Generation (RAG) capabilities.

Architecture ¶

The RAG package follows a layered architecture:

┌─────────────────────────────────────────────────────────────────────────┐
│  SearchEngine (rag/search.go)                                           │
│  • Query processing, retrieval, reranking                               │
├─────────────────────────────────────────────────────────────────────────┤
│  Chunker (rag/chunker.go)                                               │
│  • Content splitting strategies                                         │
├─────────────────────────────────────────────────────────────────────────┤
│  Shared Foundation                                                       │
│  ┌───────────────────────────┐ ┌───────────────────────────┐           │
│  │ vector/provider.go       │ │ embedder/embedder.go      │           │
│  └───────────────────────────┘ └───────────────────────────┘           │
└─────────────────────────────────────────────────────────────────────────┘

Usage ¶

Basic usage for document ingestion and search:

// Create search engine
engine, _ := rag.NewSearchEngine(rag.SearchEngineConfig{
    Provider: vectorProvider,
    Embedder: embedder,
})

// Ingest document
engine.IngestDocument(ctx, "doc1", "Document content...", metadata)

// Search
results, _ := engine.Search(ctx, "query", 10)

Integration with Memory ¶

The RAG package shares the same vector.Provider abstraction as the memory package, allowing both to use the same vector database backend.

Index ¶

Constants
func DoWithResult[T any](ctx context.Context, r *Retryer, operation string, fn func() (T, error)) (T, error)
func IsRetryExhausted(err error) bool
func NewVectorProviderFromConfig(cfg *config.VectorStoreConfig) (vector.Provider, error)
type APIAuthConfig
type APIEndpointConfig
type APISource
- func NewAPISource(baseURL string, endpoints []APIEndpointConfig, auth *APIAuthConfig) *APISource
- func (a *APISource) Close() error
- func (a *APISource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)
- func (a *APISource) GetLastModified(ctx context.Context, id string) (time.Time, error)
- func (a *APISource) ReadDocument(ctx context.Context, id string) (*Document, error)
- func (a *APISource) SupportsIncrementalIndexing() bool
- func (a *APISource) Type() string
type BinaryExtractor
- func NewBinaryExtractor(nativeParsers NativeParser) *BinaryExtractor
- func (be *BinaryExtractor) CanExtract(path string, mimeType string) bool
- func (be *BinaryExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)
- func (be *BinaryExtractor) Name() string
- func (be *BinaryExtractor) Priority() int
type BlobSource
- func NewBlobSource(ctx context.Context, cfg BlobSourceConfig) (*BlobSource, error)
- func (bs *BlobSource) Close() error
- func (bs *BlobSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)
- func (bs *BlobSource) GetFilter() FileFilter
- func (bs *BlobSource) GetLastModified(ctx context.Context, id string) (time.Time, error)
- func (bs *BlobSource) GetPrefix() string
- func (bs *BlobSource) GetURL() string
- func (bs *BlobSource) ReadDocument(ctx context.Context, id string) (*Document, error)
- func (bs *BlobSource) SupportsIncrementalIndexing() bool
- func (bs *BlobSource) Type() string
type BlobSourceConfig
- func DefaultBlobSourceConfig(url string) BlobSourceConfig
type Chunk
type ChunkContext
type Chunker
- func NewChunker(cfg ChunkerConfig) (Chunker, error)
- func NewChunkerFromConfig(cfg *config.ChunkingConfig) (Chunker, error)
type ChunkerConfig
- func DefaultChunkerConfig() ChunkerConfig
- func (c *ChunkerConfig) SetDefaults()
- func (c *ChunkerConfig) Validate() error
type ChunkerStrategy
type ChunkingError
- func NewChunkingError(strategy, documentID, message string, err error) *ChunkingError
- func (e *ChunkingError) Error() string
- func (e *ChunkingError) Unwrap() error
type CodeMetadata
type CollectionSource
- func NewCollectionSource(collectionName string) *CollectionSource
- func (cs *CollectionSource) Close() error
- func (cs *CollectionSource) CollectionName() string
- func (cs *CollectionSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)
- func (cs *CollectionSource) GetLastModified(ctx context.Context, id string) (time.Time, error)
- func (cs *CollectionSource) ReadDocument(ctx context.Context, id string) (*Document, error)
- func (cs *CollectionSource) SupportsIncrementalIndexing() bool
- func (cs *CollectionSource) Type() string
type ContentExtractor
type DBPoolAdapter
- func NewDBPoolAdapter(pool *config.DBPool, databaseDSN string) *DBPoolAdapter
- func (a *DBPoolAdapter) Get(name string) (*sql.DB, string, error)
type DataSource
- func NewBlobSourceFromConfig(ctx context.Context, cfg BlobSourceConfig) (DataSource, error)
- func NewDataSourceFromConfig(cfg *config.DocumentSourceConfig, deps *FactoryDeps) (DataSource, error)
- func NewDirectorySourceFromConfig(cfg DirectorySourceConfig) (DataSource, error)
type DirectorySource
- func NewDirectorySource(basePath string, filter FileFilter, maxFileSize int64) *DirectorySource
- func (ds *DirectorySource) Close() error
- func (ds *DirectorySource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)
- func (ds *DirectorySource) GetBasePath() string
- func (ds *DirectorySource) GetFilter() FileFilter
- func (ds *DirectorySource) GetLastModified(ctx context.Context, id string) (time.Time, error)
- func (ds *DirectorySource) ReadDocument(ctx context.Context, id string) (*Document, error)
- func (ds *DirectorySource) SupportsIncrementalIndexing() bool
- func (ds *DirectorySource) Type() string
type DirectorySourceConfig
- func DefaultDirectorySourceConfig(path string) DirectorySourceConfig
type Document
type DocumentEvent
type DocumentEventType
type DocumentStore
- func NewDocumentStore(cfg DocumentStoreConfig) (*DocumentStore, error)
- func NewDocumentStoreFromConfig(name string, storeCfg *config.DocumentStoreConfig, deps *FactoryDeps) (*DocumentStore, error)
- func (s *DocumentStore) Clear(ctx context.Context) error
- func (s *DocumentStore) Close() error
- func (s *DocumentStore) Collection() string
- func (s *DocumentStore) Config() DocumentStoreConfig
- func (s *DocumentStore) GetDocument(ctx context.Context, id string) (*SearchResult, error)
- func (s *DocumentStore) GetSearchEngine() *SearchEngine
- func (s *DocumentStore) HealthCheck(ctx context.Context) HealthCheck
- func (s *DocumentStore) Index(ctx context.Context) error
- func (s *DocumentStore) Metrics() IndexMetricsSnapshot
- func (s *DocumentStore) Name() string
- func (s *DocumentStore) RefreshDocument(ctx context.Context, docID string) error
- func (s *DocumentStore) RegisterExtractor(e ContentExtractor)
- func (s *DocumentStore) Search(ctx context.Context, req SearchRequest) (*SearchResponse, error)
- func (s *DocumentStore) SearchWithFilter(ctx context.Context, query string, topK int, filter map[string]any) (*SearchResponse, error)
- func (s *DocumentStore) StartWatching(ctx context.Context) error
- func (s *DocumentStore) Stats() DocumentStoreStats
- func (s *DocumentStore) StopWatching()
type DocumentStoreConfig
type DocumentStoreError
- func NewDocumentStoreError(storeName, operation, message, filePath string, err error) *DocumentStoreError
- func (e *DocumentStoreError) Error() string
- func (e *DocumentStoreError) Unwrap() error
type DocumentStoreStats
type ExtractedContent
type ExtractionError
- func NewExtractionError(extractor, filePath, message string, err error) *ExtractionError
- func (e *ExtractionError) Error() string
- func (e *ExtractionError) Unwrap() error
type ExtractorRegistry
- func NewExtractorRegistry() *ExtractorRegistry
- func (r *ExtractorRegistry) Extract(ctx context.Context, doc Document) (*ExtractedContent, error)
- func (r *ExtractorRegistry) ExtractContent(ctx context.Context, path string, mimeType string, fileSize int64) (*ExtractedContent, error)
- func (r *ExtractorRegistry) GetExtractors() []ContentExtractor
- func (r *ExtractorRegistry) HasExtractorForFile(path string, mimeType string) bool
- func (r *ExtractorRegistry) Register(extractor ContentExtractor)
type FactoryDeps
type FileCheckpoint
type FileFilter
type FileWatcher
- func NewFileWatcher(cfg FileWatcherConfig) (*FileWatcher, error)
- func (fw *FileWatcher) IsWatching() bool
- func (fw *FileWatcher) Start(ctx context.Context) (<-chan DocumentEvent, error)
- func (fw *FileWatcher) Stop() error
type FileWatcherConfig
type FunctionInfo
type GoMetadataExtractor
- func NewGoMetadataExtractor() *GoMetadataExtractor
- func (ge *GoMetadataExtractor) CanExtract(language string) bool
- func (ge *GoMetadataExtractor) Extract(content string, filePath string) (*CodeMetadata, error)
- func (ge *GoMetadataExtractor) Name() string
type HealthCheck
- func (h HealthCheck) IsHealthy() bool
type HealthChecker
type HealthStatus
type HyDE
- func NewHyDE(llm model.LLM) *HyDE
- func (h *HyDE) EnhancedSearch(ctx context.Context, query string) (hypotheticalDoc string, err error)
- func (h *HyDE) GenerateHypotheticalDocument(ctx context.Context, query string) (string, error)
type IndexCheckpoint
type IndexCheckpointManager
- func NewIndexCheckpointManager(storeName, sourcePath string, enabled bool) *IndexCheckpointManager
- func (cm *IndexCheckpointManager) ClearCheckpoint() error
- func (cm *IndexCheckpointManager) ForceSave() error
- func (cm *IndexCheckpointManager) FormatCheckpointInfo(checkpoint *IndexCheckpoint) string
- func (cm *IndexCheckpointManager) GetProcessedCount() int
- func (cm *IndexCheckpointManager) IsEnabled() bool
- func (cm *IndexCheckpointManager) LoadCheckpoint() (*IndexCheckpoint, error)
- func (cm *IndexCheckpointManager) RecordFile(path string, size int64, modTime time.Time, status string)
- func (cm *IndexCheckpointManager) SaveCheckpoint() error
- func (cm *IndexCheckpointManager) SetTotalFiles(total int)
- func (cm *IndexCheckpointManager) ShouldProcessFile(path string, size int64, modTime time.Time) bool
type IndexError
- func NewIndexError(storeName, documentID, operation, message string, err error) *IndexError
- func (e *IndexError) Error() string
- func (e *IndexError) Unwrap() error
type IndexMetrics
- func NewIndexMetrics(storeName string) *IndexMetrics
- func (m *IndexMetrics) IncrementErrors()
- func (m *IndexMetrics) IncrementIndexed()
- func (m *IndexMetrics) IncrementSkipped()
- func (m *IndexMetrics) IncrementTotal()
- func (m *IndexMetrics) RecordSearch(latency time.Duration)
- func (m *IndexMetrics) Reset()
- func (m *IndexMetrics) SetEndTime(t time.Time)
- func (m *IndexMetrics) SetStartTime(t time.Time)
- func (m *IndexMetrics) Snapshot() IndexMetricsSnapshot
type IndexMetricsSnapshot
type LLMQueryExpander
- func NewLLMQueryExpander(llm model.LLM) *LLMQueryExpander
- func (e *LLMQueryExpander) Expand(ctx context.Context, query string, numVariations int) ([]string, error)
type MCPExtractor
- func NewMCPExtractor(config MCPExtractorConfig) (*MCPExtractor, error)
- func (e *MCPExtractor) CanExtract(path string, mimeType string) bool
- func (e *MCPExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)
- func (e *MCPExtractor) Name() string
- func (e *MCPExtractor) Priority() int
type MCPExtractorConfig
type MetadataExtractor
type MetadataExtractorRegistry
- func NewMetadataExtractorRegistry() *MetadataExtractorRegistry
- func (r *MetadataExtractorRegistry) ExtractMetadata(language string, content string, filePath string) (*CodeMetadata, error)
- func (r *MetadataExtractorRegistry) GetExtractors() []MetadataExtractor
- func (r *MetadataExtractorRegistry) Register(extractor MetadataExtractor)
type MultiQueryExpander
- func NewMultiQueryExpander(llm model.LLM, numQueries int) *MultiQueryExpander
- func (m *MultiQueryExpander) ExpandQuery(ctx context.Context, query string) ([]string, error)
type NativeParseResult
type NativeParser
type NativeParserRegistry
- func NewNativeParserRegistry() *NativeParserRegistry
- func (r *NativeParserRegistry) GetSupportedExtensions() []string
- func (r *NativeParserRegistry) ParseDocument(ctx context.Context, filePath string, fileSize int64) (*NativeParseResult, error)
type NilChunker
- func (NilChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)
- func (NilChunker) Config() ChunkerConfig
- func (NilChunker) Strategy() ChunkerStrategy
type NilDataSource
- func (NilDataSource) Close() error
- func (NilDataSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)
- func (NilDataSource) GetLastModified(ctx context.Context, id string) (time.Time, error)
- func (NilDataSource) ReadDocument(ctx context.Context, id string) (*Document, error)
- func (NilDataSource) SupportsIncrementalIndexing() bool
- func (NilDataSource) Type() string
type NilMultiQueryExpander
- func (NilMultiQueryExpander) ExpandQuery(ctx context.Context, query string) ([]string, error)
type NilQueryExpander
- func (NilQueryExpander) Expand(ctx context.Context, query string, numVariations int) ([]string, error)
type NilReranker
- func (NilReranker) Rerank(ctx context.Context, query string, results []SearchResult) (*RerankResult, error)
type OverlappingChunker
- func NewOverlappingChunker(cfg ChunkerConfig) *OverlappingChunker
- func (c *OverlappingChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)
- func (c *OverlappingChunker) Config() ChunkerConfig
- func (c *OverlappingChunker) Strategy() ChunkerStrategy
type PaginationConfig
type PatternCache
type PatternFilter
- func NewPatternFilter(sourcePath string, includePatterns, excludePatterns []string) (*PatternFilter, error)
- func (pf *PatternFilter) ShouldExclude(path string) bool
- func (pf *PatternFilter) ShouldInclude(path string) bool
type ProgressStats
type ProgressTracker
- func NewProgressTracker(enabled bool, verbose bool) *ProgressTracker
- func (pt *ProgressTracker) GetExtractorStats() map[string]int64
- func (pt *ProgressTracker) GetStats() ProgressStats
- func (pt *ProgressTracker) IncrementDeleted()
- func (pt *ProgressTracker) IncrementFailed()
- func (pt *ProgressTracker) IncrementIndexed()
- func (pt *ProgressTracker) IncrementProcessed()
- func (pt *ProgressTracker) IncrementSkipped()
- func (pt *ProgressTracker) RecordExtractorUsage(extractorName string)
- func (pt *ProgressTracker) SetCurrentFile(filename string)
- func (pt *ProgressTracker) SetTotalFiles(total int64)
- func (pt *ProgressTracker) Start()
- func (pt *ProgressTracker) Stop()
type QueryExpander
type RankingDecision
type RerankResult
type Reranker
- func NewReranker(llm model.LLM, maxResults int) *Reranker
- func (r *Reranker) Rerank(ctx context.Context, query string, results []SearchResult) (*RerankResult, error)
type RetryConfig
- func DefaultRetryConfig() RetryConfig
type RetryError
- func (e *RetryError) Error() string
- func (e *RetryError) Unwrap() error
type Retryer
- func NewRetryer(cfg RetryConfig) *Retryer
- func (r *Retryer) Do(ctx context.Context, operation string, fn func() error) error
type SQLSource
- func NewSQLSource(opts SQLSourceOptions) (*SQLSource, error)
- func (s *SQLSource) Close() error
- func (s *SQLSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)
- func (s *SQLSource) GetLastModified(ctx context.Context, id string) (time.Time, error)
- func (s *SQLSource) ReadDocument(ctx context.Context, id string) (*Document, error)
- func (s *SQLSource) SupportsIncrementalIndexing() bool
- func (s *SQLSource) Type() string
type SQLSourceOptions
type SQLTableConfig
type SearchEngine
- func NewSearchEngine(cfg SearchEngineConfig) (*SearchEngine, error)
- func NewSearchEngineFromConfig(storeCfg *config.DocumentStoreConfig, deps *FactoryDeps, collectionName string) (*SearchEngine, error)
- func (e *SearchEngine) Clear(ctx context.Context) error
- func (e *SearchEngine) Close() error
- func (e *SearchEngine) Collection() string
- func (e *SearchEngine) DeleteByFilter(ctx context.Context, filter map[string]any) error
- func (e *SearchEngine) DeleteDocument(ctx context.Context, documentID string) error
- func (e *SearchEngine) HealthCheck(ctx context.Context) HealthCheck
- func (e *SearchEngine) IngestDocument(ctx context.Context, doc Document) error
- func (e *SearchEngine) IngestDocuments(ctx context.Context, docs []Document) error
- func (e *SearchEngine) Search(ctx context.Context, req SearchRequest) (*SearchResponse, error)
- func (e *SearchEngine) Status() map[string]any
type SearchEngineConfig
type SearchError
- func NewSearchError(component, operation, message, query string, err error) *SearchError
- func (e *SearchError) Error() string
- func (e *SearchError) Unwrap() error
type SearchMetrics
- func NewSearchMetrics(engineName string) *SearchMetrics
- func (m *SearchMetrics) RecordSearch(latency time.Duration, resultCount int, opts *SearchOptions)
- func (m *SearchMetrics) Snapshot() SearchMetricsSnapshot
type SearchMetricsSnapshot
type SearchOptions
type SearchRequest
- func (r *SearchRequest) SetDefaults()
type SearchResponse
type SearchResult
- func CombineResults(resultSets [][]SearchResult) []SearchResult
type SemanticChunker
- func NewSemanticChunker(cfg ChunkerConfig) *SemanticChunker
- func (c *SemanticChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)
- func (c *SemanticChunker) Config() ChunkerConfig
- func (c *SemanticChunker) Strategy() ChunkerStrategy
type SimpleChunker
- func NewSimpleChunker(cfg ChunkerConfig) *SimpleChunker
- func (c *SimpleChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)
- func (c *SimpleChunker) Config() ChunkerConfig
- func (c *SimpleChunker) Strategy() ChunkerStrategy
type SourceDocument
type TextExtractor
- func NewTextExtractor() *TextExtractor
- func (te *TextExtractor) CanExtract(path string, mimeType string) bool
- func (te *TextExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)
- func (te *TextExtractor) Name() string
- func (te *TextExtractor) Priority() int
type Tool
type ToolCaller
type ToolInfo
type ToolParameter
type ToolResult
type TypeInfo

Constants ¶

View Source

const (
	// MinQueryLength is the minimum allowed query length.
	MinQueryLength = 2
	// MaxQueryLength is the maximum allowed query length.
	MaxQueryLength = 10000
)

Query validation constants (from legacy).

Variables ¶

This section is empty.

Functions ¶

func DoWithResult ¶

func DoWithResult[T any](ctx context.Context, r *Retryer, operation string, fn func() (T, error)) (T, error)

DoWithResult executes an operation that returns a value.

func IsRetryExhausted ¶

func IsRetryExhausted(err error) bool

IsRetryExhausted checks if an error is a retry exhaustion error.

func NewVectorProviderFromConfig ¶

func NewVectorProviderFromConfig(cfg *config.VectorStoreConfig) (vector.Provider, error)

NewVectorProviderFromConfig creates a vector provider from configuration.

Types ¶

type APIAuthConfig ¶

type APIAuthConfig struct {
	Type   string            `yaml:"type"`   // "bearer", "basic", "apikey", "oauth2"
	Token  string            `yaml:"token"`  // Token/API key
	User   string            `yaml:"user"`   // Username (for basic auth)
	Pass   string            `yaml:"pass"`   // Password (for basic auth)
	Header string            `yaml:"header"` // Header name (for apikey type)
	Extra  map[string]string `yaml:"extra"`  // Additional auth parameters
}

APIAuthConfig defines authentication for API requests.

Direct port from legacy pkg/context/indexing/api_source.go

type APIEndpointConfig ¶

type APIEndpointConfig struct {
	Path           string            `yaml:"path"`            // API path (relative to baseURL)
	Method         string            `yaml:"method"`          // HTTP method (default: GET)
	Params         map[string]string `yaml:"params"`          // Query parameters
	Headers        map[string]string `yaml:"headers"`         // Additional headers
	Body           string            `yaml:"body"`            // Request body (for POST/PUT)
	IDField        string            `yaml:"id_field"`        // JSON field to use as document ID
	ContentField   string            `yaml:"content_field"`   // JSON field(s) to use as content (comma-separated or JSONPath)
	MetadataFields []string          `yaml:"metadata_fields"` // JSON fields to include as metadata
	UpdatedField   string            `yaml:"updated_field"`   // JSON field for last modified time
	Pagination     *PaginationConfig `yaml:"pagination"`      // Pagination configuration
	Transform      string            `yaml:"transform"`       // Optional JavaScript-like transform function (future)
}

APIEndpointConfig defines an API endpoint to index.

Direct port from legacy pkg/context/indexing/api_source.go

type APISource ¶

type APISource struct {
	// contains filtered or unexported fields
}

APISource implements DataSource for REST API endpoints.

Direct port from legacy pkg/context/indexing/api_source.go

func NewAPISource ¶

func NewAPISource(baseURL string, endpoints []APIEndpointConfig, auth *APIAuthConfig) *APISource

NewAPISource creates a new REST API data source.

Direct port from legacy pkg/context/indexing/api_source.go

func (*APISource) Close ¶

func (a *APISource) Close() error

Close closes the API source.

func (*APISource) DiscoverDocuments ¶

func (a *APISource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors.

Direct port from legacy pkg/context/indexing/api_source.go

func (*APISource) GetLastModified ¶

func (a *APISource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*APISource) ReadDocument ¶

func (a *APISource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID.

func (*APISource) SupportsIncrementalIndexing ¶

func (a *APISource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true if UpdatedField is configured.

func (*APISource) Type ¶

func (a *APISource) Type() string

Type returns the data source type.

type BinaryExtractor ¶

type BinaryExtractor struct {
	// contains filtered or unexported fields
}

BinaryExtractor handles binary files like PDF, DOCX, XLSX using native parsers.

Direct port from legacy pkg/context/extraction/binary_extractor.go

func NewBinaryExtractor ¶

func NewBinaryExtractor(nativeParsers NativeParser) *BinaryExtractor

NewBinaryExtractor creates a new binary extractor.

func (*BinaryExtractor) CanExtract ¶

func (be *BinaryExtractor) CanExtract(path string, mimeType string) bool

CanExtract checks if this extractor can handle the file.

func (*BinaryExtractor) Extract ¶

func (be *BinaryExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

Extract uses native parsers to extract content from binary files.

func (*BinaryExtractor) Name ¶

func (be *BinaryExtractor) Name() string

Name returns the extractor name.

func (*BinaryExtractor) Priority ¶

func (be *BinaryExtractor) Priority() int

Priority returns medium priority (5).

type BlobSource ¶ added in v1.21.0

type BlobSource struct {
	// contains filtered or unexported fields
}

BlobSource implements DataSource for blob storage (local files, S3, GCS, Azure Blob, etc.). Uses gocloud.dev/blob to provide unified access to multiple storage backends.

Replaces and extends DirectorySource to support both local and cloud storage: - file:///path/to/dir - Local filesystem - s3://bucket/prefix?region=us-east-1 - AWS S3 - gs://bucket/prefix - Google Cloud Storage - azblob://container/prefix - Azure Blob Storage

func NewBlobSource ¶ added in v1.21.0

func NewBlobSource(ctx context.Context, cfg BlobSourceConfig) (*BlobSource, error)

NewBlobSource creates a new blob storage data source.

URL examples:

s3://my-bucket/docs?region=us-east-1
gs://my-bucket/docs
azblob://my-container/docs

Environment variables for credentials:

AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
GCS: GOOGLE_APPLICATION_CREDENTIALS
Azure: AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_KEY

func (*BlobSource) Close ¶ added in v1.21.0

func (bs *BlobSource) Close() error

Close releases any resources held by the data source.

func (*BlobSource) DiscoverDocuments ¶ added in v1.21.0

func (bs *BlobSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors. Documents are discovered asynchronously and sent through the channel.

func (*BlobSource) GetFilter ¶ added in v1.21.0

func (bs *BlobSource) GetFilter() FileFilter

GetFilter returns the file filter.

func (*BlobSource) GetLastModified ¶ added in v1.21.0

func (bs *BlobSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*BlobSource) GetPrefix ¶ added in v1.21.0

func (bs *BlobSource) GetPrefix() string

GetPrefix returns the blob prefix filter.

func (*BlobSource) GetURL ¶ added in v1.21.0

func (bs *BlobSource) GetURL() string

GetURL returns the blob storage URL.

func (*BlobSource) ReadDocument ¶ added in v1.21.0

func (bs *BlobSource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID. ID format: blob:<storage_type>:<key>

func (*BlobSource) SupportsIncrementalIndexing ¶ added in v1.21.0

func (bs *BlobSource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true as blob sources support incremental indexing.

func (*BlobSource) Type ¶ added in v1.21.0

func (bs *BlobSource) Type() string

Type returns the data source type.

type BlobSourceConfig ¶ added in v1.21.0

type BlobSourceConfig struct {
	// URL is the blob storage URL (file://, s3://, gs://, azblob://)
	URL string

	// Prefix filters to blobs with this prefix (optional)
	Prefix string

	// Include patterns for files (same as DirectorySource)
	Include []string

	// Exclude patterns for files (same as DirectorySource)
	Exclude []string

	// MaxFileSize limits file size in bytes
	MaxFileSize int64
}

BlobSourceConfig configures a blob storage source.

func DefaultBlobSourceConfig ¶ added in v1.21.0

func DefaultBlobSourceConfig(url string) BlobSourceConfig

DefaultBlobSourceConfig returns default configuration for blob source.

type Chunk ¶

type Chunk struct {
	// Content is the actual text content of this chunk.
	Content string `json:"content"`

	// Index is the chunk's position within the document (0-based).
	Index int `json:"index"`

	// Total is the total number of chunks for this document.
	Total int `json:"total"`

	// StartLine is the starting line number in the source document (1-based).
	StartLine int `json:"start_line"`

	// EndLine is the ending line number in the source document (1-based).
	EndLine int `json:"end_line"`

	// StartByte is the byte offset where this chunk begins (optional).
	StartByte int `json:"start_byte,omitempty"`

	// EndByte is the byte offset where this chunk ends (optional).
	EndByte int `json:"end_byte,omitempty"`

	// Context provides semantic context for the chunk (function name, type, etc.).
	Context *ChunkContext `json:"context,omitempty"`

	// Metadata contains additional chunk-specific information.
	Metadata map[string]any `json:"metadata,omitempty"`
}

Chunk represents a piece of content with position and context information.

Chunks are the fundamental unit of retrieval in RAG systems. Each chunk:

Contains a portion of the original document
Tracks its position within the source
Preserves semantic context for better retrieval

Derived from legacy pkg/context/chunking/chunker.go:Chunk

type ChunkContext ¶

type ChunkContext struct {
	// FunctionName is the containing function/method name (for code).
	FunctionName string `json:"function_name,omitempty"`

	// TypeName is the containing type/class name (for code).
	TypeName string `json:"type_name,omitempty"`

	// FilePath is the source file path.
	FilePath string `json:"file_path,omitempty"`

	// Language is the detected programming language (for code).
	Language string `json:"language,omitempty"`

	// Section is the document section name (for prose documents).
	Section string `json:"section,omitempty"`

	// ParentID links to a parent chunk (for hierarchical retrieval).
	ParentID string `json:"parent_id,omitempty"`
}

ChunkContext provides semantic context for a chunk.

This is especially useful for code files where understanding the function or type a chunk belongs to improves retrieval quality.

type Chunker ¶

type Chunker interface {
	// Chunk splits content into pieces.
	//
	// The content is split according to the chunker's strategy.
	// Each chunk includes position information (line numbers, byte offsets)
	// for source mapping.
	//
	// Parameters:
	//   - content: the text to split
	//   - ctx: optional context (e.g., from metadata extraction)
	//
	// Returns chunks ordered by position in the original content.
	Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

	// Strategy returns the chunker strategy name.
	Strategy() ChunkerStrategy

	// Config returns the chunker configuration.
	Config() ChunkerConfig
}

Chunker splits content into smaller pieces for indexing.

Chunking is critical for RAG quality:

Too small: loses context, retrieves fragments
Too large: wastes tokens, dilutes relevance
Good chunking: preserves semantic units, enables precise retrieval

Derived from legacy pkg/context/chunking/chunker.go:Chunker

func NewChunker ¶

func NewChunker(cfg ChunkerConfig) (Chunker, error)

NewChunker creates a chunker from configuration.

func NewChunkerFromConfig ¶

func NewChunkerFromConfig(cfg *config.ChunkingConfig) (Chunker, error)

NewChunkerFromConfig creates a chunker from configuration.

type ChunkerConfig ¶

type ChunkerConfig struct {
	// Strategy is the chunking strategy.
	// Values: "simple", "overlapping", "semantic"
	// Default: "simple"
	Strategy ChunkerStrategy `yaml:"strategy,omitempty"`

	// Size is the target chunk size in characters.
	// Default: 1000
	Size int `yaml:"size,omitempty"`

	// Overlap is the overlap size in characters (for overlapping strategy).
	// Default: 200
	Overlap int `yaml:"overlap,omitempty"`

	// MinSize is the minimum chunk size (chunks smaller than this are merged).
	// Default: 100
	MinSize int `yaml:"min_size,omitempty"`

	// MaxSize is the maximum chunk size (hard limit).
	// Default: 2000
	MaxSize int `yaml:"max_size,omitempty"`

	// Separators are the preferred split points for semantic chunking.
	// Default: ["\n\n", "\n", ". ", " "]
	Separators []string `yaml:"separators,omitempty"`

	// PreserveWords avoids splitting in the middle of words.
	// Default: true
	PreserveWords bool `yaml:"preserve_words,omitempty"`
}

ChunkerConfig configures chunking behavior.

func DefaultChunkerConfig ¶

func DefaultChunkerConfig() ChunkerConfig

DefaultChunkerConfig returns sensible defaults.

func (*ChunkerConfig) SetDefaults ¶

func (c *ChunkerConfig) SetDefaults()

SetDefaults applies default values.

func (*ChunkerConfig) Validate ¶

func (c *ChunkerConfig) Validate() error

Validate checks the configuration for errors.

type ChunkerStrategy ¶

type ChunkerStrategy string

ChunkerStrategy identifies a chunking strategy.

const (
	// ChunkerSimple splits content by fixed character count.
	// Fast but may split mid-sentence/word.
	ChunkerSimple ChunkerStrategy = "simple"

	// ChunkerOverlapping splits with overlap between chunks.
	// Better for retrieval as context is preserved at boundaries.
	ChunkerOverlapping ChunkerStrategy = "overlapping"

	// ChunkerSemantic splits at natural boundaries (paragraphs, sections).
	// Best quality but more complex and slower.
	ChunkerSemantic ChunkerStrategy = "semantic"
)

type ChunkingError ¶

type ChunkingError struct {
	Strategy   string // Chunking strategy
	DocumentID string // Document ID
	Message    string // Error message
	Err        error  // Underlying error
}

ChunkingError represents an error during document chunking.

func NewChunkingError ¶

func NewChunkingError(strategy, documentID, message string, err error) *ChunkingError

NewChunkingError creates a new ChunkingError.

func (*ChunkingError) Error ¶

func (e *ChunkingError) Error() string

Error implements the error interface.

func (*ChunkingError) Unwrap ¶

func (e *ChunkingError) Unwrap() error

Unwrap returns the underlying error.

type CodeMetadata ¶

type CodeMetadata struct {
	Functions []FunctionInfo         `json:"functions,omitempty"`
	Types     []TypeInfo             `json:"types,omitempty"`
	Imports   []string               `json:"imports,omitempty"`
	Symbols   map[string]interface{} `json:"symbols,omitempty"`
	Custom    map[string]interface{} `json:"custom,omitempty"`
}

CodeMetadata contains extracted code structure information.

Direct port from legacy pkg/context/metadata/extractor.go

type CollectionSource ¶

type CollectionSource struct {
	// contains filtered or unexported fields
}

CollectionSource implements DataSource for collection-only stores. It's a no-op source that doesn't index anything - used when document store points to an existing collection that's already populated.

Direct port from legacy pkg/context/indexing/collection_source.go

func NewCollectionSource ¶

func NewCollectionSource(collectionName string) *CollectionSource

NewCollectionSource creates a new collection-only data source.

func (*CollectionSource) Close ¶

func (cs *CollectionSource) Close() error

Close closes the collection source.

func (*CollectionSource) CollectionName ¶

func (cs *CollectionSource) CollectionName() string

CollectionName returns the collection name.

func (*CollectionSource) DiscoverDocuments ¶

func (cs *CollectionSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns empty channels - no documents to index.

func (*CollectionSource) GetLastModified ¶

func (cs *CollectionSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns zero time - not supported for collection sources.

func (*CollectionSource) ReadDocument ¶

func (cs *CollectionSource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument returns an error - not supported for collection sources.

func (*CollectionSource) SupportsIncrementalIndexing ¶

func (cs *CollectionSource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns false.

func (*CollectionSource) Type ¶

func (cs *CollectionSource) Type() string

Type returns the data source type.

type ContentExtractor ¶

type ContentExtractor interface {
	// Name returns the extractor name for logging/debugging.
	Name() string

	// CanExtract determines if this extractor can handle the given file.
	CanExtract(path string, mimeType string) bool

	// Extract extracts content from the file.
	Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

	// Priority returns the priority (higher = preferred when multiple extractors match).
	Priority() int
}

ContentExtractor defines the interface for extracting content from files.

Direct port from legacy pkg/context/extraction/extractor.go

type DBPoolAdapter ¶

type DBPoolAdapter struct {
	// contains filtered or unexported fields
}

DBPoolAdapter wraps config.DBPool to provide sql.DB connections.

func NewDBPoolAdapter ¶

func NewDBPoolAdapter(pool *config.DBPool, databaseDSN string) *DBPoolAdapter

NewDBPoolAdapter creates an adapter for the DBPool.

func (*DBPoolAdapter) Get ¶

func (a *DBPoolAdapter) Get(name string) (*sql.DB, string, error)

Get returns a database connection using the primary database.

type DataSource ¶

type DataSource interface {
	// Type returns the type of data source (e.g., "directory", "sql", "api", "s3")
	Type() string

	// DiscoverDocuments returns a channel of discovered documents and a channel of errors.
	// Documents are discovered asynchronously and sent through the channel.
	// For file sources, content should be read from files.
	// For SQL/API sources, content should already be populated.
	DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

	// ReadDocument retrieves a specific document by its ID.
	// The ID format depends on the source type (file path, SQL row ID, API endpoint, etc.)
	ReadDocument(ctx context.Context, id string) (*Document, error)

	// SupportsIncrementalIndexing indicates if this source supports incremental updates
	// based on modification timestamps or change tracking.
	SupportsIncrementalIndexing() bool

	// GetLastModified returns the last modification time for a document, if available.
	// Returns zero time if not supported or document doesn't exist.
	GetLastModified(ctx context.Context, id string) (time.Time, error)

	// Close releases any resources held by the data source.
	Close() error
}

DataSource represents a generic source of documents to be indexed. It abstracts over filesystem, SQL databases, REST APIs, and cloud storage.

Direct port from legacy pkg/context/indexing/data_source.go

func NewBlobSourceFromConfig ¶ added in v1.21.0

func NewBlobSourceFromConfig(ctx context.Context, cfg BlobSourceConfig) (DataSource, error)

NewBlobSourceFromConfig creates a blob source from BlobSourceConfig.

func NewDataSourceFromConfig ¶

func NewDataSourceFromConfig(cfg *config.DocumentSourceConfig, deps *FactoryDeps) (DataSource, error)

NewDataSourceFromConfig creates a data source from configuration.

func NewDirectorySourceFromConfig ¶

func NewDirectorySourceFromConfig(cfg DirectorySourceConfig) (DataSource, error)

NewDirectorySourceFromConfig creates a directory source from config.

type DirectorySource ¶

type DirectorySource struct {
	// contains filtered or unexported fields
}

DirectorySource implements DataSource for local filesystem directories.

Direct port from legacy pkg/context/indexing/directory_source.go

func NewDirectorySource ¶

func NewDirectorySource(basePath string, filter FileFilter, maxFileSize int64) *DirectorySource

NewDirectorySource creates a new directory-based data source.

Direct port from legacy pkg/context/indexing/directory_source.go

func (*DirectorySource) Close ¶

func (ds *DirectorySource) Close() error

Close releases any resources held by the data source.

func (*DirectorySource) DiscoverDocuments ¶

func (ds *DirectorySource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors. Documents are discovered asynchronously and sent through the channel.

Direct port from legacy pkg/context/indexing/directory_source.go

func (*DirectorySource) GetBasePath ¶

func (ds *DirectorySource) GetBasePath() string

GetBasePath returns the base directory path (helper method).

func (*DirectorySource) GetFilter ¶

func (ds *DirectorySource) GetFilter() FileFilter

GetFilter returns the file filter (helper method).

func (*DirectorySource) GetLastModified ¶

func (ds *DirectorySource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*DirectorySource) ReadDocument ¶

func (ds *DirectorySource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID (file path).

Direct port from legacy pkg/context/indexing/directory_source.go

func (*DirectorySource) SupportsIncrementalIndexing ¶

func (ds *DirectorySource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true as directory sources support incremental indexing.

func (*DirectorySource) Type ¶

func (ds *DirectorySource) Type() string

Type returns the data source type.

type DirectorySourceConfig ¶

type DirectorySourceConfig struct {
	Path        string
	Include     []string
	Exclude     []string
	MaxFileSize int64 // Max file size in bytes to process (0 for no limit)
}

DirectorySourceConfig configures a directory data source.

func DefaultDirectorySourceConfig ¶

func DefaultDirectorySourceConfig(path string) DirectorySourceConfig

DefaultDirectorySourceConfig returns sensible defaults for directory source. Includes both text-based source code files and binary document formats that can be parsed by native parsers (PDF, DOCX, XLSX).

type Document ¶

type Document struct {
	// ID is the unique identifier for this document.
	ID string `json:"id"`

	// Content is the text content to be indexed.
	Content string `json:"content"`

	// Title is the document title (optional).
	Title string `json:"title,omitempty"`

	// SourcePath is the path to the source file (for file-based documents).
	SourcePath string `json:"source_path,omitempty"`

	// MimeType is the content type (e.g., "text/plain", "text/markdown").
	MimeType string `json:"mime_type,omitempty"`

	// Size is the content size in bytes.
	Size int64 `json:"size"`

	// Metadata contains additional document information.
	Metadata map[string]any `json:"metadata,omitempty"`
}

Document represents a document to be indexed.

Documents go through the following pipeline:

Content extraction (if binary)
Chunking (split into searchable pieces)
Embedding (convert to vectors)
Indexing (store in vector database)

type DocumentEvent ¶

type DocumentEvent struct {
	Type     DocumentEventType
	Document Document
	Error    error
}

DocumentEvent represents a change in a document.

type DocumentEventType ¶

type DocumentEventType string

DocumentEventType indicates the type of change.

const (
	DocumentEventCreate DocumentEventType = "create"
	DocumentEventUpdate DocumentEventType = "update"
	DocumentEventDelete DocumentEventType = "delete"
	DocumentEventError  DocumentEventType = "error"
)

type DocumentStore ¶

type DocumentStore struct {
	// contains filtered or unexported fields
}

DocumentStore manages document indexing and search.

It combines:

DataSource: Where documents come from
ContentExtractor: How to extract text from documents
SearchEngine: How to index and search
File watching: Automatic re-indexing on changes
Concurrent indexing with configurable worker pool
Retry logic for transient failures
Checkpoint/resume for interrupted indexing
Progress tracking with ETA

Direct port from legacy pkg/context/document_store.go

func NewDocumentStore ¶

func NewDocumentStore(cfg DocumentStoreConfig) (*DocumentStore, error)

NewDocumentStore creates a new document store.

func NewDocumentStoreFromConfig ¶

func NewDocumentStoreFromConfig(
	name string,
	storeCfg *config.DocumentStoreConfig,
	deps *FactoryDeps,
) (*DocumentStore, error)

NewDocumentStoreFromConfig creates a document store from configuration.

func (*DocumentStore) Clear ¶

func (s *DocumentStore) Clear(ctx context.Context) error

Clear removes all indexed documents.

func (*DocumentStore) Close ¶

func (s *DocumentStore) Close() error

Close stops watching and releases resources.

func (*DocumentStore) Collection ¶

func (s *DocumentStore) Collection() string

Collection returns the collection name.

func (*DocumentStore) Config ¶

func (s *DocumentStore) Config() DocumentStoreConfig

Config returns the store configuration.

func (*DocumentStore) GetDocument ¶

func (s *DocumentStore) GetDocument(ctx context.Context, id string) (*SearchResult, error)

GetDocument retrieves a specific document by ID.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) GetSearchEngine ¶

func (s *DocumentStore) GetSearchEngine() *SearchEngine

GetSearchEngine returns the underlying search engine.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) HealthCheck ¶

func (s *DocumentStore) HealthCheck(ctx context.Context) HealthCheck

DocumentStoreHealth checks the health of a DocumentStore.

func (*DocumentStore) Index ¶

func (s *DocumentStore) Index(ctx context.Context) error

Index indexes all documents from the source with concurrent processing.

Uses channel-based DiscoverDocuments from legacy architecture with worker pool for concurrent indexing (like legacy indexingSemaphore). Supports checkpoint/resume for interrupted indexing.

Direct port from legacy pkg/context/document_store_indexing.go

func (*DocumentStore) Metrics ¶

func (s *DocumentStore) Metrics() IndexMetricsSnapshot

Metrics returns detailed indexing metrics.

func (*DocumentStore) Name ¶

func (s *DocumentStore) Name() string

Name returns the store name.

func (*DocumentStore) RefreshDocument ¶

func (s *DocumentStore) RefreshDocument(ctx context.Context, docID string) error

RefreshDocument re-indexes a single document by path.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) RegisterExtractor ¶

func (s *DocumentStore) RegisterExtractor(e ContentExtractor)

RegisterExtractor adds a custom content extractor.

func (*DocumentStore) Search ¶

func (s *DocumentStore) Search(ctx context.Context, req SearchRequest) (*SearchResponse, error)

Search searches for documents.

func (*DocumentStore) SearchWithFilter ¶

func (s *DocumentStore) SearchWithFilter(ctx context.Context, query string, topK int, filter map[string]any) (*SearchResponse, error)

SearchWithFilter searches with metadata filtering.

func (*DocumentStore) StartWatching ¶

func (s *DocumentStore) StartWatching(ctx context.Context) error

StartWatching starts watching for document changes.

Direct port from legacy pkg/context/document_store.go

func (*DocumentStore) Stats ¶

func (s *DocumentStore) Stats() DocumentStoreStats

Stats returns indexing statistics.

func (*DocumentStore) StopWatching ¶

func (s *DocumentStore) StopWatching()

StopWatching stops watching for changes.

type DocumentStoreConfig ¶

type DocumentStoreConfig struct {
	// Name identifies this store.
	Name string

	// Description describes the store (used by SearchTool).
	Description string

	// Source provides documents.
	Source DataSource

	// SearchEngine for indexing and search.
	SearchEngine *SearchEngine

	// Chunker for splitting documents (optional, defaults to engine's chunker).
	Chunker Chunker

	// Collection name (optional, defaults to store name).
	Collection string

	// SourcePath is the base path for checkpoints (auto-detected from directory source).
	SourcePath string

	// Watch enables file watching for automatic re-indexing.
	Watch bool

	// IncrementalIndexing only re-indexes changed documents.
	IncrementalIndexing bool

	// EnableCheckpoints enables resume capability for interrupted indexing.
	// Checkpoints are saved to .hector/checkpoints/ in the source path.
	// Default: true for directory sources
	EnableCheckpoints bool

	// EnableProgress enables progress display during indexing.
	// Default: true
	EnableProgress bool

	// Search configuration for advanced features.
	Search *SearchOptions

	// MaxConcurrentIndexing limits parallel document processing (default: NumCPU).
	// Set to 1 for sequential indexing (legacy behavior).
	MaxConcurrentIndexing int

	// RetryConfig for transient failure handling (optional).
	RetryConfig *RetryConfig
}

DocumentStoreConfig configures a document store.

type DocumentStoreError ¶

type DocumentStoreError struct {
	StoreName string    // Name of the document store
	Operation string    // Operation that failed
	Message   string    // Error message
	FilePath  string    // File path if applicable
	Err       error     // Underlying error
	Timestamp time.Time // When the error occurred
}

DocumentStoreError represents an error in document store operations.

Inspired by legacy pkg/context/document_store.go error handling

func NewDocumentStoreError ¶

func NewDocumentStoreError(storeName, operation, message, filePath string, err error) *DocumentStoreError

NewDocumentStoreError creates a new DocumentStoreError.

func (*DocumentStoreError) Error ¶

func (e *DocumentStoreError) Error() string

Error implements the error interface.

func (*DocumentStoreError) Unwrap ¶

func (e *DocumentStoreError) Unwrap() error

Unwrap returns the underlying error.

type DocumentStoreStats ¶

type DocumentStoreStats struct {
	Name          string  `json:"name"`
	Collection    string  `json:"collection"`
	IndexedCount  int     `json:"indexed_count"`
	WatchEnabled  bool    `json:"watch_enabled"`
	SourceType    string  `json:"source_type"`
	TotalDocs     int64   `json:"total_docs"`
	SkippedDocs   int64   `json:"skipped_docs"`
	ErrorDocs     int64   `json:"error_docs"`
	DocsPerSecond float64 `json:"docs_per_second"`
	SearchCount   int64   `json:"search_count"`
}

DocumentStoreStats contains store statistics.

type ExtractedContent ¶

type ExtractedContent struct {
	Content          string            // The extracted text content
	Title            string            // Document title (if available)
	Author           string            // Document author (if available)
	Metadata         map[string]string // Additional metadata
	ProcessingTimeMs int64             // Time taken to extract
	ExtractorName    string            // Name of extractor used
}

ExtractedContent represents extracted file content with metadata.

Direct port from legacy pkg/context/extraction/extractor.go

type ExtractionError ¶

type ExtractionError struct {
	Extractor string // Extractor name
	FilePath  string // File path
	Message   string // Error message
	Err       error  // Underlying error
}

ExtractionError represents an error during content extraction.

func NewExtractionError ¶

func NewExtractionError(extractor, filePath, message string, err error) *ExtractionError

NewExtractionError creates a new ExtractionError.

func (*ExtractionError) Error ¶

func (e *ExtractionError) Error() string

Error implements the error interface.

func (*ExtractionError) Unwrap ¶

func (e *ExtractionError) Unwrap() error

Unwrap returns the underlying error.

type ExtractorRegistry ¶

type ExtractorRegistry struct {
	// contains filtered or unexported fields
}

ExtractorRegistry manages multiple content extractors.

Direct port from legacy pkg/context/extraction/extractor.go

func NewExtractorRegistry ¶

func NewExtractorRegistry() *ExtractorRegistry

NewExtractorRegistry creates a new extractor registry with default extractors. Registers:

BinaryExtractor (priority 5): PDF, DOCX, XLSX via native parsers
TextExtractor (priority 1): Plain text files

func (*ExtractorRegistry) Extract ¶

func (r *ExtractorRegistry) Extract(ctx context.Context, doc Document) (*ExtractedContent, error)

Extract tries to extract content using the best available extractor. Adapts the document-based interface for store.go compatibility.

func (*ExtractorRegistry) ExtractContent ¶

func (r *ExtractorRegistry) ExtractContent(ctx context.Context, path string, mimeType string, fileSize int64) (*ExtractedContent, error)

ExtractContent tries to extract content using the best available extractor.

func (*ExtractorRegistry) GetExtractors ¶

func (r *ExtractorRegistry) GetExtractors() []ContentExtractor

GetExtractors returns all registered extractors (for debugging).

func (*ExtractorRegistry) HasExtractorForFile ¶

func (r *ExtractorRegistry) HasExtractorForFile(path string, mimeType string) bool

HasExtractorForFile checks if any extractor can handle the given file. This is useful for determining if a file can be indexed before attempting extraction.

func (*ExtractorRegistry) Register ¶

func (r *ExtractorRegistry) Register(extractor ContentExtractor)

Register adds an extractor to the registry.

type FactoryDeps ¶

type FactoryDeps struct {
	// DBPool provides database connections.
	DBPool *config.DBPool

	// DatabaseDSN for database connections.
	DatabaseDSN string

	// VectorProviders maps provider names to instances.
	VectorProviders map[string]vector.Provider

	// Embedders maps embedder names to instances.
	Embedders map[string]embedder.Embedder

	// LLMs maps LLM names to instances.
	LLMs map[string]model.LLM

	// ToolCaller provides access to MCP tools for document parsing.
	// Optional - only needed if MCPParsers is configured.
	ToolCaller ToolCaller
}

FactoryDeps provides dependencies for creating RAG components.

type FileCheckpoint ¶

type FileCheckpoint struct {
	Path        string    `json:"path"`
	Hash        string    `json:"hash"`
	Size        int64     `json:"size"`
	ModTime     time.Time `json:"mod_time"`
	Status      string    `json:"status"` // "indexed", "skipped", "failed"
	ProcessedAt time.Time `json:"processed_at"`
}

FileCheckpoint contains information about a processed file.

Direct port from legacy pkg/context/checkpoint.go

type FileFilter ¶

type FileFilter interface {
	ShouldInclude(path string) bool
	ShouldExclude(path string) bool
}

FileFilter determines if a file should be indexed.

Direct port from legacy pkg/context/indexing/data_source.go:FileFilter

type FileWatcher ¶

type FileWatcher struct {
	// contains filtered or unexported fields
}

FileWatcher watches a directory for file changes using fsnotify.

Direct port from legacy pkg/context/document_store.go fsnotify watching

func NewFileWatcher ¶

func NewFileWatcher(cfg FileWatcherConfig) (*FileWatcher, error)

NewFileWatcher creates a new file watcher.

func (*FileWatcher) IsWatching ¶

func (fw *FileWatcher) IsWatching() bool

IsWatching returns whether the watcher is active.

func (*FileWatcher) Start ¶

func (fw *FileWatcher) Start(ctx context.Context) (<-chan DocumentEvent, error)

Start begins watching the directory for changes.

func (*FileWatcher) Stop ¶

func (fw *FileWatcher) Stop() error

Stop stops watching for changes.

type FileWatcherConfig ¶

type FileWatcherConfig struct {
	BasePath      string
	Filter        FileFilter
	DebounceDelay time.Duration // Delay before processing events (default: 100ms)
}

FileWatcherConfig configures the file watcher.

type FunctionInfo ¶

type FunctionInfo struct {
	Name       string `json:"name"`
	Signature  string `json:"signature,omitempty"`
	StartLine  int    `json:"start_line"`
	EndLine    int    `json:"end_line"`
	Receiver   string `json:"receiver,omitempty"` // For methods
	IsExported bool   `json:"is_exported,omitempty"`
	DocComment string `json:"doc_comment,omitempty"`
}

FunctionInfo contains information about a function.

Direct port from legacy pkg/context/metadata/extractor.go

type GoMetadataExtractor ¶

type GoMetadataExtractor struct{}

GoMetadataExtractor extracts metadata from Go source files using AST parsing.

Direct port from legacy pkg/context/metadata/go_extractor.go

func NewGoMetadataExtractor ¶

func NewGoMetadataExtractor() *GoMetadataExtractor

NewGoMetadataExtractor creates a new Go metadata extractor.

func (*GoMetadataExtractor) CanExtract ¶

func (ge *GoMetadataExtractor) CanExtract(language string) bool

CanExtract checks if this extractor can handle the language.

func (*GoMetadataExtractor) Extract ¶

func (ge *GoMetadataExtractor) Extract(content string, filePath string) (*CodeMetadata, error)

Extract parses Go source code and extracts metadata.

func (*GoMetadataExtractor) Name ¶

func (ge *GoMetadataExtractor) Name() string

Name returns the extractor name.

type HealthCheck ¶

type HealthCheck struct {
	// Component name.
	Component string `json:"component"`

	// Status of the component.
	Status HealthStatus `json:"status"`

	// Message provides details about the status.
	Message string `json:"message,omitempty"`

	// Latency of the health check.
	Latency time.Duration `json:"latency_ms"`

	// Timestamp of the check.
	Timestamp time.Time `json:"timestamp"`

	// Details contains component-specific health information.
	Details map[string]any `json:"details,omitempty"`
}

HealthCheck represents the result of a health check.

func (HealthCheck) IsHealthy ¶

func (h HealthCheck) IsHealthy() bool

IsHealthy returns true if the status is healthy.

type HealthChecker ¶

type HealthChecker interface {
	HealthCheck(ctx context.Context) HealthCheck
}

HealthChecker is an interface for components that support health checking.

type HealthStatus ¶

type HealthStatus string

HealthStatus represents the health state of a component.

const (
	// HealthStatusHealthy indicates the component is functioning normally.
	HealthStatusHealthy HealthStatus = "healthy"

	// HealthStatusDegraded indicates the component is functioning but with issues.
	HealthStatusDegraded HealthStatus = "degraded"

	// HealthStatusUnhealthy indicates the component is not functioning.
	HealthStatusUnhealthy HealthStatus = "unhealthy"
)

type HyDE ¶

type HyDE struct {
	// contains filtered or unexported fields
}

HyDE implements Hypothetical Document Embeddings.

Instead of searching with the query embedding directly, HyDE:

Uses an LLM to generate a hypothetical document that would answer the query
Embeds the hypothetical document
Uses that embedding for search

This can significantly improve retrieval for questions, as the hypothetical document's embedding is closer to actual relevant documents than the query embedding.

Paper: "Precise Zero-Shot Dense Retrieval without Relevance Labels" https://arxiv.org/abs/2212.10496

Derived from legacy pkg/context/hyde.go

func NewHyDE ¶

func NewHyDE(llm model.LLM) *HyDE

NewHyDE creates a new HyDE processor.

func (*HyDE) EnhancedSearch ¶

func (h *HyDE) EnhancedSearch(ctx context.Context, query string) (hypotheticalDoc string, err error)

EnhancedSearch performs HyDE-enhanced search.

This is a convenience method that:

Generates a hypothetical document
Returns both the hypothetical doc and the original query

The caller should embed the hypothetical doc instead of the query.

func (*HyDE) GenerateHypotheticalDocument ¶

func (h *HyDE) GenerateHypotheticalDocument(ctx context.Context, query string) (string, error)

GenerateHypotheticalDocument generates a hypothetical document for the query.

type IndexCheckpoint ¶

type IndexCheckpoint struct {
	Version        string                    `json:"version"`
	StoreName      string                    `json:"store_name"`
	SourcePath     string                    `json:"source_path"`
	StartTime      time.Time                 `json:"start_time"`
	LastUpdate     time.Time                 `json:"last_update"`
	ProcessedFiles map[string]FileCheckpoint `json:"processed_files"`
	TotalFiles     int                       `json:"total_files"`
	IndexedCount   int                       `json:"indexed_count"`
	SkippedCount   int                       `json:"skipped_count"`
	FailedCount    int                       `json:"failed_count"`
}

IndexCheckpoint represents a saved indexing checkpoint.

Direct port from legacy pkg/context/checkpoint.go

type IndexCheckpointManager ¶

type IndexCheckpointManager struct {
	// contains filtered or unexported fields
}

IndexCheckpointManager manages indexing checkpoints.

Direct port from legacy pkg/context/checkpoint.go

func NewIndexCheckpointManager ¶

func NewIndexCheckpointManager(storeName, sourcePath string, enabled bool) *IndexCheckpointManager

NewIndexCheckpointManager creates a new checkpoint manager.

Direct port from legacy pkg/context/checkpoint.go

func (*IndexCheckpointManager) ClearCheckpoint ¶

func (cm *IndexCheckpointManager) ClearCheckpoint() error

ClearCheckpoint removes the checkpoint file.

func (*IndexCheckpointManager) ForceSave ¶

func (cm *IndexCheckpointManager) ForceSave() error

ForceSave forces a checkpoint save regardless of the save interval.

func (*IndexCheckpointManager) FormatCheckpointInfo ¶

func (cm *IndexCheckpointManager) FormatCheckpointInfo(checkpoint *IndexCheckpoint) string

FormatCheckpointInfo returns a human-readable checkpoint summary.

func (*IndexCheckpointManager) GetProcessedCount ¶

func (cm *IndexCheckpointManager) GetProcessedCount() int

GetProcessedCount returns the number of processed files.

func (*IndexCheckpointManager) IsEnabled ¶

func (cm *IndexCheckpointManager) IsEnabled() bool

IsEnabled returns whether checkpointing is enabled.

func (*IndexCheckpointManager) LoadCheckpoint ¶

func (cm *IndexCheckpointManager) LoadCheckpoint() (*IndexCheckpoint, error)

LoadCheckpoint attempts to load an existing checkpoint.

func (*IndexCheckpointManager) RecordFile ¶

func (cm *IndexCheckpointManager) RecordFile(path string, size int64, modTime time.Time, status string)

RecordFile records a processed file in the checkpoint.

func (*IndexCheckpointManager) SaveCheckpoint ¶

func (cm *IndexCheckpointManager) SaveCheckpoint() error

SaveCheckpoint saves the current checkpoint.

func (*IndexCheckpointManager) SetTotalFiles ¶

func (cm *IndexCheckpointManager) SetTotalFiles(total int)

SetTotalFiles sets the total file count.

func (*IndexCheckpointManager) ShouldProcessFile ¶

func (cm *IndexCheckpointManager) ShouldProcessFile(path string, size int64, modTime time.Time) bool

ShouldProcessFile checks if a file should be processed (not in checkpoint or changed).

type IndexError ¶

type IndexError struct {
	StoreName  string // Document store name
	DocumentID string // Document ID
	Operation  string // Operation (e.g., "embed", "upsert", "delete")
	Message    string // Error message
	Err        error  // Underlying error
}

IndexError represents an error during indexing operations.

func NewIndexError ¶

func NewIndexError(storeName, documentID, operation, message string, err error) *IndexError

NewIndexError creates a new IndexError.

func (*IndexError) Error ¶

func (e *IndexError) Error() string

Error implements the error interface.

func (*IndexError) Unwrap ¶

func (e *IndexError) Unwrap() error

Unwrap returns the underlying error.

type IndexMetrics ¶

type IndexMetrics struct {
	// contains filtered or unexported fields
}

IndexMetrics tracks document store indexing metrics.

Thread-safe for concurrent access during indexing.

func NewIndexMetrics ¶

func NewIndexMetrics(storeName string) *IndexMetrics

NewIndexMetrics creates a new metrics tracker.

func (*IndexMetrics) IncrementErrors ¶

func (m *IndexMetrics) IncrementErrors()

IncrementErrors increments error count.

func (*IndexMetrics) IncrementIndexed ¶

func (m *IndexMetrics) IncrementIndexed()

IncrementIndexed increments indexed document count.

func (*IndexMetrics) IncrementSkipped ¶

func (m *IndexMetrics) IncrementSkipped()

IncrementSkipped increments skipped document count.

func (*IndexMetrics) IncrementTotal ¶

func (m *IndexMetrics) IncrementTotal()

IncrementTotal increments total document count.

func (*IndexMetrics) RecordSearch ¶

func (m *IndexMetrics) RecordSearch(latency time.Duration)

RecordSearch records a search operation with latency.

func (*IndexMetrics) Reset ¶

func (m *IndexMetrics) Reset()

Reset clears all metrics.

func (*IndexMetrics) SetEndTime ¶

func (m *IndexMetrics) SetEndTime(t time.Time)

SetEndTime sets the indexing end time.

func (*IndexMetrics) SetStartTime ¶

func (m *IndexMetrics) SetStartTime(t time.Time)

SetStartTime sets the indexing start time.

func (*IndexMetrics) Snapshot ¶

func (m *IndexMetrics) Snapshot() IndexMetricsSnapshot

Snapshot returns a point-in-time copy of all metrics.

type IndexMetricsSnapshot ¶

type IndexMetricsSnapshot struct {
	StoreName         string        `json:"store_name"`
	TotalDocs         int64         `json:"total_docs"`
	IndexedDocs       int64         `json:"indexed_docs"`
	SkippedDocs       int64         `json:"skipped_docs"`
	ErrorDocs         int64         `json:"error_docs"`
	DocsPerSecond     float64       `json:"docs_per_second"`
	StartTime         time.Time     `json:"start_time,omitempty"`
	EndTime           time.Time     `json:"end_time,omitempty"`
	SearchCount       int64         `json:"search_count"`
	AvgSearchLatency  time.Duration `json:"avg_search_latency_ns"`
	MaxSearchLatency  time.Duration `json:"max_search_latency_ns"`
	LastSearchLatency time.Duration `json:"last_search_latency_ns"`
}

IndexMetricsSnapshot is a point-in-time copy of metrics.

type LLMQueryExpander ¶

type LLMQueryExpander struct {
	// contains filtered or unexported fields
}

LLMQueryExpander uses an LLM to generate query variations.

Direct port from legacy pkg/context/query_expansion.go

func NewLLMQueryExpander ¶

func NewLLMQueryExpander(llm model.LLM) *LLMQueryExpander

NewLLMQueryExpander creates a new LLM-based query expander.

func (*LLMQueryExpander) Expand ¶

func (e *LLMQueryExpander) Expand(ctx context.Context, query string, numVariations int) ([]string, error)

Expand implements the QueryExpander interface.

Direct port from legacy pkg/context/query_expansion.go

type MCPExtractor ¶

type MCPExtractor struct {
	// contains filtered or unexported fields
}

MCPExtractor handles document parsing via MCP tools. This allows using any MCP service (Docling, etc.) for document parsing.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

func NewMCPExtractor ¶

func NewMCPExtractor(config MCPExtractorConfig) (*MCPExtractor, error)

NewMCPExtractor creates a new MCP-based extractor.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

func (*MCPExtractor) CanExtract ¶

func (e *MCPExtractor) CanExtract(path string, mimeType string) bool

CanExtract checks if this extractor can handle the file.

func (*MCPExtractor) Extract ¶

func (e *MCPExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

Extract uses MCP tools to extract content from files.

func (*MCPExtractor) Name ¶

func (e *MCPExtractor) Name() string

Name returns the extractor name.

func (*MCPExtractor) Priority ¶

func (e *MCPExtractor) Priority() int

Priority returns the extractor priority.

type MCPExtractorConfig ¶

type MCPExtractorConfig struct {
	ToolCaller      ToolCaller
	ParserToolNames []string // Tool names to try (e.g., ["parse_document", "docling_parse"])
	SupportedExts   []string // File extensions this extractor handles (empty = all)
	Priority        int      // Priority (higher = preferred)
	LocalBasePath   string   // Local base path of the document store (e.g., "/Users/user/workspace/hector/test-docs")
	PathPrefix      string   // Remote path prefix for containerized MCP services (e.g., "/docs")
}

MCPExtractorConfig configures an MCP extractor.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type MetadataExtractor ¶

type MetadataExtractor interface {
	// Name returns the extractor name
	Name() string

	// CanExtract determines if this extractor can handle the given language
	CanExtract(language string) bool

	// Extract extracts metadata from source code
	Extract(content string, filePath string) (*CodeMetadata, error)
}

MetadataExtractor defines the interface for extracting metadata from source code.

Direct port from legacy pkg/context/metadata/extractor.go

type MetadataExtractorRegistry ¶

type MetadataExtractorRegistry struct {
	// contains filtered or unexported fields
}

MetadataExtractorRegistry manages metadata extractors.

Direct port from legacy pkg/context/metadata/extractor.go

func NewMetadataExtractorRegistry ¶

func NewMetadataExtractorRegistry() *MetadataExtractorRegistry

NewMetadataExtractorRegistry creates a new metadata extractor registry.

func (*MetadataExtractorRegistry) ExtractMetadata ¶

func (r *MetadataExtractorRegistry) ExtractMetadata(language string, content string, filePath string) (*CodeMetadata, error)

ExtractMetadata tries to extract metadata using the appropriate extractor.

func (*MetadataExtractorRegistry) GetExtractors ¶

func (r *MetadataExtractorRegistry) GetExtractors() []MetadataExtractor

GetExtractors returns all registered extractors.

func (*MetadataExtractorRegistry) Register ¶

func (r *MetadataExtractorRegistry) Register(extractor MetadataExtractor)

Register adds a metadata extractor for specific languages.

type MultiQueryExpander ¶

type MultiQueryExpander struct {
	// contains filtered or unexported fields
}

MultiQueryExpander generates multiple query variants for better recall.

Multi-query retrieval improves recall by:

Generating alternative phrasings of the query
Searching with each variant
Combining and deduplicating results

This helps when:

Queries are ambiguous
Relevant documents use different terminology
Users don't know exact terms used in documents

Derived from legacy pkg/context/multi_query.go

func NewMultiQueryExpander ¶

func NewMultiQueryExpander(llm model.LLM, numQueries int) *MultiQueryExpander

NewMultiQueryExpander creates a new multi-query expander.

func (*MultiQueryExpander) ExpandQuery ¶

func (m *MultiQueryExpander) ExpandQuery(ctx context.Context, query string) ([]string, error)

ExpandQuery generates multiple query variants.

type NativeParseResult ¶

type NativeParseResult struct {
	Success          bool
	Content          string
	Title            string
	Author           string
	Metadata         map[string]string
	Error            string
	ProcessingTimeMs int64
}

NativeParseResult represents the result from a native parser.

Direct port from legacy pkg/context/extraction/binary_extractor.go

type NativeParser ¶

type NativeParser interface {
	ParseDocument(ctx context.Context, filePath string, fileSize int64) (*NativeParseResult, error)
}

NativeParser interface for parsing binary documents.

Direct port from legacy pkg/context/extraction/binary_extractor.go

type NativeParserRegistry ¶

type NativeParserRegistry struct {
	// contains filtered or unexported fields
}

NativeParserRegistry manages native document parsers for PDF, DOCX, XLSX.

Ported from legacy pkg/context/native_parsers.go

func NewNativeParserRegistry ¶

func NewNativeParserRegistry() *NativeParserRegistry

NewNativeParserRegistry creates a new native parser registry with built-in parsers.

func (*NativeParserRegistry) GetSupportedExtensions ¶

func (r *NativeParserRegistry) GetSupportedExtensions() []string

GetSupportedExtensions returns all supported file extensions.

func (*NativeParserRegistry) ParseDocument ¶

func (r *NativeParserRegistry) ParseDocument(ctx context.Context, filePath string, fileSize int64) (*NativeParseResult, error)

ParseDocument finds the appropriate parser and extracts content. Implements NativeParser interface.

type NilChunker ¶

type NilChunker struct{}

NilChunker returns the entire content as a single chunk.

func (NilChunker) Chunk ¶

func (NilChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

func (NilChunker) Config ¶

func (NilChunker) Config() ChunkerConfig

func (NilChunker) Strategy ¶

func (NilChunker) Strategy() ChunkerStrategy

type NilDataSource ¶

type NilDataSource struct{}

NilDataSource is a no-op data source that returns no documents.

func (NilDataSource) Close ¶

func (NilDataSource) Close() error

func (NilDataSource) DiscoverDocuments ¶

func (NilDataSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

func (NilDataSource) GetLastModified ¶

func (NilDataSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

func (NilDataSource) ReadDocument ¶

func (NilDataSource) ReadDocument(ctx context.Context, id string) (*Document, error)

func (NilDataSource) SupportsIncrementalIndexing ¶

func (NilDataSource) SupportsIncrementalIndexing() bool

func (NilDataSource) Type ¶

func (NilDataSource) Type() string

type NilMultiQueryExpander ¶

type NilMultiQueryExpander struct{}

NilMultiQueryExpander returns the original query unchanged.

func (NilMultiQueryExpander) ExpandQuery ¶

func (NilMultiQueryExpander) ExpandQuery(ctx context.Context, query string) ([]string, error)

type NilQueryExpander ¶

type NilQueryExpander struct{}

NilQueryExpander returns the original query unchanged.

func (NilQueryExpander) Expand ¶

func (NilQueryExpander) Expand(ctx context.Context, query string, numVariations int) ([]string, error)

type NilReranker ¶

type NilReranker struct{}

NilReranker returns results unchanged.

func (NilReranker) Rerank ¶

func (NilReranker) Rerank(ctx context.Context, query string, results []SearchResult) (*RerankResult, error)

type OverlappingChunker ¶

type OverlappingChunker struct {
	// contains filtered or unexported fields
}

OverlappingChunker implements chunking with configurable overlap.

This is a direct port of legacy pkg/context/chunking/overlapping_chunker.go. Overlap helps preserve context at chunk boundaries, improving retrieval quality when relevant information spans two chunks.

Use when:

Retrieval quality is important
Content has flowing prose
You can afford slightly more storage

func NewOverlappingChunker ¶

func NewOverlappingChunker(cfg ChunkerConfig) *OverlappingChunker

NewOverlappingChunker creates a new overlapping chunker.

func (*OverlappingChunker) Chunk ¶

func (c *OverlappingChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

Chunk splits content into overlapping chunks. Direct port from legacy pkg/context/chunking/overlapping_chunker.go

func (*OverlappingChunker) Config ¶

func (c *OverlappingChunker) Config() ChunkerConfig

func (*OverlappingChunker) Strategy ¶

func (c *OverlappingChunker) Strategy() ChunkerStrategy

type PaginationConfig ¶

type PaginationConfig struct {
	Type      string `yaml:"type"`       // "offset", "cursor", "page", "link"
	PageParam string `yaml:"page_param"` // Query parameter name for page/offset
	SizeParam string `yaml:"size_param"` // Query parameter name for page size
	MaxPages  int    `yaml:"max_pages"`  // Maximum pages to fetch (0 = unlimited)
	PageSize  int    `yaml:"page_size"`  // Items per page
	NextField string `yaml:"next_field"` // JSON field containing next page URL/cursor
	DataField string `yaml:"data_field"` // JSON field containing array of items (if nested)
}

PaginationConfig defines how to handle paginated API responses.

Direct port from legacy pkg/context/indexing/api_source.go

type PatternCache ¶

type PatternCache struct {
	// contains filtered or unexported fields
}

PatternCache provides fast pattern matching.

Direct port from legacy pkg/context/indexing/pattern_filter.go

type PatternFilter ¶

type PatternFilter struct {
	// contains filtered or unexported fields
}

PatternFilter implements FileFilter using include/exclude patterns.

Direct port from legacy pkg/context/indexing/pattern_filter.go

func NewPatternFilter ¶

func NewPatternFilter(sourcePath string, includePatterns, excludePatterns []string) (*PatternFilter, error)

NewPatternFilter creates a new pattern-based filter with validation.

Direct port from legacy pkg/context/indexing/pattern_filter.go

func (*PatternFilter) ShouldExclude ¶

func (pf *PatternFilter) ShouldExclude(path string) bool

ShouldExclude checks if a file matches exclude patterns.

func (*PatternFilter) ShouldInclude ¶

func (pf *PatternFilter) ShouldInclude(path string) bool

ShouldInclude checks if a file matches include patterns.

type ProgressStats ¶

type ProgressStats struct {
	TotalFiles     int64
	ProcessedFiles int64
	IndexedFiles   int64
	SkippedFiles   int64
	FailedFiles    int64
	DeletedFiles   int64
	CurrentFile    string
	ElapsedTime    time.Duration
}

ProgressStats contains progress statistics.

type ProgressTracker ¶

type ProgressTracker struct {
	// contains filtered or unexported fields
}

ProgressTracker tracks indexing progress with real-time statistics.

Direct port from legacy pkg/context/progress_tracker.go

func NewProgressTracker ¶

func NewProgressTracker(enabled bool, verbose bool) *ProgressTracker

NewProgressTracker creates a new progress tracker.

func (*ProgressTracker) GetExtractorStats ¶

func (pt *ProgressTracker) GetExtractorStats() map[string]int64

GetExtractorStats returns extractor usage statistics.

func (*ProgressTracker) GetStats ¶

func (pt *ProgressTracker) GetStats() ProgressStats

GetStats returns current statistics.

func (*ProgressTracker) IncrementDeleted ¶

func (pt *ProgressTracker) IncrementDeleted()

IncrementDeleted increments the deleted files counter.

func (*ProgressTracker) IncrementFailed ¶

func (pt *ProgressTracker) IncrementFailed()

IncrementFailed increments the failed files counter.

func (*ProgressTracker) IncrementIndexed ¶

func (pt *ProgressTracker) IncrementIndexed()

IncrementIndexed increments the indexed files counter.

func (*ProgressTracker) IncrementProcessed ¶

func (pt *ProgressTracker) IncrementProcessed()

IncrementProcessed increments the processed files counter.

func (*ProgressTracker) IncrementSkipped ¶

func (pt *ProgressTracker) IncrementSkipped()

IncrementSkipped increments the skipped files counter.

func (*ProgressTracker) RecordExtractorUsage ¶

func (pt *ProgressTracker) RecordExtractorUsage(extractorName string)

RecordExtractorUsage records which extractor was used for a document.

func (*ProgressTracker) SetCurrentFile ¶

func (pt *ProgressTracker) SetCurrentFile(filename string)

SetCurrentFile sets the currently processing file.

func (*ProgressTracker) SetTotalFiles ¶

func (pt *ProgressTracker) SetTotalFiles(total int64)

SetTotalFiles sets the total number of files to process.

func (*ProgressTracker) Start ¶

func (pt *ProgressTracker) Start()

Start begins the progress display loop.

func (*ProgressTracker) Stop ¶

func (pt *ProgressTracker) Stop()

Stop stops the progress display.

type QueryExpander ¶

type QueryExpander interface {
	// Expand generates multiple query variations from the original query.
	Expand(ctx context.Context, query string, numVariations int) ([]string, error)
}

QueryExpander expands a single query into multiple query variations.

Direct port from legacy pkg/context/query_expansion.go

type RankingDecision ¶

type RankingDecision struct {
	// Index is the original result index.
	Index int `json:"index"`

	// Relevance is the LLM-assigned relevance score (1-10).
	Relevance int `json:"relevance"`

	// Reason explains why this ranking was assigned.
	Reason string `json:"reason,omitempty"`
}

RankingDecision represents the LLM's ranking for a single result.

type RerankResult ¶

type RerankResult struct {
	// Results are the reranked search results.
	Results []SearchResult

	// Rankings contains the LLM's ranking decisions.
	Rankings []RankingDecision
}

RerankConfig configures the reranker.

type Reranker ¶

type Reranker struct {
	// contains filtered or unexported fields
}

Reranker re-ranks search results using an LLM.

Reranking improves search quality by:

Using deeper semantic understanding than vector similarity
Evaluating actual relevance to the query
Considering context that embeddings might miss

Trade-offs:

Adds latency (100-500ms per search)
Incurs LLM API costs
Only practical for small result sets (10-20 items)

Derived from legacy pkg/context/reranking/reranker.go

func NewReranker ¶

func NewReranker(llm model.LLM, maxResults int) *Reranker

NewReranker creates a new reranker.

func (*Reranker) Rerank ¶

func (r *Reranker) Rerank(ctx context.Context, query string, results []SearchResult) (*RerankResult, error)

Rerank re-orders results based on LLM assessment.

The process:

Format results and query for the LLM
Ask LLM to rank results by relevance
Parse LLM response and reorder results
Assign new scores based on ranking position

After reranking:

Scores are position-based (1st=1.0, 2nd=0.95, etc.)
Original vector similarity scores are replaced

type RetryConfig ¶

type RetryConfig struct {
	// MaxRetries is the maximum number of retry attempts (default: 3).
	MaxRetries int

	// BaseDelay is the initial delay between retries (default: 1s).
	BaseDelay time.Duration

	// MaxDelay is the maximum delay between retries (default: 30s).
	MaxDelay time.Duration

	// JitterFactor adds randomness to delays (0.0-1.0, default: 0.1).
	JitterFactor float64

	// RetryableErrors are error substrings that indicate retryable failures.
	RetryableErrors []string
}

RetryConfig configures retry behavior.

Reuses patterns from httpclient for consistency.

func DefaultRetryConfig ¶

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns sensible defaults for RAG operations.

type RetryError ¶

type RetryError struct {
	Operation   string
	Attempts    int
	LastError   error
	IsExhausted bool
}

RetryError represents an error after retry attempts.

func (*RetryError) Error ¶

func (e *RetryError) Error() string

func (*RetryError) Unwrap ¶

func (e *RetryError) Unwrap() error

type Retryer ¶

type Retryer struct {
	// contains filtered or unexported fields
}

Retryer handles retry logic with exponential backoff.

Based on httpclient patterns but generalized for any operation.

func NewRetryer ¶

func NewRetryer(cfg RetryConfig) *Retryer

NewRetryer creates a new retryer with the given config.

func (*Retryer) Do ¶

func (r *Retryer) Do(ctx context.Context, operation string, fn func() error) error

Do executes the operation with retry logic.

Returns the first successful result or the last error after all retries.

type SQLSource ¶

type SQLSource struct {
	// contains filtered or unexported fields
}

SQLSource implements DataSource for SQL databases using database/sql.

Direct port from legacy pkg/context/indexing/sql_source.go

func NewSQLSource ¶

func NewSQLSource(opts SQLSourceOptions) (*SQLSource, error)

NewSQLSource creates a new SQL data source.

Direct port from legacy pkg/context/indexing/sql_source.go

func (*SQLSource) Close ¶

func (s *SQLSource) Close() error

Close closes the underlying database connection. Note: In most cases, the connection is managed externally (e.g., by DBPool), so this is a no-op. The caller should manage the lifecycle.

func (*SQLSource) DiscoverDocuments ¶

func (s *SQLSource) DiscoverDocuments(ctx context.Context) (<-chan Document, <-chan error)

DiscoverDocuments returns channels of discovered documents and errors.

Direct port from legacy pkg/context/indexing/sql_source.go

func (*SQLSource) GetLastModified ¶

func (s *SQLSource) GetLastModified(ctx context.Context, id string) (time.Time, error)

GetLastModified returns the last modification time for a document.

func (*SQLSource) ReadDocument ¶

func (s *SQLSource) ReadDocument(ctx context.Context, id string) (*Document, error)

ReadDocument retrieves a specific document by its ID.

Direct port from legacy pkg/context/indexing/sql_source.go

func (*SQLSource) SupportsIncrementalIndexing ¶

func (s *SQLSource) SupportsIncrementalIndexing() bool

SupportsIncrementalIndexing returns true if UpdatedColumn is configured.

func (*SQLSource) Type ¶

func (s *SQLSource) Type() string

Type returns the data source type.

type SQLSourceOptions ¶

type SQLSourceOptions struct {
	DB      *sql.DB
	Driver  string
	Tables  []SQLTableConfig
	MaxRows int
}

SQLSourceOptions configures the SQL source.

type SQLTableConfig ¶

type SQLTableConfig struct {
	Table           string   `yaml:"table"`
	Columns         []string `yaml:"columns"`          // Columns to concatenate for content
	IDColumn        string   `yaml:"id_column"`        // Primary key or unique identifier
	UpdatedColumn   string   `yaml:"updated_column"`   // Column for tracking updates (e.g., updated_at)
	WhereClause     string   `yaml:"where_clause"`     // Optional WHERE clause for filtering
	MetadataColumns []string `yaml:"metadata_columns"` // Columns to include as metadata
}

SQLTableConfig defines which tables and columns to index.

Direct port from legacy pkg/context/indexing/sql_source.go

type SearchEngine ¶

type SearchEngine struct {
	// contains filtered or unexported fields
}

SearchEngine provides document indexing and semantic search.

It combines:

Document ingestion with chunking
Vector similarity search
Optional hybrid search (vector + keyword)
Optional query enhancement (HyDE, multi-query)
Optional reranking

Derived from legacy pkg/context/search.go:SearchEngine

func NewSearchEngine ¶

func NewSearchEngine(cfg SearchEngineConfig) (*SearchEngine, error)

NewSearchEngine creates a new search engine.

func NewSearchEngineFromConfig ¶

func NewSearchEngineFromConfig(
	storeCfg *config.DocumentStoreConfig,
	deps *FactoryDeps,
	collectionName string,
) (*SearchEngine, error)

NewSearchEngineFromConfig creates a search engine from configuration. collectionName is used as the default if storeCfg.Collection is empty.

func (*SearchEngine) Clear ¶

func (e *SearchEngine) Clear(ctx context.Context) error

Clear removes all documents from the index.

func (*SearchEngine) Close ¶

func (e *SearchEngine) Close() error

Close releases resources.

func (*SearchEngine) Collection ¶

func (e *SearchEngine) Collection() string

Collection returns the collection name.

func (*SearchEngine) DeleteByFilter ¶

func (e *SearchEngine) DeleteByFilter(ctx context.Context, filter map[string]any) error

DeleteByFilter removes documents matching the filter.

Direct port from legacy pkg/context/search.go

func (*SearchEngine) DeleteDocument ¶

func (e *SearchEngine) DeleteDocument(ctx context.Context, documentID string) error

DeleteDocument removes a document and all its chunks from the index.

func (*SearchEngine) HealthCheck ¶

func (e *SearchEngine) HealthCheck(ctx context.Context) HealthCheck

SearchEngineHealth checks the health of a SearchEngine.

func (*SearchEngine) IngestDocument ¶

func (e *SearchEngine) IngestDocument(ctx context.Context, doc Document) error

IngestDocument indexes a document for search.

The document is:

Split into chunks using the configured chunker
Each chunk is embedded
Chunks are stored in the vector database

Document ID should be stable across re-indexing to enable updates.

func (*SearchEngine) IngestDocuments ¶

func (e *SearchEngine) IngestDocuments(ctx context.Context, docs []Document) error

IngestDocuments indexes multiple documents concurrently.

func (*SearchEngine) Search ¶

func (e *SearchEngine) Search(ctx context.Context, req SearchRequest) (*SearchResponse, error)

Search finds documents matching the query.

func (*SearchEngine) Status ¶

func (e *SearchEngine) Status() map[string]any

Status returns the current status of the search engine.

Direct port from legacy pkg/context/search.go (GetStatus)

type SearchEngineConfig ¶

type SearchEngineConfig struct {
	// Provider for vector storage and search (required).
	Provider vector.Provider

	// Embedder for generating embeddings (required).
	Embedder embedder.Embedder

	// Chunker for splitting documents (optional, defaults to simple).
	Chunker Chunker

	// Collection name for storing documents (optional, defaults to "rag_documents").
	Collection string

	// DefaultTopK is the default number of results (default: 10).
	DefaultTopK int

	// DefaultThreshold filters results below this score (default: 0.0).
	DefaultThreshold float32

	// HyDE for hypothetical document embedding (optional).
	HyDE *HyDE

	// Reranker for LLM-based result reranking (optional).
	Reranker *Reranker

	// MultiQuery for query expansion (optional).
	MultiQuery *MultiQueryExpander
}

SearchEngineConfig configures the search engine.

type SearchError ¶

type SearchError struct {
	Component string // Component that failed (e.g., "embedder", "vector_db", "reranker")
	Operation string // Operation that failed
	Message   string // Error message
	Query     string // Query that caused the error
	Err       error  // Underlying error
}

SearchError represents an error during search operations.

Inspired by legacy pkg/context error handling

func NewSearchError ¶

func NewSearchError(component, operation, message, query string, err error) *SearchError

NewSearchError creates a new SearchError.

func (*SearchError) Error ¶

func (e *SearchError) Error() string

Error implements the error interface.

func (*SearchError) Unwrap ¶

func (e *SearchError) Unwrap() error

Unwrap returns the underlying error.

type SearchMetrics ¶

type SearchMetrics struct {
	// contains filtered or unexported fields
}

SearchMetrics tracks search engine metrics.

func NewSearchMetrics ¶

func NewSearchMetrics(engineName string) *SearchMetrics

NewSearchMetrics creates a new search metrics tracker.

func (*SearchMetrics) RecordSearch ¶

func (m *SearchMetrics) RecordSearch(latency time.Duration, resultCount int, opts *SearchOptions)

RecordSearch records a search operation.

func (*SearchMetrics) Snapshot ¶

func (m *SearchMetrics) Snapshot() SearchMetricsSnapshot

Snapshot returns a point-in-time copy of search metrics.

type SearchMetricsSnapshot ¶

type SearchMetricsSnapshot struct {
	EngineName      string        `json:"engine_name"`
	TotalSearches   int64         `json:"total_searches"`
	SuccessfulHits  int64         `json:"successful_hits"`
	EmptyResults    int64         `json:"empty_results"`
	AvgLatency      time.Duration `json:"avg_latency_ns"`
	MaxLatency      time.Duration `json:"max_latency_ns"`
	MinLatency      time.Duration `json:"min_latency_ns"`
	HyDEUsage       int64         `json:"hyde_usage"`
	RerankUsage     int64         `json:"rerank_usage"`
	MultiQueryUsage int64         `json:"multi_query_usage"`
}

SearchMetricsSnapshot is a point-in-time copy of search metrics.

type SearchOptions ¶

type SearchOptions struct {
	// Mode specifies the search mode: "vector", "keyword", "hybrid".
	Mode string `json:"mode,omitempty"`

	// EnableHyDE enables Hypothetical Document Embeddings.
	EnableHyDE bool `json:"enable_hyde,omitempty"`

	// EnableRerank enables LLM-based reranking.
	EnableRerank bool `json:"enable_rerank,omitempty"`

	// EnableMultiQuery enables query expansion.
	EnableMultiQuery bool `json:"enable_multi_query,omitempty"`

	// NumQueries is the number of query variants for multi-query.
	NumQueries int `json:"num_queries,omitempty"`
}

SearchOptions configures search behavior.

type SearchRequest ¶

type SearchRequest struct {
	// Query is the search query text.
	Query string `json:"query"`

	// Collection scopes the search to a specific collection.
	Collection string `json:"collection,omitempty"`

	// TopK is the maximum number of results to return.
	TopK int `json:"top_k,omitempty"`

	// Threshold filters results below this score.
	Threshold float32 `json:"threshold,omitempty"`

	// Filter applies metadata filtering.
	Filter map[string]any `json:"filter,omitempty"`

	// Options contains search-specific options.
	Options *SearchOptions `json:"options,omitempty"`
}

SearchRequest represents a search query.

func (*SearchRequest) SetDefaults ¶

func (r *SearchRequest) SetDefaults()

SetDefaults applies default values to SearchRequest.

type SearchResponse ¶

type SearchResponse struct {
	// Results contains the matched documents/chunks.
	Results []SearchResult `json:"results"`

	// TotalMatches is the total number of matches (before limit).
	TotalMatches int `json:"total_matches,omitempty"`

	// SearchTimeMs is the search duration in milliseconds.
	SearchTimeMs int64 `json:"search_time_ms,omitempty"`

	// QueryExpansions contains expanded queries (if multi-query enabled).
	QueryExpansions []string `json:"query_expansions,omitempty"`
}

SearchResponse contains search results.

type SearchResult ¶

type SearchResult struct {
	// ID is the chunk/document identifier.
	ID string `json:"id"`

	// Content is the matched content.
	Content string `json:"content"`

	// Score represents relevance (higher is better).
	Score float32 `json:"score"`

	// DocumentID is the parent document identifier.
	DocumentID string `json:"document_id,omitempty"`

	// ChunkIndex is the chunk position within the document.
	ChunkIndex int `json:"chunk_index,omitempty"`

	// Metadata contains additional result information.
	Metadata map[string]any `json:"metadata,omitempty"`

	// Highlights contains matched text spans (optional).
	Highlights []string `json:"highlights,omitempty"`
}

SearchResult represents a single search result.

Results are ordered by Score (highest first). The Score semantics depend on whether reranking was applied:

Without reranking: vector similarity (0.0 to 1.0)
With reranking: LLM-determined position score

func CombineResults ¶

func CombineResults(resultSets [][]SearchResult) []SearchResult

CombineResults merges results from multiple queries.

Deduplicates by document ID and keeps the highest score for each.

type SemanticChunker ¶

type SemanticChunker struct {
	// contains filtered or unexported fields
}

SemanticChunker implements AST-aware chunking that respects code structure.

This is a direct port of legacy pkg/context/chunking/semantic_chunker.go. It attempts to keep functions and types together when possible, using metadata to identify semantic boundaries.

Use when:

Chunking code files
Retrieval quality is paramount
Variable chunk sizes are acceptable

func NewSemanticChunker ¶

func NewSemanticChunker(cfg ChunkerConfig) *SemanticChunker

NewSemanticChunker creates a new semantic chunker.

func (*SemanticChunker) Chunk ¶

func (c *SemanticChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

Chunk splits content into semantically meaningful chunks. It uses metadata to identify function and type boundaries. Direct port from legacy pkg/context/chunking/semantic_chunker.go

func (*SemanticChunker) Config ¶

func (c *SemanticChunker) Config() ChunkerConfig

func (*SemanticChunker) Strategy ¶

func (c *SemanticChunker) Strategy() ChunkerStrategy

type SimpleChunker ¶

type SimpleChunker struct {
	// contains filtered or unexported fields
}

SimpleChunker implements basic line-based chunking.

This is a direct port of legacy pkg/context/chunking/simple_chunker.go. It splits content by lines first, then groups lines into chunks of the configured size. This ensures chunks never split mid-line.

Use when:

Speed is critical
Content has uniform structure
Line boundaries should be preserved

func NewSimpleChunker ¶

func NewSimpleChunker(cfg ChunkerConfig) *SimpleChunker

NewSimpleChunker creates a new simple chunker.

func (*SimpleChunker) Chunk ¶

func (c *SimpleChunker) Chunk(content string, ctx *ChunkContext) ([]Chunk, error)

Chunk splits content into chunks based on line count. Direct port from legacy pkg/context/chunking/simple_chunker.go

func (*SimpleChunker) Config ¶

func (c *SimpleChunker) Config() ChunkerConfig

func (*SimpleChunker) Strategy ¶

func (c *SimpleChunker) Strategy() ChunkerStrategy

type SourceDocument ¶

type SourceDocument struct {
	// ID is a unique identifier for the document (format depends on source type)
	ID string

	// Content is the text content to be indexed.
	// For file sources, this should be populated by reading the file.
	// For SQL/API sources, this is populated during discovery.
	Content string

	// Metadata contains source-specific metadata (file path, table name, API endpoint, etc.)
	Metadata map[string]interface{}

	// LastModified is the last modification time, if available
	LastModified time.Time

	// Size is the size of the document in bytes (approximate for non-file sources)
	Size int64

	// ShouldIndex indicates whether this document should be indexed (after filtering)
	ShouldIndex bool

	// SourcePath is the original source path (file path, table name, API endpoint, etc.)
	// This is used for relative path calculations and display purposes
	SourcePath string
}

SourceDocument represents a document from any source (file, SQL row, API response, etc.)

Direct port from legacy pkg/context/indexing/data_source.go:Document Renamed to SourceDocument to avoid conflict with the rag.Document type

type TextExtractor ¶

type TextExtractor struct{}

TextExtractor handles plain text files.

Direct port from legacy pkg/context/extraction/text_extractor.go

func NewTextExtractor ¶

func NewTextExtractor() *TextExtractor

NewTextExtractor creates a new text extractor.

func (*TextExtractor) CanExtract ¶

func (te *TextExtractor) CanExtract(path string, mimeType string) bool

CanExtract checks if this is a text file.

func (*TextExtractor) Extract ¶

func (te *TextExtractor) Extract(ctx context.Context, path string, fileSize int64) (*ExtractedContent, error)

Extract reads and cleans text content.

func (*TextExtractor) Name ¶

func (te *TextExtractor) Name() string

Name returns the extractor name.

func (*TextExtractor) Priority ¶

func (te *TextExtractor) Priority() int

Priority returns lower priority (1) so specific extractors can override.

type Tool ¶

type Tool interface {
	GetInfo() ToolInfo
	Execute(ctx context.Context, args map[string]interface{}) (ToolResult, error)
}

Tool is a minimal interface for executing tools.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolCaller ¶

type ToolCaller interface {
	GetTool(name string) (Tool, error)
}

ToolCaller is a minimal interface for calling tools without creating import cycles. This allows MCP extractors to work with any tool registry implementation.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolInfo ¶

type ToolInfo struct {
	Name        string
	Description string
	Parameters  []ToolParameter
}

ToolInfo contains information about a tool.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolParameter ¶

type ToolParameter struct {
	Name        string
	Type        string
	Description string
	Required    bool
}

ToolParameter describes a tool parameter.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type ToolResult ¶

type ToolResult struct {
	Success  bool
	Content  string
	Error    string
	Metadata interface{}
}

ToolResult contains the result of tool execution.

Direct port from legacy pkg/context/extraction/mcp_extractor.go

type TypeInfo ¶

type TypeInfo struct {
	Name       string   `json:"name"`
	Kind       string   `json:"kind"` // "struct", "interface", "alias", etc.
	StartLine  int      `json:"start_line"`
	EndLine    int      `json:"end_line"`
	Fields     []string `json:"fields,omitempty"`
	Methods    []string `json:"methods,omitempty"`
	IsExported bool     `json:"is_exported,omitempty"`
	DocComment string   `json:"doc_comment,omitempty"`
}

TypeInfo contains information about a type (struct, interface, etc.).

Direct port from legacy pkg/context/metadata/extractor.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL