vectordb

package
v0.2.0-beta.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 3, 2026 License: MIT Imports: 17 Imported by: 0

Documentation

Overview

Package vectordb provides an embedded TF-IDF / cosine-similarity vector index for semantic log search. It is a pure-Go, no-CGO, in-process accelerator. The relational DB remains the source of truth; this index is fully rebuildable.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EncodeSnapshot

func EncodeSnapshot(w io.Writer, snap Snapshot) error

EncodeSnapshot writes a versioned, CRC32-protected snapshot to w.

Wire format (big-endian for portability):

bytes[0:4]   magic       "VDB1"
bytes[4:8]   version     uint32
bytes[8:12]  CRC32-IEEE  uint32 (over bytes[12:])
bytes[12:]   gob payload Snapshot

Types

type Index

type Index struct {
	// contains filtered or unexported fields
}

Index is a thread-safe in-memory TF-IDF vector index for log bodies. Only ERROR and WARN logs are indexed to keep it small and relevant.

lastIndexedID records the highest Log.ID Add() has accepted. Persisted in the snapshot so a startup tail-replay can pick up DB rows newer than this watermark without re-indexing rows already in the snapshot. Tracked only for rows that pass shouldIndex(); INFO/DEBUG rows interleaved in the same ID range are excluded by the severity filter on replay anyway.

func New

func New(maxSize int) *Index

New creates a new Index with the given maximum entry cap.

func (*Index) Add

func (idx *Index) Add(logID uint, tenant, serviceName, severity, body string)

Add adds a log to the index. Thread-safe. Tenant is recorded with the document so Search can filter by it; an empty tenant collapses to the platform default at the boundary, matching storage.TenantFromContext.

func (*Index) LastIndexedID

func (idx *Index) LastIndexedID() uint

LastIndexedID returns the highest Log.ID that has been successfully indexed (i.e. passed shouldIndex + tokenize gates and was appended to docs). Used by the startup tail-replay path to query DB rows newer than this watermark; persisted in the snapshot so replay survives restarts.

func (*Index) LoadSnapshot

func (idx *Index) LoadSnapshot(path string) error

LoadSnapshot reads a snapshot from path and replaces the Index's state.

Caller must ensure no concurrent Add()/Search() is in flight — this is the typical startup wiring (fresh Index, before ingest accept). Errors are returned as-is so the caller can distinguish os.IsNotExist (no previous snapshot — first start) from corruption/format errors (log warn + proceed with full DB rebuild).

On error the Index state is left untouched.

func (*Index) ReplayFromDB

func (idx *Index) ReplayFromDB(ctx context.Context, src ReplaySource) (int, error)

ReplayFromDB walks ReplaySource pages starting from LastIndexedID() and feeds each row through Add(). Returns the count of rows processed (Add filters by severity, so processed ≠ indexed when the source loosens its filter — but the standard storage implementation already pre-filters to ERROR/WARN/family so the counts match in practice).

Termination contract: the source signals end-of-data by returning a zero-length slice. This lets sources page however they want without having to fill every page exactly to replayPageSize — the trade-off is one extra round-trip at the tail (fine for a one-shot startup call).

Caller passes a derived ctx so SIGTERM during boot cancels the replay cleanly. On any source error, returns the partial count + error so the caller can log and proceed with a partially-warm index.

func (*Index) Search

func (idx *Index) Search(tenant, query string, k int) []SearchResult

Search finds the top-k logs most similar to the query string within tenant. Documents from other tenants are excluded — the IDF table stays global so rarity is computed against the whole corpus, but result rows are filtered to the caller's tenant.

func (*Index) SetSnapshotObserver

func (idx *Index) SetSnapshotObserver(fn func(result string, duration time.Duration, size int64))

SetSnapshotObserver registers a callback invoked at the end of each WriteSnapshot. result is "success" or "failure"; size is the on-disk size of the latest written snapshot (0 on failure).

Set from the wiring layer (main.go) so vectordb stays free of telemetry imports. Safe to call before SnapshotLoop starts.

func (*Index) Size

func (idx *Index) Size() int

Size returns the current number of indexed documents.

func (*Index) SnapshotLoop

func (idx *Index) SnapshotLoop(ctx context.Context, path string, interval time.Duration)

SnapshotLoop writes a snapshot to path on every interval tick until ctx is done. On context cancel, fires one final WriteSnapshot before returning so graceful shutdowns capture the maximum in-memory state.

Transient write failures (disk full, fsync errors, EXDEV warnings) are logged via slog but do not break the loop — vectordb is a rebuildable accelerator, and silently dropping a tick beats taking the daemon down.

Safe to call with empty path / zero interval — both disable the loop and return immediately.

func (*Index) WriteSnapshot

func (idx *Index) WriteSnapshot(path string) error

WriteSnapshot serializes the current Index state to path atomically.

Safe to call concurrently with Add()/Search(): the docs slice and IDF map are copied under the index lock and serialization runs lock-free after release. Critical section is sub-millisecond at the 100k cap because slice copy is O(1) per-element header (LogVector strings/maps are shared by reference, and Add() never mutates an existing LogVector.Vec — it only appends new entries).

type LogVector

type LogVector struct {
	LogID       uint
	Tenant      string
	ServiceName string
	Severity    string
	Body        string
	Vec         map[string]float64 // TF-IDF sparse vector
}

LogVector represents an indexed log entry.

Tenant scopes the document so Search can return only the caller's tenant rows. The TF-IDF table is shared across tenants — global IDF still gives the right rarity signal — but the per-document tenant tag is enforced at query time so two tenants with overlapping log bodies stay isolated.

All fields are exported so encoding/gob can serialize the type for snapshot persistence (snapshot.go). Vec is the per-doc TF map (term → frequency); IDF is held separately on the Index to avoid duplicating rarity weights across documents.

type ReplayRow

type ReplayRow struct {
	ID          uint
	Tenant      string
	ServiceName string
	Severity    string
	Body        string
}

ReplayRow is the minimum field set Add() needs. Mirrors the projection a storage adapter performs at the boundary.

type ReplaySource

type ReplaySource interface {
	LogsForVectorReplay(ctx context.Context, sinceID uint, limit int) ([]ReplayRow, error)
}

ReplaySource is the minimal contract a backing store fulfills to hydrate this Index on startup. Pages are pulled in id-ascending order; the source signals end-of-data by returning a slice shorter than the requested limit. ReplayFromDB walks pages starting from LastIndexedID() until the source returns no more rows.

Vectordb intentionally does NOT import the storage package — keeping it as a leaf accelerator means tests can wire any in-memory source without a SQLite dependency, and storage is free to evolve its row type without breaking vectordb. The wiring layer (cmd/main.go) is responsible for projecting storage.Log into ReplayRow.

type SearchResult

type SearchResult struct {
	LogID       uint
	Tenant      string
	ServiceName string
	Severity    string
	Body        string
	Score       float64 // cosine similarity 0.0–1.0
}

SearchResult is a single similarity hit.

type Snapshot

type Snapshot struct {
	LastIndexedID uint
	MaxSize       int
	Docs          []LogVector
	IDF           map[string]float64
	WrittenAt     int64 // unix seconds, observability only
}

Snapshot is the persisted state of an Index.

Only the fields needed to reconstruct an equivalent Index are captured — transient state (mu, dirty) is intentionally absent. LastIndexedID is the high watermark of indexed Log.IDs so a startup tail-replay can query DB rows newer than the snapshot without double-indexing rows already in Docs.

Field changes break the format — bump snapshotVersion when the wire shape changes. Old snapshots whose magic+version don't match are rejected on load and the caller falls back to a full DB rebuild.

func DecodeSnapshot

func DecodeSnapshot(r io.Reader) (Snapshot, error)

DecodeSnapshot reads + validates a snapshot from r.

All errors are caller-visible. The expected handling is: log a warning and proceed with a full DB rebuild — never silently load partial state. Errors include short header, wrong magic, unsupported version, CRC mismatch, and gob decode failure.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL