index

package
v3.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 6, 2026 License: MIT Imports: 20 Imported by: 0

Documentation

Overview

Package index orchestrates atomic Stroma index rebuilds and searches.

Index

Constants

View Source
const (
	ArmVector = "vector"
	ArmFTS    = "fts"
)

Arm name constants used by the default Snapshot.Search pipeline. Custom FusionStrategy implementations may introduce additional arm names.

View Source
const DefaultMaxChunkSections = 10_000

DefaultMaxChunkSections caps the number of heading-aware sections a single record can contribute to the index when the caller hasn't overridden it. 10,000 is generous for legitimate technical documents (few real specs exceed a few hundred headings) while still preventing a pathological or hostile body from expanding into millions of embedder calls + rows.

View Source
const DefaultSearchLimit = 10

DefaultSearchLimit is the hit cap applied to Snapshot.Search and Snapshot.SearchVector when SearchParams.Limit / VectorSearchQuery.Limit is zero or negative. The choice is conservative; pick an explicit Limit if throughput matters or if the caller needs a stable shortlist size across snapshots.

View Source
const MaxSearchLimit = 250

MaxSearchLimit is the largest accepted SearchParams.Limit or VectorSearchQuery.Limit. Search uses bounded in-memory shortlists for vector/FTS fusion and reranking; callers needing more than this should page or shard at a higher layer rather than relying on an unbounded single-query scan.

Variables

View Source
var ErrLexicalSearchUnavailable = errors.New("lexical search unavailable")

ErrLexicalSearchUnavailable signals that a lexical-only search was requested against a legacy snapshot that does not carry the FTS5 table.

View Source
var ErrSourceRemovalsDisabled = errors.New("source update would remove records")

ErrSourceRemovalsDisabled signals that UpdateFromSource saw stored records missing from the supplied source while implicit removals were disabled. Callers that intend full desired-corpus synchronization should use SyncFromSource or set UpdateOptions.AllowSourceRemovals explicitly.

View Source
var ErrStaleUpdatePlan = errors.New("index changed while planning update")

ErrStaleUpdatePlan signals that Update planned added records against one committed snapshot, but the snapshot content changed before the write transaction applied those plans. Callers can retry the Update so chunk reuse and embeddings are recomputed against the new base snapshot.

View Source
var ErrStopWalk = errors.New("stop snapshot walk")

ErrStopWalk may be returned by a WalkRecords or WalkSections callback to stop iteration successfully. The walk method closes its SQLite cursor and returns nil. Other callback errors are wrapped in the returned error; callers should use errors.Is or errors.As when matching them.

View Source
var ErrUnsupportedSchemaVersion = errors.New("unsupported snapshot schema version")

ErrUnsupportedSchemaVersion is returned when an operation encounters a snapshot whose schema_version is neither the current schema nor one the library knows how to migrate from. It is surfaced by OpenSnapshot and wrapped via fmt.Errorf with %w so callers can use errors.Is to detect it.

View Source
var ErrUpdateCommittedIntegrityCheckFailed = errors.New("update committed but post-commit integrity check failed")

ErrUpdateCommittedIntegrityCheckFailed signals that Update's transaction committed successfully — the record, chunk, and metadata changes are durable on disk — but the post-commit PRAGMA integrity_check / foreign_key_check reported corruption. The enclosing error wraps this sentinel via fmt.Errorf with %w so callers can use errors.Is to detect it. This case is non-retriable: re-running Update will not unroll the already-durable changes, and the underlying file likely needs operator inspection (see index/ARCHITECTURE.md). Contrast with plain errors returned by Update, which come from pre-commit failures and leave the file byte-identical to its pre-call state.

View Source
var ErrUpdatePlanTooLarge = errors.New("update plan exceeds MaxPlannedRecords")

ErrUpdatePlanTooLarge signals that UpdateOptions.MaxPlannedRecords rejected the added-record set before Update opened the write transaction. Callers can split added records into smaller Update calls and retry.

Functions

This section is empty.

Types

type ArmEvidence

type ArmEvidence struct {
	// Rank is the zero-based position of the hit within the arm.
	Rank int
	// Score is the arm-native score at the time the arm returned the hit
	// (cosine derivative for vector, negative bm25 for FTS).
	Score float64
}

ArmEvidence is one arm's contribution to a fused hit.

type BuildOptions

type BuildOptions struct {
	// Path is the OS-native filesystem path where the built snapshot
	// is written. On Windows both forward and back slashes are
	// accepted — the store package normalizes drive prefixes on open.
	Path string

	// ReuseFromPath points at an existing Stroma snapshot whose embeddings
	// should be reused at the section level: a new section reuses its
	// stored embedding whenever its title, heading, and body match a
	// section already present in the prior snapshot. Records that are
	// fully unchanged are the maximal case, but sections carried over
	// from an edited record still reuse their embeddings. The snapshot is
	// opened read-only and queried per-record during the rebuild, so
	// resident memory scales with a single record's chunks rather than
	// with the whole corpus. Leave empty to disable reuse.
	ReuseFromPath string

	Embedder embed.Embedder

	// Contextualizer optionally produces a per-chunk prefix string that
	// gets prepended before the embedding text and the FTS5 content. When
	// set, the prefix persists on the chunk and participates in reuse
	// keying so a changed contextualizer invalidates stale reuse without
	// corrupting the stored representation. Nil disables contextualization
	// and leaves the build identical to the non-contextual path.
	Contextualizer ChunkContextualizer

	// MaxChunkTokens sets the approximate maximum number of tokens (words)
	// per chunk. Sections that exceed this limit are split into smaller
	// sub-sections. Zero disables token-budget splitting.
	MaxChunkTokens int

	// ChunkOverlapTokens sets the approximate number of overlapping tokens
	// between adjacent sub-sections when a section is split. Zero disables
	// overlap.
	ChunkOverlapTokens int

	// MaxChunkSections caps how many sections any single record is allowed
	// to produce. A pathological Markdown body (e.g., 10^6 heading lines)
	// would otherwise translate into 10^6 embedder calls and 10^6
	// chunk/vector rows — a DoS vector for shared embedders. Zero means
	// DefaultMaxChunkSections; a negative value disables the cap for
	// callers who have their own upstream validation. When the cap is
	// exceeded, Rebuild returns an error wrapping chunk.ErrTooManySections
	// instead of silently admitting the record.
	MaxChunkSections int

	// Quantization controls the vector storage format. See the
	// store.Quantization* constants for the accept-listed values:
	// store.QuantizationFloat32 (default), store.QuantizationInt8 (4x
	// smaller, minor precision loss), and store.QuantizationBinary
	// (32x smaller via 1-bit sign packing, full-precision rescore on a
	// companion table preserves ranking).
	Quantization string

	// ChunkPolicy selects the chunking strategy. Nil defaults to
	// chunk.MarkdownPolicy{Options: chunk.Options{
	//     MaxTokens: MaxChunkTokens,
	//     OverlapTokens: ChunkOverlapTokens,
	//     MaxSections: <resolved>,
	// }}, which reproduces the pre-1.0 chunking pipeline exactly.
	// Setting a non-nil policy overrides the per-Build chunking shape:
	// MaxChunkTokens/ChunkOverlapTokens/MaxChunkSections are read by
	// the default MarkdownPolicy but ignored when ChunkPolicy is set
	// (the policy carries its own configuration). Hierarchical
	// policies like chunk.LateChunkPolicy emit parent + leaf chunks
	// linked via parent_chunk_id; ExpandContext can surface the
	// parent on demand. See docs/superpowers/specs for the design.
	ChunkPolicy chunk.Policy

	// IntegrityMode selects post-commit validation depth (#107). The
	// zero value, IntegrityModeFast, skips the SQLite-level whole-DB
	// PRAGMA passes that scale with total snapshot size; Stroma-specific
	// completeness checks still run. IntegrityModeFull adds the deep
	// PRAGMA passes for callers that want to validate the freshly built
	// snapshot against on-disk corruption before promoting it.
	IntegrityMode IntegrityMode
}

BuildOptions controls how a Stroma index is rebuilt.

type BuildResult

type BuildResult struct {
	Path                string
	RecordCount         int
	ChunkCount          int
	ReusedRecordCount   int
	ReusedChunkCount    int
	EmbeddedChunkCount  int
	ReuseStatus         ReuseStatus
	ReuseDisabledReason string
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
}

BuildResult summarizes a completed rebuild.

func Rebuild

func Rebuild(ctx context.Context, records []corpus.Record, options BuildOptions) (*BuildResult, error)

Rebuild atomically recreates the index at the requested path.

func RebuildFromSource

func RebuildFromSource(ctx context.Context, source RecordSource, options BuildOptions) (*BuildResult, error)

RebuildFromSource atomically recreates the index at the requested path from a streaming record source.

Unlike Rebuild, this API does not require callers to materialize a []corpus.Record with every BodyText resident at once. Records are consumed one at a time in source order, normalized, chunked, embedded, and flushed in bounded internal batches. Duplicate refs are rejected by the staging snapshot's primary key. Source order determines snapshot-local chunk IDs; callers that need repeatable chunk IDs across streaming rebuilds should emit records in a stable order.

type ChunkContextualizer

type ChunkContextualizer interface {
	ContextualizeChunks(ctx context.Context, record corpus.Record, sections []chunk.Section) ([]string, error)
}

ChunkContextualizer produces a short explanatory prefix for each section of a record. The returned slice must be the same length as sections and aligned with it index-for-index. An empty prefix is allowed and disables contextual retrieval for that section. The returned prefix is prepended to the embedding text and to the FTS5 content column; it is persisted so reuse keying can detect when a changed contextualizer needs to invalidate the stored embedding.

type ContextOptions

type ContextOptions struct {
	// IncludeParent walks the requested chunk's parent_chunk_id one level
	// up and includes the parent row in the returned slice when the chunk
	// has a parent. Multi-level ancestry walks are explicit recursion by
	// the caller.
	//
	// Against snapshots built before schema v5 (#16), there is no
	// parent_chunk_id column to walk; IncludeParent is a no-op.
	IncludeParent bool

	// NeighborWindow includes up to N sibling chunks on each side of the
	// requested chunk, ordered by chunk_index. Two chunks are siblings
	// when they share the same parent_chunk_id (NULL counts as a single
	// sibling group), so for a leaf the neighborhood stays inside the
	// same parent span and for a flat or parent chunk the neighborhood
	// is other top-level chunks under the same record. Zero means no
	// neighbors are included; the requested chunk is still returned by
	// itself.
	//
	// Against snapshots built before schema v5 (#16), the parent grouping
	// is unavailable, so neighbors degrade to "other chunks in the same
	// record_ref with chunk_index in the requested window."
	NeighborWindow int
}

ContextOptions controls how Snapshot.ExpandContext widens a single chunk hit into a local-context payload.

type FusionStrategy

type FusionStrategy interface {
	Fuse(arms []RetrievalArm, limit int) ([]SearchHit, error)
}

FusionStrategy combines one or more RetrievalArms into a single ranked list, truncated to limit. Implementations must be deterministic and must attach HitProvenance to every returned hit covering each arm that contributed.

Fuse returns an error when inputs are malformed (for example Available=true with a non-nil Err, or an arm with an empty Name) or when the strategy fails closed on an upstream arm error. Callers treat errors the same way as any other retrieval failure. Strategies that want to tolerate partial-arm failures do so internally and return a nil error.

Aliasing contract: implementations must treat each input arm's Hits slice and every SearchHit it contains as read-only. They must not mutate Hit fields, must not mutate a Hit's Metadata map (which may alias storage shared across arms when the same ChunkID matched on more than one retrieval path), and must return a freshly allocated []SearchHit rather than repurposing an input arm's slice.

func DefaultFusion

func DefaultFusion() FusionStrategy

DefaultFusion returns the FusionStrategy used when SearchQuery.Fusion is nil. Ordering is identical to pre-#17 Snapshot.Search on every path, and SearchHit.Score is identical on every path except one: when the vector arm returns zero hits and the FTS arm is non-empty, DefaultFusion preserves the bm25-derived arm-native Score instead of the pre-#17 RRF-rewritten score. Callers who read Score on that specific path can recover both the arm-native and pre-#17-style scores via the HitProvenance attached to each hit.

type HitProvenance

type HitProvenance struct {
	Arms map[string]ArmEvidence
}

HitProvenance records which arms found a fused hit. The map is keyed by arm name; arms that did not return the hit are absent from the map.

type IntegrityMode

type IntegrityMode int

IntegrityMode selects how thorough the post-commit integrity validation at the end of Rebuild and Update is. The fast mode keeps Stroma-specific completeness checks (e.g., chunks_vec_full pairing on binary snapshots) because those validate what the write path itself just wrote — they're bounded by the changed set, not the corpus size. The full mode adds the SQLite whole-database PRAGMA passes that defend against on-disk page corruption (#107).

Default behavior is IntegrityModeFast: small Updates against large snapshots no longer pay for whole-database PRAGMA scans on every commit. Callers diagnosing on-disk corruption, recovering from a crash, or running periodic deep validation should opt in via IntegrityModeFull.

const (
	// IntegrityModeFast (the zero value, and default) skips the
	// whole-database PRAGMA integrity_check and PRAGMA foreign_key_check
	// passes at finalizeUpdate / finalizeRebuild time. Stroma-specific
	// completeness checks (chunks_vec_full pairing) still run. This keeps
	// finalizeUpdate's cost proportional to what changed rather than to
	// total corpus size.
	IntegrityModeFast IntegrityMode = iota
	// IntegrityModeFull adds the SQLite PRAGMA integrity_check and
	// PRAGMA foreign_key_check passes. Both are O(database size) scans,
	// so use this for periodic deep validation, after schema migrations,
	// or when investigating suspected on-disk corruption.
	IntegrityModeFull
)

type LexicalSearchParams

type LexicalSearchParams struct {
	// Text is the free-form query text. Empty rejects with a
	// "search text is required" error.
	Text string
	// Limit caps the number of SearchHits returned. Zero or negative
	// selects DefaultSearchLimit (10). Values above MaxSearchLimit
	// reject with an error instead of being silently capped.
	Limit int
	// Kinds filters candidate records to the supplied kind list. Nil
	// or empty means "no filter, all kinds".
	Kinds []string
	// Refs filters candidate records to the supplied record refs. Nil
	// or empty means "no filter, all refs".
	Refs []string
	// Metadata filters candidate records by stored record metadata. Semantics
	// match SearchParams.Metadata.
	Metadata MetadataFilter
	// OmitMetadata skips reading and decoding records.metadata_json for each
	// returned hit. Metadata filters still apply inside SQL.
	OmitMetadata bool
}

LexicalSearchParams are the retrieval parameters for FTS-only search. This surface is useful as an explicit fallback when callers cannot or do not want to call an embedder, while SearchParams remains the hybrid vector+FTS path.

type LexicalSearchQuery

type LexicalSearchQuery struct {
	// Path is the OS-native filesystem path to the snapshot. On
	// Windows both forward and back slashes are accepted — the store
	// package normalizes drive prefixes on open.
	Path string
	LexicalSearchParams
}

LexicalSearchQuery defines one FTS-only search against an index path.

type MetadataFilter

type MetadataFilter map[string][]string

MetadataFilter constrains search candidates by exact stored record metadata values. Each key is ANDed with the others; values within one key are matched as an IN-list.

type OutlineQuery

type OutlineQuery struct {
	// Refs filters candidate records to the supplied refs. Nil or empty
	// means "no filter, all refs".
	Refs []string
	// Kinds filters candidate records to the supplied kinds. Nil or empty
	// means "no filter, all kinds".
	Kinds []string
	// Metadata filters candidate records by stored record metadata. The
	// semantics match SearchParams.Metadata: keys are ANDed, values within
	// one key are ORed, and filters are evaluated in SQL without returning
	// the metadata payload.
	Metadata MetadataFilter
}

OutlineQuery filters structural chunk outlines from an opened snapshot.

type OutlineRow

type OutlineRow struct {
	ChunkID       int64
	Ref           string
	Kind          string
	Title         string
	SourceRef     string
	Heading       string
	ParentChunkID *int64
	Depth         int
	ContextPrefix string
	SourceSpan    *chunk.SourceSpan
}

OutlineRow is one structural chunk row from a snapshot. It intentionally omits full content, embeddings, and record metadata so callers can inspect a compact document outline before deciding which chunks to retrieve.

type RRFFusion

type RRFFusion struct {
	K                      int
	PreserveSingleArmScore bool
}

RRFFusion is the default FusionStrategy. K controls the RRF constant; K<=0 is treated as K=60 for backward compatibility with the pre-#17 mergeRRF helper.

PreserveSingleArmScore controls the single-arm degenerate case. When true (the default used by DefaultFusion) and exactly one arm is available-and-non-empty, Fuse returns that arm's hits in arm order with arm-native Score preserved. When false, Fuse rewrites Score to the RRF-derived 1/(K+rank+1) on every path. Callers that want numerically uniform fused scores across single-arm and multi-arm paths opt in by setting this to false.

func (RRFFusion) Fuse

func (r RRFFusion) Fuse(arms []RetrievalArm, limit int) ([]SearchHit, error)

Fuse implements FusionStrategy. See RRFFusion for the single-arm contract. Ties in RRF score are broken by (more contributing arms first) then (better cross-arm rank first).

type RecordAggregationOptions

type RecordAggregationOptions struct {
	// Limit caps the number of records returned by AggregateSearchHitsByRecord
	// and SearchRecords. Zero or negative uses DefaultSearchLimit.
	Limit int

	// Strategy optionally overrides the default record aggregation strategy.
	// Nil uses DefaultRecordAggregation().
	Strategy RecordAggregationStrategy
}

RecordAggregationOptions controls how chunk search hits are aggregated to record-level results.

type RecordAggregationStrategy

type RecordAggregationStrategy interface {
	Aggregate(hits []SearchHit, opts RecordAggregationOptions) ([]RecordSearchHit, error)
}

RecordAggregationStrategy groups ranked chunk hits into ranked record hits.

Strategies receive the full options value so future generic knobs can be added without changing this interface. Implementations should ignore opts.Strategy; it identifies the strategy already being called.

Aliasing contract: implementations must treat the input SearchHit slice and every SearchHit it contains as read-only. They must not mutate Hit fields, Metadata maps, Provenance maps, or SourceSpan values, and must return a freshly allocated []RecordSearchHit.

func DefaultRecordAggregation

func DefaultRecordAggregation() RecordAggregationStrategy

DefaultRecordAggregation returns the strategy used when RecordAggregationOptions.Strategy is nil.

type RecordHitContribution

type RecordHitContribution struct {
	ChunkID    int64
	Heading    string
	Score      float64
	SourceSpan *chunk.SourceSpan
	Provenance *HitProvenance
}

RecordHitContribution is one chunk hit that contributed evidence to a record-level aggregate.

type RecordQuery

type RecordQuery struct {
	Refs  []string
	Kinds []string

	// OmitMetadata skips reading and decoding records.metadata_json for
	// each returned record. corpus.Record.Metadata stays nil instead of
	// being populated as a (possibly empty) map. Use this for bulk
	// exports/walks that only need ref/kind/title/body and never inspect
	// metadata; it skips the per-row json.Unmarshal call.
	OmitMetadata bool
}

RecordQuery filters records from an opened snapshot.

type RecordSearchHit

type RecordSearchHit struct {
	Ref           string
	Kind          string
	Title         string
	SourceRef     string
	Metadata      map[string]string
	Score         float64
	Contributions []RecordHitContribution
}

RecordSearchHit is one record-level aggregate over contributing chunk hits.

func AggregateSearchHitsByRecord

func AggregateSearchHitsByRecord(hits []SearchHit, opts RecordAggregationOptions) ([]RecordSearchHit, error)

AggregateSearchHitsByRecord aggregates already-ranked chunk hits into ranked record hits. It is useful when callers already have Search results and want to apply Stroma's default record grouping without issuing another query. The aggregation pass is linear in len(hits); callers that provide hits from a source other than Stroma's bounded Search APIs own that input size.

func SearchRecords

func SearchRecords(ctx context.Context, query RecordSearchQuery) ([]RecordSearchHit, error)

SearchRecords returns record-level aggregates over semantically close sections from an existing index. When SearchParams.Limit is omitted, the chunk search limit is derived from Aggregation.Limit so aggregation has room to fill the requested record count.

type RecordSearchQuery

type RecordSearchQuery struct {
	// Path is the OS-native filesystem path to the snapshot. On
	// Windows both forward and back slashes are accepted — the store
	// package normalizes drive prefixes on open.
	Path string
	SearchParams
	Aggregation RecordAggregationOptions
}

RecordSearchQuery defines one semantic search whose chunk hits are aggregated to record-level results. When SearchParams.Limit is positive, it caps the chunk hits supplied to aggregation. When omitted, SearchRecords derives a chunk-hit limit from Aggregation.Limit.

type RecordSource

type RecordSource interface {
	Next(ctx context.Context) (corpus.Record, bool, error)
}

RecordSource streams records into RebuildFromSource.

Next returns the next record and true while input remains. Returning false ends the stream and ignores the returned record. Implementations should return any loading or decoding failure directly. RebuildFromSource calls Next serially with a non-nil context, propagates source errors, and leaves the destination snapshot unchanged.

type RecordSourceFunc

type RecordSourceFunc func(ctx context.Context) (corpus.Record, bool, error)

RecordSourceFunc adapts a function to RecordSource.

func (RecordSourceFunc) Next

Next calls f(ctx). A nil function returns a configuration error instead of panicking through RebuildFromSource or UpdateFromSource.

type Reranker

type Reranker interface {
	Rerank(ctx context.Context, query string, candidates []SearchHit) ([]SearchHit, error)
}

Reranker optionally refines one search candidate shortlist before the final limit truncation.

Aliasing contract: implementations must treat the input candidates slice and every SearchHit it contains as read-only. They must not mutate Hit fields, must not mutate a Hit's Metadata map (which may alias storage shared with other hits), and must not return the input slice — return a freshly allocated []SearchHit instead. Snapshot.Search defensively shallow-copies the candidates slice before handing it to the reranker, but that copy is shallow so maps and sub-slices inside each SearchHit remain shared. Reorderings and truncations are fine; mutations are not.

type RetrievalArm

type RetrievalArm struct {
	Name      string
	Hits      []SearchHit
	Available bool
	Err       error
}

RetrievalArm is one candidate list from one retrieval path, ordered by the arm's own ranking. Hits[i].Score is the arm-native score (cosine distance derivative for vector, negative bm25-equivalent for FTS).

Available and Err distinguish three otherwise identical-looking states:

  • Available=true, Err=nil, len(Hits)==0: arm ran, zero matches.
  • Available=false, Err=nil: arm unavailable on this snapshot (for example a legacy snapshot without fts_chunks). Hits must be empty.
  • Available=false, Err!=nil: arm failed. Hits must be empty.

Available=true with a non-nil Err is invalid; FusionStrategy implementations should return an error when they observe it.

type ReuseStatus

type ReuseStatus string

ReuseStatus reports whether BuildOptions.ReuseFromPath was usable during Rebuild. Reuse setup remains non-fatal by default; callers can inspect BuildResult.ReuseStatus and BuildResult.ReuseDisabledReason to distinguish "nothing reusable" from "reuse could not start".

const (
	// ReuseStatusDisabled means BuildOptions.ReuseFromPath was empty.
	ReuseStatusDisabled ReuseStatus = "disabled"
	// ReuseStatusActive means the prior snapshot opened and passed
	// compatibility checks, so section-level reuse was attempted.
	ReuseStatusActive ReuseStatus = "active"
	// ReuseStatusUnavailable means the configured path did not name a
	// readable snapshot file, for example because it was missing or a
	// directory.
	ReuseStatusUnavailable ReuseStatus = "unavailable"
	// ReuseStatusIncompatible means the snapshot exists but cannot seed
	// this build because schema, embedder, dimension, or quantization
	// metadata does not match.
	ReuseStatusIncompatible ReuseStatus = "incompatible"
	// ReuseStatusError means setup hit an operational error while
	// checking the configured snapshot.
	ReuseStatusError ReuseStatus = "error"
)

type SearchHit

type SearchHit struct {
	ChunkID    int64
	Ref        string
	Kind       string
	Title      string
	SourceRef  string
	Heading    string
	Content    string
	SourceSpan *chunk.SourceSpan
	Metadata   map[string]string
	Score      float64
	// Provenance records which retrieval arms contributed to this hit.
	// It is populated by FusionStrategy implementations; non-fusion paths
	// (SearchVector, direct searchFTS callers) leave it nil.
	Provenance *HitProvenance
}

SearchHit is one retrieved section.

func Search(ctx context.Context, query SearchQuery) ([]SearchHit, error)

Search returns semantically close sections from an existing index.

func SearchLexical

func SearchLexical(ctx context.Context, query LexicalSearchQuery) ([]SearchHit, error)

SearchLexical returns lexically close sections from an existing index without requiring an embedder.

type SearchParams

type SearchParams struct {
	// Text is the free-form query text. Empty rejects with a
	// "search text is required" error — this field has no default.
	Text string
	// Limit caps the number of SearchHits returned. Zero or negative
	// selects DefaultSearchLimit (10). Values above MaxSearchLimit
	// reject with an error instead of being silently capped.
	Limit int
	// Kinds filters candidate records to the supplied kind list. Nil
	// or empty means "no filter, all kinds".
	Kinds []string
	// Refs filters candidate records to the supplied record refs. Nil
	// or empty means "no filter, all refs". The filter is applied inside
	// each retrieval arm before that arm ranks and truncates candidates.
	Refs []string
	// Metadata filters candidate records by stored record metadata. Each
	// map key is matched exactly against corpus.Record.Metadata, values
	// are ORed within a key, and multiple keys are ANDed together. Empty
	// keys, empty value lists, and whitespace-only values reject; empty
	// metadata values are valid exact matches. The filter is applied inside
	// each retrieval arm before that arm ranks and truncates candidates.
	Metadata MetadataFilter
	// Embedder produces the query vector(s) used by the dense arm.
	// Nil rejects with a "search embedder is required" error — this
	// field has no default.
	Embedder embed.Embedder
	// Fusion optionally overrides the hybrid fusion strategy. Nil
	// uses DefaultFusion().
	Fusion FusionStrategy
	// Reranker optionally refines the candidate shortlist after
	// fusion. Nil skips reranking.
	Reranker Reranker

	// SearchDimension optionally runs a truncated-prefix vector prefilter
	// at this dimension, then rescores the shortlist with full-dim cosine.
	// Zero (default) uses the full stored dimension throughout. Positive
	// values must be <= the stored embedder dimension. Only valid when the
	// stored quantization is float32; returns an error against int8 indexes.
	// This is the shape Matryoshka Representation Learning (MRL) embeddings
	// rely on — callers who use non-MRL embeddings should leave it zero.
	//
	// The truncated path is a brute-force scan over chunks_vec, not a
	// vec0 kNN MATCH, so it is not asymptotically cheaper than the default
	// path: its win is constant-factor (fewer floats per cosine) and only
	// pays off when the truncated prefix preserves ranking. Treat this as
	// a tuning knob for MRL snapshots rather than a blanket speedup.
	SearchDimension int

	// OmitMetadata skips reading and decoding records.metadata_json for
	// each returned hit. SearchHit.Metadata stays nil instead of being
	// populated as a (possibly empty) map. This is an additive opt-out
	// for callers that drive ranking off Ref/Title/Content and never
	// inspect Metadata: it skips the per-hit json.Unmarshal and avoids
	// transferring the JSON blob from SQLite. Metadata filters
	// (SearchParams.Metadata) still work because they are evaluated
	// inside the SQL plan, not on the returned hit's payload.
	OmitMetadata bool
}

SearchParams are the retrieval parameters shared by SearchQuery (the top-level one-shot API against an index path) and SnapshotSearchQuery (the long-lived API against an open Snapshot). Extracting the shared shape lets downstream adapters thread one value through both surfaces and lets the top-level Search forward its params verbatim instead of hand-copying six fields.

type SearchQuery

type SearchQuery struct {
	// Path is the OS-native filesystem path to the snapshot. On
	// Windows both forward and back slashes are accepted — the store
	// package normalizes drive prefixes on open.
	Path string
	SearchParams
}

SearchQuery defines one semantic search against an index path. Retrieval parameters live on the embedded SearchParams so the same shape flows through Search, Snapshot.Search, and any downstream adapter wrapper.

type Section

type Section struct {
	ChunkID       int64
	Ref           string
	Kind          string
	Title         string
	SourceRef     string
	Heading       string
	Content       string
	ContextPrefix string
	SourceSpan    *chunk.SourceSpan
	Metadata      map[string]string
	Embedding     []float64
}

Section is one stored section from a Stroma snapshot.

type SectionQuery

type SectionQuery struct {
	Refs  []string
	Kinds []string

	// IncludeEmbeddings asks Sections() and WalkSections() to populate
	// Section.Embedding from the stored vector column. Snapshots produced
	// by hierarchical policies (e.g., chunk.LateChunkPolicy) hold parent
	// rows that are storage-only context with no vector — those rows are
	// filtered out of an IncludeEmbeddings = true query because the
	// underlying chunks → chunks_vec join is inner. Set IncludeEmbeddings
	// = false to receive every chunk row (parents + leaves) without
	// embeddings.
	IncludeEmbeddings bool

	// OmitMetadata skips reading and decoding records.metadata_json for
	// each returned section. Section.Metadata stays nil instead of being
	// populated as a (possibly empty) map. Use this for embedding-heavy
	// section walks that do not consult per-record metadata.
	OmitMetadata bool
}

SectionQuery filters sections from an opened snapshot.

type Snapshot

type Snapshot struct {
	// contains filtered or unexported fields
}

Snapshot is one opened Stroma index snapshot.

Safe for concurrent use by multiple goroutines once returned from OpenSnapshot: *sql.DB is goroutine-safe per the database/sql contract, and all Snapshot read methods (Stats, Records, Sections, WalkRecords, WalkSections, Search, SearchVector, ExpandContext) invoke it through that contract. Cached metadata fields (quantization, storedDimension, hasFTS, …) are populated at open time and read-only thereafter, so no additional synchronization is required around Snapshot itself.

func OpenSnapshot

func OpenSnapshot(ctx context.Context, path string) (*Snapshot, error)

OpenSnapshot opens a read-only Stroma snapshot at path. The path is OS-native; on Windows both forward and back slashes are accepted (the store package normalizes drive prefixes on open). The snapshot's schema_version metadata must be one of the versions supported by the migration table, all of which read paths can decode directly without forcing an Update. Anything else returns ErrUnsupportedSchemaVersion wrapped with the observed version, so callers can surface a clear upgrade/downgrade message instead of silently misdecoding data against a future schema.

The returned *Snapshot is safe for concurrent use by multiple goroutines once constructed: *sql.DB is goroutine-safe per the database/sql contract, and Snapshot's cached metadata fields are populated at open time and read-only thereafter.

func (*Snapshot) Close

func (s *Snapshot) Close() error

Close releases the opened snapshot handle.

func (*Snapshot) ExpandContext

func (s *Snapshot) ExpandContext(ctx context.Context, chunkID int64, opts ContextOptions) ([]Section, error)

ExpandContext returns the chunk identified by chunkID together with the caller-requested local context, in document order:

[parent (if IncludeParent and the chunk has one), neighbors before,
 the chunk itself, neighbors after]

The chunk itself is always included, so callers do not have to reconcile the original SearchHit with the expansion. Embeddings are never populated by ExpandContext — the API is for context retrieval, not for re-ranking against fresh vectors. Callers that need embeddings should use Sections() with IncludeEmbeddings = true.

Returns an empty slice + nil error when chunkID does not exist; the substrate treats "no such chunk" as an empty result rather than an error, matching the section-read APIs.

Against snapshots built before schema v5 (#16), the v5 lineage column is absent: IncludeParent becomes a no-op and NeighborWindow scopes by record_ref alone (no parent grouping). ExpandContext stays useful on legacy files; it just cannot surface lineage that was never recorded.

Internally ExpandContext issues a small bounded number of parameterized reads: at most one to locate the requested chunk, one to fetch the parent (when IncludeParent + parent_chunk_id present), and one range scan over the sibling window. There is no per-result parameter expansion (no `WHERE id IN (?, ?, ?, ...)`), so the query never approaches SQLite's parameter cap regardless of NeighborWindow.

func (*Snapshot) Outline

func (s *Snapshot) Outline(ctx context.Context, query OutlineQuery) ([]OutlineRow, error)

Outline returns all matching structural chunk rows from the opened snapshot. It is the all-at-once convenience API; use WalkOutline for bounded-memory traversal of large snapshots.

func (*Snapshot) Path

func (s *Snapshot) Path() string

Path returns the opened snapshot path.

func (*Snapshot) Records

func (s *Snapshot) Records(ctx context.Context, query RecordQuery) ([]corpus.Record, error)

Records returns all matching records from the opened snapshot. It is the all-at-once convenience API: large result sets are fully materialized in memory. Use WalkRecords when callers need bounded memory.

func (*Snapshot) Search

func (s *Snapshot) Search(ctx context.Context, query SnapshotSearchQuery) ([]SearchHit, error)

Search runs a hybrid text search (vector + FTS5) against the opened snapshot.

func (*Snapshot) SearchLexical

func (s *Snapshot) SearchLexical(ctx context.Context, query SnapshotLexicalSearchQuery) ([]SearchHit, error)

SearchLexical runs an FTS-only text search against the opened snapshot. It does not require an embedder and is the explicit fallback surface for callers that want lexical retrieval when vector search is unavailable.

func (*Snapshot) SearchRecords

func (s *Snapshot) SearchRecords(ctx context.Context, query SnapshotRecordSearchQuery) ([]RecordSearchHit, error)

SearchRecords runs Search, then aggregates the returned chunk hits to record-level results. All SearchParams filters apply before aggregation because they are evaluated by the underlying Search call.

When SearchParams.Limit is positive, it caps the chunk hits supplied to aggregation. When omitted, SearchRecords derives a chunk-hit limit from RecordAggregationOptions.Limit. RecordAggregationOptions.Limit caps the returned record count.

func (*Snapshot) SearchVector

func (s *Snapshot) SearchVector(ctx context.Context, query VectorSearchQuery) ([]SearchHit, error)

SearchVector runs a vector search against the opened snapshot.

func (*Snapshot) Sections

func (s *Snapshot) Sections(ctx context.Context, query SectionQuery) ([]Section, error)

Sections returns all matching sections from the opened snapshot. It is the all-at-once convenience API: large result sets, especially with IncludeEmbeddings=true, are fully materialized in memory. Use WalkSections when callers need bounded memory.

func (*Snapshot) Stats

func (s *Snapshot) Stats(ctx context.Context) (*Stats, error)

Stats inspects the opened snapshot.

func (*Snapshot) UsesIndexedMetadata

func (s *Snapshot) UsesIndexedMetadata() bool

UsesIndexedMetadata reports whether the opened snapshot evaluates metadata filters through the indexed record_metadata side table (schema v6+) or through the JSON-backed read-path fallback (older read-only snapshots). Callers benchmarking metadata-filter latency can use this to confirm the snapshot is on the fast path.

func (*Snapshot) VerifyBinaryCompanion

func (s *Snapshot) VerifyBinaryCompanion(ctx context.Context) error

VerifyBinaryCompanion runs a full row-by-row check that every searchable chunk has a matching chunks_vec_full row and vice versa. OpenSnapshot itself trusts the binary_companion_validated_fingerprint marker for performance (#109), so this method is the explicit strict path callers should use when they need to detect external tampering of the companion table that preserved the snapshot's content_fingerprint. The check is a no-op for non-binary snapshots (returns nil with no work) and otherwise scans the chunks ↔ chunks_vec_full join in O(chunk count). Safe to call on a long-lived snapshot handle and from multiple goroutines.

Strict semantics for binary snapshots: a binary snapshot whose chunks_vec_full table has been dropped entirely is treated as a hard error. The shared completeness helper short-circuits on missing tables (correct for non-binary snapshots) so this method probes the table presence first and rejects up front when it is gone, rather than returning a misleading nil and letting the next binary search fail at query time.

func (*Snapshot) WalkOutline

func (s *Snapshot) WalkOutline(ctx context.Context, query OutlineQuery, fn func(OutlineRow) error) error

WalkOutline streams matching structural chunk rows ordered by ref, chunk_index, then chunk id. Rows include parent links when the snapshot schema has chunks.parent_chunk_id; older snapshots project nil parents and depth zero. The callback runs while the SQLite cursor is open, so keep it quick for the same reasons as WalkRecords and WalkSections.

If the callback returns ErrStopWalk or wraps it, walking stops successfully and WalkOutline returns nil. Other callback errors stop walking and are wrapped in the returned error.

func (*Snapshot) WalkRecords

func (s *Snapshot) WalkRecords(ctx context.Context, query RecordQuery, fn func(corpus.Record) error) error

WalkRecords streams matching records from the opened snapshot in ref order. The callback is invoked once per row while the SQLite cursor is open, so callers can process records without materializing the full result set. This is a single-pass cursor, not a resumable page API: if a caller stops and calls WalkRecords again, the next walk starts from the first matching row. Keep callbacks quick: slow I/O in the callback keeps the SQLite read cursor open until the callback returns, which can delay WAL checkpoints while the walk is active.

If the callback returns ErrStopWalk or wraps it, walking stops successfully and WalkRecords returns nil. Other callback errors stop walking and are wrapped in the returned error; callers should use errors.Is or errors.As when matching them.

func (*Snapshot) WalkSections

func (s *Snapshot) WalkSections(ctx context.Context, query SectionQuery, fn func(Section) error) error

WalkSections streams matching sections from the opened snapshot ordered by ref, chunk_index, then chunk id. The callback is invoked once per row while the SQLite cursor is open, so callers can process sections without materializing the full result set. This is a single-pass cursor, not a resumable page API: if a caller stops and calls WalkSections again, the next walk starts from the first matching row.

With SectionQuery.IncludeEmbeddings=true, WalkSections uses the same inner vector join as Sections and only returns rows that have stored vectors. Hierarchical parent rows without vectors are filtered out; use IncludeEmbeddings=false for a complete structural section walk.

Keep callbacks quick: slow I/O in the callback keeps the SQLite read cursor open until the callback returns, which can delay WAL checkpoints while the walk is active.

If the callback returns ErrStopWalk or wraps it, walking stops successfully and WalkSections returns nil. Other callback errors stop walking and are wrapped in the returned error; callers should use errors.Is or errors.As when matching them.

type SnapshotLexicalSearchQuery

type SnapshotLexicalSearchQuery struct {
	LexicalSearchParams
}

SnapshotLexicalSearchQuery defines one FTS-only search against an opened snapshot.

type SnapshotRecordSearchQuery

type SnapshotRecordSearchQuery struct {
	SearchParams
	Aggregation RecordAggregationOptions
}

SnapshotRecordSearchQuery defines one text search against an opened snapshot with chunk hits aggregated into record-level results. The embedded SearchParams filters are evaluated before aggregation. When SearchParams.Limit is positive, it caps the chunk hits supplied to aggregation. When omitted, SearchRecords derives a chunk-hit limit from Aggregation.Limit.

type SnapshotSearchQuery

type SnapshotSearchQuery struct {
	SearchParams
}

SnapshotSearchQuery defines one text search against an opened snapshot. Retrieval parameters live on the embedded SearchParams so the same value can be forwarded verbatim from SearchQuery.SearchParams without hand-copying fields.

type Stats

type Stats struct {
	Path                string
	RecordCount         int
	ChunkCount          int
	KindCounts          map[string]int
	SchemaVersion       string
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
	CreatedAt           string
}

Stats describes a built Stroma index.

func ReadStats

func ReadStats(ctx context.Context, path string) (*Stats, error)

ReadStats inspects an existing index.

type SumScoreRecordAggregation

type SumScoreRecordAggregation struct{}

SumScoreRecordAggregation is the default RecordAggregationStrategy. It groups chunk hits by Ref, scores each record by the sum of its contributing chunk scores, and keeps contributions in the same order they appeared in the input hit list. Ties are broken by best contributing chunk score, then by more contributing chunks, then by Ref. Record descriptor fields and Metadata are copied from the first hit seen for each Ref; Stroma-produced hits carry the same record payload on every chunk for one Ref.

func (SumScoreRecordAggregation) Aggregate

Aggregate implements RecordAggregationStrategy.

type UpdateOptions

type UpdateOptions struct {
	// Path is the OS-native filesystem path to the existing snapshot
	// to update in place. On Windows both forward and back slashes are
	// accepted — the store package normalizes drive prefixes on open.
	Path     string
	Embedder embed.Embedder

	// Contextualizer optionally produces a per-chunk prefix string. See
	// BuildOptions.Contextualizer for the contract. Leaving it nil
	// preserves the non-contextual path and produces chunks with an
	// empty persisted prefix.
	Contextualizer ChunkContextualizer

	// MaxChunkTokens sets the approximate maximum number of tokens (words)
	// per chunk. It should match the chunking policy used to build the current
	// index if callers want incremental updates to remain section-compatible.
	MaxChunkTokens int

	// ChunkOverlapTokens sets the approximate number of overlapping tokens
	// between adjacent sub-sections when a section is split. It should match
	// the chunking policy used to build the current index.
	ChunkOverlapTokens int

	// MaxChunkSections mirrors BuildOptions.MaxChunkSections for the
	// incremental-update path. Zero → DefaultMaxChunkSections; negative
	// → no cap.
	MaxChunkSections int

	// MaxPlannedRecords caps how many added/replaced records Update will
	// chunk, reuse-plan, and embed before opening its write transaction.
	// This bounds resident pre-transaction plan memory for callers that
	// split large ingests into repeated Update calls. Zero keeps the
	// historical unbounded behavior; negative values reject. The cap
	// applies only to added/replaced records, not removals.
	MaxPlannedRecords int

	// AllowSourceRemovals permits UpdateFromSource to remove stored records
	// that are absent from the supplied source. The default is false because
	// callers often think of "update from source" as a partial changed-record
	// feed. Use SyncFromSource for the full desired-corpus synchronization
	// path where removals are expected.
	AllowSourceRemovals bool

	// Quantization, when provided, must match the existing index — see
	// the store.Quantization* constants (float32, int8, binary) for the
	// accept-listed values. Leaving it empty reuses the stored
	// quantization metadata.
	Quantization string

	// ChunkPolicy mirrors BuildOptions.ChunkPolicy for the incremental
	// update path. Nil defaults to chunk.MarkdownPolicy with the
	// MaxChunkTokens / ChunkOverlapTokens / MaxChunkSections knobs
	// resolved here. The substrate does not enforce that the policy
	// matches the one used to build the snapshot — callers who switch
	// policies between Build and Update should expect reuse cache
	// misses on the affected sections (the leaves still re-embed
	// correctly; the snapshot just won't share embeddings across
	// rebuilds).
	ChunkPolicy chunk.Policy

	// IntegrityMode selects post-commit validation depth (#107). The
	// zero value, IntegrityModeFast, skips the SQLite-level whole-DB
	// PRAGMA passes so a small Update against a large snapshot no
	// longer pays for an O(database size) integrity_check on every
	// commit. IntegrityModeFull adds those passes back for callers that
	// want deep validation at the cost of full-corpus scans.
	IntegrityMode IntegrityMode
}

UpdateOptions controls how an existing Stroma index is updated in place.

type UpdateResult

type UpdateResult struct {
	Path                 string
	UpsertedCount        int
	RemovedCount         int
	RecordCount          int
	ChunkCount           int
	UnchangedRecordCount int
	UnchangedChunkCount  int
	ReusedRecordCount    int
	ReusedChunkCount     int
	EmbeddedChunkCount   int
	EmbedderDimension    int
	EmbedderFingerprint  string
	ContentFingerprint   string
}

UpdateResult summarizes one incremental update.

func SyncFromSource

func SyncFromSource(ctx context.Context, source RecordSource, options UpdateOptions) (*UpdateResult, error)

SyncFromSource synchronizes an existing Stroma index to exactly match a full desired-corpus source. Stored records absent from source are removed.

Source order is preserved for added/replaced records, so it determines the chunk ID order for rows written by this update. The method consumes record bodies one at a time, but it keeps stored (ref, content_hash) pairs and the full source ref set in memory to detect removals/no-ops. It also retains changed records and their planned chunks/vectors until commit. Use UpdateOptions.MaxPlannedRecords to bound that changed-record plan; the cap is checked while the source is consumed, before embedding and before the write handle is opened.

func Update

func Update(ctx context.Context, added []corpus.Record, removed []string, options UpdateOptions) (*UpdateResult, error)

Update applies add, replace, and remove operations to an existing Stroma index without rebuilding it from scratch.

func UpdateFromSource

func UpdateFromSource(ctx context.Context, source RecordSource, options UpdateOptions) (*UpdateResult, error)

UpdateFromSource diffs a source against an existing Stroma index. Records yielded by source are normalized, compared with the snapshot's stored (ref, content_hash) pairs, and only added/replaced records are chunked and embedded. Unchanged records are counted as reused without loading their full bodies.

By default, stored records absent from source are rejected with ErrSourceRemovalsDisabled instead of being removed. This keeps partial changed-record feeds from accidentally deleting the rest of the corpus. Use SyncFromSource, or set UpdateOptions.AllowSourceRemovals, when the source is the complete desired corpus and removals are intended.

Source order is preserved for added/replaced records, so it determines the chunk ID order for rows written by this update. The method consumes record bodies one at a time, but it keeps stored (ref, content_hash) pairs and the full source ref set in memory to detect removals/no-ops. It also retains changed records and their planned chunks/vectors until commit. Use UpdateOptions.MaxPlannedRecords to bound that changed-record plan; the cap is checked while the source is consumed, before embedding and before the write handle is opened.

type VectorSearchQuery

type VectorSearchQuery struct {
	// Embedding is the precomputed query vector. Empty rejects with
	// a "search embedding is required" error — this field has no
	// default.
	Embedding []float64
	// Limit caps the number of SearchHits returned. Zero or negative
	// selects DefaultSearchLimit (10). Values above MaxSearchLimit
	// reject with an error instead of being silently capped.
	Limit int
	// Kinds filters candidate records to the supplied kind list. Nil
	// or empty means "no filter, all kinds".
	Kinds []string
	// Refs filters candidate records to the supplied record refs. Nil
	// or empty means "no filter, all refs". The filter is applied before
	// the vector arm ranks and truncates candidates.
	Refs []string
	// Metadata filters candidate records by stored record metadata. Each
	// key is ANDed with the others; values within one key are ORed. Empty
	// keys, empty value lists, and whitespace-only values reject; empty
	// metadata values are valid exact matches. The filter is applied before
	// the vector arm ranks and truncates candidates.
	Metadata MetadataFilter

	// OmitMetadata skips reading and decoding records.metadata_json for
	// each returned hit. See SearchParams.OmitMetadata for the same
	// opt-out shape and rationale.
	OmitMetadata bool
}

VectorSearchQuery defines one vector search against an opened snapshot.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL