index

package
v2.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 23, 2026 License: MIT Imports: 19 Imported by: 0

Documentation

Overview

Package index orchestrates atomic Stroma index rebuilds and searches.

Index

Constants

View Source
const (
	ArmVector = "vector"
	ArmFTS    = "fts"
)

Arm name constants used by the default Snapshot.Search pipeline. Custom FusionStrategy implementations may introduce additional arm names.

View Source
const DefaultMaxChunkSections = 10_000

DefaultMaxChunkSections caps the number of heading-aware sections a single record can contribute to the index when the caller hasn't overridden it. 10,000 is generous for legitimate technical documents (few real specs exceed a few hundred headings) while still preventing a pathological or hostile body from expanding into millions of embedder calls + rows.

View Source
const DefaultSearchLimit = 10

DefaultSearchLimit is the hit cap applied to Snapshot.Search and Snapshot.SearchVector when SearchParams.Limit / VectorSearchQuery.Limit is zero or negative. The choice is conservative; pick an explicit Limit if throughput matters or if the caller needs a stable shortlist size across snapshots.

Variables

View Source
var ErrUnsupportedSchemaVersion = errors.New("unsupported snapshot schema version")

ErrUnsupportedSchemaVersion is returned when an operation encounters a snapshot whose schema_version is neither the current schema nor one the library knows how to migrate from. It is surfaced by OpenSnapshot and wrapped via fmt.Errorf with %w so callers can use errors.Is to detect it.

View Source
var ErrUpdateCommittedIntegrityCheckFailed = errors.New("update committed but post-commit integrity check failed")

ErrUpdateCommittedIntegrityCheckFailed signals that Update's transaction committed successfully — the record, chunk, and metadata changes are durable on disk — but the post-commit PRAGMA integrity_check / foreign_key_check reported corruption. The enclosing error wraps this sentinel via fmt.Errorf with %w so callers can use errors.Is to detect it. This case is non-retriable: re-running Update will not unroll the already-durable changes, and the underlying file likely needs operator inspection (see index/ARCHITECTURE.md). Contrast with plain errors returned by Update, which come from pre-commit failures and leave the file byte-identical to its pre-call state.

Functions

This section is empty.

Types

type ArmEvidence

type ArmEvidence struct {
	// Rank is the zero-based position of the hit within the arm.
	Rank int
	// Score is the arm-native score at the time the arm returned the hit
	// (cosine derivative for vector, negative bm25 for FTS).
	Score float64
}

ArmEvidence is one arm's contribution to a fused hit.

type BuildOptions

type BuildOptions struct {
	// Path is the OS-native filesystem path where the built snapshot
	// is written. On Windows both forward and back slashes are
	// accepted — the store package normalizes drive prefixes on open.
	Path string

	// ReuseFromPath points at an existing Stroma snapshot whose embeddings
	// should be reused at the section level: a new section reuses its
	// stored embedding whenever its title, heading, and body match a
	// section already present in the prior snapshot. Records that are
	// fully unchanged are the maximal case, but sections carried over
	// from an edited record still reuse their embeddings. The snapshot is
	// opened read-only and queried per-record during the rebuild, so
	// resident memory scales with a single record's chunks rather than
	// with the whole corpus. Leave empty to disable reuse.
	ReuseFromPath string

	Embedder embed.Embedder

	// Contextualizer optionally produces a per-chunk prefix string that
	// gets prepended before the embedding text and the FTS5 content. When
	// set, the prefix persists on the chunk and participates in reuse
	// keying so a changed contextualizer invalidates stale reuse without
	// corrupting the stored representation. Nil disables contextualization
	// and leaves the build identical to the non-contextual path.
	Contextualizer ChunkContextualizer

	// MaxChunkTokens sets the approximate maximum number of tokens (words)
	// per chunk. Sections that exceed this limit are split into smaller
	// sub-sections. Zero disables token-budget splitting.
	MaxChunkTokens int

	// ChunkOverlapTokens sets the approximate number of overlapping tokens
	// between adjacent sub-sections when a section is split. Zero disables
	// overlap.
	ChunkOverlapTokens int

	// MaxChunkSections caps how many sections any single record is allowed
	// to produce. A pathological Markdown body (e.g., 10^6 heading lines)
	// would otherwise translate into 10^6 embedder calls and 10^6
	// chunk/vector rows — a DoS vector for shared embedders. Zero means
	// DefaultMaxChunkSections; a negative value disables the cap for
	// callers who have their own upstream validation. When the cap is
	// exceeded, Rebuild returns an error wrapping chunk.ErrTooManySections
	// instead of silently admitting the record.
	MaxChunkSections int

	// Quantization controls the vector storage format. See the
	// store.Quantization* constants for the accept-listed values:
	// store.QuantizationFloat32 (default), store.QuantizationInt8 (4x
	// smaller, minor precision loss), and store.QuantizationBinary
	// (32x smaller via 1-bit sign packing, full-precision rescore on a
	// companion table preserves ranking).
	Quantization string

	// ChunkPolicy selects the chunking strategy. Nil defaults to
	// chunk.MarkdownPolicy{Options: chunk.Options{
	//     MaxTokens: MaxChunkTokens,
	//     OverlapTokens: ChunkOverlapTokens,
	//     MaxSections: <resolved>,
	// }}, which reproduces the pre-1.0 chunking pipeline exactly.
	// Setting a non-nil policy overrides the per-Build chunking shape:
	// MaxChunkTokens/ChunkOverlapTokens/MaxChunkSections are read by
	// the default MarkdownPolicy but ignored when ChunkPolicy is set
	// (the policy carries its own configuration). Hierarchical
	// policies like chunk.LateChunkPolicy emit parent + leaf chunks
	// linked via parent_chunk_id; ExpandContext can surface the
	// parent on demand. See docs/superpowers/specs for the design.
	ChunkPolicy chunk.Policy
}

BuildOptions controls how a Stroma index is rebuilt.

type BuildResult

type BuildResult struct {
	Path                string
	RecordCount         int
	ChunkCount          int
	ReusedRecordCount   int
	ReusedChunkCount    int
	EmbeddedChunkCount  int
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
}

BuildResult summarizes a completed rebuild.

func Rebuild

func Rebuild(ctx context.Context, records []corpus.Record, options BuildOptions) (*BuildResult, error)

Rebuild atomically recreates the index at the requested path.

type ChunkContextualizer

type ChunkContextualizer interface {
	ContextualizeChunks(ctx context.Context, record corpus.Record, sections []chunk.Section) ([]string, error)
}

ChunkContextualizer produces a short explanatory prefix for each section of a record. The returned slice must be the same length as sections and aligned with it index-for-index. An empty prefix is allowed and disables contextual retrieval for that section. The returned prefix is prepended to the embedding text and to the FTS5 content column; it is persisted so reuse keying can detect when a changed contextualizer needs to invalidate the stored embedding.

type ContextOptions

type ContextOptions struct {
	// IncludeParent walks the requested chunk's parent_chunk_id one level
	// up and includes the parent row in the returned slice when the chunk
	// has a parent. Multi-level ancestry walks are explicit recursion by
	// the caller.
	//
	// Against snapshots built before schema v5 (#16), there is no
	// parent_chunk_id column to walk; IncludeParent is a no-op.
	IncludeParent bool

	// NeighborWindow includes up to N sibling chunks on each side of the
	// requested chunk, ordered by chunk_index. Two chunks are siblings
	// when they share the same parent_chunk_id (NULL counts as a single
	// sibling group), so for a leaf the neighborhood stays inside the
	// same parent span and for a flat or parent chunk the neighborhood
	// is other top-level chunks under the same record. Zero means no
	// neighbors are included; the requested chunk is still returned by
	// itself.
	//
	// Against snapshots built before schema v5 (#16), the parent grouping
	// is unavailable, so neighbors degrade to "other chunks in the same
	// record_ref with chunk_index in the requested window."
	NeighborWindow int
}

ContextOptions controls how Snapshot.ExpandContext widens a single chunk hit into a local-context payload.

type FusionStrategy

type FusionStrategy interface {
	Fuse(arms []RetrievalArm, limit int) ([]SearchHit, error)
}

FusionStrategy combines one or more RetrievalArms into a single ranked list, truncated to limit. Implementations must be deterministic and must attach HitProvenance to every returned hit covering each arm that contributed.

Fuse returns an error when inputs are malformed (for example Available=true with a non-nil Err, or an arm with an empty Name) or when the strategy fails closed on an upstream arm error. Callers treat errors the same way as any other retrieval failure. Strategies that want to tolerate partial-arm failures do so internally and return a nil error.

Aliasing contract: implementations must treat each input arm's Hits slice and every SearchHit it contains as read-only. They must not mutate Hit fields, must not mutate a Hit's Metadata map (which may alias storage shared across arms when the same ChunkID matched on more than one retrieval path), and must return a freshly allocated []SearchHit rather than repurposing an input arm's slice.

func DefaultFusion

func DefaultFusion() FusionStrategy

DefaultFusion returns the FusionStrategy used when SearchQuery.Fusion is nil. Ordering is identical to pre-#17 Snapshot.Search on every path, and SearchHit.Score is identical on every path except one: when the vector arm returns zero hits and the FTS arm is non-empty, DefaultFusion preserves the bm25-derived arm-native Score instead of the pre-#17 RRF-rewritten score. Callers who read Score on that specific path can recover both the arm-native and pre-#17-style scores via the HitProvenance attached to each hit.

type HitProvenance

type HitProvenance struct {
	Arms map[string]ArmEvidence
}

HitProvenance records which arms found a fused hit. The map is keyed by arm name; arms that did not return the hit are absent from the map.

type RRFFusion

type RRFFusion struct {
	K                      int
	PreserveSingleArmScore bool
}

RRFFusion is the default FusionStrategy. K controls the RRF constant; K<=0 is treated as K=60 for backward compatibility with the pre-#17 mergeRRF helper.

PreserveSingleArmScore controls the single-arm degenerate case. When true (the default used by DefaultFusion) and exactly one arm is available-and-non-empty, Fuse returns that arm's hits in arm order with arm-native Score preserved. When false, Fuse rewrites Score to the RRF-derived 1/(K+rank+1) on every path. Callers that want numerically uniform fused scores across single-arm and multi-arm paths opt in by setting this to false.

func (RRFFusion) Fuse

func (r RRFFusion) Fuse(arms []RetrievalArm, limit int) ([]SearchHit, error)

Fuse implements FusionStrategy. See RRFFusion for the single-arm contract. Ties in RRF score are broken by (more contributing arms first) then (better cross-arm rank first).

type RecordQuery

type RecordQuery struct {
	Refs  []string
	Kinds []string
}

RecordQuery filters records from an opened snapshot.

type Reranker

type Reranker interface {
	Rerank(ctx context.Context, query string, candidates []SearchHit) ([]SearchHit, error)
}

Reranker optionally refines one search candidate shortlist before the final limit truncation.

Aliasing contract: implementations must treat the input candidates slice and every SearchHit it contains as read-only. They must not mutate Hit fields, must not mutate a Hit's Metadata map (which may alias storage shared with other hits), and must not return the input slice — return a freshly allocated []SearchHit instead. Snapshot.Search defensively shallow-copies the candidates slice before handing it to the reranker, but that copy is shallow so maps and sub-slices inside each SearchHit remain shared. Reorderings and truncations are fine; mutations are not.

type RetrievalArm

type RetrievalArm struct {
	Name      string
	Hits      []SearchHit
	Available bool
	Err       error
}

RetrievalArm is one candidate list from one retrieval path, ordered by the arm's own ranking. Hits[i].Score is the arm-native score (cosine distance derivative for vector, negative bm25-equivalent for FTS).

Available and Err distinguish three otherwise identical-looking states:

  • Available=true, Err=nil, len(Hits)==0: arm ran, zero matches.
  • Available=false, Err=nil: arm unavailable on this snapshot (for example a legacy snapshot without fts_chunks). Hits must be empty.
  • Available=false, Err!=nil: arm failed. Hits must be empty.

Available=true with a non-nil Err is invalid; FusionStrategy implementations should return an error when they observe it.

type SearchHit

type SearchHit struct {
	ChunkID   int64
	Ref       string
	Kind      string
	Title     string
	SourceRef string
	Heading   string
	Content   string
	Metadata  map[string]string
	Score     float64
	// Provenance records which retrieval arms contributed to this hit.
	// It is populated by FusionStrategy implementations; non-fusion paths
	// (SearchVector, direct searchFTS callers) leave it nil.
	Provenance *HitProvenance
}

SearchHit is one retrieved section.

func Search(ctx context.Context, query SearchQuery) ([]SearchHit, error)

Search returns semantically close sections from an existing index.

type SearchParams

type SearchParams struct {
	// Text is the free-form query text. Empty rejects with a
	// "search text is required" error — this field has no default.
	Text string
	// Limit caps the number of SearchHits returned. Zero or negative
	// selects DefaultSearchLimit (10). Pass an explicit Limit when
	// throughput matters or when a downstream consumer needs a stable
	// shortlist size across snapshots.
	Limit int
	// Kinds filters candidate records to the supplied kind list. Nil
	// or empty means "no filter, all kinds".
	Kinds []string
	// Embedder produces the query vector(s) used by the dense arm.
	// Nil rejects with a "search embedder is required" error — this
	// field has no default.
	Embedder embed.Embedder
	// Fusion optionally overrides the hybrid fusion strategy. Nil
	// uses DefaultFusion().
	Fusion FusionStrategy
	// Reranker optionally refines the candidate shortlist after
	// fusion. Nil skips reranking.
	Reranker Reranker

	// SearchDimension optionally runs a truncated-prefix vector prefilter
	// at this dimension, then rescores the shortlist with full-dim cosine.
	// Zero (default) uses the full stored dimension throughout. Positive
	// values must be <= the stored embedder dimension. Only valid when the
	// stored quantization is float32; returns an error against int8 indexes.
	// This is the shape Matryoshka Representation Learning (MRL) embeddings
	// rely on — callers who use non-MRL embeddings should leave it zero.
	//
	// The truncated path is a brute-force scan over chunks_vec, not a
	// vec0 kNN MATCH, so it is not asymptotically cheaper than the default
	// path: its win is constant-factor (fewer floats per cosine) and only
	// pays off when the truncated prefix preserves ranking. Treat this as
	// a tuning knob for MRL snapshots rather than a blanket speedup.
	SearchDimension int
}

SearchParams are the retrieval parameters shared by SearchQuery (the top-level one-shot API against an index path) and SnapshotSearchQuery (the long-lived API against an open Snapshot). Extracting the shared shape lets downstream adapters thread one value through both surfaces and lets the top-level Search forward its params verbatim instead of hand-copying six fields.

type SearchQuery

type SearchQuery struct {
	// Path is the OS-native filesystem path to the snapshot. On
	// Windows both forward and back slashes are accepted — the store
	// package normalizes drive prefixes on open.
	Path string
	SearchParams
}

SearchQuery defines one semantic search against an index path. Retrieval parameters live on the embedded SearchParams so the same shape flows through Search, Snapshot.Search, and any downstream adapter wrapper.

type Section

type Section struct {
	ChunkID       int64
	Ref           string
	Kind          string
	Title         string
	SourceRef     string
	Heading       string
	Content       string
	ContextPrefix string
	Metadata      map[string]string
	Embedding     []float64
}

Section is one stored section from a Stroma snapshot.

type SectionQuery

type SectionQuery struct {
	Refs  []string
	Kinds []string

	// IncludeEmbeddings asks Sections() to populate Section.Embedding
	// from the stored vector column. Snapshots produced by hierarchical
	// policies (e.g., chunk.LateChunkPolicy) hold parent rows that are
	// storage-only context with no vector — those rows are filtered
	// out of an IncludeEmbeddings = true query because the underlying
	// chunks → chunks_vec join is inner. Set IncludeEmbeddings = false
	// to receive every chunk row (parents + leaves) without embeddings.
	IncludeEmbeddings bool
}

SectionQuery filters sections from an opened snapshot.

type Snapshot

type Snapshot struct {
	// contains filtered or unexported fields
}

Snapshot is one opened Stroma index snapshot.

Safe for concurrent use by multiple goroutines once returned from OpenSnapshot: *sql.DB is goroutine-safe per the database/sql contract, and all Snapshot read methods (Stats, Records, Sections, Search, SearchVector, ExpandContext) invoke it through that contract. Cached metadata fields (quantization, storedDimension, hasFTS, …) are populated at open time and read-only thereafter, so no additional synchronization is required around Snapshot itself.

func OpenSnapshot

func OpenSnapshot(ctx context.Context, path string) (*Snapshot, error)

OpenSnapshot opens a read-only Stroma snapshot at path. The path is OS-native; on Windows both forward and back slashes are accepted (the store package normalizes drive prefixes on open). The snapshot's schema_version metadata must be one of the accept-listed versions — schemaVersion (current), prevSchemaVersion, legacySchemaVersionV3, or legacySchemaVersionV2 — all of which read paths can decode directly without forcing an Update. Anything else returns ErrUnsupportedSchemaVersion wrapped with the observed version, so callers can surface a clear upgrade/downgrade message instead of silently misdecoding data against a future schema.

The returned *Snapshot is safe for concurrent use by multiple goroutines once constructed: *sql.DB is goroutine-safe per the database/sql contract, and Snapshot's cached metadata fields are populated at open time and read-only thereafter.

func (*Snapshot) Close

func (s *Snapshot) Close() error

Close releases the opened snapshot handle.

func (*Snapshot) ExpandContext

func (s *Snapshot) ExpandContext(ctx context.Context, chunkID int64, opts ContextOptions) ([]Section, error)

ExpandContext returns the chunk identified by chunkID together with the caller-requested local context, in document order:

[parent (if IncludeParent and the chunk has one), neighbors before,
 the chunk itself, neighbors after]

The chunk itself is always included, so callers do not have to reconcile the original SearchHit with the expansion. Embeddings are never populated by ExpandContext — the API is for context retrieval, not for re-ranking against fresh vectors. Callers that need embeddings should use Sections() with IncludeEmbeddings = true.

Returns an empty slice + nil error when chunkID does not exist; the substrate treats "no such chunk" as an empty result rather than an error, matching the section-read APIs.

Against snapshots built before schema v5 (#16), the v5 lineage column is absent: IncludeParent becomes a no-op and NeighborWindow scopes by record_ref alone (no parent grouping). ExpandContext stays useful on legacy files; it just cannot surface lineage that was never recorded.

Internally ExpandContext issues a small bounded number of parameterized reads: at most one to locate the requested chunk, one to fetch the parent (when IncludeParent + parent_chunk_id present), and one range scan over the sibling window. There is no per-result parameter expansion (no `WHERE id IN (?, ?, ?, ...)`), so the query never approaches SQLite's parameter cap regardless of NeighborWindow.

func (*Snapshot) Path

func (s *Snapshot) Path() string

Path returns the opened snapshot path.

func (*Snapshot) Records

func (s *Snapshot) Records(ctx context.Context, query RecordQuery) ([]corpus.Record, error)

Records returns records from the opened snapshot.

func (*Snapshot) Search

func (s *Snapshot) Search(ctx context.Context, query SnapshotSearchQuery) ([]SearchHit, error)

Search runs a hybrid text search (vector + FTS5) against the opened snapshot.

func (*Snapshot) SearchVector

func (s *Snapshot) SearchVector(ctx context.Context, query VectorSearchQuery) ([]SearchHit, error)

SearchVector runs a vector search against the opened snapshot.

func (*Snapshot) Sections

func (s *Snapshot) Sections(ctx context.Context, query SectionQuery) ([]Section, error)

Sections returns sections from the opened snapshot.

func (*Snapshot) Stats

func (s *Snapshot) Stats(ctx context.Context) (*Stats, error)

Stats inspects the opened snapshot.

type SnapshotSearchQuery

type SnapshotSearchQuery struct {
	SearchParams
}

SnapshotSearchQuery defines one text search against an opened snapshot. Retrieval parameters live on the embedded SearchParams so the same value can be forwarded verbatim from SearchQuery.SearchParams without hand-copying fields.

type Stats

type Stats struct {
	Path                string
	RecordCount         int
	ChunkCount          int
	KindCounts          map[string]int
	SchemaVersion       string
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
	CreatedAt           string
}

Stats describes a built Stroma index.

func ReadStats

func ReadStats(ctx context.Context, path string) (*Stats, error)

ReadStats inspects an existing index.

type UpdateOptions

type UpdateOptions struct {
	// Path is the OS-native filesystem path to the existing snapshot
	// to update in place. On Windows both forward and back slashes are
	// accepted — the store package normalizes drive prefixes on open.
	Path     string
	Embedder embed.Embedder

	// Contextualizer optionally produces a per-chunk prefix string. See
	// BuildOptions.Contextualizer for the contract. Leaving it nil
	// preserves the non-contextual path and produces chunks with an
	// empty persisted prefix.
	Contextualizer ChunkContextualizer

	// MaxChunkTokens sets the approximate maximum number of tokens (words)
	// per chunk. It should match the chunking policy used to build the current
	// index if callers want incremental updates to remain section-compatible.
	MaxChunkTokens int

	// ChunkOverlapTokens sets the approximate number of overlapping tokens
	// between adjacent sub-sections when a section is split. It should match
	// the chunking policy used to build the current index.
	ChunkOverlapTokens int

	// MaxChunkSections mirrors BuildOptions.MaxChunkSections for the
	// incremental-update path. Zero → DefaultMaxChunkSections; negative
	// → no cap.
	MaxChunkSections int

	// Quantization, when provided, must match the existing index — see
	// the store.Quantization* constants (float32, int8, binary) for the
	// accept-listed values. Leaving it empty reuses the stored
	// quantization metadata.
	Quantization string

	// ChunkPolicy mirrors BuildOptions.ChunkPolicy for the incremental
	// update path. Nil defaults to chunk.MarkdownPolicy with the
	// MaxChunkTokens / ChunkOverlapTokens / MaxChunkSections knobs
	// resolved here. The substrate does not enforce that the policy
	// matches the one used to build the snapshot — callers who switch
	// policies between Build and Update should expect reuse cache
	// misses on the affected sections (the leaves still re-embed
	// correctly; the snapshot just won't share embeddings across
	// rebuilds).
	ChunkPolicy chunk.Policy
}

UpdateOptions controls how an existing Stroma index is updated in place.

type UpdateResult

type UpdateResult struct {
	Path                string
	UpsertedCount       int
	RemovedCount        int
	RecordCount         int
	ChunkCount          int
	ReusedRecordCount   int
	ReusedChunkCount    int
	EmbeddedChunkCount  int
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
}

UpdateResult summarizes one incremental update.

func Update

func Update(ctx context.Context, added []corpus.Record, removed []string, options UpdateOptions) (*UpdateResult, error)

Update applies add, replace, and remove operations to an existing Stroma index without rebuilding it from scratch.

type VectorSearchQuery

type VectorSearchQuery struct {
	// Embedding is the precomputed query vector. Empty rejects with
	// a "search embedding is required" error — this field has no
	// default.
	Embedding []float64
	// Limit caps the number of SearchHits returned. Zero or negative
	// selects DefaultSearchLimit (10).
	Limit int
	// Kinds filters candidate records to the supplied kind list. Nil
	// or empty means "no filter, all kinds".
	Kinds []string
}

VectorSearchQuery defines one vector search against an opened snapshot.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL