index

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2026 License: MIT Imports: 18 Imported by: 0

Documentation

Overview

Package index orchestrates atomic Stroma index rebuilds and searches.

Index

Constants

View Source
const DefaultMaxChunkSections = 10_000

DefaultMaxChunkSections caps the number of heading-aware sections a single record can contribute to the index when the caller hasn't overridden it. 10,000 is generous for legitimate technical documents (few real specs exceed a few hundred headings) while still preventing a pathological or hostile body from expanding into millions of embedder calls + rows.

Variables

View Source
var ErrUnsupportedSchemaVersion = errors.New("unsupported snapshot schema version")

ErrUnsupportedSchemaVersion is returned when an operation encounters a snapshot whose schema_version is neither the current schema nor one the library knows how to migrate from. It is surfaced by OpenSnapshot and wrapped via fmt.Errorf with %w so callers can use errors.Is to detect it.

View Source
var ErrUpdateCommittedIntegrityCheckFailed = errors.New("update committed but post-commit integrity check failed")

ErrUpdateCommittedIntegrityCheckFailed signals that Update's transaction committed successfully — the record, chunk, and metadata changes are durable on disk — but the post-commit PRAGMA integrity_check / foreign_key_check reported corruption. The enclosing error wraps this sentinel via fmt.Errorf with %w so callers can use errors.Is to detect it. This case is non-retriable: re-running Update will not unroll the already-durable changes, and the underlying file likely needs operator inspection (see index/ARCHITECTURE.md). Contrast with plain errors returned by Update, which come from pre-commit failures and leave the file byte-identical to its pre-call state.

Functions

This section is empty.

Types

type BuildOptions

type BuildOptions struct {
	Path string

	// ReuseFromPath points at an existing Stroma snapshot whose embeddings
	// should be reused at the section level: a new section reuses its
	// stored embedding whenever its title, heading, and body match a
	// section already present in the prior snapshot. Records that are
	// fully unchanged are the maximal case, but sections carried over
	// from an edited record still reuse their embeddings. The snapshot is
	// opened read-only and queried per-record during the rebuild, so
	// resident memory scales with a single record's chunks rather than
	// with the whole corpus. Leave empty to disable reuse.
	ReuseFromPath string

	Embedder embed.Embedder

	// Contextualizer optionally produces a per-chunk prefix string that
	// gets prepended before the embedding text and the FTS5 content. When
	// set, the prefix persists on the chunk and participates in reuse
	// keying so a changed contextualizer invalidates stale reuse without
	// corrupting the stored representation. Nil disables contextualization
	// and leaves the build identical to the non-contextual path.
	Contextualizer ChunkContextualizer

	// MaxChunkTokens sets the approximate maximum number of tokens (words)
	// per chunk. Sections that exceed this limit are split into smaller
	// sub-sections. Zero disables token-budget splitting.
	MaxChunkTokens int

	// ChunkOverlapTokens sets the approximate number of overlapping tokens
	// between adjacent sub-sections when a section is split. Zero disables
	// overlap.
	ChunkOverlapTokens int

	// MaxChunkSections caps how many sections any single record is allowed
	// to produce. A pathological Markdown body (e.g., 10^6 heading lines)
	// would otherwise translate into 10^6 embedder calls and 10^6
	// chunk/vector rows — a DoS vector for shared embedders. Zero means
	// DefaultMaxChunkSections; a negative value disables the cap for
	// callers who have their own upstream validation. When the cap is
	// exceeded, Rebuild returns an error wrapping chunk.ErrTooManySections
	// instead of silently admitting the record.
	MaxChunkSections int

	// Quantization controls the vector storage format. See the
	// store.Quantization* constants for the accept-listed values:
	// store.QuantizationFloat32 (default), store.QuantizationInt8 (4x
	// smaller, minor precision loss), and store.QuantizationBinary
	// (32x smaller via 1-bit sign packing, full-precision rescore on a
	// companion table preserves ranking).
	Quantization string

	// ChunkPolicy selects the chunking strategy. Nil defaults to
	// chunk.MarkdownPolicy{Options: chunk.Options{
	//     MaxTokens: MaxChunkTokens,
	//     OverlapTokens: ChunkOverlapTokens,
	//     MaxSections: <resolved>,
	// }}, which reproduces the pre-1.0 chunking pipeline exactly.
	// Setting a non-nil policy overrides the per-Build chunking shape:
	// MaxChunkTokens/ChunkOverlapTokens/MaxChunkSections are read by
	// the default MarkdownPolicy but ignored when ChunkPolicy is set
	// (the policy carries its own configuration). Hierarchical
	// policies like chunk.LateChunkPolicy emit parent + leaf chunks
	// linked via parent_chunk_id; ExpandContext can surface the
	// parent on demand. See docs/superpowers/specs for the design.
	ChunkPolicy chunk.Policy
}

BuildOptions controls how a Stroma index is rebuilt.

type BuildResult

type BuildResult struct {
	Path                string
	RecordCount         int
	ChunkCount          int
	ReusedRecordCount   int
	ReusedChunkCount    int
	EmbeddedChunkCount  int
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
}

BuildResult summarizes a completed rebuild.

func Rebuild

func Rebuild(ctx context.Context, records []corpus.Record, options BuildOptions) (*BuildResult, error)

Rebuild atomically recreates the index at the requested path.

type ChunkContextualizer added in v1.0.0

type ChunkContextualizer interface {
	ContextualizeChunks(ctx context.Context, record corpus.Record, sections []chunk.Section) ([]string, error)
}

ChunkContextualizer produces a short explanatory prefix for each section of a record. The returned slice must be the same length as sections and aligned with it index-for-index. An empty prefix is allowed and disables contextual retrieval for that section. The returned prefix is prepended to the embedding text and to the FTS5 content column; it is persisted so reuse keying can detect when a changed contextualizer needs to invalidate the stored embedding.

type ContextOptions added in v1.0.0

type ContextOptions struct {
	// IncludeParent walks the requested chunk's parent_chunk_id one level
	// up and includes the parent row in the returned slice when the chunk
	// has a parent. Multi-level ancestry walks are explicit recursion by
	// the caller.
	//
	// Against snapshots built before schema v5 (#16), there is no
	// parent_chunk_id column to walk; IncludeParent is a no-op.
	IncludeParent bool

	// NeighborWindow includes up to N sibling chunks on each side of the
	// requested chunk, ordered by chunk_index. Two chunks are siblings
	// when they share the same parent_chunk_id (NULL counts as a single
	// sibling group), so for a leaf the neighborhood stays inside the
	// same parent span and for a flat or parent chunk the neighborhood
	// is other top-level chunks under the same record. Zero means no
	// neighbors are included; the requested chunk is still returned by
	// itself.
	//
	// Against snapshots built before schema v5 (#16), the parent grouping
	// is unavailable, so neighbors degrade to "other chunks in the same
	// record_ref with chunk_index in the requested window."
	NeighborWindow int
}

ContextOptions controls how Snapshot.ExpandContext widens a single chunk hit into a local-context payload.

type RecordQuery

type RecordQuery struct {
	Refs  []string
	Kinds []string
}

RecordQuery filters records from an opened snapshot.

type Reranker added in v0.4.0

type Reranker interface {
	Rerank(ctx context.Context, query string, candidates []SearchHit) ([]SearchHit, error)
}

Reranker optionally refines one search candidate shortlist before the final limit truncation.

type SearchHit

type SearchHit struct {
	ChunkID   int64
	Ref       string
	Kind      string
	Title     string
	SourceRef string
	Heading   string
	Content   string
	Metadata  map[string]string
	Score     float64
}

SearchHit is one retrieved section.

func Search(ctx context.Context, query SearchQuery) ([]SearchHit, error)

Search returns semantically close sections from an existing index.

type SearchQuery

type SearchQuery struct {
	Path     string
	Text     string
	Limit    int
	Kinds    []string
	Embedder embed.Embedder
	Reranker Reranker

	// SearchDimension optionally runs a truncated-prefix vector prefilter
	// at this dimension, then rescores the shortlist with full-dim cosine.
	// Zero (default) uses the full stored dimension throughout. Positive
	// values must be <= the stored embedder dimension. Only valid when the
	// stored quantization is float32; returns an error against int8 indexes.
	// This is the shape Matryoshka Representation Learning (MRL) embeddings
	// rely on — callers who use non-MRL embeddings should leave it zero.
	//
	// The truncated path is a brute-force scan over chunks_vec, not a
	// vec0 kNN MATCH, so it is not asymptotically cheaper than the default
	// path: its win is constant-factor (fewer floats per cosine) and only
	// pays off when the truncated prefix preserves ranking. Treat this as
	// a tuning knob for MRL snapshots rather than a blanket speedup.
	SearchDimension int
}

SearchQuery defines one semantic search.

type Section

type Section struct {
	ChunkID       int64
	Ref           string
	Kind          string
	Title         string
	SourceRef     string
	Heading       string
	Content       string
	ContextPrefix string
	Metadata      map[string]string
	Embedding     []float64
}

Section is one stored section from a Stroma snapshot.

type SectionQuery

type SectionQuery struct {
	Refs  []string
	Kinds []string

	// IncludeEmbeddings asks Sections() to populate Section.Embedding
	// from the stored vector column. Snapshots produced by hierarchical
	// policies (e.g., chunk.LateChunkPolicy) hold parent rows that are
	// storage-only context with no vector — those rows are filtered
	// out of an IncludeEmbeddings = true query because the underlying
	// chunks → chunks_vec join is inner. Set IncludeEmbeddings = false
	// to receive every chunk row (parents + leaves) without embeddings.
	IncludeEmbeddings bool
}

SectionQuery filters sections from an opened snapshot.

type Snapshot

type Snapshot struct {
	// contains filtered or unexported fields
}

Snapshot is one opened Stroma index snapshot.

func OpenSnapshot

func OpenSnapshot(ctx context.Context, path string) (*Snapshot, error)

OpenSnapshot opens a read-only Stroma snapshot. The snapshot's schema_version metadata must be one of the accept-listed versions — schemaVersion (current), prevSchemaVersion, legacySchemaVersionV3, or legacySchemaVersionV2 — all of which read paths can decode directly without forcing an Update. Anything else returns ErrUnsupportedSchemaVersion wrapped with the observed version, so callers can surface a clear upgrade/downgrade message instead of silently misdecoding data against a future schema.

func (*Snapshot) Close

func (s *Snapshot) Close() error

Close releases the opened snapshot handle.

func (*Snapshot) ExpandContext added in v1.0.0

func (s *Snapshot) ExpandContext(ctx context.Context, chunkID int64, opts ContextOptions) ([]Section, error)

ExpandContext returns the chunk identified by chunkID together with the caller-requested local context, in document order:

[parent (if IncludeParent and the chunk has one), neighbors before,
 the chunk itself, neighbors after]

The chunk itself is always included, so callers do not have to reconcile the original SearchHit with the expansion. Embeddings are never populated by ExpandContext — the API is for context retrieval, not for re-ranking against fresh vectors. Callers that need embeddings should use Sections() with IncludeEmbeddings = true.

Returns an empty slice + nil error when chunkID does not exist; the substrate treats "no such chunk" as an empty result rather than an error, matching the section-read APIs.

Against snapshots built before schema v5 (#16), the v5 lineage column is absent: IncludeParent becomes a no-op and NeighborWindow scopes by record_ref alone (no parent grouping). ExpandContext stays useful on legacy files; it just cannot surface lineage that was never recorded.

Internally ExpandContext issues a small bounded number of parameterized reads: at most one to locate the requested chunk, one to fetch the parent (when IncludeParent + parent_chunk_id present), and one range scan over the sibling window. There is no per-result parameter expansion (no `WHERE id IN (?, ?, ?, ...)`), so the query never approaches SQLite's parameter cap regardless of NeighborWindow.

func (*Snapshot) Path

func (s *Snapshot) Path() string

Path returns the opened snapshot path.

func (*Snapshot) Records

func (s *Snapshot) Records(ctx context.Context, query RecordQuery) ([]corpus.Record, error)

Records returns records from the opened snapshot.

func (*Snapshot) Search

func (s *Snapshot) Search(ctx context.Context, query SnapshotSearchQuery) ([]SearchHit, error)

Search runs a hybrid text search (vector + FTS5) against the opened snapshot.

func (*Snapshot) SearchVector

func (s *Snapshot) SearchVector(ctx context.Context, query VectorSearchQuery) ([]SearchHit, error)

SearchVector runs a vector search against the opened snapshot.

func (*Snapshot) Sections

func (s *Snapshot) Sections(ctx context.Context, query SectionQuery) ([]Section, error)

Sections returns sections from the opened snapshot.

func (*Snapshot) Stats

func (s *Snapshot) Stats(ctx context.Context) (*Stats, error)

Stats inspects the opened snapshot.

type SnapshotSearchQuery

type SnapshotSearchQuery struct {
	Text     string
	Limit    int
	Kinds    []string
	Embedder embed.Embedder
	Reranker Reranker

	// SearchDimension optionally runs a truncated-prefix vector prefilter
	// at this dimension, then rescores the shortlist with full-dim cosine.
	// See SearchQuery.SearchDimension for the full contract.
	SearchDimension int
}

SnapshotSearchQuery defines one text search against an opened snapshot.

type Stats

type Stats struct {
	Path                string
	RecordCount         int
	ChunkCount          int
	KindCounts          map[string]int
	SchemaVersion       string
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
	CreatedAt           string
}

Stats describes a built Stroma index.

func ReadStats

func ReadStats(ctx context.Context, path string) (*Stats, error)

ReadStats inspects an existing index.

type UpdateOptions added in v0.4.0

type UpdateOptions struct {
	Path     string
	Embedder embed.Embedder

	// Contextualizer optionally produces a per-chunk prefix string. See
	// BuildOptions.Contextualizer for the contract. Leaving it nil
	// preserves the non-contextual path and produces chunks with an
	// empty persisted prefix.
	Contextualizer ChunkContextualizer

	// MaxChunkTokens sets the approximate maximum number of tokens (words)
	// per chunk. It should match the chunking policy used to build the current
	// index if callers want incremental updates to remain section-compatible.
	MaxChunkTokens int

	// ChunkOverlapTokens sets the approximate number of overlapping tokens
	// between adjacent sub-sections when a section is split. It should match
	// the chunking policy used to build the current index.
	ChunkOverlapTokens int

	// MaxChunkSections mirrors BuildOptions.MaxChunkSections for the
	// incremental-update path. Zero → DefaultMaxChunkSections; negative
	// → no cap.
	MaxChunkSections int

	// Quantization, when provided, must match the existing index — see
	// the store.Quantization* constants (float32, int8, binary) for the
	// accept-listed values. Leaving it empty reuses the stored
	// quantization metadata.
	Quantization string

	// ChunkPolicy mirrors BuildOptions.ChunkPolicy for the incremental
	// update path. Nil defaults to chunk.MarkdownPolicy with the
	// MaxChunkTokens / ChunkOverlapTokens / MaxChunkSections knobs
	// resolved here. The substrate does not enforce that the policy
	// matches the one used to build the snapshot — callers who switch
	// policies between Build and Update should expect reuse cache
	// misses on the affected sections (the leaves still re-embed
	// correctly; the snapshot just won't share embeddings across
	// rebuilds).
	ChunkPolicy chunk.Policy
}

UpdateOptions controls how an existing Stroma index is updated in place.

type UpdateResult added in v0.4.0

type UpdateResult struct {
	Path                string
	UpsertedCount       int
	RemovedCount        int
	RecordCount         int
	ChunkCount          int
	ReusedRecordCount   int
	ReusedChunkCount    int
	EmbeddedChunkCount  int
	EmbedderDimension   int
	EmbedderFingerprint string
	ContentFingerprint  string
}

UpdateResult summarizes one incremental update.

func Update added in v0.4.0

func Update(ctx context.Context, added []corpus.Record, removed []string, options UpdateOptions) (*UpdateResult, error)

Update applies add, replace, and remove operations to an existing Stroma index without rebuilding it from scratch.

type VectorSearchQuery

type VectorSearchQuery struct {
	Embedding []float64
	Limit     int
	Kinds     []string
}

VectorSearchQuery defines one vector search against an opened snapshot.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL