Documentation
¶
Overview ¶
Package index orchestrates atomic Stroma index rebuilds and searches.
Index ¶
- Constants
- Variables
- type BuildOptions
- type BuildResult
- type ChunkContextualizer
- type ContextOptions
- type RecordQuery
- type Reranker
- type SearchHit
- type SearchQuery
- type Section
- type SectionQuery
- type Snapshot
- func (s *Snapshot) Close() error
- func (s *Snapshot) ExpandContext(ctx context.Context, chunkID int64, opts ContextOptions) ([]Section, error)
- func (s *Snapshot) Path() string
- func (s *Snapshot) Records(ctx context.Context, query RecordQuery) ([]corpus.Record, error)
- func (s *Snapshot) Search(ctx context.Context, query SnapshotSearchQuery) ([]SearchHit, error)
- func (s *Snapshot) SearchVector(ctx context.Context, query VectorSearchQuery) ([]SearchHit, error)
- func (s *Snapshot) Sections(ctx context.Context, query SectionQuery) ([]Section, error)
- func (s *Snapshot) Stats(ctx context.Context) (*Stats, error)
- type SnapshotSearchQuery
- type Stats
- type UpdateOptions
- type UpdateResult
- type VectorSearchQuery
Constants ¶
const DefaultMaxChunkSections = 10_000
DefaultMaxChunkSections caps the number of heading-aware sections a single record can contribute to the index when the caller hasn't overridden it. 10,000 is generous for legitimate technical documents (few real specs exceed a few hundred headings) while still preventing a pathological or hostile body from expanding into millions of embedder calls + rows.
Variables ¶
var ErrUnsupportedSchemaVersion = errors.New("unsupported snapshot schema version")
ErrUnsupportedSchemaVersion is returned when an operation encounters a snapshot whose schema_version is neither the current schema nor one the library knows how to migrate from. It is surfaced by OpenSnapshot and wrapped via fmt.Errorf with %w so callers can use errors.Is to detect it.
var ErrUpdateCommittedIntegrityCheckFailed = errors.New("update committed but post-commit integrity check failed")
ErrUpdateCommittedIntegrityCheckFailed signals that Update's transaction committed successfully — the record, chunk, and metadata changes are durable on disk — but the post-commit PRAGMA integrity_check / foreign_key_check reported corruption. The enclosing error wraps this sentinel via fmt.Errorf with %w so callers can use errors.Is to detect it. This case is non-retriable: re-running Update will not unroll the already-durable changes, and the underlying file likely needs operator inspection (see index/ARCHITECTURE.md). Contrast with plain errors returned by Update, which come from pre-commit failures and leave the file byte-identical to its pre-call state.
Functions ¶
This section is empty.
Types ¶
type BuildOptions ¶
type BuildOptions struct {
Path string
// ReuseFromPath points at an existing Stroma snapshot whose embeddings
// should be reused at the section level: a new section reuses its
// stored embedding whenever its title, heading, and body match a
// section already present in the prior snapshot. Records that are
// fully unchanged are the maximal case, but sections carried over
// from an edited record still reuse their embeddings. The snapshot is
// opened read-only and queried per-record during the rebuild, so
// resident memory scales with a single record's chunks rather than
// with the whole corpus. Leave empty to disable reuse.
ReuseFromPath string
Embedder embed.Embedder
// Contextualizer optionally produces a per-chunk prefix string that
// gets prepended before the embedding text and the FTS5 content. When
// set, the prefix persists on the chunk and participates in reuse
// keying so a changed contextualizer invalidates stale reuse without
// corrupting the stored representation. Nil disables contextualization
// and leaves the build identical to the non-contextual path.
Contextualizer ChunkContextualizer
// MaxChunkTokens sets the approximate maximum number of tokens (words)
// per chunk. Sections that exceed this limit are split into smaller
// sub-sections. Zero disables token-budget splitting.
MaxChunkTokens int
// ChunkOverlapTokens sets the approximate number of overlapping tokens
// between adjacent sub-sections when a section is split. Zero disables
// overlap.
ChunkOverlapTokens int
// MaxChunkSections caps how many sections any single record is allowed
// to produce. A pathological Markdown body (e.g., 10^6 heading lines)
// would otherwise translate into 10^6 embedder calls and 10^6
// chunk/vector rows — a DoS vector for shared embedders. Zero means
// DefaultMaxChunkSections; a negative value disables the cap for
// callers who have their own upstream validation. When the cap is
// exceeded, Rebuild returns an error wrapping chunk.ErrTooManySections
// instead of silently admitting the record.
MaxChunkSections int
// Quantization controls the vector storage format. See the
// store.Quantization* constants for the accept-listed values:
// store.QuantizationFloat32 (default), store.QuantizationInt8 (4x
// smaller, minor precision loss), and store.QuantizationBinary
// (32x smaller via 1-bit sign packing, full-precision rescore on a
// companion table preserves ranking).
Quantization string
// ChunkPolicy selects the chunking strategy. Nil defaults to
// chunk.MarkdownPolicy{Options: chunk.Options{
// MaxTokens: MaxChunkTokens,
// OverlapTokens: ChunkOverlapTokens,
// MaxSections: <resolved>,
// }}, which reproduces the pre-1.0 chunking pipeline exactly.
// Setting a non-nil policy overrides the per-Build chunking shape:
// MaxChunkTokens/ChunkOverlapTokens/MaxChunkSections are read by
// the default MarkdownPolicy but ignored when ChunkPolicy is set
// (the policy carries its own configuration). Hierarchical
// policies like chunk.LateChunkPolicy emit parent + leaf chunks
// linked via parent_chunk_id; ExpandContext can surface the
// parent on demand. See docs/superpowers/specs for the design.
ChunkPolicy chunk.Policy
}
BuildOptions controls how a Stroma index is rebuilt.
type BuildResult ¶
type BuildResult struct {
Path string
RecordCount int
ChunkCount int
ReusedRecordCount int
ReusedChunkCount int
EmbeddedChunkCount int
EmbedderDimension int
EmbedderFingerprint string
ContentFingerprint string
}
BuildResult summarizes a completed rebuild.
func Rebuild ¶
func Rebuild(ctx context.Context, records []corpus.Record, options BuildOptions) (*BuildResult, error)
Rebuild atomically recreates the index at the requested path.
type ChunkContextualizer ¶ added in v1.0.0
type ChunkContextualizer interface {
ContextualizeChunks(ctx context.Context, record corpus.Record, sections []chunk.Section) ([]string, error)
}
ChunkContextualizer produces a short explanatory prefix for each section of a record. The returned slice must be the same length as sections and aligned with it index-for-index. An empty prefix is allowed and disables contextual retrieval for that section. The returned prefix is prepended to the embedding text and to the FTS5 content column; it is persisted so reuse keying can detect when a changed contextualizer needs to invalidate the stored embedding.
type ContextOptions ¶ added in v1.0.0
type ContextOptions struct {
// IncludeParent walks the requested chunk's parent_chunk_id one level
// up and includes the parent row in the returned slice when the chunk
// has a parent. Multi-level ancestry walks are explicit recursion by
// the caller.
//
// Against snapshots built before schema v5 (#16), there is no
// parent_chunk_id column to walk; IncludeParent is a no-op.
IncludeParent bool
// NeighborWindow includes up to N sibling chunks on each side of the
// requested chunk, ordered by chunk_index. Two chunks are siblings
// when they share the same parent_chunk_id (NULL counts as a single
// sibling group), so for a leaf the neighborhood stays inside the
// same parent span and for a flat or parent chunk the neighborhood
// is other top-level chunks under the same record. Zero means no
// neighbors are included; the requested chunk is still returned by
// itself.
//
// Against snapshots built before schema v5 (#16), the parent grouping
// is unavailable, so neighbors degrade to "other chunks in the same
// record_ref with chunk_index in the requested window."
NeighborWindow int
}
ContextOptions controls how Snapshot.ExpandContext widens a single chunk hit into a local-context payload.
type RecordQuery ¶
RecordQuery filters records from an opened snapshot.
type Reranker ¶ added in v0.4.0
type Reranker interface {
Rerank(ctx context.Context, query string, candidates []SearchHit) ([]SearchHit, error)
}
Reranker optionally refines one search candidate shortlist before the final limit truncation.
type SearchHit ¶
type SearchHit struct {
ChunkID int64
Ref string
Kind string
Title string
SourceRef string
Heading string
Content string
Metadata map[string]string
Score float64
}
SearchHit is one retrieved section.
type SearchQuery ¶
type SearchQuery struct {
Path string
Text string
Limit int
Kinds []string
Embedder embed.Embedder
Reranker Reranker
// SearchDimension optionally runs a truncated-prefix vector prefilter
// at this dimension, then rescores the shortlist with full-dim cosine.
// Zero (default) uses the full stored dimension throughout. Positive
// values must be <= the stored embedder dimension. Only valid when the
// stored quantization is float32; returns an error against int8 indexes.
// This is the shape Matryoshka Representation Learning (MRL) embeddings
// rely on — callers who use non-MRL embeddings should leave it zero.
//
// The truncated path is a brute-force scan over chunks_vec, not a
// vec0 kNN MATCH, so it is not asymptotically cheaper than the default
// path: its win is constant-factor (fewer floats per cosine) and only
// pays off when the truncated prefix preserves ranking. Treat this as
// a tuning knob for MRL snapshots rather than a blanket speedup.
SearchDimension int
}
SearchQuery defines one semantic search.
type Section ¶
type Section struct {
ChunkID int64
Ref string
Kind string
Title string
SourceRef string
Heading string
Content string
ContextPrefix string
Metadata map[string]string
Embedding []float64
}
Section is one stored section from a Stroma snapshot.
type SectionQuery ¶
type SectionQuery struct {
Refs []string
Kinds []string
// IncludeEmbeddings asks Sections() to populate Section.Embedding
// from the stored vector column. Snapshots produced by hierarchical
// policies (e.g., chunk.LateChunkPolicy) hold parent rows that are
// storage-only context with no vector — those rows are filtered
// out of an IncludeEmbeddings = true query because the underlying
// chunks → chunks_vec join is inner. Set IncludeEmbeddings = false
// to receive every chunk row (parents + leaves) without embeddings.
IncludeEmbeddings bool
}
SectionQuery filters sections from an opened snapshot.
type Snapshot ¶
type Snapshot struct {
// contains filtered or unexported fields
}
Snapshot is one opened Stroma index snapshot.
func OpenSnapshot ¶
OpenSnapshot opens a read-only Stroma snapshot. The snapshot's schema_version metadata must be one of the accept-listed versions — schemaVersion (current), prevSchemaVersion, legacySchemaVersionV3, or legacySchemaVersionV2 — all of which read paths can decode directly without forcing an Update. Anything else returns ErrUnsupportedSchemaVersion wrapped with the observed version, so callers can surface a clear upgrade/downgrade message instead of silently misdecoding data against a future schema.
func (*Snapshot) ExpandContext ¶ added in v1.0.0
func (s *Snapshot) ExpandContext(ctx context.Context, chunkID int64, opts ContextOptions) ([]Section, error)
ExpandContext returns the chunk identified by chunkID together with the caller-requested local context, in document order:
[parent (if IncludeParent and the chunk has one), neighbors before, the chunk itself, neighbors after]
The chunk itself is always included, so callers do not have to reconcile the original SearchHit with the expansion. Embeddings are never populated by ExpandContext — the API is for context retrieval, not for re-ranking against fresh vectors. Callers that need embeddings should use Sections() with IncludeEmbeddings = true.
Returns an empty slice + nil error when chunkID does not exist; the substrate treats "no such chunk" as an empty result rather than an error, matching the section-read APIs.
Against snapshots built before schema v5 (#16), the v5 lineage column is absent: IncludeParent becomes a no-op and NeighborWindow scopes by record_ref alone (no parent grouping). ExpandContext stays useful on legacy files; it just cannot surface lineage that was never recorded.
Internally ExpandContext issues a small bounded number of parameterized reads: at most one to locate the requested chunk, one to fetch the parent (when IncludeParent + parent_chunk_id present), and one range scan over the sibling window. There is no per-result parameter expansion (no `WHERE id IN (?, ?, ?, ...)`), so the query never approaches SQLite's parameter cap regardless of NeighborWindow.
func (*Snapshot) Search ¶
Search runs a hybrid text search (vector + FTS5) against the opened snapshot.
func (*Snapshot) SearchVector ¶
SearchVector runs a vector search against the opened snapshot.
type SnapshotSearchQuery ¶
type SnapshotSearchQuery struct {
Text string
Limit int
Kinds []string
Embedder embed.Embedder
Reranker Reranker
// SearchDimension optionally runs a truncated-prefix vector prefilter
// at this dimension, then rescores the shortlist with full-dim cosine.
// See SearchQuery.SearchDimension for the full contract.
SearchDimension int
}
SnapshotSearchQuery defines one text search against an opened snapshot.
type Stats ¶
type Stats struct {
Path string
RecordCount int
ChunkCount int
KindCounts map[string]int
SchemaVersion string
EmbedderDimension int
EmbedderFingerprint string
ContentFingerprint string
CreatedAt string
}
Stats describes a built Stroma index.
type UpdateOptions ¶ added in v0.4.0
type UpdateOptions struct {
Path string
Embedder embed.Embedder
// Contextualizer optionally produces a per-chunk prefix string. See
// BuildOptions.Contextualizer for the contract. Leaving it nil
// preserves the non-contextual path and produces chunks with an
// empty persisted prefix.
Contextualizer ChunkContextualizer
// MaxChunkTokens sets the approximate maximum number of tokens (words)
// per chunk. It should match the chunking policy used to build the current
// index if callers want incremental updates to remain section-compatible.
MaxChunkTokens int
// ChunkOverlapTokens sets the approximate number of overlapping tokens
// between adjacent sub-sections when a section is split. It should match
// the chunking policy used to build the current index.
ChunkOverlapTokens int
// MaxChunkSections mirrors BuildOptions.MaxChunkSections for the
// incremental-update path. Zero → DefaultMaxChunkSections; negative
// → no cap.
MaxChunkSections int
// Quantization, when provided, must match the existing index — see
// the store.Quantization* constants (float32, int8, binary) for the
// accept-listed values. Leaving it empty reuses the stored
// quantization metadata.
Quantization string
// ChunkPolicy mirrors BuildOptions.ChunkPolicy for the incremental
// update path. Nil defaults to chunk.MarkdownPolicy with the
// MaxChunkTokens / ChunkOverlapTokens / MaxChunkSections knobs
// resolved here. The substrate does not enforce that the policy
// matches the one used to build the snapshot — callers who switch
// policies between Build and Update should expect reuse cache
// misses on the affected sections (the leaves still re-embed
// correctly; the snapshot just won't share embeddings across
// rebuilds).
ChunkPolicy chunk.Policy
}
UpdateOptions controls how an existing Stroma index is updated in place.
type UpdateResult ¶ added in v0.4.0
type UpdateResult struct {
Path string
UpsertedCount int
RemovedCount int
RecordCount int
ChunkCount int
ReusedRecordCount int
ReusedChunkCount int
EmbeddedChunkCount int
EmbedderDimension int
EmbedderFingerprint string
ContentFingerprint string
}
UpdateResult summarizes one incremental update.
func Update ¶ added in v0.4.0
func Update(ctx context.Context, added []corpus.Record, removed []string, options UpdateOptions) (*UpdateResult, error)
Update applies add, replace, and remove operations to an existing Stroma index without rebuilding it from scratch.
type VectorSearchQuery ¶
VectorSearchQuery defines one vector search against an opened snapshot.