search

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 17, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DeleteLSASnapshot added in v0.5.0

func DeleteLSASnapshot(dir, tenantID string) error

DeleteLSASnapshot removes a tenant's on-disk LSA snapshot (<dir>/<tenant>.lsa). A missing file is not an error. Call this on tenant deletion so LoadAll doesn't resurrect the deleted tenant's index on the next restart. Mirrors LoadAll's filename sanitization so the path matches what SaveToFile wrote.

Types

type Document

type Document struct {
	ID    uint64 // graphdb NodeID
	Title string // optional; amplified by cfg.TitleBoost
	Body  string
}

Document is the input shape for BuildLSAIndex.

Title is amplified in the index by LSAConfig.TitleBoost (default 3×); pass an empty string when no title is meaningful. Body is the raw content — any YAML frontmatter or leading H1 line is stripped internally before indexing.

type FullTextIndex

type FullTextIndex struct {
	// contains filtered or unexported fields
}

FullTextIndex provides full-text search capabilities

func NewFullTextIndex

func NewFullTextIndex(gs *storage.GraphStorage) *FullTextIndex

NewFullTextIndex creates a new full-text search index

func (*FullTextIndex) IndexNodes

func (fti *FullTextIndex) IndexNodes(labels []string, properties []string) error

IndexNodes indexes all nodes with specified labels and properties.

CROSS-TENANT / not for request paths: it samples FindNodesByLabelAcrossTenants, so it indexes every tenant's nodes into one index. The live search API uses the per-tenant IndexForTenant (pkg/search/tenant_indexes.go); IndexNodes is test/CLI-only. Do NOT wire it to a tenant-scoped request path — use IndexForTenant. (Tenant-isolation sweep F3.)

func (*FullTextIndex) IndexPrepared

func (fti *FullTextIndex) IndexPrepared(nodes []*storage.Node, labels, properties []string) error

IndexPrepared indexes a pre-collected set of nodes under the given labels and properties. Use when the caller has scoped the node set (e.g. to a single tenant) that IndexNodes's label-based storage lookup can't express. Replaces any existing entries for these nodes.

func (*FullTextIndex) NodeContent

func (fti *FullTextIndex) NodeContent(id uint64) (string, bool)

NodeContent returns the concatenated indexed text for the given NodeID, or ("", false) if the node is not currently indexed. Read-locks the index so callers can safely read during concurrent updates. Exposed for snippet generation at query time (handlers reach into the index rather than hitting storage for every result).

func (*FullTextIndex) Search

func (fti *FullTextIndex) Search(query string) ([]SearchResult, error)

Search performs a basic text search (multi-word is treated as AND)

func (*FullTextIndex) SearchBoolean

func (fti *FullTextIndex) SearchBoolean(query string) ([]SearchResult, error)

SearchBoolean performs boolean search with AND, OR, NOT operators

func (*FullTextIndex) SearchFuzzy

func (fti *FullTextIndex) SearchFuzzy(query string, maxDistance int) ([]SearchResult, error)

SearchFuzzy performs fuzzy search with edit distance tolerance

func (*FullTextIndex) SearchInProperty

func (fti *FullTextIndex) SearchInProperty(property string, query string) ([]SearchResult, error)

SearchInProperty searches only in a specific property

func (*FullTextIndex) SearchPhrase

func (fti *FullTextIndex) SearchPhrase(phrase string) ([]SearchResult, error)

SearchPhrase searches for an exact phrase

func (*FullTextIndex) SearchTopK

func (fti *FullTextIndex) SearchTopK(query string, k int) ([]SearchResult, error)

SearchTopK returns the top k results for query, ranked by TF-IDF score. Unlike Search, only the top-k candidates are hydrated via GetNode — scoring uses only in-memory posting data, so queries that hit many candidates don't pay an LSM round-trip per hit. Pass k <= 0 for the all-results behavior (equivalent to Search).

func (*FullTextIndex) UpdateNode

func (fti *FullTextIndex) UpdateNode(nodeID uint64) error

UpdateNode updates the index for a specific node.

CROSS-TENANT / not for request paths: it resolves the node via the tenant-blind GetNode (no tenant validation), so a request path must NOT pass it an untrusted nodeID. Manual/test maintenance only. (Tenant-isolation sweep F3.)

type HybridHit

type HybridHit struct {
	NodeID  uint64
	Score   float64
	FTSRank int
	LSARank int
	FTSNode *storage.Node
}

HybridHit is one RRF-merged candidate. FTSRank and LSARank expose the per-stage rank so callers can see why a candidate scored where it did; -1 indicates the stage did not return this candidate.

FTSNode is the storage node when the FTS stage returned it. LSA-only candidates have FTSNode == nil; the caller is responsible for hydrating them via storage if needed (we don't pull *storage.GraphStorage into pkg/search just for that — it's a caller-side concern).

type HybridSearchOpts

type HybridSearchOpts struct {
	// OverFetchK is the per-stage candidate pool size. Caller
	// typically sets it to ~3× the desired final result count to
	// give the RRF merge enough overlap to discriminate.
	OverFetchK int

	// Alpha weights FTS vs LSA contribution. 0.5 is balanced.
	// 1.0 = FTS only, 0.0 = LSA only. Out-of-range values are
	// clamped to [0, 1].
	Alpha float64
}

HybridSearchOpts configures the RRF-merged hybrid search.

Audit F2 #2 (2026-05-08): factored from pkg/api/handlers_hybrid_search.go's inline RRF composition so non-handler callers (notably pkg/retrieval/ for F2 GraphRAG) can invoke hybrid search without duplicating the merge logic.

type HybridSearchResult

type HybridSearchResult struct {
	Hits     []HybridHit
	Degraded string
}

HybridSearchResult bundles the merged hits with a degraded indicator. Degraded is non-empty when the hybrid path fell back to a single stage:

"no-lsa-index"             — tenant has no LSA index built
"query-out-of-vocabulary"  — query terms aren't in the LSA vocab

Note: "no-fts-match" is not surfaced here — an empty FTS top-k is not a degradation, it's just a result. The caller may choose to flag this externally based on len(Hits) and Degraded together.

func SearchHybridForTenant

func SearchHybridForTenant(
	searchIdx *TenantIndexes,
	lsaIdx *TenantLSAIndexes,
	tenantID, query string,
	opts HybridSearchOpts,
) (*HybridSearchResult, error)

SearchHybridForTenant performs RRF-merged FTS + LSA search scoped to a single tenant. The caller supplies the per-tenant index containers; this function does the per-tenant Get internally.

Returns the merged candidate list plus a degraded indicator. The caller is responsible for:

  • label / property post-filters (need storage access)
  • pagination (offset / limit beyond the candidate pool)
  • node hydration of LSA-only hits if needed

This split keeps pkg/search as the merge primitive and pushes API-shaped concerns (HTTP, storage hydration, response shape) to the caller. F2's pkg/retrieval/ uses this directly for GraphRAG seed retrieval; pkg/api/handlers_hybrid_search.go also calls it.

type LSAConfig

type LSAConfig struct {
	Dims       int   // latent dimensions after SVD (default 200)
	Oversamp   int   // extra sketch dims for SVD numerical stability (default 10)
	PowerIter  int   // power iterations in randomized SVD (default 2)
	MaxVocab   int   // hard cap on vocabulary size (default 8000)
	MinDocFreq int   // filter terms appearing in fewer docs than this (default 3)
	TitleBoost int   // times to repeat Title to amplify title-term weight (default 3)
	Seed       int64 // RNG seed for determinism (default 42)
}

LSAConfig knobs. Use DefaultLSAConfig() for the tuned values from the wiki-graph implementation this code was ported from.

func DefaultLSAConfig

func DefaultLSAConfig() LSAConfig

DefaultLSAConfig returns the config used by the wiki-graph port. These values are tuned for corpora in the 1k-10k document range.

type LSAIndex

type LSAIndex struct {
	// contains filtered or unexported fields
}

LSAIndex holds the LSA model and BM25 index built once at corpus load. After BuildLSAIndex returns, queries (FoldQuery, BM25Score) are sub-millisecond.

func BuildLSAIndex

func BuildLSAIndex(docs []Document, cfg LSAConfig) (*LSAIndex, error)

BuildLSAIndex constructs an LSA model and co-resident BM25 index from docs. Heavy linear algebra (SVD, Jacobi eigendecomposition) runs here; queries against the returned index are sub-millisecond.

Returns an error if the corpus is empty or produces fewer unique vocabulary terms than cfg.Dims (LSA cannot project into a higher-dimensional latent space than the vocabulary supports).

func LoadLSAFromFile

func LoadLSAFromFile(path string) (*LSAIndex, error)

LoadLSAFromFile reads an LSA index from path. Returns nil, os.ErrNotExist (wrapped) if the file is absent — callers should treat that as "no snapshot for this tenant yet" and fall through to the build path.

func ReadLSASnapshot

func ReadLSASnapshot(r io.Reader) (*LSAIndex, error)

ReadLSASnapshot deserializes an LSAIndex from r. Returns an error if the magic or version bytes don't match — callers should treat ErrLSASnapshotVersion as "regenerate via the admin endpoint" rather than retrying or falling back.

func (*LSAIndex) BM25Score

func (i *LSAIndex) BM25Score(tokens []string, candidates map[uint64]bool) map[uint64]float64

BM25Score returns Okapi BM25 scores (k1=1.5, b=0.75) keyed by graphdb NodeID. Only documents whose index contains at least one query token appear in the result map — callers should treat missing keys as score 0.

If candidates is non-nil, scoring is restricted to NodeIDs in the set (other nodes are skipped). Pass nil to score across the full corpus.

func (*LSAIndex) Dimensions

func (i *LSAIndex) Dimensions() int

Dimensions returns the LSA latent dimension count (cfg.Dims).

func (*LSAIndex) DocSnippet

func (i *LSAIndex) DocSnippet(id uint64, maxLen int) string

DocSnippet returns a rune-safe truncated excerpt of the document body for presentation. maxLen is a character (rune) count, not a byte count. If maxLen <= 0 the full stored content is returned. If the document is not in the corpus, returns "".

func (*LSAIndex) DocVector

func (i *LSAIndex) DocVector(id uint64) ([]float32, bool)

DocVector returns the L2-normalized LSA embedding for the given NodeID. The second return is false if the NodeID was not present in the corpus.

The returned slice is freshly allocated and dequantized from the int8 in-memory representation; callers can hold it without worrying about index-state mutation. Re-quantization error vs the original float32 is at most lsaQuantScale^-1 per component (~0.79%).

func (*LSAIndex) FoldQuery

func (i *LSAIndex) FoldQuery(query string) (vec []float32, tokens []string, err error)

FoldQuery maps a text query into the LSA latent space (k-dim, L2-normalized) and returns the shared tokenization so callers can feed the same tokens into BM25Score without re-running the tokenizer.

Returns an error if no query term maps to a vocabulary entry (out-of-vocab query) or if the projection collapses to the zero vector.

func (*LSAIndex) NumDocs

func (i *LSAIndex) NumDocs() int

NumDocs returns the number of documents in the corpus.

func (*LSAIndex) SaveToFile

func (i *LSAIndex) SaveToFile(path string) error

SaveToFile writes the index to path atomically (write to .tmp, then rename). Same idiom as pkg/storage's snapshot to avoid leaving a half-written file if the process is killed mid-write.

func (*LSAIndex) TopKByVector

func (i *LSAIndex) TopKByVector(qvec []float32, k int) ([]LSAResult, error)

TopKByVector returns the k documents most similar to qvec, ranked by cosine similarity descending. Ties are broken by NodeID ascending so the result is a deterministic prefix of any larger K — the same property SearchTopK maintains for paginated callers.

qvec must have the same dimensionality as the index (Dimensions()) and should be L2-normalized; FoldQuery returns vectors that satisfy both. A mismatched dimension returns an error; it's a programming bug, not a user error.

No storage I/O — operates entirely on the in-memory int8-quantized doc vectors. The dot product fuses the dequantization into the accumulator (multiply int8 component then divide once at the loop tail) so the quantization shows up as a single division per doc rather than per component.

func (*LSAIndex) WriteSnapshot

func (i *LSAIndex) WriteSnapshot(w io.Writer) error

WriteSnapshot serializes the index to w in the on-disk format described at the top of this file. Caller is responsible for closing w. Holds no internal locks — callers writing a tenant snapshot should serialize against any rebuild path themselves (TenantLSAIndexes.SaveAll handles this via its RWMutex).

type LSAResult

type LSAResult struct {
	NodeID     uint64
	Similarity float32 // cosine similarity in [-1, 1] (typically [0, 1] for stored embeddings)
}

LSAResult is a ranked result from LSA semantic search.

type SearchResult

type SearchResult struct {
	NodeID uint64
	Score  float64
	Node   *storage.Node
}

SearchResult represents a search result with score

type TenantIndexes

type TenantIndexes struct {
	// contains filtered or unexported fields
}

TenantIndexes holds a FullTextIndex per tenant. Each tenant's index sees only its own nodes — isolation is enforced at build time via storage.GetNodesByLabelForTenant, not by filtering a shared index.

Indexes are constructed lazily on first Get. A tenant that has never been indexed returns an empty index, which produces zero search results — the safe default.

Design note: the API server owns the TenantIndexes; the query DSL's executor keeps its own (currently unused) shared index, so DSL search() is not yet tenant-scoped. Tenant-aware DSL search is a follow-up that requires threading tenant context through the executor. For now, callers of DSL search() are internal/trusted.

func NewTenantIndexes

func NewTenantIndexes(gs *storage.GraphStorage) *TenantIndexes

NewTenantIndexes returns an empty TenantIndexes backed by gs.

func (*TenantIndexes) Delete added in v0.5.0

func (ti *TenantIndexes) Delete(tenantID string)

Delete removes a tenant's in-memory full-text index (no-op if absent). The FTS index is not persisted (admin-rebuilt via POST /search/index), so no on-disk cleanup is needed — unlike LSA.

func (*TenantIndexes) Get

func (ti *TenantIndexes) Get(tenantID string) *FullTextIndex

Get returns the FullTextIndex for tenantID, constructing one lazily on first access. Safe for concurrent use. The returned index is populated only after IndexForTenant has been called for this tenant.

func (*TenantIndexes) IndexForTenant

func (ti *TenantIndexes) IndexForTenant(tenantID string, labels, properties []string) error

IndexForTenant builds (or rebuilds) the index for tenantID from nodes that match the given labels AND belong to that tenant. Cross-tenant nodes are never passed to the index because the caller uses the tenant-scoped storage accessor.

func (*TenantIndexes) Tenants

func (ti *TenantIndexes) Tenants() []string

Tenants returns the IDs of tenants that currently have an index (whether populated or just touched via Get). Order is unspecified.

type TenantLSAIndexes

type TenantLSAIndexes struct {
	// contains filtered or unexported fields
}

TenantLSAIndexes holds a per-tenant LSAIndex. Unlike TenantIndexes (which lazily constructs an empty FullTextIndex on first Get), LSA indexes require an expensive SVD build up front — so callers are expected to build explicitly via BuildLSAIndex and register the result via Set. Get returns nil for tenants that haven't been registered, signaling "no semantic search available for this tenant yet" to callers; the /hybrid-search handler uses this to degrade gracefully to a pure-FTS response.

Not coupled to storage — LSA builds take a []Document that the caller is responsible for gathering (from any source, scoped by whatever means). This keeps the tenant scoping concern at the build-time layer, not inside this map.

func NewTenantLSAIndexes

func NewTenantLSAIndexes() *TenantLSAIndexes

NewTenantLSAIndexes returns an empty per-tenant LSA registry.

func (*TenantLSAIndexes) Delete added in v0.5.0

func (tli *TenantLSAIndexes) Delete(tenantID string)

Delete removes the in-memory LSA index for tenantID (no-op if absent). Does NOT remove the on-disk snapshot — pair with DeleteLSASnapshot on tenant deletion, or LoadAll resurrects the index on the next restart.

func (*TenantLSAIndexes) Get

func (tli *TenantLSAIndexes) Get(tenantID string) *LSAIndex

Get returns the LSA index registered for tenantID, or nil if none has been registered. Callers MUST nil-check; the zero value is a deliberate signal ("LSA not available for this tenant").

func (*TenantLSAIndexes) LoadAll

func (tli *TenantLSAIndexes) LoadAll(dir string) error

LoadAll reads every <tenantID>.lsa file in dir and registers each with the receiver. A missing dir returns nil (treat as "no snapshots yet") rather than an error — fresh deployments would otherwise fail to boot. Per-tenant decode failures are logged via the returned aggregate error but do not block other tenants from loading.

File-naming convention: filename stem is the tenant ID after the same sanitization SaveAll applies. Files whose stem doesn't survive round-trip sanitization are silently ignored (defense against hand-edited or attacker-planted files with traversal-like names).

func (*TenantLSAIndexes) SaveAll

func (tli *TenantLSAIndexes) SaveAll(dir string) error

SaveAll writes every tenant's LSA index to dir/<tenantID>.lsa. Tenants with no registered index are skipped (no file written, no error). Errors per tenant are returned as a single aggregate; one tenant's failure doesn't block others. Holds the registry's read lock for the duration so a concurrent Set() can't race a snapshot mid-write — the in-memory map is read once, then file I/O happens unlocked per tenant.

func (*TenantLSAIndexes) Set

func (tli *TenantLSAIndexes) Set(tenantID string, idx *LSAIndex)

Set registers idx as the LSA index for tenantID. A subsequent Set for the same tenantID replaces the prior index (supports rebuild). Set(tenantID, nil) removes the entry so callers can explicitly revoke LSA for a tenant (e.g. during corpus wipe).

func (*TenantLSAIndexes) Tenants

func (tli *TenantLSAIndexes) Tenants() []string

Tenants returns the IDs of tenants with a registered LSA index. Order is unspecified.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL