Documentation
¶
Index ¶
- type Document
- type FullTextIndex
- func (fti *FullTextIndex) IndexNodes(labels []string, properties []string) error
- func (fti *FullTextIndex) IndexPrepared(nodes []*storage.Node, labels, properties []string) error
- func (fti *FullTextIndex) NodeContent(id uint64) (string, bool)
- func (fti *FullTextIndex) Search(query string) ([]SearchResult, error)
- func (fti *FullTextIndex) SearchBoolean(query string) ([]SearchResult, error)
- func (fti *FullTextIndex) SearchFuzzy(query string, maxDistance int) ([]SearchResult, error)
- func (fti *FullTextIndex) SearchInProperty(property string, query string) ([]SearchResult, error)
- func (fti *FullTextIndex) SearchPhrase(phrase string) ([]SearchResult, error)
- func (fti *FullTextIndex) SearchTopK(query string, k int) ([]SearchResult, error)
- func (fti *FullTextIndex) UpdateNode(nodeID uint64) error
- type HybridHit
- type HybridSearchOpts
- type HybridSearchResult
- type LSAConfig
- type LSAIndex
- func (i *LSAIndex) BM25Score(tokens []string, candidates map[uint64]bool) map[uint64]float64
- func (i *LSAIndex) Dimensions() int
- func (i *LSAIndex) DocSnippet(id uint64, maxLen int) string
- func (i *LSAIndex) DocVector(id uint64) ([]float32, bool)
- func (i *LSAIndex) FoldQuery(query string) (vec []float32, tokens []string, err error)
- func (i *LSAIndex) NumDocs() int
- func (i *LSAIndex) SaveToFile(path string) error
- func (i *LSAIndex) TopKByVector(qvec []float32, k int) ([]LSAResult, error)
- func (i *LSAIndex) WriteSnapshot(w io.Writer) error
- type LSAResult
- type SearchResult
- type TenantIndexes
- type TenantLSAIndexes
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Document ¶
type Document struct {
ID uint64 // graphdb NodeID
Title string // optional; amplified by cfg.TitleBoost
Body string
}
Document is the input shape for BuildLSAIndex.
Title is amplified in the index by LSAConfig.TitleBoost (default 3×); pass an empty string when no title is meaningful. Body is the raw content — any YAML frontmatter or leading H1 line is stripped internally before indexing.
type FullTextIndex ¶
type FullTextIndex struct {
// contains filtered or unexported fields
}
FullTextIndex provides full-text search capabilities
func NewFullTextIndex ¶
func NewFullTextIndex(gs *storage.GraphStorage) *FullTextIndex
NewFullTextIndex creates a new full-text search index
func (*FullTextIndex) IndexNodes ¶
func (fti *FullTextIndex) IndexNodes(labels []string, properties []string) error
IndexNodes indexes all nodes with specified labels and properties.
CROSS-TENANT / not for request paths: it samples FindNodesByLabelAcrossTenants, so it indexes every tenant's nodes into one index. The live search API uses the per-tenant IndexForTenant (pkg/search/tenant_indexes.go); IndexNodes is test/CLI-only. Do NOT wire it to a tenant-scoped request path — use IndexForTenant. (Tenant-isolation sweep F3.)
func (*FullTextIndex) IndexPrepared ¶
func (fti *FullTextIndex) IndexPrepared(nodes []*storage.Node, labels, properties []string) error
IndexPrepared indexes a pre-collected set of nodes under the given labels and properties. Use when the caller has scoped the node set (e.g. to a single tenant) that IndexNodes's label-based storage lookup can't express. Replaces any existing entries for these nodes.
func (*FullTextIndex) NodeContent ¶
func (fti *FullTextIndex) NodeContent(id uint64) (string, bool)
NodeContent returns the concatenated indexed text for the given NodeID, or ("", false) if the node is not currently indexed. Read-locks the index so callers can safely read during concurrent updates. Exposed for snippet generation at query time (handlers reach into the index rather than hitting storage for every result).
func (*FullTextIndex) Search ¶
func (fti *FullTextIndex) Search(query string) ([]SearchResult, error)
Search performs a basic text search (multi-word is treated as AND)
func (*FullTextIndex) SearchBoolean ¶
func (fti *FullTextIndex) SearchBoolean(query string) ([]SearchResult, error)
SearchBoolean performs boolean search with AND, OR, NOT operators
func (*FullTextIndex) SearchFuzzy ¶
func (fti *FullTextIndex) SearchFuzzy(query string, maxDistance int) ([]SearchResult, error)
SearchFuzzy performs fuzzy search with edit distance tolerance
func (*FullTextIndex) SearchInProperty ¶
func (fti *FullTextIndex) SearchInProperty(property string, query string) ([]SearchResult, error)
SearchInProperty searches only in a specific property
func (*FullTextIndex) SearchPhrase ¶
func (fti *FullTextIndex) SearchPhrase(phrase string) ([]SearchResult, error)
SearchPhrase searches for an exact phrase
func (*FullTextIndex) SearchTopK ¶
func (fti *FullTextIndex) SearchTopK(query string, k int) ([]SearchResult, error)
SearchTopK returns the top k results for query, ranked by TF-IDF score. Unlike Search, only the top-k candidates are hydrated via GetNode — scoring uses only in-memory posting data, so queries that hit many candidates don't pay an LSM round-trip per hit. Pass k <= 0 for the all-results behavior (equivalent to Search).
func (*FullTextIndex) UpdateNode ¶
func (fti *FullTextIndex) UpdateNode(nodeID uint64) error
UpdateNode updates the index for a specific node.
CROSS-TENANT / not for request paths: it resolves the node via the tenant-blind GetNode (no tenant validation), so a request path must NOT pass it an untrusted nodeID. Manual/test maintenance only. (Tenant-isolation sweep F3.)
type HybridHit ¶
HybridHit is one RRF-merged candidate. FTSRank and LSARank expose the per-stage rank so callers can see why a candidate scored where it did; -1 indicates the stage did not return this candidate.
FTSNode is the storage node when the FTS stage returned it. LSA-only candidates have FTSNode == nil; the caller is responsible for hydrating them via storage if needed (we don't pull *storage.GraphStorage into pkg/search just for that — it's a caller-side concern).
type HybridSearchOpts ¶
type HybridSearchOpts struct {
// OverFetchK is the per-stage candidate pool size. Caller
// typically sets it to ~3× the desired final result count to
// give the RRF merge enough overlap to discriminate.
OverFetchK int
// Alpha weights FTS vs LSA contribution. 0.5 is balanced.
// 1.0 = FTS only, 0.0 = LSA only. Out-of-range values are
// clamped to [0, 1].
Alpha float64
}
HybridSearchOpts configures the RRF-merged hybrid search.
Audit F2 #2 (2026-05-08): factored from pkg/api/handlers_hybrid_search.go's inline RRF composition so non-handler callers (notably pkg/retrieval/ for F2 GraphRAG) can invoke hybrid search without duplicating the merge logic.
type HybridSearchResult ¶
HybridSearchResult bundles the merged hits with a degraded indicator. Degraded is non-empty when the hybrid path fell back to a single stage:
"no-lsa-index" — tenant has no LSA index built "query-out-of-vocabulary" — query terms aren't in the LSA vocab
Note: "no-fts-match" is not surfaced here — an empty FTS top-k is not a degradation, it's just a result. The caller may choose to flag this externally based on len(Hits) and Degraded together.
func SearchHybridForTenant ¶
func SearchHybridForTenant( searchIdx *TenantIndexes, lsaIdx *TenantLSAIndexes, tenantID, query string, opts HybridSearchOpts, ) (*HybridSearchResult, error)
SearchHybridForTenant performs RRF-merged FTS + LSA search scoped to a single tenant. The caller supplies the per-tenant index containers; this function does the per-tenant Get internally.
Returns the merged candidate list plus a degraded indicator. The caller is responsible for:
- label / property post-filters (need storage access)
- pagination (offset / limit beyond the candidate pool)
- node hydration of LSA-only hits if needed
This split keeps pkg/search as the merge primitive and pushes API-shaped concerns (HTTP, storage hydration, response shape) to the caller. F2's pkg/retrieval/ uses this directly for GraphRAG seed retrieval; pkg/api/handlers_hybrid_search.go also calls it.
type LSAConfig ¶
type LSAConfig struct {
Dims int // latent dimensions after SVD (default 200)
Oversamp int // extra sketch dims for SVD numerical stability (default 10)
PowerIter int // power iterations in randomized SVD (default 2)
MaxVocab int // hard cap on vocabulary size (default 8000)
MinDocFreq int // filter terms appearing in fewer docs than this (default 3)
TitleBoost int // times to repeat Title to amplify title-term weight (default 3)
Seed int64 // RNG seed for determinism (default 42)
}
LSAConfig knobs. Use DefaultLSAConfig() for the tuned values from the wiki-graph implementation this code was ported from.
func DefaultLSAConfig ¶
func DefaultLSAConfig() LSAConfig
DefaultLSAConfig returns the config used by the wiki-graph port. These values are tuned for corpora in the 1k-10k document range.
type LSAIndex ¶
type LSAIndex struct {
// contains filtered or unexported fields
}
LSAIndex holds the LSA model and BM25 index built once at corpus load. After BuildLSAIndex returns, queries (FoldQuery, BM25Score) are sub-millisecond.
func BuildLSAIndex ¶
BuildLSAIndex constructs an LSA model and co-resident BM25 index from docs. Heavy linear algebra (SVD, Jacobi eigendecomposition) runs here; queries against the returned index are sub-millisecond.
Returns an error if the corpus is empty or produces fewer unique vocabulary terms than cfg.Dims (LSA cannot project into a higher-dimensional latent space than the vocabulary supports).
func LoadLSAFromFile ¶
LoadLSAFromFile reads an LSA index from path. Returns nil, os.ErrNotExist (wrapped) if the file is absent — callers should treat that as "no snapshot for this tenant yet" and fall through to the build path.
func ReadLSASnapshot ¶
ReadLSASnapshot deserializes an LSAIndex from r. Returns an error if the magic or version bytes don't match — callers should treat ErrLSASnapshotVersion as "regenerate via the admin endpoint" rather than retrying or falling back.
func (*LSAIndex) BM25Score ¶
BM25Score returns Okapi BM25 scores (k1=1.5, b=0.75) keyed by graphdb NodeID. Only documents whose index contains at least one query token appear in the result map — callers should treat missing keys as score 0.
If candidates is non-nil, scoring is restricted to NodeIDs in the set (other nodes are skipped). Pass nil to score across the full corpus.
func (*LSAIndex) Dimensions ¶
Dimensions returns the LSA latent dimension count (cfg.Dims).
func (*LSAIndex) DocSnippet ¶
DocSnippet returns a rune-safe truncated excerpt of the document body for presentation. maxLen is a character (rune) count, not a byte count. If maxLen <= 0 the full stored content is returned. If the document is not in the corpus, returns "".
func (*LSAIndex) DocVector ¶
DocVector returns the L2-normalized LSA embedding for the given NodeID. The second return is false if the NodeID was not present in the corpus.
The returned slice is freshly allocated and dequantized from the int8 in-memory representation; callers can hold it without worrying about index-state mutation. Re-quantization error vs the original float32 is at most lsaQuantScale^-1 per component (~0.79%).
func (*LSAIndex) FoldQuery ¶
FoldQuery maps a text query into the LSA latent space (k-dim, L2-normalized) and returns the shared tokenization so callers can feed the same tokens into BM25Score without re-running the tokenizer.
Returns an error if no query term maps to a vocabulary entry (out-of-vocab query) or if the projection collapses to the zero vector.
func (*LSAIndex) SaveToFile ¶
SaveToFile writes the index to path atomically (write to .tmp, then rename). Same idiom as pkg/storage's snapshot to avoid leaving a half-written file if the process is killed mid-write.
func (*LSAIndex) TopKByVector ¶
TopKByVector returns the k documents most similar to qvec, ranked by cosine similarity descending. Ties are broken by NodeID ascending so the result is a deterministic prefix of any larger K — the same property SearchTopK maintains for paginated callers.
qvec must have the same dimensionality as the index (Dimensions()) and should be L2-normalized; FoldQuery returns vectors that satisfy both. A mismatched dimension returns an error; it's a programming bug, not a user error.
No storage I/O — operates entirely on the in-memory int8-quantized doc vectors. The dot product fuses the dequantization into the accumulator (multiply int8 component then divide once at the loop tail) so the quantization shows up as a single division per doc rather than per component.
func (*LSAIndex) WriteSnapshot ¶
WriteSnapshot serializes the index to w in the on-disk format described at the top of this file. Caller is responsible for closing w. Holds no internal locks — callers writing a tenant snapshot should serialize against any rebuild path themselves (TenantLSAIndexes.SaveAll handles this via its RWMutex).
type LSAResult ¶
type LSAResult struct {
NodeID uint64
Similarity float32 // cosine similarity in [-1, 1] (typically [0, 1] for stored embeddings)
}
LSAResult is a ranked result from LSA semantic search.
type SearchResult ¶
SearchResult represents a search result with score
type TenantIndexes ¶
type TenantIndexes struct {
// contains filtered or unexported fields
}
TenantIndexes holds a FullTextIndex per tenant. Each tenant's index sees only its own nodes — isolation is enforced at build time via storage.GetNodesByLabelForTenant, not by filtering a shared index.
Indexes are constructed lazily on first Get. A tenant that has never been indexed returns an empty index, which produces zero search results — the safe default.
Design note: the API server owns the TenantIndexes; the query DSL's executor keeps its own (currently unused) shared index, so DSL search() is not yet tenant-scoped. Tenant-aware DSL search is a follow-up that requires threading tenant context through the executor. For now, callers of DSL search() are internal/trusted.
func NewTenantIndexes ¶
func NewTenantIndexes(gs *storage.GraphStorage) *TenantIndexes
NewTenantIndexes returns an empty TenantIndexes backed by gs.
func (*TenantIndexes) Get ¶
func (ti *TenantIndexes) Get(tenantID string) *FullTextIndex
Get returns the FullTextIndex for tenantID, constructing one lazily on first access. Safe for concurrent use. The returned index is populated only after IndexForTenant has been called for this tenant.
func (*TenantIndexes) IndexForTenant ¶
func (ti *TenantIndexes) IndexForTenant(tenantID string, labels, properties []string) error
IndexForTenant builds (or rebuilds) the index for tenantID from nodes that match the given labels AND belong to that tenant. Cross-tenant nodes are never passed to the index because the caller uses the tenant-scoped storage accessor.
func (*TenantIndexes) Tenants ¶
func (ti *TenantIndexes) Tenants() []string
Tenants returns the IDs of tenants that currently have an index (whether populated or just touched via Get). Order is unspecified.
type TenantLSAIndexes ¶
type TenantLSAIndexes struct {
// contains filtered or unexported fields
}
TenantLSAIndexes holds a per-tenant LSAIndex. Unlike TenantIndexes (which lazily constructs an empty FullTextIndex on first Get), LSA indexes require an expensive SVD build up front — so callers are expected to build explicitly via BuildLSAIndex and register the result via Set. Get returns nil for tenants that haven't been registered, signaling "no semantic search available for this tenant yet" to callers; the /hybrid-search handler uses this to degrade gracefully to a pure-FTS response.
Not coupled to storage — LSA builds take a []Document that the caller is responsible for gathering (from any source, scoped by whatever means). This keeps the tenant scoping concern at the build-time layer, not inside this map.
func NewTenantLSAIndexes ¶
func NewTenantLSAIndexes() *TenantLSAIndexes
NewTenantLSAIndexes returns an empty per-tenant LSA registry.
func (*TenantLSAIndexes) Get ¶
func (tli *TenantLSAIndexes) Get(tenantID string) *LSAIndex
Get returns the LSA index registered for tenantID, or nil if none has been registered. Callers MUST nil-check; the zero value is a deliberate signal ("LSA not available for this tenant").
func (*TenantLSAIndexes) LoadAll ¶
func (tli *TenantLSAIndexes) LoadAll(dir string) error
LoadAll reads every <tenantID>.lsa file in dir and registers each with the receiver. A missing dir returns nil (treat as "no snapshots yet") rather than an error — fresh deployments would otherwise fail to boot. Per-tenant decode failures are logged via the returned aggregate error but do not block other tenants from loading.
File-naming convention: filename stem is the tenant ID after the same sanitization SaveAll applies. Files whose stem doesn't survive round-trip sanitization are silently ignored (defense against hand-edited or attacker-planted files with traversal-like names).
func (*TenantLSAIndexes) SaveAll ¶
func (tli *TenantLSAIndexes) SaveAll(dir string) error
SaveAll writes every tenant's LSA index to dir/<tenantID>.lsa. Tenants with no registered index are skipped (no file written, no error). Errors per tenant are returned as a single aggregate; one tenant's failure doesn't block others. Holds the registry's read lock for the duration so a concurrent Set() can't race a snapshot mid-write — the in-memory map is read once, then file I/O happens unlocked per tenant.
func (*TenantLSAIndexes) Set ¶
func (tli *TenantLSAIndexes) Set(tenantID string, idx *LSAIndex)
Set registers idx as the LSA index for tenantID. A subsequent Set for the same tenantID replaces the prior index (supports rebuild). Set(tenantID, nil) removes the entry so callers can explicitly revoke LSA for a tenant (e.g. during corpus wipe).
func (*TenantLSAIndexes) Tenants ¶
func (tli *TenantLSAIndexes) Tenants() []string
Tenants returns the IDs of tenants with a registered LSA index. Order is unspecified.