Documentation
¶
Overview ¶
Package baseline provides pre-curated golden-query regression metrics for VaultMind's retrievers. Baselines are committed snapshots of Hit@K and MRR against a stable query fixture. The point is not absolute quality (that's research-grade evaluation); it's a tripwire that fires when a change degrades retrieval on a fixed input — the "you cannot improve what you cannot measure" principle applied to a single dimension.
Index ¶
Constants ¶
const DefaultTolerance = 0.02
DefaultTolerance is the aggregate-metric drop allowed before a regression fires. 0.02 (2 percentage points) is tight enough to catch real degradations on a curated fixture but tolerant of small FTS scoring jitter between runs.
Variables ¶
This section is empty.
Functions ¶
func HitAtK ¶
HitAtK reports 1.0 if any expected ID appears within the first k results, else 0.0. A float return type keeps aggregation (mean across queries) arithmetically simple.
Semantics: strict-intersection — any expected in top-k counts. Not all-of-expected, not strict-order. Partial-recall wins are the point of keyword/hybrid retrieval and we don't want the metric to punish them.
k larger than len(results) scans everything that exists (no panic). An empty expected set returns 0 rather than NaN so downstream means stay well-defined.
func ReciprocalRank ¶
ReciprocalRank reports 1/rank of the best (lowest-rank) expected ID within results, or 0 if no expected ID appears at all. This is the per-query contribution to MRR; the mean across queries is taken at the aggregate layer.
"Best rank wins" — if multiple expected IDs are present, we take the minimum rank. Using first-declared expected ID would silently tie the metric to the *order* of the expected set in the fixture, which is a trap: reshuffling the fixture (for readability, say) would then move the metric.
Types ¶
type Diff ¶
Diff is the output of CompareToSnapshot. OK is the one-bit "passed the gate" answer; Regressions carries human-readable descriptions of every regressing dimension so operators don't have to re-derive what broke from a number.
func CompareToSnapshot ¶
CompareToSnapshot reports regressions between a current run and a committed snapshot. A regression is any aggregate or per-query metric that dropped below (snapshot - tolerance). Improvements never trigger the gate — callers can refresh the snapshot deliberately when a retrieval change is intended to raise quality.
K mismatch is an operator error, not a silent comparison. Hit@5 numbers compared against Hit@10 numbers would produce nonsense.
type Query ¶
type Query struct {
Name string `yaml:"name" json:"name"`
Text string `yaml:"text" json:"text"`
Expected []string `yaml:"expected" json:"expected"`
}
Query is a single golden-query spec. Name is a short label for per-query reporting; Text is the actual search string; Expected is the curated set of note IDs a well-behaved retriever should surface in the top-K (order among Expected is not significant — see ReciprocalRank).
func LoadQueries ¶
LoadQueries reads a golden-query fixture from YAML. The fixture is a flat list of Query entries (name/text/expected). Empty or missing files are errors — a baseline gate without queries measures nothing.
type QueryResult ¶
type QueryResult struct {
Name string `json:"name"`
Text string `json:"text"`
Expected []string `json:"expected"`
ResultIDs []string `json:"result_ids"`
HitAtK float64 `json:"hit_at_k"`
ReciprocalRank float64 `json:"reciprocal_rank"`
}
QueryResult is one row of a Report: the resolved top-K IDs plus per-query metrics. ResultIDs is deliberately included so a regression is *diagnosable* — a number without provenance can't be debugged.
type Report ¶
type Report struct {
K int `json:"k"`
Queries []QueryResult `json:"queries"`
HitAtK float64 `json:"hit_at_k"`
MRR float64 `json:"mrr"`
}
Report is the full baseline run — one QueryResult per input query plus aggregate Hit@K and MRR (means across queries).
func LoadSnapshot ¶
LoadSnapshot reads a committed baseline.json produced by a previous run. The JSON shape matches Report's tags, so snapshots round-trip through the runner's output without a mapping layer.
func Run ¶
Run executes each query through the retriever and builds a Report. Miss queries (no expected ID in top-K) register as 0.0 rows rather than errors — partial failure is a signal, not a crash.
A retriever failure on any single query aborts the whole run: silently dropping a query would hide the failure behind an artificially lower aggregate, which is the class of bug baselines exist to catch.