baseline

package
v0.1.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 7, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package baseline provides pre-curated golden-query regression metrics for VaultMind's retrievers. Baselines are committed snapshots of Hit@K and MRR against a stable query fixture. The point is not absolute quality (that's research-grade evaluation); it's a tripwire that fires when a change degrades retrieval on a fixed input — the "you cannot improve what you cannot measure" principle applied to a single dimension.

Index

Constants

View Source
const DefaultTolerance = 0.02

DefaultTolerance is the aggregate-metric drop allowed before a regression fires. 0.02 (2 percentage points) is tight enough to catch real degradations on a curated fixture but tolerant of small FTS scoring jitter between runs.

Variables

This section is empty.

Functions

func HitAtK

func HitAtK(results, expected []string, k int) float64

HitAtK reports 1.0 if any expected ID appears within the first k results, else 0.0. A float return type keeps aggregation (mean across queries) arithmetically simple.

Semantics: strict-intersection — any expected in top-k counts. Not all-of-expected, not strict-order. Partial-recall wins are the point of keyword/hybrid retrieval and we don't want the metric to punish them.

k larger than len(results) scans everything that exists (no panic). An empty expected set returns 0 rather than NaN so downstream means stay well-defined.

func ReciprocalRank

func ReciprocalRank(results, expected []string) float64

ReciprocalRank reports 1/rank of the best (lowest-rank) expected ID within results, or 0 if no expected ID appears at all. This is the per-query contribution to MRR; the mean across queries is taken at the aggregate layer.

"Best rank wins" — if multiple expected IDs are present, we take the minimum rank. Using first-declared expected ID would silently tie the metric to the *order* of the expected set in the fixture, which is a trap: reshuffling the fixture (for readability, say) would then move the metric.

Types

type Diff

type Diff struct {
	OK          bool
	Regressions []string
}

Diff is the output of CompareToSnapshot. OK is the one-bit "passed the gate" answer; Regressions carries human-readable descriptions of every regressing dimension so operators don't have to re-derive what broke from a number.

func CompareToSnapshot

func CompareToSnapshot(current, snapshot *Report, tolerance float64) (*Diff, error)

CompareToSnapshot reports regressions between a current run and a committed snapshot. A regression is any aggregate or per-query metric that dropped below (snapshot - tolerance). Improvements never trigger the gate — callers can refresh the snapshot deliberately when a retrieval change is intended to raise quality.

K mismatch is an operator error, not a silent comparison. Hit@5 numbers compared against Hit@10 numbers would produce nonsense.

type Query

type Query struct {
	Name     string   `yaml:"name"     json:"name"`
	Text     string   `yaml:"text"     json:"text"`
	Expected []string `yaml:"expected" json:"expected"`
}

Query is a single golden-query spec. Name is a short label for per-query reporting; Text is the actual search string; Expected is the curated set of note IDs a well-behaved retriever should surface in the top-K (order among Expected is not significant — see ReciprocalRank).

func LoadQueries

func LoadQueries(path string) ([]Query, error)

LoadQueries reads a golden-query fixture from YAML. The fixture is a flat list of Query entries (name/text/expected). Empty or missing files are errors — a baseline gate without queries measures nothing.

type QueryResult

type QueryResult struct {
	Name           string   `json:"name"`
	Text           string   `json:"text"`
	Expected       []string `json:"expected"`
	ResultIDs      []string `json:"result_ids"`
	HitAtK         float64  `json:"hit_at_k"`
	ReciprocalRank float64  `json:"reciprocal_rank"`
}

QueryResult is one row of a Report: the resolved top-K IDs plus per-query metrics. ResultIDs is deliberately included so a regression is *diagnosable* — a number without provenance can't be debugged.

type Report

type Report struct {
	K       int           `json:"k"`
	Queries []QueryResult `json:"queries"`
	HitAtK  float64       `json:"hit_at_k"`
	MRR     float64       `json:"mrr"`
}

Report is the full baseline run — one QueryResult per input query plus aggregate Hit@K and MRR (means across queries).

func LoadSnapshot

func LoadSnapshot(path string) (*Report, error)

LoadSnapshot reads a committed baseline.json produced by a previous run. The JSON shape matches Report's tags, so snapshots round-trip through the runner's output without a mapping layer.

func Run

func Run(retriever retrieval.Retriever, queries []Query, cfg RunConfig) (*Report, error)

Run executes each query through the retriever and builds a Report. Miss queries (no expected ID in top-K) register as 0.0 rows rather than errors — partial failure is a signal, not a crash.

A retriever failure on any single query aborts the whole run: silently dropping a query would hide the failure behind an artificially lower aggregate, which is the class of bug baselines exist to catch.

type RunConfig

type RunConfig struct {
	K     int
	Limit int
}

RunConfig controls per-run behavior.

K     — rank cutoff for Hit@K (typically 5 or 10).
Limit — max results requested from the retriever. Should be ≥ K;
        lower limits make Hit@K pessimistic because the retriever
        never gets a chance to place expected IDs beyond the cap.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL