Documentation
¶
Overview ¶
Package bench is a retrieval benchmark harness: it ingests a dataset of memories and scores each question's gold retrieval (Recall@K, MRR) and latency. Runs offline on the committed sample with a deterministic local embedder, or against a real endpoint and a converted LongMemEval/LoCoMo set.
Index ¶
- func Markdown(results []Result) string
- func RerankGateMarkdown(rows []RerankGateResult, k int) string
- func RerankMarkdown(results []RerankResult, k int) string
- func VecGateMarkdown(rows []VecGateResult, k int) string
- type Dataset
- func LoadFile(path string) (*Dataset, error)
- func LoadLoCoMo(path string) (*Dataset, error)
- func LoadLoCoMoSessions(path string) (*Dataset, error)
- func LoadLongMemEval(path string, mode DocMode) (*Dataset, error)
- func Poison(ds *Dataset, perGroup int, filler string) *Dataset
- func Sample() (*Dataset, error)
- type DocMode
- type Item
- type Question
- type RerankGateResult
- type RerankResult
- type Result
- type System
- type VecGateResult
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func RerankGateMarkdown ¶ added in v0.4.13
func RerankGateMarkdown(rows []RerankGateResult, k int) string
RerankGateMarkdown renders the sweep, lowest threshold first.
func RerankMarkdown ¶
func RerankMarkdown(results []RerankResult, k int) string
RerankMarkdown renders the RRF-vs-composite comparison, grouped by category.
func VecGateMarkdown ¶ added in v0.4.13
func VecGateMarkdown(rows []VecGateResult, k int) string
VecGateMarkdown renders the sweep, lowest threshold first.
Types ¶
type Dataset ¶
type Dataset struct {
Name string `json:"name"`
Items []Item `json:"items"`
Questions []Question `json:"questions"`
}
Dataset is a normalized retrieval benchmark.
func LoadLoCoMo ¶
LoadLoCoMo converts the published LoCoMo file into the normalized Dataset. Each conversation is its own group/namespace (dialogue ids repeat across conversations); each dialogue turn is an item, and each QA's evidence ids are its gold set. Questions without evidence (e.g. adversarial) are skipped.
func LoadLoCoMoSessions ¶ added in v0.0.4
LoadLoCoMoSessions loads LoCoMo at SESSION granularity: each conversation session becomes one document (its turns concatenated), and a question's gold set is the session(s) holding its evidence turns. This matches how session-level memory systems (e.g. MemPalace) score LoCoMo, enabling an apples-to-apples comparison; LoadLoCoMo scores the harder turn granularity.
func Poison ¶ added in v0.0.11
Poison returns a copy of ds with perGroup debris items added to every group that has questions — simulating a low-quality bulk import (e.g. a mem0 export of restatements) collapsed into the namespace. The debris shares one content template so a dedup pass clusters and collapses it, modelling the realistic "exports are full of near-duplicates" case. Use it to measure the Recall@K delta a poisoned store suffers, and that dedup/curation recover it.
type DocMode ¶ added in v0.0.4
type DocMode string
LoadLongMemEval converts a LongMemEval file: each haystack session becomes an item, each question's answer_session_ids becomes its gold set. DocMode selects how a LongMemEval haystack session is rendered into one embedded item, for the vector-leg document-construction experiment.
const ( // DocFull renders "role: content\n" for every turn (the production shape). DocFull DocMode = "full" // DocUserOnly renders only user turns, with no role prefixes (MemPalace's // raw mode: assistant turns dilute the vector leg on user-question recall). DocUserOnly DocMode = "user-only" // DocDated prefixes the full session with its date, so temporal questions // have a textual anchor the embedder can see. DocDated DocMode = "dated" )
type Item ¶
type Item struct {
ID string `json:"id"`
Content string `json:"content"`
Group string `json:"group,omitempty"`
Time time.Time `json:"-"`
}
Item is one memory to ingest; Group scopes it to a namespace, empty falls back to a shared default. Time, when set, is the memory's source timestamp (used to ground recency in the recency-aware re-ranking comparison).
type Question ¶
type Question struct {
Query string `json:"query"`
Gold []string `json:"gold"`
Group string `json:"group,omitempty"`
Answer string `json:"answer,omitempty"`
Category string `json:"category,omitempty"`
Now time.Time `json:"-"`
}
Question is a query plus the gold memory IDs it should retrieve. Group must match its items; Answer/Category are populated for QA evaluation where available. Now, when set, is the query's reference time (e.g. the question date) — the "now" against which recency is measured.
type RerankGateResult ¶ added in v0.4.13
type RerankGateResult struct {
Threshold float64 `json:"threshold"`
PosRecallAtK float64 `json:"pos_recall_at_k"`
NegInjectionRate float64 `json:"neg_injection_rate"`
}
RerankGateResult is one cross-encoder relevance-score threshold's effect under a per-query gate: if a query's best rerank score (over its recall pool) is below the threshold, nothing relevant exists and recall returns empty. Positive = own namespace (recall must survive); negative = a foreign namespace (injection must collapse). Cross-encoders emit calibrated absolute relevance, unlike bi-encoder cosine — this measures whether that separation is real.
func RerankGateSweep ¶ added in v0.4.13
func RerankGateSweep( ctx context.Context, st store.Store, e embed.Embedder, ce *rerank.CrossEncoder, ds *Dataset, k, pool int, thresholds []float64, queryPrefix string, ) ([]RerankGateResult, error)
RerankGateSweep ingests once, then for every question reranks its recall pool (hybrid fusion + composite, top `pool`) against the query in its own namespace (positive) and in a foreign namespace (negative), recording the top rerank score and whether the gold lands in the reranked top-k. It reports the top rerank-score distribution and, per threshold, positive recall@k vs negative injection. Negatives pair each question with the next question's namespace.
type RerankResult ¶
type RerankResult struct {
System string
Category string
Questions int
RecallAt1 float64
RecallAtK float64
MRR float64
}
RerankResult is one ranking strategy's score over a question set.
func LLMRerankCompare ¶ added in v0.0.4
func LLMRerankCompare( ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker, ds *Dataset, k, fetch int, queryPrefix string, ) ([]RerankResult, error)
LLMRerankCompare measures the with-LLM read-side rerank lift on pure retrieval. For each question it builds the production candidate order (hybrid score fusion -> composite re-rank), then re-orders the top `fetch` with an LLM reranker, and scores recall@1/@k and MRR for both. The LLM tier is slow (one chat call per question), so drive it over a subset with cmd/bench -limit.
func RerankCompare ¶
func RerankCompare( ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, cats []string, k int, queryPrefix string, ) ([]RerankResult, error)
RerankCompare isolates the effect of recency-aware re-ranking: it ingests ds (items carry source timestamps), then for each selected question scores the SAME fused candidate set two ways — pure RRF order vs the composite re-ranker using the question's reference time. Reports recall@1, recall@K, and MRR per category and overall, for both strategies. cats empty means all categories.
type Result ¶
type Result struct {
System string `json:"system"`
Dataset string `json:"dataset"`
K int `json:"k"`
Questions int `json:"questions"`
RecallAtK float64 `json:"recall_at_k"`
MRR float64 `json:"mrr"`
P50Millis float64 `json:"p50_ms"`
P95Millis float64 `json:"p95_ms"`
IngestMs float64 `json:"ingest_ms"`
PerCategory map[string]float64 `json:"per_category,omitempty"`
}
Result is one system's score on a dataset at a given K.
type System ¶
type System interface {
Name() string
Ingest(ctx context.Context, items []Item) error
Recall(ctx context.Context, group, query string, k int) ([]string, error)
}
System is a memory system under test.
func MeminiSystems ¶
func MeminiSystems( st store.Store, e embed.Embedder, concurrency int, queryPrefix string, fusionAlpha float64, poolFactor, poolFloor int, ) []System
MeminiSystems returns the hybrid, vector-only, and keyword-only retrieval strategies sharing one ingested store. queryPrefix, when non-empty, is prepended to query embeddings (hybrid and vector legs), matching MEMINI_EMBED_QUERY_PREFIX in production. fusionAlpha < 0 uses RRF; >= 0 uses convex-combination score fusion with that vector weight. poolFactor/poolFloor override hybrid recall's per-leg pool sizing (non-positive keeps defaults).
type VecGateResult ¶ added in v0.4.13
type VecGateResult struct {
Threshold float64 `json:"threshold"`
PosRecallAtK float64 `json:"pos_recall_at_k"`
NegInjectionRate float64 `json:"neg_injection_rate"`
}
VecGateResult is one absolute-vector-score threshold's effect under a per-query semantic-relevance gate: if a query's best raw vector score (1/(1+L2)) is below the threshold, nothing relevant exists and recall returns empty. Positive = each query against its own namespace (recall must survive); negative = the same query against a foreign namespace (injection must collapse). The right default is the knee: highest threshold where PosRecallAtK is ~unchanged but NegInjectionRate has dropped.
func VecGateSweep ¶ added in v0.4.13
func VecGateSweep( ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, k int, thresholds []float64, concurrency int, queryPrefix string, fusionAlpha float64, ) ([]VecGateResult, error)
VecGateSweep ingests once, then for every question measures the top raw vector score in its own namespace (positive) and in a foreign namespace (negative), plus whether the real fused recall already retrieves the gold. It reports, per threshold, the per-query gate's effect: positive recall@k (lost only when the own-namespace top vector score falls below the gate) and negative injection rate (a foreign query passes the gate when its top vector score clears it). Negatives pair each question with the next question's namespace; group ids are unique per question, so the paired namespace never holds the answer.