bench

package

v0.4.13 Latest Latest Go to latest Published: Jun 20, 2026 License: AGPL-3.0 Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/eleboucher/memini

Links

Open Source Insights

README ¶

memini benchmark harness

A retrieval benchmark: ingest a dataset of memories, then for each question measure how well a system retrieves the gold supporting memories.

mise run bench                 # offline sample, local embedder
go run ./cmd/bench -k 5        # same, explicit K

Against a real embeddings model and a real dataset:

export MEMINI_EMBED_BASE_URL=http://localhost:8081/v1
export MEMINI_EMBED_MODEL=bge-m3 MEMINI_EMBED_DIMS=1024
# Optional: instruction-tuned asymmetric embedders (Qwen3-Embedding, bge) score
# higher when queries carry a retrieval instruction; documents stay bare.
# Measured on Qwen3-Embedding-8B: +6.0pp R@5 on the LongMemEval vector leg
# (91.2% -> 97.2%), +1.0pp MRR on the fused ranking on both datasets.
export MEMINI_EMBED_QUERY_PREFIX=$'Instruct: Given a user query, retrieve relevant memories that answer it\nQuery:'
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -k 5
go run ./cmd/bench -suite locomo      -data ./locomo.json        -k 5

# Isolate the recency-aware re-ranker against pure RRF on the same candidates,
# using each question's date as "now" (needs a timestamped dataset):
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -rerank -k 5

Full results

Everything this harness measures, in one table — sourced from the committed results/ JSON, all on the same all-MiniLM-L6-v2 (384-d) endpoint. Cells are recall_any@5 / @10 / MRR (%); p50 is in-process recall latency (rerank rows show the added cost). The detailed per-dataset sections below explain the methodology, sweeps, and caveats behind each column.

Strategy	LongMemEval · session	LoCoMo · turn-level	LoCoMo · session-level	p50
vector	92.6 / 95.4 / 80.7	41.3 / 51.8 / 28.1	64.1 / 79.8 / 45.2	<1 ms
keyword (Porter BM25)	97.6 / 99.0 / 92.2	58.7 / 67.1 / 44.8	92.6 / 96.8 / 79.4	~3 ms
hybrid (default, production path)	98.4 / 99.2 / 93.0	59.7 / 69.9 / 42.4	90.9 / 96.6 / 74.3	~5 ms
+ cross-encoder (`MEMINI_RERANK=<url>`)	98.4 / 99.2 / 93.1	70.9 / 75.0 / 59.8	90.9 / 96.6 / 74.3	+20–230 ms
+ LLM rerank (`MEMINI_RERANK=llm`)	98.4 / 99.2 / 93.0	74.4 / 76.5 / 67.4	—	+350–420 ms

Questions per dataset: LongMemEval 500 (session granularity), LoCoMo turn-level 1,982 (gold = exact evidence turns), LoCoMo session-level 1,981 (gold = sessions holding those turns). Rerank backends: Qwen3-Reranker-0.6B (cross-encoder) and Qwen3.5-9B (LLM). Reproduce with the per-suite commands in the sections below (-suite longmemeval, locomo, locomo-sessions; add -rerank-url/-llm-rerank for the rerank rows).

Reading it: hybrid never trails either single leg on the saturated session sets (it ties keyword on LoCoMo-session, where keyword's exact-token match is already near-ceiling). On turn-level LoCoMo base recall has real headroom, so the rerank tier earns its keep — the cross-encoder lands +11pp R@5 / +17pp MRR over hybrid at a fraction of the LLM's latency, and the LLM adds a few more points (+15pp / +25pp) if you already run a chat model. Where recall is already at ceiling (both session sets), reranking is a measured no-op.

Results: memini vs other memory systems

All memini numbers below are measured by this harness against a live all-MiniLM-L6-v2 (384-d) endpoint — the same embedding model agentmemory benchmarks with. Competitor numbers are cited from their own publications — we cannot re-run their systems here, and they use different embedding models, readers, and judges. Treat cross-system rows as directional, not a controlled head-to-head. (This mirrors how agentmemory documents its comparison.)

LongMemEval-S — retrieval `recall_any@K`

Full 500-question LongMemEval-S (~48 sessions/question), same metric agentmemory reports: does any gold session appear in the top-K retrieved? No LLM in the loop — pure retrieval. The run is the full 500 questions with the identical embedding model agentmemory benchmarks with (all-MiniLM-L6-v2, 384-d) for a true apples-to-apples comparison.

Hybrid recall over-fetches a deep candidate pool per leg (max(k*5, 50)) before fusing, so a memory just outside the top-k of both legs can still win — the production Recall path does the same. Fusion defaults to convex score fusion (MEMINI_FUSION_ALPHA=0.5): each leg's scores are min-max normalized to [0,1] and combined 0.5·vector + 0.5·keyword, keeping score magnitude so a memory a leg ranks far above its runners-up dominates one that is merely middling in both. A negative alpha falls back to Reciprocal Rank Fusion; deep pools then need a steep decay (rrfK=5, not the classic 60), since a flat decay lets both-leg mediocrity outscore single-leg excellence (2/(60+20) > 1/(60+0)). Score fusion gets the same effect from score magnitude directly, and beat RRF on 3 of 4 model×dataset cells (and on MRR in all 4).

System	Embedding model	R@5	R@10	Source
memini — hybrid (score)	all-MiniLM-L6-v2 (384-d)	98.4%	99.4%	measured
memini — keyword (Porter BM25)	—	97.6%	99.0%	measured
memini — vector	all-MiniLM-L6-v2	91.8%	96.6%	measured
agentmemory — BM25 + Vector	all-MiniLM-L6-v2	95.2%	98.6%	published
agentmemory — BM25 only	—	86.2%	94.6%	published
MemPalace (vector only)	larger model	~96.6%	—	self-reported

On the same model/dataset/metric (full 500 questions), memini hybrid beats agentmemory at R@5 (98.4% vs 95.2%), R@10 (99.4% vs 98.6%), and MRR (92.3% vs 88.2%). memini's keyword leg is +11.4pp over agentmemory's BM25-only (97.6% vs 86.2%) thanks to Porter stemming, and hybrid fusion now beats either leg alone. Relative to fetching only k per leg with the classic rrfK=60, the deep-pool + score fusion is worth +2.0pp R@5 / +1.0pp R@10.

LoCoMo — retrieval `recall_any@K`

LoCoMo retrieval at dialogue-turn granularity (1,982 questions over 10 long conversations, gold = exact evidence turns among ~590 turns/conversation) — a much harder target than LongMemEval's session granularity, and the regime where flat-decay RRF over deep pools degrades badly.

System (all-MiniLM-L6-v2)	R@5	R@10
memini — hybrid (score)	59.8%	69.8%
memini — keyword (Porter BM25)	58.7%	67.1%
memini — vector	41.5%	52.1%

No published turn-level retrieval baselines exist to compare against (mem0 / Letta report LLM-judged QA accuracy, below). This is the one cell where the default score fusion is edged by RRF (60.1% / 71.0%): when the vector leg is near-noise (MiniLM scores only 41.5% here), giving it an equal-weight normalized vote hurts, whereas RRF's rank-only vote is more robust. Score fusion still wins this cell on MRR and wins outright on every cell with a stronger embedder — so it is the default, and MEMINI_FUSION_ALPHA=-1 selects RRF for weak-vector deployments. (Ablation: rrfK=60 over the same deep pools scored just 52.8% R@5, below the keyword leg alone — both score fusion and rrfK=5 fix that.)

Pool-depth robustness (`-pool-factor` / `-pool-floor`)

Min-max normalization could in principle be fragile to pool depth (the score at the bottom of the pool sets each leg's zero point), so score fusion was swept at per-leg depths 30 / 50 / 80 on both datasets and both embedders (hybrid R@5 / R@10 / MRR):

cell	depth 30	depth 50 (default)	depth 80
LME · MiniLM	97.8 / 99.4 / 92.0	98.4 / 99.4 / 92.3	98.6 / 99.4 / 92.6
LME · Qwen3+prefix	98.8 / 99.4 / 94.5	98.8 / 99.6 / 94.6	98.8 / 99.6 / 94.6
LoCoMo · MiniLM	60.0 / 70.1 / 42.1	59.8 / 69.8 / 42.6	59.3 / 69.6 / 42.7
LoCoMo · Qwen3+prefix	70.1 / 77.9 / 52.1	70.1 / 78.5 / 52.4	70.1 / 78.7 / 52.5

Quality moves at most ±0.6pp R@5 across a 2.7× depth range — no tail collapse — with the two datasets drifting in opposite directions (deeper pools help session-granularity LongMemEval slightly and hurt turn-granularity LoCoMo slightly), so the default max(k*5, 50) sits at the crossover.

Recency-aware re-ranking (`-rerank`)

memini re-ranks the fused candidates by a composite of relevance, recency, and importance. The recency weight is deliberately light (0.05): a sweep on LongMemEval-S (knowledge-update + temporal-reasoning, q.Now = question date, sessions timestamped from haystack_dates) shows recency is a net win only as a tie-breaker, and actively harmful when over-weighted.

recency weight	R@1 (both cats)	knowledge-update R@1	temporal R@1	MRR
0 (pure RRF)	82.9%	91.0%	78.2%	90.1%
0.05 (default)	83.4%	91.0%	78.9%	90.5%
0.15	83.9%	89.7%	80.5%	90.7%
0.25	83.4%	87.2%	81.2%	90.4%

At 0.05 the re-ranker is +0.5pp R@1 / +0.4pp MRR over pure RRF with no knowledge-update cost, and recall@5 is identical across all weights (the re-rank only reorders within the top results). The steep RRF decay made the composite far more robust to the recency weight than the flat rrfK=60 decay was (where 0.15+ buried correct-but-older memories); the default stays at the conservative 0.05 since the gains beyond it are within noise.

Temporal targeting (`temporal0.40`)

Recency weighting trades off against itself: raising it helps temporal-reasoning (78.2→81.2% R@1) but hurts knowledge-update (91.0→87.2%), whose answers aren't necessarily recent. Temporal targeting avoids that: when a query names a relative time ("three weeks ago"), it computes target = now − offset and boosts candidates dated near that point, not near now. It only fires on temporal queries, so other categories are unaffected.

Strategy	all R@1	knowledge-update R@1	temporal-reasoning R@1	MRR
recency 0.05 (prior default)	83.4%	91.0%	78.9%	90.5%
recency 0.25	83.4%	87.2%	81.2%	90.4%
temporal 0.40	85.3%	91.0%	82.0%	91.5%

Temporal targeting is +1.9pp R@1 overall over the recency default and beats even the heaviest recency weight on temporal-reasoning without the knowledge-update regression — so it ships on in production (MEMINI_TEMPORAL_BOOST=0.40, 0 disables). The no-LLM regex extractor only catches templated phrasing; an LLM anchor extractor (plugging into the same search.AnchorExtractor interface) can resolve looser references and is the intended with-LLM tier.

Held-out split (`-holdout`)

To avoid overfitting tuning decisions to the full benchmark, -holdout splits LongMemEval deterministically by load order: every 10th question is held (50/500), the rest are tune (450/500). Sweep parameters on -holdout tune, then report the final number on -holdout held (unseen). Default all runs the full set. Results files are suffixed (longmemeval-held.json) so splits don't overwrite each other.

Measured (memini-hybrid, all-MiniLM-L6-v2 — the parameters were swept on tune, not held):

Split	Questions	R@5	R@10	MRR
`all` (full)	500	98.4	99.2	93.0
`tune`	450	98.2	99.1	93.0
`held`	50	100.0	100.0	93.5

The held split does not regress against tune, so the tuning choices generalize (no tuned-to-test inflation). Per-category R@5:

Category	`tune` (450)	`held` (50)
knowledge-update	100.0	100.0
multi-session	99.2	100.0
single-session-assistant	100.0	100.0
single-session-user	96.8	100.0
temporal-reasoning	98.3	100.0
single-session-preference	88.9	100.0

Read the per-category numbers off tune (450 questions); it shows the real headroom is single-session-preference (88.9% R@5). On held each category is only 2–13 questions, so its across-the-board 100% is small-sample, not a separate claim of perfection.

Session-doc construction (`-session-doc`)

LongMemEval sessions are embedded as one document per session; -session-doc controls what text that document contains, to measure the vector leg's sensitivity to document shape:

full (default) — "role: content" for every turn.
user-only — only the user turns, no role prefixes. Assistant turns dilute the embedding for user-question recall; this is the shape MemPalace reports 96.6% R@5 vector-only with on the same MiniLM model.
dated — full prefixed with the session date, giving temporal questions a textual anchor embeddings would otherwise ignore.

Compare the vector row's recall_any@5 across modes (cached embeddings make the sweep cheap); the keyword and hybrid rows shift too but the vector leg is the target.

memini hybrid per-category (all-MiniLM, recall_any@10): multi-session 100%, knowledge-update 100%, single-session-user 98.6%, single-session-assistant 98.2%, temporal-reasoning 97.0%, single-session-preference 96.7%.

Rerank tier — cross-encoder vs LLM (`-rerank-url` / `-llm-rerank`)

The read-side rerank reorders the top of the production candidate order. The bench drives either backend through the same comparison (one reranker call per question — use -limit):

# cross-encoder (fast; e.g. Qwen3-Reranker-0.6B via llama-server --rerank):
go run ./cmd/bench -suite locomo -data ./locomo.json -rerank-url http://localhost:8002/v1 -rerank-model qwen3-reranker-0.6b -limit 100 -k 5,10
# LLM reranker (slow; MEMINI_LLM_*):
go run ./cmd/bench -suite locomo -data ./locomo.json -llm-rerank -limit 100 -k 5,10

Measured on all-MiniLM-L6-v2 (cross-encoder = Qwen3-Reranker-0.6B, LLM = Qwen3.5-9B), recall_any@5 / @10 / MRR:

Config	LongMemEval (session)	LoCoMo turn-level	added p50
hybrid (base)	98.4 / 99.2 / 93.0	59.7 / 69.9 / 42.4	—
+ cross-encoder	98.4 / 99.2 / 93.1	70.9 / 75.0 / 59.8	~20–230 ms
+ LLM rerank	98.4 / 99.2 / 93.0	74.4 / 76.5 / 67.4	~350–420 ms

Reranking is a no-op at recall ceiling (session-level) and a big win where recall has headroom (turn-level: +11pp R@5 / +17pp MRR for the cross-encoder, +15pp / +25pp for the LLM). The cross-encoder captures most of the LLM's lift at a fraction of the latency with no chat model — the recommended production rerank (MEMINI_RERANK=<url>); the LLM tier (MEMINI_RERANK=llm) buys the last points if you already run one.

LoCoMo — end-to-end QA accuracy (LLM-judge)

The metric mem0/Letta publish: retrieve → generate an answer → an LLM judges it against the gold answer. memini's number uses a fast instruct reader+judge (Llama-3.3-70B-Instruct); the competitor numbers use their own readers/judges, so this is directional.

System	LoCoMo QA accuracy	Source
memini (hybrid retrieval + instruct reader)	full run pending	measured
Letta / MemGPT	83.2%	published
Mem0	68.5%	published

Sources: agentmemory COMPARISON.md/LONGMEMEVAL.md; LongMemEval (arXiv 2410.10813); LoCoMo (snap-stanford.github.io/LoCoMo); mem0.ai; letta.com.

Metrics

Recall@K — fraction of questions whose gold memory appears in the top K.
MRR — mean reciprocal rank of the first gold hit.
p50/p95 — recall latency; ingest — total ingest time.

Output is a Markdown table (stdout) plus JSON under bench/results/.

What it compares today

Three memini retrieval strategies over the same ingested store, to show the value of hybrid fusion:

System	Retrieval
`memini-hybrid`	vector + keyword, score fusion (production path)
`memini-vector`	dense vector only
`memini-keyword`	BM25 keyword only

memini-hybrid should never score below either single strategy.

Datasets

sample — committed at bench/data/sample.json, runs fully offline.
Normalized schema (-suite file) — {name, items:[{id,content}], questions:[{query,gold:[id]}]}.
LongMemEval / LoCoMo — loaders map the published JSON shapes to the normalized schema (each session/turn becomes an item; answer/evidence ids become gold). Download the datasets and pass -data.

Recall@K on LongMemEval/LoCoMo is easy to overfit — treat scores as directional.

External baselines

bench.System is the extension point. To compare against mem0, Zep/Graphiti, Letta, Cognee, agentmemory, or supermemory, implement System (Name / Ingest / Recall) over each service's API and add it to the run list in cmd/bench. These require the respective services/keys and are intentionally not vendored here.

Documentation ¶

Overview ¶

Package bench is a retrieval benchmark harness: it ingests a dataset of memories and scores each question's gold retrieval (Recall@K, MRR) and latency. Runs offline on the committed sample with a deterministic local embedder, or against a real endpoint and a converted LongMemEval/LoCoMo set.

Index ¶

func Markdown(results []Result) string
func RerankGateMarkdown(rows []RerankGateResult, k int) string
func RerankMarkdown(results []RerankResult, k int) string
func VecGateMarkdown(rows []VecGateResult, k int) string
type Dataset
type DocMode
type Item
type Question
type RerankGateResult
- func RerankGateSweep(ctx context.Context, st store.Store, e embed.Embedder, ce *rerank.CrossEncoder, ...) ([]RerankGateResult, error)
type RerankResult
- func LLMRerankCompare(ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker, ...) ([]RerankResult, error)
- func RerankCompare(ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, ...) ([]RerankResult, error)
type Result
- func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)
type System
- func MeminiSystems(st store.Store, e embed.Embedder, concurrency int, queryPrefix string, ...) []System
type VecGateResult
- func VecGateSweep(ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, k int, ...) ([]VecGateResult, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Markdown ¶

func Markdown(results []Result) string

Markdown renders results that share a K as a comparison table, best first.

func RerankGateMarkdown ¶ added in v0.4.13

func RerankGateMarkdown(rows []RerankGateResult, k int) string

RerankGateMarkdown renders the sweep, lowest threshold first.

func RerankMarkdown ¶

func RerankMarkdown(results []RerankResult, k int) string

RerankMarkdown renders the RRF-vs-composite comparison, grouped by category.

func VecGateMarkdown ¶ added in v0.4.13

func VecGateMarkdown(rows []VecGateResult, k int) string

VecGateMarkdown renders the sweep, lowest threshold first.

Types ¶

type Dataset ¶

type Dataset struct {
	Name      string     `json:"name"`
	Items     []Item     `json:"items"`
	Questions []Question `json:"questions"`
}

Dataset is a normalized retrieval benchmark.

func LoadFile ¶

func LoadFile(path string) (*Dataset, error)

LoadFile reads a dataset in memini's normalized JSON schema.

func LoadLoCoMo ¶

func LoadLoCoMo(path string) (*Dataset, error)

LoadLoCoMo converts the published LoCoMo file into the normalized Dataset. Each conversation is its own group/namespace (dialogue ids repeat across conversations); each dialogue turn is an item, and each QA's evidence ids are its gold set. Questions without evidence (e.g. adversarial) are skipped.

func LoadLoCoMoSessions ¶ added in v0.0.4

func LoadLoCoMoSessions(path string) (*Dataset, error)

LoadLoCoMoSessions loads LoCoMo at SESSION granularity: each conversation session becomes one document (its turns concatenated), and a question's gold set is the session(s) holding its evidence turns. This matches how session-level memory systems (e.g. MemPalace) score LoCoMo, enabling an apples-to-apples comparison; LoadLoCoMo scores the harder turn granularity.

func LoadLongMemEval ¶

func LoadLongMemEval(path string, mode DocMode) (*Dataset, error)

func Poison ¶ added in v0.0.11

func Poison(ds *Dataset, perGroup int, filler string) *Dataset

Poison returns a copy of ds with perGroup debris items added to every group that has questions — simulating a low-quality bulk import (e.g. a mem0 export of restatements) collapsed into the namespace. The debris shares one content template so a dedup pass clusters and collapses it, modelling the realistic "exports are full of near-duplicates" case. Use it to measure the Recall@K delta a poisoned store suffers, and that dedup/curation recover it.

func Sample ¶

func Sample() (*Dataset, error)

Sample returns the committed offline sample dataset.

type DocMode ¶ added in v0.0.4

type DocMode string

LoadLongMemEval converts a LongMemEval file: each haystack session becomes an item, each question's answer_session_ids becomes its gold set. DocMode selects how a LongMemEval haystack session is rendered into one embedded item, for the vector-leg document-construction experiment.

const (
	// DocFull renders "role: content\n" for every turn (the production shape).
	DocFull DocMode = "full"
	// DocUserOnly renders only user turns, with no role prefixes (MemPalace's
	// raw mode: assistant turns dilute the vector leg on user-question recall).
	DocUserOnly DocMode = "user-only"
	// DocDated prefixes the full session with its date, so temporal questions
	// have a textual anchor the embedder can see.
	DocDated DocMode = "dated"
)

type Item ¶

type Item struct {
	ID      string    `json:"id"`
	Content string    `json:"content"`
	Group   string    `json:"group,omitempty"`
	Time    time.Time `json:"-"`
}

Item is one memory to ingest; Group scopes it to a namespace, empty falls back to a shared default. Time, when set, is the memory's source timestamp (used to ground recency in the recency-aware re-ranking comparison).

type Question ¶

type Question struct {
	Query    string    `json:"query"`
	Gold     []string  `json:"gold"`
	Group    string    `json:"group,omitempty"`
	Answer   string    `json:"answer,omitempty"`
	Category string    `json:"category,omitempty"`
	Now      time.Time `json:"-"`
}

Question is a query plus the gold memory IDs it should retrieve. Group must match its items; Answer/Category are populated for QA evaluation where available. Now, when set, is the query's reference time (e.g. the question date) — the "now" against which recency is measured.

type RerankGateResult ¶ added in v0.4.13

type RerankGateResult struct {
	Threshold        float64 `json:"threshold"`
	PosRecallAtK     float64 `json:"pos_recall_at_k"`
	NegInjectionRate float64 `json:"neg_injection_rate"`
}

RerankGateResult is one cross-encoder relevance-score threshold's effect under a per-query gate: if a query's best rerank score (over its recall pool) is below the threshold, nothing relevant exists and recall returns empty. Positive = own namespace (recall must survive); negative = a foreign namespace (injection must collapse). Cross-encoders emit calibrated absolute relevance, unlike bi-encoder cosine — this measures whether that separation is real.

func RerankGateSweep ¶ added in v0.4.13

func RerankGateSweep(
	ctx context.Context, st store.Store, e embed.Embedder, ce *rerank.CrossEncoder,
	ds *Dataset, k, pool int, thresholds []float64, queryPrefix string,
) ([]RerankGateResult, error)

RerankGateSweep ingests once, then for every question reranks its recall pool (hybrid fusion + composite, top `pool`) against the query in its own namespace (positive) and in a foreign namespace (negative), recording the top rerank score and whether the gold lands in the reranked top-k. It reports the top rerank-score distribution and, per threshold, positive recall@k vs negative injection. Negatives pair each question with the next question's namespace.

type RerankResult ¶

type RerankResult struct {
	System    string
	Category  string
	Questions int
	RecallAt1 float64
	RecallAtK float64
	MRR       float64
}

RerankResult is one ranking strategy's score over a question set.

func LLMRerankCompare ¶ added in v0.0.4

func LLMRerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker,
	ds *Dataset, k, fetch int, queryPrefix string,
) ([]RerankResult, error)

LLMRerankCompare measures the with-LLM read-side rerank lift on pure retrieval. For each question it builds the production candidate order (hybrid score fusion -> composite re-rank), then re-orders the top `fetch` with an LLM reranker, and scores recall@1/@k and MRR for both. The LLM tier is slow (one chat call per question), so drive it over a subset with cmd/bench -limit.

func RerankCompare ¶

func RerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, cats []string, k int, queryPrefix string,
) ([]RerankResult, error)

RerankCompare isolates the effect of recency-aware re-ranking: it ingests ds (items carry source timestamps), then for each selected question scores the SAME fused candidate set two ways — pure RRF order vs the composite re-ranker using the question's reference time. Reports recall@1, recall@K, and MRR per category and overall, for both strategies. cats empty means all categories.

type Result ¶

type Result struct {
	System      string             `json:"system"`
	Dataset     string             `json:"dataset"`
	K           int                `json:"k"`
	Questions   int                `json:"questions"`
	RecallAtK   float64            `json:"recall_at_k"`
	MRR         float64            `json:"mrr"`
	P50Millis   float64            `json:"p50_ms"`
	P95Millis   float64            `json:"p95_ms"`
	IngestMs    float64            `json:"ingest_ms"`
	PerCategory map[string]float64 `json:"per_category,omitempty"`
}

Result is one system's score on a dataset at a given K.

func Run ¶

func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)

Run ingests the dataset into a system once, then scores recall_any@K and MRR for every K in ks from a single retrieval pass (retrieving max(ks) per question). Returns one Result per K.

type System ¶

type System interface {
	Name() string
	Ingest(ctx context.Context, items []Item) error
	Recall(ctx context.Context, group, query string, k int) ([]string, error)
}

System is a memory system under test.

func MeminiSystems ¶

func MeminiSystems(
	st store.Store, e embed.Embedder, concurrency int, queryPrefix string, fusionAlpha float64, poolFactor, poolFloor int,
) []System

MeminiSystems returns the hybrid, vector-only, and keyword-only retrieval strategies sharing one ingested store. queryPrefix, when non-empty, is prepended to query embeddings (hybrid and vector legs), matching MEMINI_EMBED_QUERY_PREFIX in production. fusionAlpha < 0 uses RRF; >= 0 uses convex-combination score fusion with that vector weight. poolFactor/poolFloor override hybrid recall's per-leg pool sizing (non-positive keeps defaults).

type VecGateResult ¶ added in v0.4.13

type VecGateResult struct {
	Threshold        float64 `json:"threshold"`
	PosRecallAtK     float64 `json:"pos_recall_at_k"`
	NegInjectionRate float64 `json:"neg_injection_rate"`
}

VecGateResult is one absolute-vector-score threshold's effect under a per-query semantic-relevance gate: if a query's best raw vector score (1/(1+L2)) is below the threshold, nothing relevant exists and recall returns empty. Positive = each query against its own namespace (recall must survive); negative = the same query against a foreign namespace (injection must collapse). The right default is the knee: highest threshold where PosRecallAtK is ~unchanged but NegInjectionRate has dropped.

func VecGateSweep ¶ added in v0.4.13

func VecGateSweep(
	ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset,
	k int, thresholds []float64, concurrency int, queryPrefix string, fusionAlpha float64,
) ([]VecGateResult, error)

VecGateSweep ingests once, then for every question measures the top raw vector score in its own namespace (positive) and in a foreign namespace (negative), plus whether the real fused recall already retrieves the gold. It reports, per threshold, the per-query gate's effect: positive recall@k (lost only when the own-namespace top vector score falls below the gate) and negative injection rate (a foreign query passes the gate when its top vector score clears it). Negatives pair each question with the next question's namespace; group ids are unique per question, so the paired namespace never holds the answer.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL