bench

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 12, 2026 License: AGPL-3.0 Imports: 17 Imported by: 0

README

memini benchmark harness

A retrieval benchmark: ingest a dataset of memories, then for each question measure how well a system retrieves the gold supporting memories.

mise run bench                 # offline sample, local embedder
go run ./cmd/bench -k 5        # same, explicit K

Against a real embeddings model and a real dataset:

export MEMINI_EMBED_BASE_URL=http://localhost:8081/v1
export MEMINI_EMBED_MODEL=bge-m3 MEMINI_EMBED_DIMS=1024
# Optional: instruction-tuned asymmetric embedders (Qwen3-Embedding, bge) score
# higher when queries carry a retrieval instruction; documents stay bare.
# Measured on Qwen3-Embedding-8B: +6.0pp R@5 on the LongMemEval vector leg
# (91.2% -> 97.2%), +1.0pp MRR on the fused ranking on both datasets.
export MEMINI_EMBED_QUERY_PREFIX=$'Instruct: Given a user query, retrieve relevant memories that answer it\nQuery:'
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -k 5
go run ./cmd/bench -suite locomo      -data ./locomo.json        -k 5

# Isolate the recency-aware re-ranker against pure RRF on the same candidates,
# using each question's date as "now" (needs a timestamped dataset):
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -rerank -k 5

Full results

Everything this harness measures, in one table — sourced from the committed results/ JSON, all on the same all-MiniLM-L6-v2 (384-d) endpoint. Cells are recall_any@5 / @10 / MRR (%); p50 is in-process recall latency (rerank rows show the added cost). The detailed per-dataset sections below explain the methodology, sweeps, and caveats behind each column.

Strategy LongMemEval · session LoCoMo · turn-level LoCoMo · session-level p50
vector 92.6 / 95.4 / 80.7 41.3 / 51.8 / 28.1 64.1 / 79.8 / 45.2 <1 ms
keyword (Porter BM25) 97.6 / 99.0 / 92.2 58.7 / 67.1 / 44.8 92.6 / 96.8 / 79.4 ~3 ms
hybrid (default, production path) 98.4 / 99.2 / 93.0 59.7 / 69.9 / 42.4 90.9 / 96.6 / 74.3 ~5 ms
+ cross-encoder (MEMINI_RERANK=<url>) 98.4 / 99.2 / 93.1 70.9 / 75.0 / 59.8 90.9 / 96.6 / 74.3 +20–230 ms
+ LLM rerank (MEMINI_RERANK=llm) 98.4 / 99.2 / 93.0 74.4 / 76.5 / 67.4 +350–420 ms

Questions per dataset: LongMemEval 500 (session granularity), LoCoMo turn-level 1,982 (gold = exact evidence turns), LoCoMo session-level 1,981 (gold = sessions holding those turns). Rerank backends: Qwen3-Reranker-0.6B (cross-encoder) and Qwen3.5-9B (LLM). Reproduce with the per-suite commands in the sections below (-suite longmemeval, locomo, locomo-sessions; add -rerank-url/-llm-rerank for the rerank rows).

Reading it: hybrid never trails either single leg on the saturated session sets (it ties keyword on LoCoMo-session, where keyword's exact-token match is already near-ceiling). On turn-level LoCoMo base recall has real headroom, so the rerank tier earns its keep — the cross-encoder lands +11pp R@5 / +17pp MRR over hybrid at a fraction of the LLM's latency, and the LLM adds a few more points (+15pp / +25pp) if you already run a chat model. Where recall is already at ceiling (both session sets), reranking is a measured no-op.

Results: memini vs other memory systems

All memini numbers below are measured by this harness against a live all-MiniLM-L6-v2 (384-d) endpoint — the same embedding model agentmemory benchmarks with. Competitor numbers are cited from their own publications — we cannot re-run their systems here, and they use different embedding models, readers, and judges. Treat cross-system rows as directional, not a controlled head-to-head. (This mirrors how agentmemory documents its comparison.)

LongMemEval-S — retrieval recall_any@K

Full 500-question LongMemEval-S (~48 sessions/question), same metric agentmemory reports: does any gold session appear in the top-K retrieved? No LLM in the loop — pure retrieval. The run is the full 500 questions with the identical embedding model agentmemory benchmarks with (all-MiniLM-L6-v2, 384-d) for a true apples-to-apples comparison.

Hybrid recall over-fetches a deep candidate pool per leg (max(k*5, 50)) before fusing, so a memory just outside the top-k of both legs can still win — the production Recall path does the same. Fusion defaults to convex score fusion (MEMINI_FUSION_ALPHA=0.5): each leg's scores are min-max normalized to [0,1] and combined 0.5·vector + 0.5·keyword, keeping score magnitude so a memory a leg ranks far above its runners-up dominates one that is merely middling in both. A negative alpha falls back to Reciprocal Rank Fusion; deep pools then need a steep decay (rrfK=5, not the classic 60), since a flat decay lets both-leg mediocrity outscore single-leg excellence (2/(60+20) > 1/(60+0)). Score fusion gets the same effect from score magnitude directly, and beat RRF on 3 of 4 model×dataset cells (and on MRR in all 4).

System Embedding model R@5 R@10 Source
memini — hybrid (score) all-MiniLM-L6-v2 (384-d) 98.4% 99.4% measured
memini — keyword (Porter BM25) 97.6% 99.0% measured
memini — vector all-MiniLM-L6-v2 91.8% 96.6% measured
agentmemory — BM25 + Vector all-MiniLM-L6-v2 95.2% 98.6% published
agentmemory — BM25 only 86.2% 94.6% published
MemPalace (vector only) larger model ~96.6% self-reported

On the same model/dataset/metric (full 500 questions), memini hybrid beats agentmemory at R@5 (98.4% vs 95.2%), R@10 (99.4% vs 98.6%), and MRR (92.3% vs 88.2%). memini's keyword leg is +11.4pp over agentmemory's BM25-only (97.6% vs 86.2%) thanks to Porter stemming, and hybrid fusion now beats either leg alone. Relative to fetching only k per leg with the classic rrfK=60, the deep-pool + score fusion is worth +2.0pp R@5 / +1.0pp R@10.

LoCoMo — retrieval recall_any@K

LoCoMo retrieval at dialogue-turn granularity (1,982 questions over 10 long conversations, gold = exact evidence turns among ~590 turns/conversation) — a much harder target than LongMemEval's session granularity, and the regime where flat-decay RRF over deep pools degrades badly.

System (all-MiniLM-L6-v2) R@5 R@10
memini — hybrid (score) 59.8% 69.8%
memini — keyword (Porter BM25) 58.7% 67.1%
memini — vector 41.5% 52.1%

No published turn-level retrieval baselines exist to compare against (mem0 / Letta report LLM-judged QA accuracy, below). This is the one cell where the default score fusion is edged by RRF (60.1% / 71.0%): when the vector leg is near-noise (MiniLM scores only 41.5% here), giving it an equal-weight normalized vote hurts, whereas RRF's rank-only vote is more robust. Score fusion still wins this cell on MRR and wins outright on every cell with a stronger embedder — so it is the default, and MEMINI_FUSION_ALPHA=-1 selects RRF for weak-vector deployments. (Ablation: rrfK=60 over the same deep pools scored just 52.8% R@5, below the keyword leg alone — both score fusion and rrfK=5 fix that.)

Pool-depth robustness (-pool-factor / -pool-floor)

Min-max normalization could in principle be fragile to pool depth (the score at the bottom of the pool sets each leg's zero point), so score fusion was swept at per-leg depths 30 / 50 / 80 on both datasets and both embedders (hybrid R@5 / R@10 / MRR):

cell depth 30 depth 50 (default) depth 80
LME · MiniLM 97.8 / 99.4 / 92.0 98.4 / 99.4 / 92.3 98.6 / 99.4 / 92.6
LME · Qwen3+prefix 98.8 / 99.4 / 94.5 98.8 / 99.6 / 94.6 98.8 / 99.6 / 94.6
LoCoMo · MiniLM 60.0 / 70.1 / 42.1 59.8 / 69.8 / 42.6 59.3 / 69.6 / 42.7
LoCoMo · Qwen3+prefix 70.1 / 77.9 / 52.1 70.1 / 78.5 / 52.4 70.1 / 78.7 / 52.5

Quality moves at most ±0.6pp R@5 across a 2.7× depth range — no tail collapse — with the two datasets drifting in opposite directions (deeper pools help session-granularity LongMemEval slightly and hurt turn-granularity LoCoMo slightly), so the default max(k*5, 50) sits at the crossover.

Recency-aware re-ranking (-rerank)

memini re-ranks the fused candidates by a composite of relevance, recency, and importance. The recency weight is deliberately light (0.05): a sweep on LongMemEval-S (knowledge-update + temporal-reasoning, q.Now = question date, sessions timestamped from haystack_dates) shows recency is a net win only as a tie-breaker, and actively harmful when over-weighted.

recency weight R@1 (both cats) knowledge-update R@1 temporal R@1 MRR
0 (pure RRF) 82.9% 91.0% 78.2% 90.1%
0.05 (default) 83.4% 91.0% 78.9% 90.5%
0.15 83.9% 89.7% 80.5% 90.7%
0.25 83.4% 87.2% 81.2% 90.4%

At 0.05 the re-ranker is +0.5pp R@1 / +0.4pp MRR over pure RRF with no knowledge-update cost, and recall@5 is identical across all weights (the re-rank only reorders within the top results). The steep RRF decay made the composite far more robust to the recency weight than the flat rrfK=60 decay was (where 0.15+ buried correct-but-older memories); the default stays at the conservative 0.05 since the gains beyond it are within noise.

Temporal targeting (temporal0.40)

Recency weighting trades off against itself: raising it helps temporal-reasoning (78.2→81.2% R@1) but hurts knowledge-update (91.0→87.2%), whose answers aren't necessarily recent. Temporal targeting avoids that: when a query names a relative time ("three weeks ago"), it computes target = now − offset and boosts candidates dated near that point, not near now. It only fires on temporal queries, so other categories are unaffected.

Strategy all R@1 knowledge-update R@1 temporal-reasoning R@1 MRR
recency 0.05 (prior default) 83.4% 91.0% 78.9% 90.5%
recency 0.25 83.4% 87.2% 81.2% 90.4%
temporal 0.40 85.3% 91.0% 82.0% 91.5%

Temporal targeting is +1.9pp R@1 overall over the recency default and beats even the heaviest recency weight on temporal-reasoning without the knowledge-update regression — so it ships on in production (MEMINI_TEMPORAL_BOOST=0.40, 0 disables). The no-LLM regex extractor only catches templated phrasing; an LLM anchor extractor (plugging into the same search.AnchorExtractor interface) can resolve looser references and is the intended with-LLM tier.

Held-out split (-holdout)

To avoid overfitting tuning decisions to the full benchmark, -holdout splits LongMemEval deterministically by load order: every 10th question is held (50/500), the rest are tune (450/500). Sweep parameters on -holdout tune, then report the final number on -holdout held (unseen). Default all runs the full set. Results files are suffixed (longmemeval-held.json) so splits don't overwrite each other.

Session-doc construction (-session-doc)

LongMemEval sessions are embedded as one document per session; -session-doc controls what text that document contains, to measure the vector leg's sensitivity to document shape:

  • full (default) — "role: content" for every turn.
  • user-only — only the user turns, no role prefixes. Assistant turns dilute the embedding for user-question recall; this is the shape MemPalace reports 96.6% R@5 vector-only with on the same MiniLM model.
  • datedfull prefixed with the session date, giving temporal questions a textual anchor embeddings would otherwise ignore.

Compare the vector row's recall_any@5 across modes (cached embeddings make the sweep cheap); the keyword and hybrid rows shift too but the vector leg is the target.

memini hybrid per-category (all-MiniLM, recall_any@10): multi-session 100%, knowledge-update 100%, single-session-user 98.6%, single-session-assistant 98.2%, temporal-reasoning 97.0%, single-session-preference 96.7%.

Rerank tier — cross-encoder vs LLM (-rerank-url / -llm-rerank)

The read-side rerank reorders the top of the production candidate order. The bench drives either backend through the same comparison (one reranker call per question — use -limit):

# cross-encoder (fast; e.g. Qwen3-Reranker-0.6B via llama-server --rerank):
go run ./cmd/bench -suite locomo -data ./locomo.json -rerank-url http://localhost:8002/v1 -rerank-model qwen3-reranker-0.6b -limit 100 -k 5,10
# LLM reranker (slow; MEMINI_LLM_*):
go run ./cmd/bench -suite locomo -data ./locomo.json -llm-rerank -limit 100 -k 5,10

Measured on all-MiniLM-L6-v2 (cross-encoder = Qwen3-Reranker-0.6B, LLM = Qwen3.5-9B), recall_any@5 / @10 / MRR:

Config LongMemEval (session) LoCoMo turn-level added p50
hybrid (base) 98.4 / 99.2 / 93.0 59.7 / 69.9 / 42.4
+ cross-encoder 98.4 / 99.2 / 93.1 70.9 / 75.0 / 59.8 ~20–230 ms
+ LLM rerank 98.4 / 99.2 / 93.0 74.4 / 76.5 / 67.4 ~350–420 ms

Reranking is a no-op at recall ceiling (session-level) and a big win where recall has headroom (turn-level: +11pp R@5 / +17pp MRR for the cross-encoder, +15pp / +25pp for the LLM). The cross-encoder captures most of the LLM's lift at a fraction of the latency with no chat model — the recommended production rerank (MEMINI_RERANK=<url>); the LLM tier (MEMINI_RERANK=llm) buys the last points if you already run one.

LoCoMo — end-to-end QA accuracy (LLM-judge)

The metric mem0/Letta publish: retrieve → generate an answer → an LLM judges it against the gold answer. memini's number uses a fast instruct reader+judge (Llama-3.3-70B-Instruct); the competitor numbers use their own readers/judges, so this is directional.

System LoCoMo QA accuracy Source
memini (hybrid retrieval + instruct reader) full run pending measured
Letta / MemGPT 83.2% published
Mem0 68.5% published

Sources: agentmemory COMPARISON.md/LONGMEMEVAL.md; LongMemEval (arXiv 2410.10813); LoCoMo (snap-stanford.github.io/LoCoMo); mem0.ai; letta.com.

Metrics

  • Recall@K — fraction of questions whose gold memory appears in the top K.
  • MRR — mean reciprocal rank of the first gold hit.
  • p50/p95 — recall latency; ingest — total ingest time.

Output is a Markdown table (stdout) plus JSON under bench/results/.

What it compares today

Three memini retrieval strategies over the same ingested store, to show the value of hybrid fusion:

System Retrieval
memini-hybrid vector + keyword, score fusion (production path)
memini-vector dense vector only
memini-keyword BM25 keyword only

memini-hybrid should never score below either single strategy.

Datasets

  • sample — committed at bench/data/sample.json, runs fully offline.
  • Normalized schema (-suite file) — {name, items:[{id,content}], questions:[{query,gold:[id]}]}.
  • LongMemEval / LoCoMo — loaders map the published JSON shapes to the normalized schema (each session/turn becomes an item; answer/evidence ids become gold). Download the datasets and pass -data.

Recall@K on LongMemEval/LoCoMo is easy to overfit — treat scores as directional.

External baselines

bench.System is the extension point. To compare against mem0, Zep/Graphiti, Letta, Cognee, agentmemory, or supermemory, implement System (Name / Ingest / Recall) over each service's API and add it to the run list in cmd/bench. These require the respective services/keys and are intentionally not vendored here.

Documentation

Overview

Package bench is a retrieval benchmark harness: it ingests a dataset of memories and scores each question's gold retrieval (Recall@K, MRR) and latency. Runs offline on the committed sample with a deterministic local embedder, or against a real endpoint and a converted LongMemEval/LoCoMo set.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Markdown

func Markdown(results []Result) string

Markdown renders results that share a K as a comparison table, best first.

func RerankMarkdown

func RerankMarkdown(results []RerankResult, k int) string

RerankMarkdown renders the RRF-vs-composite comparison, grouped by category.

Types

type Dataset

type Dataset struct {
	Name      string     `json:"name"`
	Items     []Item     `json:"items"`
	Questions []Question `json:"questions"`
}

Dataset is a normalized retrieval benchmark.

func LoadFile

func LoadFile(path string) (*Dataset, error)

LoadFile reads a dataset in memini's normalized JSON schema.

func LoadLoCoMo

func LoadLoCoMo(path string) (*Dataset, error)

LoadLoCoMo converts the published LoCoMo file into the normalized Dataset. Each conversation is its own group/namespace (dialogue ids repeat across conversations); each dialogue turn is an item, and each QA's evidence ids are its gold set. Questions without evidence (e.g. adversarial) are skipped.

func LoadLoCoMoSessions added in v0.0.4

func LoadLoCoMoSessions(path string) (*Dataset, error)

LoadLoCoMoSessions loads LoCoMo at SESSION granularity: each conversation session becomes one document (its turns concatenated), and a question's gold set is the session(s) holding its evidence turns. This matches how session-level memory systems (e.g. MemPalace) score LoCoMo, enabling an apples-to-apples comparison; LoadLoCoMo scores the harder turn granularity.

func LoadLongMemEval

func LoadLongMemEval(path string, mode DocMode) (*Dataset, error)

func Poison added in v0.0.11

func Poison(ds *Dataset, perGroup int, filler string) *Dataset

Poison returns a copy of ds with perGroup debris items added to every group that has questions — simulating a low-quality bulk import (e.g. a mem0 export of restatements) collapsed into the namespace. The debris shares one content template so a dedup pass clusters and collapses it, modelling the realistic "exports are full of near-duplicates" case. Use it to measure the Recall@K delta a poisoned store suffers, and that dedup/curation recover it.

func Sample

func Sample() (*Dataset, error)

Sample returns the committed offline sample dataset.

type DocMode added in v0.0.4

type DocMode string

LoadLongMemEval converts a LongMemEval file: each haystack session becomes an item, each question's answer_session_ids becomes its gold set. DocMode selects how a LongMemEval haystack session is rendered into one embedded item, for the vector-leg document-construction experiment.

const (
	// DocFull renders "role: content\n" for every turn (the production shape).
	DocFull DocMode = "full"
	// DocUserOnly renders only user turns, with no role prefixes (MemPalace's
	// raw mode: assistant turns dilute the vector leg on user-question recall).
	DocUserOnly DocMode = "user-only"
	// DocDated prefixes the full session with its date, so temporal questions
	// have a textual anchor the embedder can see.
	DocDated DocMode = "dated"
)

type Item

type Item struct {
	ID      string    `json:"id"`
	Content string    `json:"content"`
	Group   string    `json:"group,omitempty"`
	Time    time.Time `json:"-"`
}

Item is one memory to ingest; Group scopes it to a namespace, empty falls back to a shared default. Time, when set, is the memory's source timestamp (used to ground recency in the recency-aware re-ranking comparison).

type Question

type Question struct {
	Query    string    `json:"query"`
	Gold     []string  `json:"gold"`
	Group    string    `json:"group,omitempty"`
	Answer   string    `json:"answer,omitempty"`
	Category string    `json:"category,omitempty"`
	Now      time.Time `json:"-"`
}

Question is a query plus the gold memory IDs it should retrieve. Group must match its items; Answer/Category are populated for QA evaluation where available. Now, when set, is the query's reference time (e.g. the question date) — the "now" against which recency is measured.

type RerankResult

type RerankResult struct {
	System    string
	Category  string
	Questions int
	RecallAt1 float64
	RecallAtK float64
	MRR       float64
}

RerankResult is one ranking strategy's score over a question set.

func LLMRerankCompare added in v0.0.4

func LLMRerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker,
	ds *Dataset, k, fetch int, queryPrefix string,
) ([]RerankResult, error)

LLMRerankCompare measures the with-LLM read-side rerank lift on pure retrieval. For each question it builds the production candidate order (hybrid score fusion -> composite re-rank), then re-orders the top `fetch` with an LLM reranker, and scores recall@1/@k and MRR for both. The LLM tier is slow (one chat call per question), so drive it over a subset with cmd/bench -limit.

func RerankCompare

func RerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, cats []string, k int, queryPrefix string,
) ([]RerankResult, error)

RerankCompare isolates the effect of recency-aware re-ranking: it ingests ds (items carry source timestamps), then for each selected question scores the SAME fused candidate set two ways — pure RRF order vs the composite re-ranker using the question's reference time. Reports recall@1, recall@K, and MRR per category and overall, for both strategies. cats empty means all categories.

type Result

type Result struct {
	System      string             `json:"system"`
	Dataset     string             `json:"dataset"`
	K           int                `json:"k"`
	Questions   int                `json:"questions"`
	RecallAtK   float64            `json:"recall_at_k"`
	MRR         float64            `json:"mrr"`
	P50Millis   float64            `json:"p50_ms"`
	P95Millis   float64            `json:"p95_ms"`
	IngestMs    float64            `json:"ingest_ms"`
	PerCategory map[string]float64 `json:"per_category,omitempty"`
}

Result is one system's score on a dataset at a given K.

func Run

func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)

Run ingests the dataset into a system once, then scores recall_any@K and MRR for every K in ks from a single retrieval pass (retrieving max(ks) per question). Returns one Result per K.

type System

type System interface {
	Name() string
	Ingest(ctx context.Context, items []Item) error
	Recall(ctx context.Context, group, query string, k int) ([]string, error)
}

System is a memory system under test.

func MeminiSystems

func MeminiSystems(
	st store.Store, e embed.Embedder, concurrency int, queryPrefix string, fusionAlpha float64, poolFactor, poolFloor int,
) []System

MeminiSystems returns the hybrid, vector-only, and keyword-only retrieval strategies sharing one ingested store. queryPrefix, when non-empty, is prepended to query embeddings (hybrid and vector legs), matching MEMINI_EMBED_QUERY_PREFIX in production. fusionAlpha < 0 uses RRF; >= 0 uses convex-combination score fusion with that vector weight. poolFactor/poolFloor override hybrid recall's per-leg pool sizing (non-positive keeps defaults).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL