bench

package
v0.2.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2026 License: AGPL-3.0 Imports: 17 Imported by: 0

README

memini benchmark harness

A retrieval benchmark: ingest a dataset of memories, then for each question measure how well a system retrieves the gold supporting memories.

mise run bench                 # offline sample, local embedder
go run ./cmd/bench -k 5        # same, explicit K

Against a real embeddings model and a real dataset:

export MEMINI_EMBED_BASE_URL=http://localhost:8081/v1
export MEMINI_EMBED_MODEL=bge-m3 MEMINI_EMBED_DIMS=1024
# Optional: instruction-tuned asymmetric embedders (Qwen3-Embedding, bge) score
# higher when queries carry a retrieval instruction; documents stay bare.
# Measured on Qwen3-Embedding-8B: +6.0pp R@5 on the LongMemEval vector leg
# (91.2% -> 97.2%), +1.0pp MRR on the fused ranking on both datasets.
export MEMINI_EMBED_QUERY_PREFIX=$'Instruct: Given a user query, retrieve relevant memories that answer it\nQuery:'
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -k 5
go run ./cmd/bench -suite locomo      -data ./locomo.json        -k 5

# Isolate the recency-aware re-ranker against pure RRF on the same candidates,
# using each question's date as "now" (needs a timestamped dataset):
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -rerank -k 5

Full results

Everything this harness measures, in one table — sourced from the committed results/ JSON, all on the same all-MiniLM-L6-v2 (384-d) endpoint. Cells are recall_any@5 / @10 / MRR (%); p50 is in-process recall latency (rerank rows show the added cost). The detailed per-dataset sections below explain the methodology, sweeps, and caveats behind each column.

Strategy LongMemEval · session LoCoMo · turn-level LoCoMo · session-level p50
vector 92.6 / 95.4 / 80.7 41.3 / 51.8 / 28.1 64.1 / 79.8 / 45.2 <1 ms
keyword (Porter BM25) 97.6 / 99.0 / 92.2 58.7 / 67.1 / 44.8 92.6 / 96.8 / 79.4 ~3 ms
hybrid (default, production path) 98.4 / 99.2 / 93.0 59.7 / 69.9 / 42.4 90.9 / 96.6 / 74.3 ~5 ms
+ cross-encoder (MEMINI_RERANK=<url>) 98.4 / 99.2 / 93.1 70.9 / 75.0 / 59.8 90.9 / 96.6 / 74.3 +20–230 ms
+ LLM rerank (MEMINI_RERANK=llm) 98.4 / 99.2 / 93.0 74.4 / 76.5 / 67.4 +350–420 ms

Questions per dataset: LongMemEval 500 (session granularity), LoCoMo turn-level 1,982 (gold = exact evidence turns), LoCoMo session-level 1,981 (gold = sessions holding those turns). Rerank backends: Qwen3-Reranker-0.6B (cross-encoder) and Qwen3.5-9B (LLM). Reproduce with the per-suite commands in the sections below (-suite longmemeval, locomo, locomo-sessions; add -rerank-url/-llm-rerank for the rerank rows).

Reading it: hybrid never trails either single leg on the saturated session sets (it ties keyword on LoCoMo-session, where keyword's exact-token match is already near-ceiling). On turn-level LoCoMo base recall has real headroom, so the rerank tier earns its keep — the cross-encoder lands +11pp R@5 / +17pp MRR over hybrid at a fraction of the LLM's latency, and the LLM adds a few more points (+15pp / +25pp) if you already run a chat model. Where recall is already at ceiling (both session sets), reranking is a measured no-op.

Results: memini vs other memory systems

All memini numbers below are measured by this harness against a live all-MiniLM-L6-v2 (384-d) endpoint — the same embedding model agentmemory benchmarks with. Competitor numbers are cited from their own publications — we cannot re-run their systems here, and they use different embedding models, readers, and judges. Treat cross-system rows as directional, not a controlled head-to-head. (This mirrors how agentmemory documents its comparison.)

LongMemEval-S — retrieval recall_any@K

Full 500-question LongMemEval-S (~48 sessions/question), same metric agentmemory reports: does any gold session appear in the top-K retrieved? No LLM in the loop — pure retrieval. The run is the full 500 questions with the identical embedding model agentmemory benchmarks with (all-MiniLM-L6-v2, 384-d) for a true apples-to-apples comparison.

Hybrid recall over-fetches a deep candidate pool per leg (max(k*5, 50)) before fusing, so a memory just outside the top-k of both legs can still win — the production Recall path does the same. Fusion defaults to convex score fusion (MEMINI_FUSION_ALPHA=0.5): each leg's scores are min-max normalized to [0,1] and combined 0.5·vector + 0.5·keyword, keeping score magnitude so a memory a leg ranks far above its runners-up dominates one that is merely middling in both. A negative alpha falls back to Reciprocal Rank Fusion; deep pools then need a steep decay (rrfK=5, not the classic 60), since a flat decay lets both-leg mediocrity outscore single-leg excellence (2/(60+20) > 1/(60+0)). Score fusion gets the same effect from score magnitude directly, and beat RRF on 3 of 4 model×dataset cells (and on MRR in all 4).

System Embedding model R@5 R@10 Source
memini — hybrid (score) all-MiniLM-L6-v2 (384-d) 98.4% 99.4% measured
memini — keyword (Porter BM25) 97.6% 99.0% measured
memini — vector all-MiniLM-L6-v2 91.8% 96.6% measured
agentmemory — BM25 + Vector all-MiniLM-L6-v2 95.2% 98.6% published
agentmemory — BM25 only 86.2% 94.6% published
MemPalace (vector only) larger model ~96.6% self-reported

On the same model/dataset/metric (full 500 questions), memini hybrid beats agentmemory at R@5 (98.4% vs 95.2%), R@10 (99.4% vs 98.6%), and MRR (92.3% vs 88.2%). memini's keyword leg is +11.4pp over agentmemory's BM25-only (97.6% vs 86.2%) thanks to Porter stemming, and hybrid fusion now beats either leg alone. Relative to fetching only k per leg with the classic rrfK=60, the deep-pool + score fusion is worth +2.0pp R@5 / +1.0pp R@10.

LoCoMo — retrieval recall_any@K

LoCoMo retrieval at dialogue-turn granularity (1,982 questions over 10 long conversations, gold = exact evidence turns among ~590 turns/conversation) — a much harder target than LongMemEval's session granularity, and the regime where flat-decay RRF over deep pools degrades badly.

System (all-MiniLM-L6-v2) R@5 R@10
memini — hybrid (score) 59.8% 69.8%
memini — keyword (Porter BM25) 58.7% 67.1%
memini — vector 41.5% 52.1%

No published turn-level retrieval baselines exist to compare against (mem0 / Letta report LLM-judged QA accuracy, below). This is the one cell where the default score fusion is edged by RRF (60.1% / 71.0%): when the vector leg is near-noise (MiniLM scores only 41.5% here), giving it an equal-weight normalized vote hurts, whereas RRF's rank-only vote is more robust. Score fusion still wins this cell on MRR and wins outright on every cell with a stronger embedder — so it is the default, and MEMINI_FUSION_ALPHA=-1 selects RRF for weak-vector deployments. (Ablation: rrfK=60 over the same deep pools scored just 52.8% R@5, below the keyword leg alone — both score fusion and rrfK=5 fix that.)

Pool-depth robustness (-pool-factor / -pool-floor)

Min-max normalization could in principle be fragile to pool depth (the score at the bottom of the pool sets each leg's zero point), so score fusion was swept at per-leg depths 30 / 50 / 80 on both datasets and both embedders (hybrid R@5 / R@10 / MRR):

cell depth 30 depth 50 (default) depth 80
LME · MiniLM 97.8 / 99.4 / 92.0 98.4 / 99.4 / 92.3 98.6 / 99.4 / 92.6
LME · Qwen3+prefix 98.8 / 99.4 / 94.5 98.8 / 99.6 / 94.6 98.8 / 99.6 / 94.6
LoCoMo · MiniLM 60.0 / 70.1 / 42.1 59.8 / 69.8 / 42.6 59.3 / 69.6 / 42.7
LoCoMo · Qwen3+prefix 70.1 / 77.9 / 52.1 70.1 / 78.5 / 52.4 70.1 / 78.7 / 52.5

Quality moves at most ±0.6pp R@5 across a 2.7× depth range — no tail collapse — with the two datasets drifting in opposite directions (deeper pools help session-granularity LongMemEval slightly and hurt turn-granularity LoCoMo slightly), so the default max(k*5, 50) sits at the crossover.

Recency-aware re-ranking (-rerank)

memini re-ranks the fused candidates by a composite of relevance, recency, and importance. The recency weight is deliberately light (0.05): a sweep on LongMemEval-S (knowledge-update + temporal-reasoning, q.Now = question date, sessions timestamped from haystack_dates) shows recency is a net win only as a tie-breaker, and actively harmful when over-weighted.

recency weight R@1 (both cats) knowledge-update R@1 temporal R@1 MRR
0 (pure RRF) 82.9% 91.0% 78.2% 90.1%
0.05 (default) 83.4% 91.0% 78.9% 90.5%
0.15 83.9% 89.7% 80.5% 90.7%
0.25 83.4% 87.2% 81.2% 90.4%

At 0.05 the re-ranker is +0.5pp R@1 / +0.4pp MRR over pure RRF with no knowledge-update cost, and recall@5 is identical across all weights (the re-rank only reorders within the top results). The steep RRF decay made the composite far more robust to the recency weight than the flat rrfK=60 decay was (where 0.15+ buried correct-but-older memories); the default stays at the conservative 0.05 since the gains beyond it are within noise.

Temporal targeting (temporal0.40)

Recency weighting trades off against itself: raising it helps temporal-reasoning (78.2→81.2% R@1) but hurts knowledge-update (91.0→87.2%), whose answers aren't necessarily recent. Temporal targeting avoids that: when a query names a relative time ("three weeks ago"), it computes target = now − offset and boosts candidates dated near that point, not near now. It only fires on temporal queries, so other categories are unaffected.

Strategy all R@1 knowledge-update R@1 temporal-reasoning R@1 MRR
recency 0.05 (prior default) 83.4% 91.0% 78.9% 90.5%
recency 0.25 83.4% 87.2% 81.2% 90.4%
temporal 0.40 85.3% 91.0% 82.0% 91.5%

Temporal targeting is +1.9pp R@1 overall over the recency default and beats even the heaviest recency weight on temporal-reasoning without the knowledge-update regression — so it ships on in production (MEMINI_TEMPORAL_BOOST=0.40, 0 disables). The no-LLM regex extractor only catches templated phrasing; an LLM anchor extractor (plugging into the same search.AnchorExtractor interface) can resolve looser references and is the intended with-LLM tier.

Held-out split (-holdout)

To avoid overfitting tuning decisions to the full benchmark, -holdout splits LongMemEval deterministically by load order: every 10th question is held (50/500), the rest are tune (450/500). Sweep parameters on -holdout tune, then report the final number on -holdout held (unseen). Default all runs the full set. Results files are suffixed (longmemeval-held.json) so splits don't overwrite each other.

Measured (memini-hybrid, all-MiniLM-L6-v2 — the parameters were swept on tune, not held):

Split Questions R@5 R@10 MRR
all (full) 500 98.4 99.2 93.0
tune 450 98.2 99.1 93.0
held 50 100.0 100.0 93.5

The held split does not regress against tune, so the tuning choices generalize (no tuned-to-test inflation). Per-category R@5:

Category tune (450) held (50)
knowledge-update 100.0 100.0
multi-session 99.2 100.0
single-session-assistant 100.0 100.0
single-session-user 96.8 100.0
temporal-reasoning 98.3 100.0
single-session-preference 88.9 100.0

Read the per-category numbers off tune (450 questions); it shows the real headroom is single-session-preference (88.9% R@5). On held each category is only 2–13 questions, so its across-the-board 100% is small-sample, not a separate claim of perfection.

Session-doc construction (-session-doc)

LongMemEval sessions are embedded as one document per session; -session-doc controls what text that document contains, to measure the vector leg's sensitivity to document shape:

  • full (default) — "role: content" for every turn.
  • user-only — only the user turns, no role prefixes. Assistant turns dilute the embedding for user-question recall; this is the shape MemPalace reports 96.6% R@5 vector-only with on the same MiniLM model.
  • datedfull prefixed with the session date, giving temporal questions a textual anchor embeddings would otherwise ignore.

Compare the vector row's recall_any@5 across modes (cached embeddings make the sweep cheap); the keyword and hybrid rows shift too but the vector leg is the target.

memini hybrid per-category (all-MiniLM, recall_any@10): multi-session 100%, knowledge-update 100%, single-session-user 98.6%, single-session-assistant 98.2%, temporal-reasoning 97.0%, single-session-preference 96.7%.

Rerank tier — cross-encoder vs LLM (-rerank-url / -llm-rerank)

The read-side rerank reorders the top of the production candidate order. The bench drives either backend through the same comparison (one reranker call per question — use -limit):

# cross-encoder (fast; e.g. Qwen3-Reranker-0.6B via llama-server --rerank):
go run ./cmd/bench -suite locomo -data ./locomo.json -rerank-url http://localhost:8002/v1 -rerank-model qwen3-reranker-0.6b -limit 100 -k 5,10
# LLM reranker (slow; MEMINI_LLM_*):
go run ./cmd/bench -suite locomo -data ./locomo.json -llm-rerank -limit 100 -k 5,10

Measured on all-MiniLM-L6-v2 (cross-encoder = Qwen3-Reranker-0.6B, LLM = Qwen3.5-9B), recall_any@5 / @10 / MRR:

Config LongMemEval (session) LoCoMo turn-level added p50
hybrid (base) 98.4 / 99.2 / 93.0 59.7 / 69.9 / 42.4
+ cross-encoder 98.4 / 99.2 / 93.1 70.9 / 75.0 / 59.8 ~20–230 ms
+ LLM rerank 98.4 / 99.2 / 93.0 74.4 / 76.5 / 67.4 ~350–420 ms

Reranking is a no-op at recall ceiling (session-level) and a big win where recall has headroom (turn-level: +11pp R@5 / +17pp MRR for the cross-encoder, +15pp / +25pp for the LLM). The cross-encoder captures most of the LLM's lift at a fraction of the latency with no chat model — the recommended production rerank (MEMINI_RERANK=<url>); the LLM tier (MEMINI_RERANK=llm) buys the last points if you already run one.

LoCoMo — end-to-end QA accuracy (LLM-judge)

The metric mem0/Letta publish: retrieve → generate an answer → an LLM judges it against the gold answer. memini's number uses a fast instruct reader+judge (Llama-3.3-70B-Instruct); the competitor numbers use their own readers/judges, so this is directional.

System LoCoMo QA accuracy Source
memini (hybrid retrieval + instruct reader) full run pending measured
Letta / MemGPT 83.2% published
Mem0 68.5% published

Sources: agentmemory COMPARISON.md/LONGMEMEVAL.md; LongMemEval (arXiv 2410.10813); LoCoMo (snap-stanford.github.io/LoCoMo); mem0.ai; letta.com.

Metrics

  • Recall@K — fraction of questions whose gold memory appears in the top K.
  • MRR — mean reciprocal rank of the first gold hit.
  • p50/p95 — recall latency; ingest — total ingest time.

Output is a Markdown table (stdout) plus JSON under bench/results/.

What it compares today

Three memini retrieval strategies over the same ingested store, to show the value of hybrid fusion:

System Retrieval
memini-hybrid vector + keyword, score fusion (production path)
memini-vector dense vector only
memini-keyword BM25 keyword only

memini-hybrid should never score below either single strategy.

Datasets

  • sample — committed at bench/data/sample.json, runs fully offline.
  • Normalized schema (-suite file) — {name, items:[{id,content}], questions:[{query,gold:[id]}]}.
  • LongMemEval / LoCoMo — loaders map the published JSON shapes to the normalized schema (each session/turn becomes an item; answer/evidence ids become gold). Download the datasets and pass -data.

Recall@K on LongMemEval/LoCoMo is easy to overfit — treat scores as directional.

External baselines

bench.System is the extension point. To compare against mem0, Zep/Graphiti, Letta, Cognee, agentmemory, or supermemory, implement System (Name / Ingest / Recall) over each service's API and add it to the run list in cmd/bench. These require the respective services/keys and are intentionally not vendored here.

Documentation

Overview

Package bench is a retrieval benchmark harness: it ingests a dataset of memories and scores each question's gold retrieval (Recall@K, MRR) and latency. Runs offline on the committed sample with a deterministic local embedder, or against a real endpoint and a converted LongMemEval/LoCoMo set.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Markdown

func Markdown(results []Result) string

Markdown renders results that share a K as a comparison table, best first.

func RerankMarkdown

func RerankMarkdown(results []RerankResult, k int) string

RerankMarkdown renders the RRF-vs-composite comparison, grouped by category.

Types

type Dataset

type Dataset struct {
	Name      string     `json:"name"`
	Items     []Item     `json:"items"`
	Questions []Question `json:"questions"`
}

Dataset is a normalized retrieval benchmark.

func LoadFile

func LoadFile(path string) (*Dataset, error)

LoadFile reads a dataset in memini's normalized JSON schema.

func LoadLoCoMo

func LoadLoCoMo(path string) (*Dataset, error)

LoadLoCoMo converts the published LoCoMo file into the normalized Dataset. Each conversation is its own group/namespace (dialogue ids repeat across conversations); each dialogue turn is an item, and each QA's evidence ids are its gold set. Questions without evidence (e.g. adversarial) are skipped.

func LoadLoCoMoSessions added in v0.0.4

func LoadLoCoMoSessions(path string) (*Dataset, error)

LoadLoCoMoSessions loads LoCoMo at SESSION granularity: each conversation session becomes one document (its turns concatenated), and a question's gold set is the session(s) holding its evidence turns. This matches how session-level memory systems (e.g. MemPalace) score LoCoMo, enabling an apples-to-apples comparison; LoadLoCoMo scores the harder turn granularity.

func LoadLongMemEval

func LoadLongMemEval(path string, mode DocMode) (*Dataset, error)

func Poison added in v0.0.11

func Poison(ds *Dataset, perGroup int, filler string) *Dataset

Poison returns a copy of ds with perGroup debris items added to every group that has questions — simulating a low-quality bulk import (e.g. a mem0 export of restatements) collapsed into the namespace. The debris shares one content template so a dedup pass clusters and collapses it, modelling the realistic "exports are full of near-duplicates" case. Use it to measure the Recall@K delta a poisoned store suffers, and that dedup/curation recover it.

func Sample

func Sample() (*Dataset, error)

Sample returns the committed offline sample dataset.

type DocMode added in v0.0.4

type DocMode string

LoadLongMemEval converts a LongMemEval file: each haystack session becomes an item, each question's answer_session_ids becomes its gold set. DocMode selects how a LongMemEval haystack session is rendered into one embedded item, for the vector-leg document-construction experiment.

const (
	// DocFull renders "role: content\n" for every turn (the production shape).
	DocFull DocMode = "full"
	// DocUserOnly renders only user turns, with no role prefixes (MemPalace's
	// raw mode: assistant turns dilute the vector leg on user-question recall).
	DocUserOnly DocMode = "user-only"
	// DocDated prefixes the full session with its date, so temporal questions
	// have a textual anchor the embedder can see.
	DocDated DocMode = "dated"
)

type Item

type Item struct {
	ID      string    `json:"id"`
	Content string    `json:"content"`
	Group   string    `json:"group,omitempty"`
	Time    time.Time `json:"-"`
}

Item is one memory to ingest; Group scopes it to a namespace, empty falls back to a shared default. Time, when set, is the memory's source timestamp (used to ground recency in the recency-aware re-ranking comparison).

type Question

type Question struct {
	Query    string    `json:"query"`
	Gold     []string  `json:"gold"`
	Group    string    `json:"group,omitempty"`
	Answer   string    `json:"answer,omitempty"`
	Category string    `json:"category,omitempty"`
	Now      time.Time `json:"-"`
}

Question is a query plus the gold memory IDs it should retrieve. Group must match its items; Answer/Category are populated for QA evaluation where available. Now, when set, is the query's reference time (e.g. the question date) — the "now" against which recency is measured.

type RerankResult

type RerankResult struct {
	System    string
	Category  string
	Questions int
	RecallAt1 float64
	RecallAtK float64
	MRR       float64
}

RerankResult is one ranking strategy's score over a question set.

func LLMRerankCompare added in v0.0.4

func LLMRerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker,
	ds *Dataset, k, fetch int, queryPrefix string,
) ([]RerankResult, error)

LLMRerankCompare measures the with-LLM read-side rerank lift on pure retrieval. For each question it builds the production candidate order (hybrid score fusion -> composite re-rank), then re-orders the top `fetch` with an LLM reranker, and scores recall@1/@k and MRR for both. The LLM tier is slow (one chat call per question), so drive it over a subset with cmd/bench -limit.

func RerankCompare

func RerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, cats []string, k int, queryPrefix string,
) ([]RerankResult, error)

RerankCompare isolates the effect of recency-aware re-ranking: it ingests ds (items carry source timestamps), then for each selected question scores the SAME fused candidate set two ways — pure RRF order vs the composite re-ranker using the question's reference time. Reports recall@1, recall@K, and MRR per category and overall, for both strategies. cats empty means all categories.

type Result

type Result struct {
	System      string             `json:"system"`
	Dataset     string             `json:"dataset"`
	K           int                `json:"k"`
	Questions   int                `json:"questions"`
	RecallAtK   float64            `json:"recall_at_k"`
	MRR         float64            `json:"mrr"`
	P50Millis   float64            `json:"p50_ms"`
	P95Millis   float64            `json:"p95_ms"`
	IngestMs    float64            `json:"ingest_ms"`
	PerCategory map[string]float64 `json:"per_category,omitempty"`
}

Result is one system's score on a dataset at a given K.

func Run

func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)

Run ingests the dataset into a system once, then scores recall_any@K and MRR for every K in ks from a single retrieval pass (retrieving max(ks) per question). Returns one Result per K.

type System

type System interface {
	Name() string
	Ingest(ctx context.Context, items []Item) error
	Recall(ctx context.Context, group, query string, k int) ([]string, error)
}

System is a memory system under test.

func MeminiSystems

func MeminiSystems(
	st store.Store, e embed.Embedder, concurrency int, queryPrefix string, fusionAlpha float64, poolFactor, poolFloor int,
) []System

MeminiSystems returns the hybrid, vector-only, and keyword-only retrieval strategies sharing one ingested store. queryPrefix, when non-empty, is prepended to query embeddings (hybrid and vector legs), matching MEMINI_EMBED_QUERY_PREFIX in production. fusionAlpha < 0 uses RRF; >= 0 uses convex-combination score fusion with that vector weight. poolFactor/poolFloor override hybrid recall's per-leg pool sizing (non-positive keeps defaults).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL