bench

package

v0.2.10 Latest Latest Go to latest Published: Jun 13, 2026 License: AGPL-3.0 Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/eleboucher/memini

Links

Open Source Insights

README ¶

memini benchmark harness

A retrieval benchmark: ingest a dataset of memories, then for each question measure how well a system retrieves the gold supporting memories.

mise run bench                 # offline sample, local embedder
go run ./cmd/bench -k 5        # same, explicit K

Against a real embeddings model and a real dataset:

export MEMINI_EMBED_BASE_URL=http://localhost:8081/v1
export MEMINI_EMBED_MODEL=bge-m3 MEMINI_EMBED_DIMS=1024
# Optional: instruction-tuned asymmetric embedders (Qwen3-Embedding, bge) score
# higher when queries carry a retrieval instruction; documents stay bare.
# Measured on Qwen3-Embedding-8B: +6.0pp R@5 on the LongMemEval vector leg
# (91.2% -> 97.2%), +1.0pp MRR on the fused ranking on both datasets.
export MEMINI_EMBED_QUERY_PREFIX=$'Instruct: Given a user query, retrieve relevant memories that answer it\nQuery:'
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -k 5
go run ./cmd/bench -suite locomo      -data ./locomo.json        -k 5

# Isolate the recency-aware re-ranker against pure RRF on the same candidates,
# using each question's date as "now" (needs a timestamped dataset):
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -rerank -k 5

Full results

Everything this harness measures, in one table — sourced from the committed results/ JSON, all on the same all-MiniLM-L6-v2 (384-d) endpoint. Cells are recall_any@5 / @10 / MRR (%); p50 is in-process recall latency (rerank rows show the added cost). The detailed per-dataset sections below explain the methodology, sweeps, and caveats behind each column.

Strategy	LongMemEval · session	LoCoMo · turn-level	LoCoMo · session-level	p50
vector	92.6 / 95.4 / 80.7	41.3 / 51.8 / 28.1	64.1 / 79.8 / 45.2	<1 ms
keyword (Porter BM25)	97.6 / 99.0 / 92.2	58.7 / 67.1 / 44.8	92.6 / 96.8 / 79.4	~3 ms
hybrid (default, production path)	98.4 / 99.2 / 93.0	59.7 / 69.9 / 42.4	90.9 / 96.6 / 74.3	~5 ms
+ cross-encoder (`MEMINI_RERANK=<url>`)	98.4 / 99.2 / 93.1	70.9 / 75.0 / 59.8	90.9 / 96.6 / 74.3	+20–230 ms
+ LLM rerank (`MEMINI_RERANK=llm`)	98.4 / 99.2 / 93.0	74.4 / 76.5 / 67.4	—	+350–420 ms

Questions per dataset: LongMemEval 500 (session granularity), LoCoMo turn-level 1,982 (gold = exact evidence turns), LoCoMo session-level 1,981 (gold = sessions holding those turns). Rerank backends: Qwen3-Reranker-0.6B (cross-encoder) and Qwen3.5-9B (LLM). Reproduce with the per-suite commands in the sections below (-suite longmemeval, locomo, locomo-sessions; add -rerank-url/-llm-rerank for the rerank rows).

Reading it: hybrid never trails either single leg on the saturated session sets (it ties keyword on LoCoMo-session, where keyword's exact-token match is already near-ceiling). On turn-level LoCoMo base recall has real headroom, so the rerank tier earns its keep — the cross-encoder lands +11pp R@5 / +17pp MRR over hybrid at a fraction of the LLM's latency, and the LLM adds a few more points (+15pp / +25pp) if you already run a chat model. Where recall is already at ceiling (both session sets), reranking is a measured no-op.

Results: memini vs other memory systems

All memini numbers below are measured by this harness against a live all-MiniLM-L6-v2 (384-d) endpoint — the same embedding model agentmemory benchmarks with. Competitor numbers are cited from their own publications — we cannot re-run their systems here, and they use different embedding models, readers, and judges. Treat cross-system rows as directional, not a controlled head-to-head. (This mirrors how agentmemory documents its comparison.)

LongMemEval-S — retrieval `recall_any@K`

Full 500-question LongMemEval-S (~48 sessions/question), same metric agentmemory reports: does any gold session appear in the top-K retrieved? No LLM in the loop — pure retrieval. The run is the full 500 questions with the identical embedding model agentmemory benchmarks with (all-MiniLM-L6-v2, 384-d) for a true apples-to-apples comparison.

Hybrid recall over-fetches a deep candidate pool per leg (max(k*5, 50)) before fusing, so a memory just outside the top-k of both legs can still win — the production Recall path does the same. Fusion defaults to convex score fusion (MEMINI_FUSION_ALPHA=0.5): each leg's scores are min-max normalized to [0,1] and combined 0.5·vector + 0.5·keyword, keeping score magnitude so a memory a leg ranks far above its runners-up dominates one that is merely middling in both. A negative alpha falls back to Reciprocal Rank Fusion; deep pools then need a steep decay (rrfK=5, not the classic 60), since a flat decay lets both-leg mediocrity outscore single-leg excellence (2/(60+20) > 1/(60+0)). Score fusion gets the same effect from score magnitude directly, and beat RRF on 3 of 4 model×dataset cells (and on MRR in all 4).

System	Embedding model	R@5	R@10	Source
memini — hybrid (score)	all-MiniLM-L6-v2 (384-d)	98.4%	99.4%	measured
memini — keyword (Porter BM25)	—	97.6%	99.0%	measured
memini — vector	all-MiniLM-L6-v2	91.8%	96.6%	measured
agentmemory — BM25 + Vector	all-MiniLM-L6-v2	95.2%	98.6%	published
agentmemory — BM25 only	—	86.2%	94.6%	published
MemPalace (vector only)	larger model	~96.6%	—	self-reported

On the same model/dataset/metric (full 500 questions), memini hybrid beats agentmemory at R@5 (98.4% vs 95.2%), R@10 (99.4% vs 98.6%), and MRR (92.3% vs 88.2%). memini's keyword leg is +11.4pp over agentmemory's BM25-only (97.6% vs 86.2%) thanks to Porter stemming, and hybrid fusion now beats either leg alone. Relative to fetching only k per leg with the classic rrfK=60, the deep-pool + score fusion is worth +2.0pp R@5 / +1.0pp R@10.

LoCoMo — retrieval `recall_any@K`

LoCoMo retrieval at dialogue-turn granularity (1,982 questions over 10 long conversations, gold = exact evidence turns among ~590 turns/conversation) — a much harder target than LongMemEval's session granularity, and the regime where flat-decay RRF over deep pools degrades badly.

System (all-MiniLM-L6-v2)	R@5	R@10
memini — hybrid (score)	59.8%	69.8%
memini — keyword (Porter BM25)	58.7%	67.1%
memini — vector	41.5%	52.1%

No published turn-level retrieval baselines exist to compare against (mem0 / Letta report LLM-judged QA accuracy, below). This is the one cell where the default score fusion is edged by RRF (60.1% / 71.0%): when the vector leg is near-noise (MiniLM scores only 41.5% here), giving it an equal-weight normalized vote hurts, whereas RRF's rank-only vote is more robust. Score fusion still wins this cell on MRR and wins outright on every cell with a stronger embedder — so it is the default, and MEMINI_FUSION_ALPHA=-1 selects RRF for weak-vector deployments. (Ablation: rrfK=60 over the same deep pools scored just 52.8% R@5, below the keyword leg alone — both score fusion and rrfK=5 fix that.)

Pool-depth robustness (`-pool-factor` / `-pool-floor`)

Min-max normalization could in principle be fragile to pool depth (the score at the bottom of the pool sets each leg's zero point), so score fusion was swept at per-leg depths 30 / 50 / 80 on both datasets and both embedders (hybrid R@5 / R@10 / MRR):

cell	depth 30	depth 50 (default)	depth 80
LME · MiniLM	97.8 / 99.4 / 92.0	98.4 / 99.4 / 92.3	98.6 / 99.4 / 92.6
LME · Qwen3+prefix	98.8 / 99.4 / 94.5	98.8 / 99.6 / 94.6	98.8 / 99.6 / 94.6
LoCoMo · MiniLM	60.0 / 70.1 / 42.1	59.8 / 69.8 / 42.6	59.3 / 69.6 / 42.7
LoCoMo · Qwen3+prefix	70.1 / 77.9 / 52.1	70.1 / 78.5 / 52.4	70.1 / 78.7 / 52.5

Quality moves at most ±0.6pp R@5 across a 2.7× depth range — no tail collapse — with the two datasets drifting in opposite directions (deeper pools help session-granularity LongMemEval slightly and hurt turn-granularity LoCoMo slightly), so the default max(k*5, 50) sits at the crossover.

Recency-aware re-ranking (`-rerank`)

memini re-ranks the fused candidates by a composite of relevance, recency, and importance. The recency weight is deliberately light (0.05): a sweep on LongMemEval-S (knowledge-update + temporal-reasoning, q.Now = question date, sessions timestamped from haystack_dates) shows recency is a net win only as a tie-breaker, and actively harmful when over-weighted.

recency weight	R@1 (both cats)	knowledge-update R@1	temporal R@1	MRR
0 (pure RRF)	82.9%	91.0%	78.2%	90.1%
0.05 (default)	83.4%	91.0%	78.9%	90.5%
0.15	83.9%	89.7%	80.5%	90.7%
0.25	83.4%	87.2%	81.2%	90.4%

At 0.05 the re-ranker is +0.5pp R@1 / +0.4pp MRR over pure RRF with no knowledge-update cost, and recall@5 is identical across all weights (the re-rank only reorders within the top results). The steep RRF decay made the composite far more robust to the recency weight than the flat rrfK=60 decay was (where 0.15+ buried correct-but-older memories); the default stays at the conservative 0.05 since the gains beyond it are within noise.

Temporal targeting (`temporal0.40`)

Recency weighting trades off against itself: raising it helps temporal-reasoning (78.2→81.2% R@1) but hurts knowledge-update (91.0→87.2%), whose answers aren't necessarily recent. Temporal targeting avoids that: when a query names a relative time ("three weeks ago"), it computes target = now − offset and boosts candidates dated near that point, not near now. It only fires on temporal queries, so other categories are unaffected.

Strategy	all R@1	knowledge-update R@1	temporal-reasoning R@1	MRR
recency 0.05 (prior default)	83.4%	91.0%	78.9%	90.5%
recency 0.25	83.4%	87.2%	81.2%	90.4%
temporal 0.40	85.3%	91.0%	82.0%	91.5%

Temporal targeting is +1.9pp R@1 overall over the recency default and beats even the heaviest recency weight on temporal-reasoning without the knowledge-update regression — so it ships on in production (MEMINI_TEMPORAL_BOOST=0.40, 0 disables). The no-LLM regex extractor only catches templated phrasing; an LLM anchor extractor (plugging into the same search.AnchorExtractor interface) can resolve looser references and is the intended with-LLM tier.

Held-out split (`-holdout`)

To avoid overfitting tuning decisions to the full benchmark, -holdout splits LongMemEval deterministically by load order: every 10th question is held (50/500), the rest are tune (450/500). Sweep parameters on -holdout tune, then report the final number on -holdout held (unseen). Default all runs the full set. Results files are suffixed (longmemeval-held.json) so splits don't overwrite each other.

Measured (memini-hybrid, all-MiniLM-L6-v2 — the parameters were swept on tune, not held):

Split	Questions	R@5	R@10	MRR
`all` (full)	500	98.4	99.2	93.0
`tune`	450	98.2	99.1	93.0
`held`	50	100.0	100.0	93.5

The held split does not regress against tune, so the tuning choices generalize (no tuned-to-test inflation). Per-category R@5:

Category	`tune` (450)	`held` (50)
knowledge-update	100.0	100.0
multi-session	99.2	100.0
single-session-assistant	100.0	100.0
single-session-user	96.8	100.0
temporal-reasoning	98.3	100.0
single-session-preference	88.9	100.0

Read the per-category numbers off tune (450 questions); it shows the real headroom is single-session-preference (88.9% R@5). On held each category is only 2–13 questions, so its across-the-board 100% is small-sample, not a separate claim of perfection.

Session-doc construction (`-session-doc`)

LongMemEval sessions are embedded as one document per session; -session-doc controls what text that document contains, to measure the vector leg's sensitivity to document shape:

full (default) — "role: content" for every turn.
user-only — only the user turns, no role prefixes. Assistant turns dilute the embedding for user-question recall; this is the shape MemPalace reports 96.6% R@5 vector-only with on the same MiniLM model.
dated — full prefixed with the session date, giving temporal questions a textual anchor embeddings would otherwise ignore.

Compare the vector row's recall_any@5 across modes (cached embeddings make the sweep cheap); the keyword and hybrid rows shift too but the vector leg is the target.

memini hybrid per-category (all-MiniLM, recall_any@10): multi-session 100%, knowledge-update 100%, single-session-user 98.6%, single-session-assistant 98.2%, temporal-reasoning 97.0%, single-session-preference 96.7%.

Rerank tier — cross-encoder vs LLM (`-rerank-url` / `-llm-rerank`)

The read-side rerank reorders the top of the production candidate order. The bench drives either backend through the same comparison (one reranker call per question — use -limit):

# cross-encoder (fast; e.g. Qwen3-Reranker-0.6B via llama-server --rerank):
go run ./cmd/bench -suite locomo -data ./locomo.json -rerank-url http://localhost:8002/v1 -rerank-model qwen3-reranker-0.6b -limit 100 -k 5,10
# LLM reranker (slow; MEMINI_LLM_*):
go run ./cmd/bench -suite locomo -data ./locomo.json -llm-rerank -limit 100 -k 5,10

Measured on all-MiniLM-L6-v2 (cross-encoder = Qwen3-Reranker-0.6B, LLM = Qwen3.5-9B), recall_any@5 / @10 / MRR:

Config	LongMemEval (session)	LoCoMo turn-level	added p50
hybrid (base)	98.4 / 99.2 / 93.0	59.7 / 69.9 / 42.4	—
+ cross-encoder	98.4 / 99.2 / 93.1	70.9 / 75.0 / 59.8	~20–230 ms
+ LLM rerank	98.4 / 99.2 / 93.0	74.4 / 76.5 / 67.4	~350–420 ms

Reranking is a no-op at recall ceiling (session-level) and a big win where recall has headroom (turn-level: +11pp R@5 / +17pp MRR for the cross-encoder, +15pp / +25pp for the LLM). The cross-encoder captures most of the LLM's lift at a fraction of the latency with no chat model — the recommended production rerank (MEMINI_RERANK=<url>); the LLM tier (MEMINI_RERANK=llm) buys the last points if you already run one.

LoCoMo — end-to-end QA accuracy (LLM-judge)

The metric mem0/Letta publish: retrieve → generate an answer → an LLM judges it against the gold answer. memini's number uses a fast instruct reader+judge (Llama-3.3-70B-Instruct); the competitor numbers use their own readers/judges, so this is directional.

System	LoCoMo QA accuracy	Source
memini (hybrid retrieval + instruct reader)	full run pending	measured
Letta / MemGPT	83.2%	published
Mem0	68.5%	published

Sources: agentmemory COMPARISON.md/LONGMEMEVAL.md; LongMemEval (arXiv 2410.10813); LoCoMo (snap-stanford.github.io/LoCoMo); mem0.ai; letta.com.

Metrics

Recall@K — fraction of questions whose gold memory appears in the top K.
MRR — mean reciprocal rank of the first gold hit.
p50/p95 — recall latency; ingest — total ingest time.

Output is a Markdown table (stdout) plus JSON under bench/results/.

What it compares today

Three memini retrieval strategies over the same ingested store, to show the value of hybrid fusion:

System	Retrieval
`memini-hybrid`	vector + keyword, score fusion (production path)
`memini-vector`	dense vector only
`memini-keyword`	BM25 keyword only

memini-hybrid should never score below either single strategy.

Datasets

sample — committed at bench/data/sample.json, runs fully offline.
Normalized schema (-suite file) — {name, items:[{id,content}], questions:[{query,gold:[id]}]}.
LongMemEval / LoCoMo — loaders map the published JSON shapes to the normalized schema (each session/turn becomes an item; answer/evidence ids become gold). Download the datasets and pass -data.

Recall@K on LongMemEval/LoCoMo is easy to overfit — treat scores as directional.

External baselines

bench.System is the extension point. To compare against mem0, Zep/Graphiti, Letta, Cognee, agentmemory, or supermemory, implement System (Name / Ingest / Recall) over each service's API and add it to the run list in cmd/bench. These require the respective services/keys and are intentionally not vendored here.

Documentation ¶

Overview ¶

Package bench is a retrieval benchmark harness: it ingests a dataset of memories and scores each question's gold retrieval (Recall@K, MRR) and latency. Runs offline on the committed sample with a deterministic local embedder, or against a real endpoint and a converted LongMemEval/LoCoMo set.

Index ¶

func Markdown(results []Result) string
func RerankMarkdown(results []RerankResult, k int) string
type Dataset
type DocMode
type Item
type Question
type RerankResult
- func LLMRerankCompare(ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker, ...) ([]RerankResult, error)
- func RerankCompare(ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, ...) ([]RerankResult, error)
type Result
- func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)
type System
- func MeminiSystems(st store.Store, e embed.Embedder, concurrency int, queryPrefix string, ...) []System

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Markdown ¶

func Markdown(results []Result) string

Markdown renders results that share a K as a comparison table, best first.

func RerankMarkdown ¶

func RerankMarkdown(results []RerankResult, k int) string

RerankMarkdown renders the RRF-vs-composite comparison, grouped by category.

Types ¶

type Dataset ¶

type Dataset struct {
	Name      string     `json:"name"`
	Items     []Item     `json:"items"`
	Questions []Question `json:"questions"`
}

Dataset is a normalized retrieval benchmark.

func LoadFile ¶

func LoadFile(path string) (*Dataset, error)

LoadFile reads a dataset in memini's normalized JSON schema.

func LoadLoCoMo ¶

func LoadLoCoMo(path string) (*Dataset, error)

LoadLoCoMo converts the published LoCoMo file into the normalized Dataset. Each conversation is its own group/namespace (dialogue ids repeat across conversations); each dialogue turn is an item, and each QA's evidence ids are its gold set. Questions without evidence (e.g. adversarial) are skipped.

func LoadLoCoMoSessions ¶ added in v0.0.4

func LoadLoCoMoSessions(path string) (*Dataset, error)

LoadLoCoMoSessions loads LoCoMo at SESSION granularity: each conversation session becomes one document (its turns concatenated), and a question's gold set is the session(s) holding its evidence turns. This matches how session-level memory systems (e.g. MemPalace) score LoCoMo, enabling an apples-to-apples comparison; LoadLoCoMo scores the harder turn granularity.

func LoadLongMemEval ¶

func LoadLongMemEval(path string, mode DocMode) (*Dataset, error)

func Poison ¶ added in v0.0.11

func Poison(ds *Dataset, perGroup int, filler string) *Dataset

Poison returns a copy of ds with perGroup debris items added to every group that has questions — simulating a low-quality bulk import (e.g. a mem0 export of restatements) collapsed into the namespace. The debris shares one content template so a dedup pass clusters and collapses it, modelling the realistic "exports are full of near-duplicates" case. Use it to measure the Recall@K delta a poisoned store suffers, and that dedup/curation recover it.

func Sample ¶

func Sample() (*Dataset, error)

Sample returns the committed offline sample dataset.

type DocMode ¶ added in v0.0.4

type DocMode string

LoadLongMemEval converts a LongMemEval file: each haystack session becomes an item, each question's answer_session_ids becomes its gold set. DocMode selects how a LongMemEval haystack session is rendered into one embedded item, for the vector-leg document-construction experiment.

const (
	// DocFull renders "role: content\n" for every turn (the production shape).
	DocFull DocMode = "full"
	// DocUserOnly renders only user turns, with no role prefixes (MemPalace's
	// raw mode: assistant turns dilute the vector leg on user-question recall).
	DocUserOnly DocMode = "user-only"
	// DocDated prefixes the full session with its date, so temporal questions
	// have a textual anchor the embedder can see.
	DocDated DocMode = "dated"
)

type Item ¶

type Item struct {
	ID      string    `json:"id"`
	Content string    `json:"content"`
	Group   string    `json:"group,omitempty"`
	Time    time.Time `json:"-"`
}

Item is one memory to ingest; Group scopes it to a namespace, empty falls back to a shared default. Time, when set, is the memory's source timestamp (used to ground recency in the recency-aware re-ranking comparison).

type Question ¶

type Question struct {
	Query    string    `json:"query"`
	Gold     []string  `json:"gold"`
	Group    string    `json:"group,omitempty"`
	Answer   string    `json:"answer,omitempty"`
	Category string    `json:"category,omitempty"`
	Now      time.Time `json:"-"`
}

Question is a query plus the gold memory IDs it should retrieve. Group must match its items; Answer/Category are populated for QA evaluation where available. Now, when set, is the query's reference time (e.g. the question date) — the "now" against which recency is measured.

type RerankResult ¶

type RerankResult struct {
	System    string
	Category  string
	Questions int
	RecallAt1 float64
	RecallAtK float64
	MRR       float64
}

RerankResult is one ranking strategy's score over a question set.

func LLMRerankCompare ¶ added in v0.0.4

func LLMRerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, rr rerank.Reranker,
	ds *Dataset, k, fetch int, queryPrefix string,
) ([]RerankResult, error)

LLMRerankCompare measures the with-LLM read-side rerank lift on pure retrieval. For each question it builds the production candidate order (hybrid score fusion -> composite re-rank), then re-orders the top `fetch` with an LLM reranker, and scores recall@1/@k and MRR for both. The LLM tier is slow (one chat call per question), so drive it over a subset with cmd/bench -limit.

func RerankCompare ¶

func RerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, cats []string, k int, queryPrefix string,
) ([]RerankResult, error)

RerankCompare isolates the effect of recency-aware re-ranking: it ingests ds (items carry source timestamps), then for each selected question scores the SAME fused candidate set two ways — pure RRF order vs the composite re-ranker using the question's reference time. Reports recall@1, recall@K, and MRR per category and overall, for both strategies. cats empty means all categories.

type Result ¶

type Result struct {
	System      string             `json:"system"`
	Dataset     string             `json:"dataset"`
	K           int                `json:"k"`
	Questions   int                `json:"questions"`
	RecallAtK   float64            `json:"recall_at_k"`
	MRR         float64            `json:"mrr"`
	P50Millis   float64            `json:"p50_ms"`
	P95Millis   float64            `json:"p95_ms"`
	IngestMs    float64            `json:"ingest_ms"`
	PerCategory map[string]float64 `json:"per_category,omitempty"`
}

Result is one system's score on a dataset at a given K.

func Run ¶

func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)

Run ingests the dataset into a system once, then scores recall_any@K and MRR for every K in ks from a single retrieval pass (retrieving max(ks) per question). Returns one Result per K.

type System ¶

type System interface {
	Name() string
	Ingest(ctx context.Context, items []Item) error
	Recall(ctx context.Context, group, query string, k int) ([]string, error)
}

System is a memory system under test.

func MeminiSystems ¶

func MeminiSystems(
	st store.Store, e embed.Embedder, concurrency int, queryPrefix string, fusionAlpha float64, poolFactor, poolFloor int,
) []System

MeminiSystems returns the hybrid, vector-only, and keyword-only retrieval strategies sharing one ingested store. queryPrefix, when non-empty, is prepended to query embeddings (hybrid and vector legs), matching MEMINI_EMBED_QUERY_PREFIX in production. fusionAlpha < 0 uses RRF; >= 0 uses convex-combination score fusion with that vector weight. poolFactor/poolFloor override hybrid recall's per-leg pool sizing (non-positive keeps defaults).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL