bench

package

v0.0.1 Latest Latest Go to latest Published: Jun 10, 2026 License: AGPL-3.0 Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/eleboucher/memini

Links

Open Source Insights

README ¶

memini benchmark harness

A retrieval benchmark: ingest a dataset of memories, then for each question measure how well a system retrieves the gold supporting memories.

mise run bench                 # offline sample, local embedder
go run ./cmd/bench -k 5        # same, explicit K

Against a real embeddings model and a real dataset:

export MEMINI_EMBED_BASE_URL=http://localhost:8081/v1
export MEMINI_EMBED_MODEL=bge-m3 MEMINI_EMBED_DIMS=1024
# Optional: instruction-tuned asymmetric embedders (Qwen3-Embedding, bge) score
# higher when queries carry a retrieval instruction; documents stay bare.
# Measured on Qwen3-Embedding-8B: +6.0pp R@5 on the LongMemEval vector leg
# (91.2% -> 97.2%), +1.0pp MRR on the fused ranking on both datasets.
export MEMINI_EMBED_QUERY_PREFIX=$'Instruct: Given a user query, retrieve relevant memories that answer it\nQuery:'
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -k 5
go run ./cmd/bench -suite locomo      -data ./locomo.json        -k 5

# Isolate the recency-aware re-ranker against pure RRF on the same candidates,
# using each question's date as "now" (needs a timestamped dataset):
go run ./cmd/bench -suite longmemeval -data ./longmemeval_s.json -rerank -k 5

Results: memini vs other memory systems

All memini numbers below are measured by this harness against a live all-MiniLM-L6-v2 (384-d) endpoint — the same embedding model agentmemory benchmarks with. Competitor numbers are cited from their own publications — we cannot re-run their systems here, and they use different embedding models, readers, and judges. Treat cross-system rows as directional, not a controlled head-to-head. (This mirrors how agentmemory documents its comparison.)

LongMemEval-S — retrieval `recall_any@K`

Full 500-question LongMemEval-S (~48 sessions/question), same metric agentmemory reports: does any gold session appear in the top-K retrieved? No LLM in the loop — pure retrieval. The run is the full 500 questions with the identical embedding model agentmemory benchmarks with (all-MiniLM-L6-v2, 384-d) for a true apples-to-apples comparison.

Hybrid recall over-fetches a deep candidate pool per leg (max(k*5, 50)) before fusing, so a memory just outside the top-k of both legs can still win — the production Recall path does the same. Fusion defaults to convex score fusion (MEMINI_FUSION_ALPHA=0.5): each leg's scores are min-max normalized to [0,1] and combined 0.5·vector + 0.5·keyword, keeping score magnitude so a memory a leg ranks far above its runners-up dominates one that is merely middling in both. A negative alpha falls back to Reciprocal Rank Fusion; deep pools then need a steep decay (rrfK=5, not the classic 60), since a flat decay lets both-leg mediocrity outscore single-leg excellence (2/(60+20) > 1/(60+0)). Score fusion gets the same effect from score magnitude directly, and beat RRF on 3 of 4 model×dataset cells (and on MRR in all 4).

System	Embedding model	R@5	R@10	Source
memini — hybrid (score)	all-MiniLM-L6-v2 (384-d)	98.4%	99.4%	measured
memini — keyword (Porter BM25)	—	97.6%	99.0%	measured
memini — vector	all-MiniLM-L6-v2	91.8%	96.6%	measured
agentmemory — BM25 + Vector	all-MiniLM-L6-v2	95.2%	98.6%	published
agentmemory — BM25 only	—	86.2%	94.6%	published
MemPalace (vector only)	larger model	~96.6%	—	self-reported

On the same model/dataset/metric (full 500 questions), memini hybrid beats agentmemory at R@5 (98.4% vs 95.2%), R@10 (99.4% vs 98.6%), and MRR (92.3% vs 88.2%). memini's keyword leg is +11.4pp over agentmemory's BM25-only (97.6% vs 86.2%) thanks to Porter stemming, and hybrid fusion now beats either leg alone. Relative to fetching only k per leg with the classic rrfK=60, the deep-pool + score fusion is worth +2.0pp R@5 / +1.0pp R@10.

LoCoMo — retrieval `recall_any@K`

LoCoMo retrieval at dialogue-turn granularity (1,982 questions over 10 long conversations, gold = exact evidence turns among ~590 turns/conversation) — a much harder target than LongMemEval's session granularity, and the regime where flat-decay RRF over deep pools degrades badly.

System (all-MiniLM-L6-v2)	R@5	R@10
memini — hybrid (score)	59.8%	69.8%
memini — keyword (Porter BM25)	58.7%	67.1%
memini — vector	41.5%	52.1%

No published turn-level retrieval baselines exist to compare against (mem0 / Letta report LLM-judged QA accuracy, below). This is the one cell where the default score fusion is edged by RRF (60.1% / 71.0%): when the vector leg is near-noise (MiniLM scores only 41.5% here), giving it an equal-weight normalized vote hurts, whereas RRF's rank-only vote is more robust. Score fusion still wins this cell on MRR and wins outright on every cell with a stronger embedder — so it is the default, and MEMINI_FUSION_ALPHA=-1 selects RRF for weak-vector deployments. (Ablation: rrfK=60 over the same deep pools scored just 52.8% R@5, below the keyword leg alone — both score fusion and rrfK=5 fix that.)

Pool-depth robustness (`-pool-factor` / `-pool-floor`)

Min-max normalization could in principle be fragile to pool depth (the score at the bottom of the pool sets each leg's zero point), so score fusion was swept at per-leg depths 30 / 50 / 80 on both datasets and both embedders (hybrid R@5 / R@10 / MRR):

cell	depth 30	depth 50 (default)	depth 80
LME · MiniLM	97.8 / 99.4 / 92.0	98.4 / 99.4 / 92.3	98.6 / 99.4 / 92.6
LME · Qwen3+prefix	98.8 / 99.4 / 94.5	98.8 / 99.6 / 94.6	98.8 / 99.6 / 94.6
LoCoMo · MiniLM	60.0 / 70.1 / 42.1	59.8 / 69.8 / 42.6	59.3 / 69.6 / 42.7
LoCoMo · Qwen3+prefix	70.1 / 77.9 / 52.1	70.1 / 78.5 / 52.4	70.1 / 78.7 / 52.5

Quality moves at most ±0.6pp R@5 across a 2.7× depth range — no tail collapse — with the two datasets drifting in opposite directions (deeper pools help session-granularity LongMemEval slightly and hurt turn-granularity LoCoMo slightly), so the default max(k*5, 50) sits at the crossover.

Recency-aware re-ranking (`-rerank`)

memini re-ranks the fused candidates by a composite of relevance, recency, and importance. The recency weight is deliberately light (0.05): a sweep on LongMemEval-S (knowledge-update + temporal-reasoning, q.Now = question date, sessions timestamped from haystack_dates) shows recency is a net win only as a tie-breaker, and actively harmful when over-weighted.

recency weight	R@1 (both cats)	knowledge-update R@1	temporal R@1	MRR
0 (pure RRF)	82.9%	91.0%	78.2%	90.1%
0.05 (default)	83.4%	91.0%	78.9%	90.5%
0.15	83.9%	89.7%	80.5%	90.7%
0.25	83.4%	87.2%	81.2%	90.4%

At 0.05 the re-ranker is +0.5pp R@1 / +0.4pp MRR over pure RRF with no knowledge-update cost, and recall@5 is identical across all weights (the re-rank only reorders within the top results). The steep RRF decay made the composite far more robust to the recency weight than the flat rrfK=60 decay was (where 0.15+ buried correct-but-older memories); the default stays at the conservative 0.05 since the gains beyond it are within noise.

memini hybrid per-category (all-MiniLM, recall_any@10): multi-session 100%, knowledge-update 100%, single-session-user 98.6%, single-session-assistant 98.2%, temporal-reasoning 97.0%, single-session-preference 96.7%.

LoCoMo — end-to-end QA accuracy (LLM-judge)

The metric mem0/Letta publish: retrieve → generate an answer → an LLM judges it against the gold answer. memini's number uses a fast instruct reader+judge (Llama-3.3-70B-Instruct); the competitor numbers use their own readers/judges, so this is directional.

System	LoCoMo QA accuracy	Source
memini (hybrid retrieval + instruct reader)	full run pending	measured
Letta / MemGPT	83.2%	published
Mem0	68.5%	published

Sources: agentmemory COMPARISON.md/LONGMEMEVAL.md; LongMemEval (arXiv 2410.10813); LoCoMo (snap-stanford.github.io/LoCoMo); mem0.ai; letta.com.

Metrics

Recall@K — fraction of questions whose gold memory appears in the top K.
MRR — mean reciprocal rank of the first gold hit.
p50/p95 — recall latency; ingest — total ingest time.

Output is a Markdown table (stdout) plus JSON under bench/results/.

What it compares today

Three memini retrieval strategies over the same ingested store, to show the value of hybrid fusion:

System	Retrieval
`memini-hybrid`	vector + keyword, score fusion (production path)
`memini-vector`	dense vector only
`memini-keyword`	BM25 keyword only

memini-hybrid should never score below either single strategy.

Datasets

sample — committed at bench/data/sample.json, runs fully offline.
Normalized schema (-suite file) — {name, items:[{id,content}], questions:[{query,gold:[id]}]}.
LongMemEval / LoCoMo — loaders map the published JSON shapes to the normalized schema (each session/turn becomes an item; answer/evidence ids become gold). Download the datasets and pass -data.

Recall@K on LongMemEval/LoCoMo is easy to overfit — treat scores as directional.

External baselines

bench.System is the extension point. To compare against mem0, Zep/Graphiti, Letta, Cognee, agentmemory, or supermemory, implement System (Name / Ingest / Recall) over each service's API and add it to the run list in cmd/bench. These require the respective services/keys and are intentionally not vendored here.

Documentation ¶

Overview ¶

Package bench is a retrieval benchmark harness: it ingests a dataset of memories and scores each question's gold retrieval (Recall@K, MRR) and latency. Runs offline on the committed sample with a deterministic local embedder, or against a real endpoint and a converted LongMemEval/LoCoMo set.

Index ¶

func Markdown(results []Result) string
func RerankMarkdown(results []RerankResult, k int) string
type Dataset
type Item
type Question
type RerankResult
- func RerankCompare(ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, ...) ([]RerankResult, error)
type Result
- func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)
type System
- func MeminiSystems(st store.Store, e embed.Embedder, concurrency int, queryPrefix string, ...) []System

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Markdown ¶

func Markdown(results []Result) string

Markdown renders results that share a K as a comparison table, best first.

func RerankMarkdown ¶

func RerankMarkdown(results []RerankResult, k int) string

RerankMarkdown renders the RRF-vs-composite comparison, grouped by category.

Types ¶

type Dataset ¶

type Dataset struct {
	Name      string     `json:"name"`
	Items     []Item     `json:"items"`
	Questions []Question `json:"questions"`
}

Dataset is a normalized retrieval benchmark.

func LoadFile ¶

func LoadFile(path string) (*Dataset, error)

LoadFile reads a dataset in memini's normalized JSON schema.

func LoadLoCoMo ¶

func LoadLoCoMo(path string) (*Dataset, error)

LoadLoCoMo converts the published LoCoMo file into the normalized Dataset. Each conversation is its own group/namespace (dialogue ids repeat across conversations); each dialogue turn is an item, and each QA's evidence ids are its gold set. Questions without evidence (e.g. adversarial) are skipped.

func LoadLongMemEval ¶

func LoadLongMemEval(path string) (*Dataset, error)

LoadLongMemEval converts a LongMemEval file: each haystack session becomes an item, each question's answer_session_ids becomes its gold set.

func Sample ¶

func Sample() (*Dataset, error)

Sample returns the committed offline sample dataset.

type Item ¶

type Item struct {
	ID      string    `json:"id"`
	Content string    `json:"content"`
	Group   string    `json:"group,omitempty"`
	Time    time.Time `json:"-"`
}

Item is one memory to ingest; Group scopes it to a namespace, empty falls back to a shared default. Time, when set, is the memory's source timestamp (used to ground recency in the recency-aware re-ranking comparison).

type Question ¶

type Question struct {
	Query    string    `json:"query"`
	Gold     []string  `json:"gold"`
	Group    string    `json:"group,omitempty"`
	Answer   string    `json:"answer,omitempty"`
	Category string    `json:"category,omitempty"`
	Now      time.Time `json:"-"`
}

Question is a query plus the gold memory IDs it should retrieve. Group must match its items; Answer/Category are populated for QA evaluation where available. Now, when set, is the query's reference time (e.g. the question date) — the "now" against which recency is measured.

type RerankResult ¶

type RerankResult struct {
	System    string
	Category  string
	Questions int
	RecallAt1 float64
	RecallAtK float64
	MRR       float64
}

RerankResult is one ranking strategy's score over a question set.

func RerankCompare ¶

func RerankCompare(
	ctx context.Context, st store.Store, e embed.Embedder, ds *Dataset, cats []string, k int, queryPrefix string,
) ([]RerankResult, error)

RerankCompare isolates the effect of recency-aware re-ranking: it ingests ds (items carry source timestamps), then for each selected question scores the SAME fused candidate set two ways — pure RRF order vs the composite re-ranker using the question's reference time. Reports recall@1, recall@K, and MRR per category and overall, for both strategies. cats empty means all categories.

type Result ¶

type Result struct {
	System      string             `json:"system"`
	Dataset     string             `json:"dataset"`
	K           int                `json:"k"`
	Questions   int                `json:"questions"`
	RecallAtK   float64            `json:"recall_at_k"`
	MRR         float64            `json:"mrr"`
	P50Millis   float64            `json:"p50_ms"`
	P95Millis   float64            `json:"p95_ms"`
	IngestMs    float64            `json:"ingest_ms"`
	PerCategory map[string]float64 `json:"per_category,omitempty"`
}

Result is one system's score on a dataset at a given K.

func Run ¶

func Run(ctx context.Context, sys System, ds *Dataset, ks []int) ([]Result, error)

Run ingests the dataset into a system once, then scores recall_any@K and MRR for every K in ks from a single retrieval pass (retrieving max(ks) per question). Returns one Result per K.

type System ¶

type System interface {
	Name() string
	Ingest(ctx context.Context, items []Item) error
	Recall(ctx context.Context, group, query string, k int) ([]string, error)
}

System is a memory system under test.

func MeminiSystems ¶

func MeminiSystems(
	st store.Store, e embed.Embedder, concurrency int, queryPrefix string, fusionAlpha float64, poolFactor, poolFloor int,
) []System

MeminiSystems returns the hybrid, vector-only, and keyword-only retrieval strategies sharing one ingested store. queryPrefix, when non-empty, is prepended to query embeddings (hybrid and vector legs), matching MEMINI_EMBED_QUERY_PREFIX in production. fusionAlpha < 0 uses RRF; >= 0 uses convex-combination score fusion with that vector weight. poolFactor/poolFloor override hybrid recall's per-leg pool sizing (non-positive keeps defaults).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL