eval/

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

Links

Open Source Insights

README ¶

Benchmark Evaluation Harness

This directory contains a benchmark evaluation harness for openclaw-cortex against two canonical long-term memory benchmarks: LoCoMo and LongMemEval.

Both benchmarks use synthetic datasets only — no external downloads or internet access required.

Directory Layout

eval/
├── cmd/eval/main.go          # CLI entry point
├── locomo/
│   ├── dataset.go            # 10 synthetic LoCoMo QA pairs (3 conversations)
│   └── harness.go            # Ingest + recall + score runner
├── longmemeval/
│   ├── dataset.go            # 10 synthetic LongMemEval QA pairs
│   └── harness.go            # Ingest + recall + score runner
└── runner/
    └── runner.go             # Shared types, CortexClient, scoring functions

Tests for the eval harness live in the top-level tests/ package: tests/eval_locomo_test.go, tests/eval_longmemeval_test.go, tests/eval_runner_test.go.

Prerequisites

Dependency	Purpose
`openclaw-cortex` binary in `PATH`	Retrieval backend
Memgraph running on `bolt://localhost:7687`	Vector + graph storage
Ollama running on `http://localhost:11434`	Embeddings (`nomic-embed-text`)

Start local services:

docker compose up -d

How to Build and Run

Run all benchmarks (default)

go run ./eval/cmd/eval --benchmark all --k 5

Run a single benchmark

go run ./eval/cmd/eval --benchmark locomo --k 5
go run ./eval/cmd/eval --benchmark longmemeval --k 5

Save JSON results to a file

go run ./eval/cmd/eval --benchmark all --output results.json

Use a custom binary path or config

go run ./eval/cmd/eval \
  --binary /path/to/openclaw-cortex \
  --config ~/.openclaw-cortex/config.yaml \
  --benchmark all \
  --k 10

Run unit tests only (no binary / services needed)

go test -short -count=1 ./tests/... -v

Output Format

JSON

The JSON output is an array of BenchmarkSummary objects:

[
  {
    "name": "LoCoMo",
    "total_questions": 10,
    "exact_match_accuracy": 0.6,
    "avg_f1": 0.623,
    "recall_at_k": 0.8,
    "k": 5,
    "results": [ ... ]
  }
]

Each results entry is a BenchmarkResult:

Field	Description
`question_id`	Synthetic dataset identifier (e.g. `locomo-A1`)
`question`	The evaluation question
`ground_truth`	Expected answer substring
`retrieved`	Oracle-selected best candidate — the top-k result with the highest token-F1 vs. `ground_truth`
`exact_match`	Whether `retrieved` contains `ground_truth` (case-insensitive); oracle-selected, not top-ranked
`f1_score`	Token-level F1 between `retrieved` and `ground_truth`; oracle-selected, not top-ranked
`recalled_at_k`	Whether any of the top-k memories contained `ground_truth`

Markdown Table

After the JSON block the tool prints a summary table:

| Benchmark      | Questions | Exact Match | Avg F1  | Recall@5 |
|----------------|-----------|-------------|---------|----------|
| LoCoMo         | 10        |       60.0% | 0.6230  |    80.0% |
| LongMemEval    | 10        |       50.0% | 0.5410  |    70.0% |

Interpreting Results

Metrics

Metric	Definition
Exact Match	Fraction of questions where the oracle-selected best candidate (highest token-F1 among top-K) contains the ground-truth string (case-insensitive). This is an upper-bound metric — it answers "could the answer be found anywhere in top-K?", not "did the system rank the answer first?".
Avg F1	Mean token-level F1 between the oracle-selected best candidate and ground truth. Upper-bound metric; same oracle selection as Exact Match.
Recall@K	Fraction of questions where any of the top-K retrieved memories contains the ground truth. The canonical recall metric.

Competitor Context (from issue #88)

System	LoCoMo EM	LongMemEval EM
GPT-4 (RAG baseline)	~58%	~52%
MemGPT	~61%	~55%
A-MEM	~63%	~57%
openclaw-cortex (target)	>60%	>55%

These numbers are from the academic literature and reflect full-scale benchmark runs. The synthetic datasets here have 10 QA pairs each — they are representative but not statistically equivalent to the full benchmarks. Use them for regression detection and qualitative comparison, not for publication-grade claims.

Isolation Design and Comparison Caveats

The harness resets the memory store before each QA pair. This means every question is evaluated against a freshly-empty store containing only the facts/turns for that single pair. This differs from the published LoCoMo and LongMemEval protocols, which accumulate conversation history across all turns before running evaluation:

LoCoMo is designed to stress multi-session recall — the model is expected to answer questions by drawing on a long, accumulated conversation history. Resetting between pairs removes cross-pair context, so questions that require facts from earlier conversations will always score zero here. The published LoCoMo numbers assume full history is available.
LongMemEval similarly expects the full fact set to be in the store simultaneously.

Why the harness resets anyway: The reset ensures deterministic, non-contaminating isolation between QA pairs — a required property for CI/regression use. Each run produces identical scores regardless of ordering or prior state. The trade-off is that scores reflect single-pair retrieval capability rather than long-horizon accumulation.

Consequence: Scores from this harness will generally be lower than published benchmarks for cross-turn questions. Do not compare raw numbers directly against the academic literature. Use the scores as a stable regression baseline — if scores drop between commits, retrieval quality degraded; if they hold steady, the change is neutral.

How to Reproduce

# 1. Build the binary
go build -o bin/openclaw-cortex ./cmd/openclaw-cortex

# 2. Start services
docker compose up -d

# 3. Run benchmarks
go run ./eval/cmd/eval --binary ./bin/openclaw-cortex --benchmark all --k 5 --output eval_results.json

# 4. Inspect per-question breakdown
cat eval_results.json | jq '.[] | {name, exact_match_accuracy, avg_f1, recall_at_k}'

# 5. Run unit tests (no services needed)
go test -short -count=1 ./tests/... -v

Dataset Design

LoCoMo (10 QA pairs, 3 conversations)

Conversation A — Alice (software engineer): programming language preferences, job history.

Conversation B — Bob (infra lead): Kubernetes adoption, deployment strategies, prior tooling.

Conversation C — Carol (ML engineer): framework choices, hardware, career path.

QA categories: single-hop, multi-hop, temporal.

LongMemEval (10 QA pairs)

Covers knowledge that changes over time (job titles, databases, protocols) and chained facts requiring two-step reasoning.

QA categories: temporal, multi-hop, knowledge-update.

Each knowledge-update pair has at least one fact with a ValidTo field (a superseded memory) and a newer replacement fact.

Adding New QA Pairs

Add entries to locomo/dataset.go or longmemeval/dataset.go.
Ensure each entry has a unique ID, non-empty Question and GroundTruth, and at least one Conversation turn / Fact.
Run the unit tests: go test -short ./tests/... — the size/structure assertions will catch common mistakes.

Troubleshooting

Binary hangs / per-call timeout fires

Each Reset, Store, and Recall subprocess has a 30 s deadline (runner.defaultCallTimeout). When the deadline fires, cmd.Run() returns an error wrapping the killed-process signal; the harness counts it as a recallFailure (or aborts, for Reset/Store). The failure log will include "context deadline exceeded" to make the cause diagnosable.

If the binary consistently hangs in CI (e.g. Memgraph is slow to respond), tune the per-call deadline via CortexClient.CallTimeout:

client := runner.NewCortexClient(binaryPath, configPath)
client.CallTimeout = 2 * time.Minute // override 30 s default

The global --timeout flag bounds the entire benchmark run; CallTimeout bounds each individual subprocess call.

Scores lower than expected / recall_failures > 0

Check that Memgraph and Ollama are running (openclaw-cortex health). A non-zero recall_failures in the JSON output means some QA pairs scored zero due to binary/connectivity errors and the aggregate metrics are deflated.

Directories ¶

Path	Synopsis
cmd
eval command Command eval runs LoCoMo and/or LongMemEval benchmarks against a live openclaw-cortex instance and reports results as JSON and a markdown table.	Command eval runs LoCoMo and/or LongMemEval benchmarks against a live openclaw-cortex instance and reports results as JSON and a markdown table.
locomo Package locomo provides a synthetic LoCoMo-style benchmark dataset with 10 QA pairs across 3 multi-session conversations.	Package locomo provides a synthetic LoCoMo-style benchmark dataset with 10 QA pairs across 3 multi-session conversations.
longmemeval Package longmemeval provides a synthetic LongMemEval-style benchmark dataset with 10 QA pairs testing temporal reasoning, multi-hop retrieval, and knowledge-update (superseded memory) scenarios.	Package longmemeval provides a synthetic LongMemEval-style benchmark dataset with 10 QA pairs testing temporal reasoning, multi-hop retrieval, and knowledge-update (superseded memory) scenarios.
runner Package runner provides shared types, a CortexClient wrapper, and scoring functions used by all benchmark harnesses.	Package runner provides shared types, a CortexClient wrapper, and scoring functions used by all benchmark harnesses.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL