longmemeval

package
v0.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 22, 2026 License: MIT Imports: 4 Imported by: 0

Documentation

Overview

Package longmemeval provides a synthetic LongMemEval-style benchmark dataset with 10 QA pairs testing temporal reasoning, multi-hop retrieval, and knowledge-update (superseded memory) scenarios.

LongMemEval specifically probes an agent's ability to recall information from long-horizon conversation histories, including cases where facts have changed over time.

Package longmemeval implements the LongMemEval benchmark harness.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Run

func Run(ctx context.Context, client runner.Client, k int) (*runner.BenchmarkSummary, error)

Run ingests all synthetic LongMemEval facts and evaluates all QA pairs. It returns a BenchmarkSummary with individual and aggregate results.

For each QA pair:

  1. Ingest all facts via client.Store.
  2. Run Recall(question, k) to retrieve relevant memories.
  3. Score: ExactMatch + TokenF1 + RecallAtK.

Types

type MemoryFact

type MemoryFact struct {
	Content string
	// DatasetValidFrom and DatasetValidTo are dataset metadata for human
	// readability only. The "Dataset" prefix is intentional: these fields are
	// NOT forwarded to the openclaw-cortex binary — the harness calls
	// client.Store(ctx, fact.Content) and ignores them entirely.
	// Temporal-versioning paths (valid_from/valid_to in the store, --supersedes,
	// SearchFilters.AsOf) are therefore out of scope for this harness; it measures
	// semantic retrieval only. See longmemeval/harness.go for the full rationale.
	DatasetValidFrom string // e.g. "2024-01" — dataset documentation only; NOT passed to binary
	DatasetValidTo   string // non-empty = superseded fact; NOT passed as --supersedes
}

MemoryFact is a pre-formed statement that gets ingested directly via Store (rather than a full conversation turn) to simulate a long conversation history.

type QAPair

type QAPair struct {
	ID          string
	Facts       []MemoryFact // facts to ingest (in order) before querying
	Question    string
	GroundTruth string
	Category    string // "temporal" | "multi-hop" | "knowledge-update"
}

QAPair is a LongMemEval-style evaluation unit.

func Dataset

func Dataset() []QAPair

Dataset returns the synthetic LongMemEval QA pairs.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL