runner

package
v0.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 22, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package runner provides shared types, a CortexClient wrapper, and scoring functions used by all benchmark harnesses.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BestCandidate

func BestCandidate(memories []string, groundTruth string) string

BestCandidate picks the memory from the retrieved list that has the highest token-F1 against the ground truth. Falls back to the first result if no candidate scores above zero.

func ExactMatch

func ExactMatch(retrieved, groundTruth string) bool

ExactMatch returns true if retrieved contains the ground truth (case-insensitive).

func FormatMarkdownTable

func FormatMarkdownTable(summaries []*BenchmarkSummary, k int) string

FormatMarkdownTable renders a GitHub-flavored markdown results table for a slice of BenchmarkSummary values. k is the recall-at-k value used only for the column header label.

Column widths are fixed except for Recall@k, which grows with k to avoid misalignment for k>=10 (e.g. "Recall@5"=8 chars, "Recall@100"=10 chars).

func RecallAtK

func RecallAtK(memories []string, groundTruth string, k int) bool

RecallAtK checks if any of the top-k retrieved memories contains the ground truth.

func TokenF1

func TokenF1(retrieved, groundTruth string) float64

TokenF1 computes token-level F1 between retrieved and ground truth. Returns 0 in all degenerate cases (empty groundTruth, all-punctuation inputs, or no token overlap) so it stays consistent with ExactMatch's false return for empty groundTruth.

Types

type BenchmarkResult

type BenchmarkResult struct {
	QuestionID  string  `json:"question_id"`
	Question    string  `json:"question"`
	GroundTruth string  `json:"ground_truth"`
	Retrieved   string  `json:"retrieved"`     // oracle-selected best candidate (highest token-F1 vs. ground truth)
	ExactMatch  bool    `json:"exact_match"`   // Retrieved contains GroundTruth (case-insensitive); oracle-selected, not top-ranked
	F1Score     float64 `json:"f1_score"`      // token-F1 of Retrieved vs. GroundTruth; oracle-selected, not top-ranked
	RecalledAtK bool    `json:"recalled_at_k"` // was ground truth in any of the top-k results?
}

BenchmarkResult holds the outcome of one QA pair evaluation.

ExactMatch and F1Score are both computed against the oracle-selected best candidate (the top-k result with the highest token-F1 vs. ground truth, chosen by BestCandidate). They measure "could the answer be found anywhere in the top-k?" — an upper-bound / recall-style metric, NOT Precision@1. RecalledAtK is the canonical recall metric; ExactMatch is a stricter variant of the same signal. See eval/README.md § Metrics.

type BenchmarkSummary

type BenchmarkSummary struct {
	Name           string  `json:"name"`
	TotalQuestions int     `json:"total_questions"`
	ExactMatchAcc  float64 `json:"exact_match_accuracy"`
	AvgF1          float64 `json:"avg_f1"`
	RecallAtK      float64 `json:"recall_at_k"`
	K              int     `json:"k"`
	// RecallFailures is the number of QA pairs for which the recall call failed
	// (binary error, connectivity issue, etc.). Non-zero values indicate that
	// scores for those pairs are artificially zero and should not be compared
	// against baselines without qualification.
	RecallFailures int               `json:"recall_failures,omitempty"`
	Results        []BenchmarkResult `json:"results"`
}

BenchmarkSummary aggregates results from a single benchmark run.

func Summarize

func Summarize(name string, results []BenchmarkResult, k, recallFailures int) *BenchmarkSummary

Summarize aggregates a slice of BenchmarkResult into a BenchmarkSummary. recallFailures is the number of QA pairs for which the recall step failed; it is recorded in the summary so callers can detect partially-degraded runs.

type Client

type Client interface {
	Reset(ctx context.Context) error
	Store(ctx context.Context, content string) error
	Recall(ctx context.Context, query string, limit int) ([]string, error)
}

Client is the interface that benchmark harnesses use to interact with the openclaw-cortex binary. CortexClient implements it; tests can inject a stub.

type CortexClient

type CortexClient struct {
	// BinaryPath is the path to the openclaw-cortex binary. Defaults to "openclaw-cortex".
	BinaryPath string
	// ConfigPath optionally points to an openclaw-cortex config file.
	ConfigPath string
	// CallTimeout is the per-subprocess deadline for each Reset/Store/Recall call.
	// Zero means use defaultCallTimeout (30 s).
	CallTimeout time.Duration
}

CortexClient wraps the openclaw-cortex binary via execFile (no shell injection). It implements Client.

func NewCortexClient

func NewCortexClient(binaryPath, configPath string) *CortexClient

NewCortexClient returns a CortexClient with sensible defaults.

func (*CortexClient) Recall

func (c *CortexClient) Recall(ctx context.Context, query string, limit int) ([]string, error)

Recall runs `openclaw-cortex recall --context _ <query>` and returns up to limit memory content strings parsed from the JSON output.

--budget limit*500 is a token-based heuristic, not a hard result count. The binary trims output to that many tokens; if memories are verbose the binary may return fewer than limit results, and the trailing contents[:limit] slice becomes a no-op. For the synthetic benchmark datasets (each fact/turn ≤ 30 tokens) 500 tokens per expected result is intentionally generous, making under-counting in practice very unlikely.

Note: Memgraph itself may also return fewer than limit items if fewer memories match the query. In that case len(contents) < limit with no error — RecallAtK is evaluated against the actual candidates returned, not a padded set. For the synthetic datasets this is expected and correct.

func (*CortexClient) Reset

func (c *CortexClient) Reset(ctx context.Context) error

Reset calls `openclaw-cortex reset --yes` to wipe all memories from the store. Used by benchmark harnesses to isolate QA pairs from each other.

func (*CortexClient) Store

func (c *CortexClient) Store(ctx context.Context, content string) error

Store runs `openclaw-cortex store <content>` to persist a fact memory. --scope permanent is intentional: eval facts represent ground-truth knowledge that should outlive a session, and all eval facts receive the same scope so relative recall scoring is unaffected by scope-boost.

type RecallJSONResult

type RecallJSONResult struct {
	Memory struct {
		Content string `json:"content"`
	} `json:"memory"`
}

RecallJSONResult is a minimal struct for parsing JSON output from `openclaw-cortex recall --context _`.

Schema: matches cmd_recall.go output as of commit e38b3d5f. The binary serializes []models.RecallResult (internal/models/memory.go):

[{"memory":{"content":"..."},...}, ...]

The outer key is "memory" (json:"memory") and the content key is "content" (json:"content"). If the recall command's output schema changes — e.g. the outer wrapper is flattened or the field is renamed — update this struct and TestRecallJSONResultSchema in tests/eval_runner_test.go.

Exported so that tests/eval_runner_test.go can test JSON schema parsing without requiring a live binary (CLAUDE.md: tests live in tests/).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL