Documentation
¶
Overview ¶
Package runner provides shared types, a CortexClient wrapper, and scoring functions used by all benchmark harnesses.
Index ¶
- func BestCandidate(memories []string, groundTruth string) string
- func ExactMatch(retrieved, groundTruth string) bool
- func FormatMarkdownTable(summaries []*BenchmarkSummary, k int) string
- func RecallAtK(memories []string, groundTruth string, k int) bool
- func TokenF1(retrieved, groundTruth string) float64
- type BenchmarkResult
- type BenchmarkSummary
- type Client
- type CortexClient
- type RecallJSONResult
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func BestCandidate ¶
BestCandidate picks the memory from the retrieved list that has the highest token-F1 against the ground truth. Falls back to the first result if no candidate scores above zero.
func ExactMatch ¶
ExactMatch returns true if retrieved contains the ground truth (case-insensitive).
func FormatMarkdownTable ¶
func FormatMarkdownTable(summaries []*BenchmarkSummary, k int) string
FormatMarkdownTable renders a GitHub-flavored markdown results table for a slice of BenchmarkSummary values. k is the recall-at-k value used only for the column header label.
Column widths are fixed except for Recall@k, which grows with k to avoid misalignment for k>=10 (e.g. "Recall@5"=8 chars, "Recall@100"=10 chars).
Types ¶
type BenchmarkResult ¶
type BenchmarkResult struct {
QuestionID string `json:"question_id"`
Question string `json:"question"`
GroundTruth string `json:"ground_truth"`
Retrieved string `json:"retrieved"` // oracle-selected best candidate (highest token-F1 vs. ground truth)
ExactMatch bool `json:"exact_match"` // Retrieved contains GroundTruth (case-insensitive); oracle-selected, not top-ranked
F1Score float64 `json:"f1_score"` // token-F1 of Retrieved vs. GroundTruth; oracle-selected, not top-ranked
RecalledAtK bool `json:"recalled_at_k"` // was ground truth in any of the top-k results?
}
BenchmarkResult holds the outcome of one QA pair evaluation.
ExactMatch and F1Score are both computed against the oracle-selected best candidate (the top-k result with the highest token-F1 vs. ground truth, chosen by BestCandidate). They measure "could the answer be found anywhere in the top-k?" — an upper-bound / recall-style metric, NOT Precision@1. RecalledAtK is the canonical recall metric; ExactMatch is a stricter variant of the same signal. See eval/README.md § Metrics.
type BenchmarkSummary ¶
type BenchmarkSummary struct {
Name string `json:"name"`
TotalQuestions int `json:"total_questions"`
ExactMatchAcc float64 `json:"exact_match_accuracy"`
AvgF1 float64 `json:"avg_f1"`
RecallAtK float64 `json:"recall_at_k"`
K int `json:"k"`
// RecallFailures is the number of QA pairs for which the recall call failed
// (binary error, connectivity issue, etc.). Non-zero values indicate that
// scores for those pairs are artificially zero and should not be compared
// against baselines without qualification.
RecallFailures int `json:"recall_failures,omitempty"`
Results []BenchmarkResult `json:"results"`
}
BenchmarkSummary aggregates results from a single benchmark run.
func Summarize ¶
func Summarize(name string, results []BenchmarkResult, k, recallFailures int) *BenchmarkSummary
Summarize aggregates a slice of BenchmarkResult into a BenchmarkSummary. recallFailures is the number of QA pairs for which the recall step failed; it is recorded in the summary so callers can detect partially-degraded runs.
type Client ¶
type Client interface {
Reset(ctx context.Context) error
Store(ctx context.Context, content string) error
Recall(ctx context.Context, query string, limit int) ([]string, error)
}
Client is the interface that benchmark harnesses use to interact with the openclaw-cortex binary. CortexClient implements it; tests can inject a stub.
type CortexClient ¶
type CortexClient struct {
// BinaryPath is the path to the openclaw-cortex binary. Defaults to "openclaw-cortex".
BinaryPath string
// ConfigPath optionally points to an openclaw-cortex config file.
ConfigPath string
// CallTimeout is the per-subprocess deadline for each Reset/Store/Recall call.
// Zero means use defaultCallTimeout (30 s).
CallTimeout time.Duration
}
CortexClient wraps the openclaw-cortex binary via execFile (no shell injection). It implements Client.
func NewCortexClient ¶
func NewCortexClient(binaryPath, configPath string) *CortexClient
NewCortexClient returns a CortexClient with sensible defaults.
func (*CortexClient) Recall ¶
Recall runs `openclaw-cortex recall --context _ <query>` and returns up to limit memory content strings parsed from the JSON output.
--budget limit*500 is a token-based heuristic, not a hard result count. The binary trims output to that many tokens; if memories are verbose the binary may return fewer than limit results, and the trailing contents[:limit] slice becomes a no-op. For the synthetic benchmark datasets (each fact/turn ≤ 30 tokens) 500 tokens per expected result is intentionally generous, making under-counting in practice very unlikely.
Note: Memgraph itself may also return fewer than limit items if fewer memories match the query. In that case len(contents) < limit with no error — RecallAtK is evaluated against the actual candidates returned, not a padded set. For the synthetic datasets this is expected and correct.
func (*CortexClient) Reset ¶
func (c *CortexClient) Reset(ctx context.Context) error
Reset calls `openclaw-cortex reset --yes` to wipe all memories from the store. Used by benchmark harnesses to isolate QA pairs from each other.
func (*CortexClient) Store ¶
func (c *CortexClient) Store(ctx context.Context, content string) error
Store runs `openclaw-cortex store <content>` to persist a fact memory. --scope permanent is intentional: eval facts represent ground-truth knowledge that should outlive a session, and all eval facts receive the same scope so relative recall scoring is unaffected by scope-boost.
type RecallJSONResult ¶
type RecallJSONResult struct {
Memory struct {
Content string `json:"content"`
} `json:"memory"`
}
RecallJSONResult is a minimal struct for parsing JSON output from `openclaw-cortex recall --context _`.
Schema: matches cmd_recall.go output as of commit e38b3d5f. The binary serializes []models.RecallResult (internal/models/memory.go):
[{"memory":{"content":"..."},...}, ...]
The outer key is "memory" (json:"memory") and the content key is "content" (json:"content"). If the recall command's output schema changes — e.g. the outer wrapper is flattened or the field is renamed — update this struct and TestRecallJSONResultSchema in tests/eval_runner_test.go.
Exported so that tests/eval_runner_test.go can test JSON schema parsing without requiring a live binary (CLAUDE.md: tests live in tests/).