runner

package

v0.10.0 Latest Latest Go to latest Published: Mar 22, 2026 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ajitpratap0/openclaw-cortex

Links

Open Source Insights

Documentation ¶

Overview ¶

Package runner provides shared types, a CortexClient wrapper, and scoring functions used by all benchmark harnesses.

Index ¶

func BestCandidate(memories []string, groundTruth string) string
func ExactMatch(retrieved, groundTruth string) bool
func FormatMarkdownTable(summaries []*BenchmarkSummary, k int) string
func RecallAtK(memories []string, groundTruth string, k int) bool
func TokenF1(retrieved, groundTruth string) float64
type BenchmarkResult
type BenchmarkSummary
- func Summarize(name string, results []BenchmarkResult, k, recallFailures int) *BenchmarkSummary
type Client
type CortexClient
- func NewCortexClient(binaryPath, configPath string) *CortexClient
type RecallJSONResult

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func BestCandidate ¶

func BestCandidate(memories []string, groundTruth string) string

BestCandidate picks the memory from the retrieved list that has the highest token-F1 against the ground truth. Falls back to the first result if no candidate scores above zero.

func ExactMatch ¶

func ExactMatch(retrieved, groundTruth string) bool

ExactMatch returns true if retrieved contains the ground truth (case-insensitive).

func FormatMarkdownTable ¶

func FormatMarkdownTable(summaries []*BenchmarkSummary, k int) string

FormatMarkdownTable renders a GitHub-flavored markdown results table for a slice of BenchmarkSummary values. k is the recall-at-k value used only for the column header label.

Column widths are fixed except for Recall@k, which grows with k to avoid misalignment for k>=10 (e.g. "Recall@5"=8 chars, "Recall@100"=10 chars).

func RecallAtK ¶

func RecallAtK(memories []string, groundTruth string, k int) bool

RecallAtK checks if any of the top-k retrieved memories contains the ground truth.

func TokenF1 ¶

func TokenF1(retrieved, groundTruth string) float64

TokenF1 computes token-level F1 between retrieved and ground truth. Returns 0 in all degenerate cases (empty groundTruth, all-punctuation inputs, or no token overlap) so it stays consistent with ExactMatch's false return for empty groundTruth.

Types ¶

type BenchmarkResult ¶

type BenchmarkResult struct {
	QuestionID  string  `json:"question_id"`
	Question    string  `json:"question"`
	GroundTruth string  `json:"ground_truth"`
	Retrieved   string  `json:"retrieved"`     // oracle-selected best candidate (highest token-F1 vs. ground truth)
	ExactMatch  bool    `json:"exact_match"`   // Retrieved contains GroundTruth (case-insensitive); oracle-selected, not top-ranked
	F1Score     float64 `json:"f1_score"`      // token-F1 of Retrieved vs. GroundTruth; oracle-selected, not top-ranked
	RecalledAtK bool    `json:"recalled_at_k"` // was ground truth in any of the top-k results?
}

BenchmarkResult holds the outcome of one QA pair evaluation.

ExactMatch and F1Score are both computed against the oracle-selected best candidate (the top-k result with the highest token-F1 vs. ground truth, chosen by BestCandidate). They measure "could the answer be found anywhere in the top-k?" — an upper-bound / recall-style metric, NOT Precision@1. RecalledAtK is the canonical recall metric; ExactMatch is a stricter variant of the same signal. See eval/README.md § Metrics.

type BenchmarkSummary ¶

type BenchmarkSummary struct {
	Name           string  `json:"name"`
	TotalQuestions int     `json:"total_questions"`
	ExactMatchAcc  float64 `json:"exact_match_accuracy"`
	AvgF1          float64 `json:"avg_f1"`
	RecallAtK      float64 `json:"recall_at_k"`
	K              int     `json:"k"`
	// RecallFailures is the number of QA pairs for which the recall call failed
	// (binary error, connectivity issue, etc.). Non-zero values indicate that
	// scores for those pairs are artificially zero and should not be compared
	// against baselines without qualification.
	RecallFailures int               `json:"recall_failures,omitempty"`
	Results        []BenchmarkResult `json:"results"`
}

BenchmarkSummary aggregates results from a single benchmark run.

func Summarize ¶

func Summarize(name string, results []BenchmarkResult, k, recallFailures int) *BenchmarkSummary

Summarize aggregates a slice of BenchmarkResult into a BenchmarkSummary. recallFailures is the number of QA pairs for which the recall step failed; it is recorded in the summary so callers can detect partially-degraded runs.

type Client ¶

type Client interface {
	Reset(ctx context.Context) error
	Store(ctx context.Context, content string) error
	Recall(ctx context.Context, query string, limit int) ([]string, error)
}

Client is the interface that benchmark harnesses use to interact with the openclaw-cortex binary. CortexClient implements it; tests can inject a stub.

type CortexClient ¶

type CortexClient struct {
	// BinaryPath is the path to the openclaw-cortex binary. Defaults to "openclaw-cortex".
	BinaryPath string
	// ConfigPath optionally points to an openclaw-cortex config file.
	ConfigPath string
	// CallTimeout is the per-subprocess deadline for each Reset/Store/Recall call.
	// Zero means use defaultCallTimeout (30 s).
	CallTimeout time.Duration
}

CortexClient wraps the openclaw-cortex binary via execFile (no shell injection). It implements Client.

func NewCortexClient ¶

func NewCortexClient(binaryPath, configPath string) *CortexClient

NewCortexClient returns a CortexClient with sensible defaults.

func (*CortexClient) Recall ¶

func (c *CortexClient) Recall(ctx context.Context, query string, limit int) ([]string, error)

Recall runs `openclaw-cortex recall --context _ <query>` and returns up to limit memory content strings parsed from the JSON output.

--budget limit*500 is a token-based heuristic, not a hard result count. The binary trims output to that many tokens; if memories are verbose the binary may return fewer than limit results, and the trailing contents[:limit] slice becomes a no-op. For the synthetic benchmark datasets (each fact/turn ≤ 30 tokens) 500 tokens per expected result is intentionally generous, making under-counting in practice very unlikely.

Note: Memgraph itself may also return fewer than limit items if fewer memories match the query. In that case len(contents) < limit with no error — RecallAtK is evaluated against the actual candidates returned, not a padded set. For the synthetic datasets this is expected and correct.

func (*CortexClient) Reset ¶

func (c *CortexClient) Reset(ctx context.Context) error

Reset calls `openclaw-cortex reset --yes` to wipe all memories from the store. Used by benchmark harnesses to isolate QA pairs from each other.

func (*CortexClient) Store ¶

func (c *CortexClient) Store(ctx context.Context, content string) error

Store runs `openclaw-cortex store <content>` to persist a fact memory. --scope permanent is intentional: eval facts represent ground-truth knowledge that should outlive a session, and all eval facts receive the same scope so relative recall scoring is unaffected by scope-boost.

type RecallJSONResult ¶

type RecallJSONResult struct {
	Memory struct {
		Content string `json:"content"`
	} `json:"memory"`
}

RecallJSONResult is a minimal struct for parsing JSON output from `openclaw-cortex recall --context _`.

Schema: matches cmd_recall.go output as of commit e38b3d5f. The binary serializes []models.RecallResult (internal/models/memory.go):

[{"memory":{"content":"..."},...}, ...]

The outer key is "memory" (json:"memory") and the content key is "content" (json:"content"). If the recall command's output schema changes — e.g. the outer wrapper is flattened or the field is renamed — update this struct and TestRecallJSONResultSchema in tests/eval_runner_test.go.

Exported so that tests/eval_runner_test.go can test JSON schema parsing without requiring a live binary (CLAUDE.md: tests live in tests/).

Source Files ¶

View all Source files

runner.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL