Documentation
¶
Overview ¶
Package comparison provides cross-harness comparison, benchmarking, and quorum primitives for the fizeau integration suite.
Index ¶
- func CondenseOutput(input, namespacePrefix string) string
- func QuorumMet(strategy string, threshold int, results []RunResult) bool
- func SaveBenchmarkResult(path string, result *BenchmarkResult) error
- type BenchmarkArm
- type BenchmarkArmSummary
- type BenchmarkPrompt
- type BenchmarkResult
- type BenchmarkSuite
- type BenchmarkSummary
- type CompareOptions
- type ComparisonArm
- type ComparisonRecord
- type QuorumOptions
- type RunFunc
- type RunResult
- type ToolCallEntry
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CondenseOutput ¶
CondenseOutput filters raw agent output to keep only progress-relevant lines.
Keeps:
- Lines starting with namespacePrefix (e.g. "helix:") — caller progress
- Tool call lines starting with "$ "
- First line following a tool call ("$ cmd") — the result
- Error/warning/fail/panic lines
- Lines containing issue IDs, commit SHAs, or status keywords
- ALLCAPS label lines (e.g. "PHASE 1:", "STATUS:")
- Markdown headings (#), table rows (|), bold markers (**)
- Phase/step markers (Phase, Step, ---)
Drops:
- Raw diff hunks (diff --, @@ headers and +/-/context lines)
- Codex boilerplate ("Commands run:", "tokens used" footer)
- Consecutive blank lines (at most one emitted between kept sections)
- All other verbose output
Full raw output should be preserved separately before condensing. namespacePrefix is the caller-specific prefix (e.g. "helix:"). Pass empty string to disable namespace-prefix matching.
func SaveBenchmarkResult ¶
func SaveBenchmarkResult(path string, result *BenchmarkResult) error
SaveBenchmarkResult writes a benchmark result to a JSON file.
Types ¶
type BenchmarkArm ¶
type BenchmarkArm struct {
Label string `json:"label"`
Harness string `json:"harness"`
Tier string `json:"tier,omitempty"` // "smart" | "standard" | "cheap"
Model string `json:"model,omitempty"` // explicit override
}
BenchmarkArm defines one arm in a benchmark suite.
type BenchmarkArmSummary ¶
type BenchmarkArmSummary struct {
Label string `json:"label"`
Completed int `json:"completed"`
Failed int `json:"failed"`
TotalTokens int `json:"total_tokens"`
TotalCostUSD float64 `json:"total_cost_usd"`
AvgDurationMS int `json:"avg_duration_ms"`
AvgScore float64 `json:"avg_score,omitempty"`
}
BenchmarkArmSummary aggregates stats for one arm across all prompts.
type BenchmarkPrompt ¶
type BenchmarkPrompt struct {
ID string `json:"id"`
Name string `json:"name"`
Description string `json:"description,omitempty"`
Prompt string `json:"prompt"` // inline prompt text
PromptFile string `json:"prompt_file,omitempty"` // or path to prompt file
Tags []string `json:"tags,omitempty"`
MaxTokens int `json:"max_tokens,omitempty"`
}
BenchmarkPrompt is a single test case in a benchmark suite.
type BenchmarkResult ¶
type BenchmarkResult struct {
Suite string `json:"suite"`
Version string `json:"version"`
Timestamp time.Time `json:"timestamp"`
Arms []BenchmarkArm `json:"arms"`
Comparisons []ComparisonRecord `json:"comparisons"`
Summary BenchmarkSummary `json:"summary"`
}
BenchmarkResult is the output of running a full benchmark suite.
func RunBenchmark ¶
func RunBenchmark(run RunFunc, suite *BenchmarkSuite) (*BenchmarkResult, error)
RunBenchmark executes all prompts in a suite against all arms. The run function is called once per (arm, prompt) pair.
type BenchmarkSuite ¶
type BenchmarkSuite struct {
Name string `json:"name"`
Description string `json:"description,omitempty"`
Version string `json:"version"`
Arms []BenchmarkArm `json:"arms"`
Prompts []BenchmarkPrompt `json:"prompts"`
Sandbox bool `json:"sandbox,omitempty"`
PostRun string `json:"post_run,omitempty"`
Timeout string `json:"timeout,omitempty"`
}
BenchmarkSuite defines a repeatable set of comparison runs.
func LoadBenchmarkSuite ¶
func LoadBenchmarkSuite(path string) (*BenchmarkSuite, error)
LoadBenchmarkSuite reads a benchmark suite from a JSON file.
type BenchmarkSummary ¶
type BenchmarkSummary struct {
TotalPrompts int `json:"total_prompts"`
Arms []BenchmarkArmSummary `json:"arms"`
}
BenchmarkSummary aggregates stats across all arms and prompts.
type CompareOptions ¶
type CompareOptions struct {
Harnesses []string // harnesses to compare
ArmModels map[int]string // per-arm model overrides keyed by arm index
ArmLabels map[int]string // per-arm display labels
Prompt string // prompt text
WorkDir string // working directory for worktree operations
Sandbox bool // run each arm in an isolated worktree
KeepSandbox bool // preserve worktrees after comparison
PostRun string // command to run in each worktree after the agent completes
}
CompareOptions configures a comparison dispatch.
type ComparisonArm ¶
type ComparisonArm struct {
Harness string `json:"harness"`
Model string `json:"model,omitempty"`
Output string `json:"output"`
Diff string `json:"diff,omitempty"` // git diff of side effects
ToolCalls []ToolCallEntry `json:"tool_calls,omitempty"` // agent tool call log
PostRunOut string `json:"post_run_out,omitempty"` // post-run command output
PostRunOK *bool `json:"post_run_ok,omitempty"` // post-run pass/fail
Tokens int `json:"tokens,omitempty"`
InputTokens int `json:"input_tokens,omitempty"`
OutputTokens int `json:"output_tokens,omitempty"`
CostUSD float64 `json:"cost_usd,omitempty"`
DurationMS int `json:"duration_ms"`
ExitCode int `json:"exit_code"`
Error string `json:"error,omitempty"`
}
ComparisonArm holds the result of one harness arm in a comparison.
type ComparisonRecord ¶
type ComparisonRecord struct {
ID string `json:"id"`
Timestamp time.Time `json:"timestamp"`
Prompt string `json:"prompt"`
Arms []ComparisonArm `json:"arms"`
}
ComparisonRecord is the complete record of a comparison run.
func RunCompare ¶
func RunCompare(run RunFunc, opts CompareOptions) (*ComparisonRecord, error)
RunCompare dispatches the same prompt to multiple harnesses, optionally in isolated worktrees, and returns a ComparisonRecord.
type QuorumOptions ¶
type QuorumOptions struct {
Harnesses []string // multiple harnesses to invoke
Strategy string // any, majority, unanimous, or numeric
Threshold int // numeric threshold (when Strategy is "")
Prompt string
Model string
}
QuorumOptions configures a quorum dispatch.
type RunFunc ¶
RunFunc is the single-invocation primitive that RunCompare and RunQuorum drive. It receives a harness name and a prompt, and returns a RunResult. Callers wire this to whatever execution engine they use (DDx Runner.Run, agent service.Execute, etc.).
type RunResult ¶
type RunResult struct {
Harness string
Model string
Output string
ToolCalls []ToolCallEntry
Tokens int
InputTokens int
OutputTokens int
CostUSD float64
DurationMS int
ExitCode int
Error string
}
RunResult is the minimal result shape the comparison engine needs from a single harness invocation. Callers adapt their concrete result type (e.g. agent.Result in DDx, or service-level events in agent) to RunResult.