comparison

package
v0.10.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 5, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package comparison provides cross-harness comparison, benchmarking, and quorum primitives for the fizeau integration suite.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CondenseOutput

func CondenseOutput(input, namespacePrefix string) string

CondenseOutput filters raw agent output to keep only progress-relevant lines.

Keeps:

  • Lines starting with namespacePrefix (e.g. "helix:") — caller progress
  • Tool call lines starting with "$ "
  • First line following a tool call ("$ cmd") — the result
  • Error/warning/fail/panic lines
  • Lines containing issue IDs, commit SHAs, or status keywords
  • ALLCAPS label lines (e.g. "PHASE 1:", "STATUS:")
  • Markdown headings (#), table rows (|), bold markers (**)
  • Phase/step markers (Phase, Step, ---)

Drops:

  • Raw diff hunks (diff --, @@ headers and +/-/context lines)
  • Codex boilerplate ("Commands run:", "tokens used" footer)
  • Consecutive blank lines (at most one emitted between kept sections)
  • All other verbose output

Full raw output should be preserved separately before condensing. namespacePrefix is the caller-specific prefix (e.g. "helix:"). Pass empty string to disable namespace-prefix matching.

func QuorumMet

func QuorumMet(strategy string, threshold int, results []RunResult) bool

QuorumMet returns true if enough results succeeded.

func SaveBenchmarkResult

func SaveBenchmarkResult(path string, result *BenchmarkResult) error

SaveBenchmarkResult writes a benchmark result to a JSON file.

Types

type BenchmarkArm

type BenchmarkArm struct {
	Label   string `json:"label"`
	Harness string `json:"harness"`
	Tier    string `json:"tier,omitempty"`  // "smart" | "standard" | "cheap"
	Model   string `json:"model,omitempty"` // explicit override
}

BenchmarkArm defines one arm in a benchmark suite.

type BenchmarkArmSummary

type BenchmarkArmSummary struct {
	Label         string  `json:"label"`
	Completed     int     `json:"completed"`
	Failed        int     `json:"failed"`
	TotalTokens   int     `json:"total_tokens"`
	TotalCostUSD  float64 `json:"total_cost_usd"`
	AvgDurationMS int     `json:"avg_duration_ms"`
	AvgScore      float64 `json:"avg_score,omitempty"`
}

BenchmarkArmSummary aggregates stats for one arm across all prompts.

type BenchmarkPrompt

type BenchmarkPrompt struct {
	ID          string   `json:"id"`
	Name        string   `json:"name"`
	Description string   `json:"description,omitempty"`
	Prompt      string   `json:"prompt"`                // inline prompt text
	PromptFile  string   `json:"prompt_file,omitempty"` // or path to prompt file
	Tags        []string `json:"tags,omitempty"`
	MaxTokens   int      `json:"max_tokens,omitempty"`
}

BenchmarkPrompt is a single test case in a benchmark suite.

type BenchmarkResult

type BenchmarkResult struct {
	Suite       string             `json:"suite"`
	Version     string             `json:"version"`
	Timestamp   time.Time          `json:"timestamp"`
	Arms        []BenchmarkArm     `json:"arms"`
	Comparisons []ComparisonRecord `json:"comparisons"`
	Summary     BenchmarkSummary   `json:"summary"`
}

BenchmarkResult is the output of running a full benchmark suite.

func RunBenchmark

func RunBenchmark(run RunFunc, suite *BenchmarkSuite) (*BenchmarkResult, error)

RunBenchmark executes all prompts in a suite against all arms. The run function is called once per (arm, prompt) pair.

type BenchmarkSuite

type BenchmarkSuite struct {
	Name        string            `json:"name"`
	Description string            `json:"description,omitempty"`
	Version     string            `json:"version"`
	Arms        []BenchmarkArm    `json:"arms"`
	Prompts     []BenchmarkPrompt `json:"prompts"`
	Sandbox     bool              `json:"sandbox,omitempty"`
	PostRun     string            `json:"post_run,omitempty"`
	Timeout     string            `json:"timeout,omitempty"`
}

BenchmarkSuite defines a repeatable set of comparison runs.

func LoadBenchmarkSuite

func LoadBenchmarkSuite(path string) (*BenchmarkSuite, error)

LoadBenchmarkSuite reads a benchmark suite from a JSON file.

type BenchmarkSummary

type BenchmarkSummary struct {
	TotalPrompts int                   `json:"total_prompts"`
	Arms         []BenchmarkArmSummary `json:"arms"`
}

BenchmarkSummary aggregates stats across all arms and prompts.

type CompareOptions

type CompareOptions struct {
	Harnesses   []string       // harnesses to compare
	ArmModels   map[int]string // per-arm model overrides keyed by arm index
	ArmLabels   map[int]string // per-arm display labels
	Prompt      string         // prompt text
	WorkDir     string         // working directory for worktree operations
	Sandbox     bool           // run each arm in an isolated worktree
	KeepSandbox bool           // preserve worktrees after comparison
	PostRun     string         // command to run in each worktree after the agent completes
}

CompareOptions configures a comparison dispatch.

type ComparisonArm

type ComparisonArm struct {
	Harness      string          `json:"harness"`
	Model        string          `json:"model,omitempty"`
	Output       string          `json:"output"`
	Diff         string          `json:"diff,omitempty"`         // git diff of side effects
	ToolCalls    []ToolCallEntry `json:"tool_calls,omitempty"`   // agent tool call log
	PostRunOut   string          `json:"post_run_out,omitempty"` // post-run command output
	PostRunOK    *bool           `json:"post_run_ok,omitempty"`  // post-run pass/fail
	Tokens       int             `json:"tokens,omitempty"`
	InputTokens  int             `json:"input_tokens,omitempty"`
	OutputTokens int             `json:"output_tokens,omitempty"`
	CostUSD      float64         `json:"cost_usd,omitempty"`
	DurationMS   int             `json:"duration_ms"`
	ExitCode     int             `json:"exit_code"`
	Error        string          `json:"error,omitempty"`
}

ComparisonArm holds the result of one harness arm in a comparison.

type ComparisonRecord

type ComparisonRecord struct {
	ID        string          `json:"id"`
	Timestamp time.Time       `json:"timestamp"`
	Prompt    string          `json:"prompt"`
	Arms      []ComparisonArm `json:"arms"`
}

ComparisonRecord is the complete record of a comparison run.

func RunCompare

func RunCompare(run RunFunc, opts CompareOptions) (*ComparisonRecord, error)

RunCompare dispatches the same prompt to multiple harnesses, optionally in isolated worktrees, and returns a ComparisonRecord.

type QuorumOptions

type QuorumOptions struct {
	Harnesses []string // multiple harnesses to invoke
	Strategy  string   // any, majority, unanimous, or numeric
	Threshold int      // numeric threshold (when Strategy is "")
	Prompt    string
	Model     string
}

QuorumOptions configures a quorum dispatch.

type RunFunc

type RunFunc func(harness, model, prompt string) RunResult

RunFunc is the single-invocation primitive that RunCompare and RunQuorum drive. It receives a harness name and a prompt, and returns a RunResult. Callers wire this to whatever execution engine they use (DDx Runner.Run, agent service.Execute, etc.).

type RunResult

type RunResult struct {
	Harness      string
	Model        string
	Output       string
	ToolCalls    []ToolCallEntry
	Tokens       int
	InputTokens  int
	OutputTokens int
	CostUSD      float64
	DurationMS   int
	ExitCode     int
	Error        string
}

RunResult is the minimal result shape the comparison engine needs from a single harness invocation. Callers adapt their concrete result type (e.g. agent.Result in DDx, or service-level events in agent) to RunResult.

func RunQuorum

func RunQuorum(run RunFunc, opts QuorumOptions) ([]RunResult, error)

RunQuorum invokes multiple harnesses and evaluates consensus. It returns the results from all harnesses; use QuorumMet to check success.

type ToolCallEntry

type ToolCallEntry struct {
	Tool     string `json:"tool"`
	Input    string `json:"input"`
	Output   string `json:"output,omitempty"`
	Duration int    `json:"duration_ms,omitempty"`
	Error    string `json:"error,omitempty"`
}

ToolCallEntry records one tool execution during an agent run.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL