comparison

package

v0.11.0 Latest Latest Go to latest Published: May 9, 2026 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/DocumentDrivenDX/fizeau

Links

Open Source Insights

Documentation ¶

Overview ¶

Package comparison provides cross-harness comparison, benchmarking, and quorum primitives for the fizeau integration suite.

Index ¶

func CondenseOutput(input, namespacePrefix string) string
func QuorumMet(strategy string, threshold int, results []RunResult) bool
func SaveBenchmarkResult(path string, result *BenchmarkResult) error
type BenchmarkArm
type BenchmarkArmSummary
type BenchmarkPrompt
type BenchmarkResult
- func RunBenchmark(run RunFunc, suite *BenchmarkSuite) (*BenchmarkResult, error)
type BenchmarkSuite
- func LoadBenchmarkSuite(path string) (*BenchmarkSuite, error)
type BenchmarkSummary
type CompareOptions
type ComparisonArm
type ComparisonRecord
- func RunCompare(run RunFunc, opts CompareOptions) (*ComparisonRecord, error)
type QuorumOptions
type RunFunc
type RunResult
- func RunQuorum(run RunFunc, opts QuorumOptions) ([]RunResult, error)
type ToolCallEntry

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CondenseOutput ¶

func CondenseOutput(input, namespacePrefix string) string

CondenseOutput filters raw agent output to keep only progress-relevant lines.

Keeps:

Lines starting with namespacePrefix (e.g. "helix:") — caller progress
Tool call lines starting with "$ "
First line following a tool call ("$ cmd") — the result
Error/warning/fail/panic lines
Lines containing issue IDs, commit SHAs, or status keywords
ALLCAPS label lines (e.g. "PHASE 1:", "STATUS:")
Markdown headings (#), table rows (|), bold markers (**)
Phase/step markers (Phase, Step, ---)

Drops:

Raw diff hunks (diff --, @@ headers and +/-/context lines)
Codex boilerplate ("Commands run:", "tokens used" footer)
Consecutive blank lines (at most one emitted between kept sections)
All other verbose output

Full raw output should be preserved separately before condensing. namespacePrefix is the caller-specific prefix (e.g. "helix:"). Pass empty string to disable namespace-prefix matching.

func QuorumMet ¶

func QuorumMet(strategy string, threshold int, results []RunResult) bool

QuorumMet returns true if enough results succeeded.

func SaveBenchmarkResult ¶

func SaveBenchmarkResult(path string, result *BenchmarkResult) error

SaveBenchmarkResult writes a benchmark result to a JSON file.

Types ¶

type BenchmarkArm ¶

type BenchmarkArm struct {
	Label   string `json:"label"`
	Harness string `json:"harness"`
	Tier    string `json:"tier,omitempty"`  // "smart" | "standard" | "cheap"
	Model   string `json:"model,omitempty"` // explicit override
}

BenchmarkArm defines one arm in a benchmark suite.

type BenchmarkArmSummary ¶

type BenchmarkArmSummary struct {
	Label         string  `json:"label"`
	Completed     int     `json:"completed"`
	Failed        int     `json:"failed"`
	TotalTokens   int     `json:"total_tokens"`
	TotalCostUSD  float64 `json:"total_cost_usd"`
	AvgDurationMS int     `json:"avg_duration_ms"`
	AvgScore      float64 `json:"avg_score,omitempty"`
}

BenchmarkArmSummary aggregates stats for one arm across all prompts.

type BenchmarkPrompt ¶

type BenchmarkPrompt struct {
	ID          string   `json:"id"`
	Name        string   `json:"name"`
	Description string   `json:"description,omitempty"`
	Prompt      string   `json:"prompt"`                // inline prompt text
	PromptFile  string   `json:"prompt_file,omitempty"` // or path to prompt file
	Tags        []string `json:"tags,omitempty"`
	MaxTokens   int      `json:"max_tokens,omitempty"`
}

BenchmarkPrompt is a single test case in a benchmark suite.

type BenchmarkResult ¶

type BenchmarkResult struct {
	Suite       string             `json:"suite"`
	Version     string             `json:"version"`
	Timestamp   time.Time          `json:"timestamp"`
	Arms        []BenchmarkArm     `json:"arms"`
	Comparisons []ComparisonRecord `json:"comparisons"`
	Summary     BenchmarkSummary   `json:"summary"`
}

BenchmarkResult is the output of running a full benchmark suite.

func RunBenchmark ¶

func RunBenchmark(run RunFunc, suite *BenchmarkSuite) (*BenchmarkResult, error)

RunBenchmark executes all prompts in a suite against all arms. The run function is called once per (arm, prompt) pair.

type BenchmarkSuite ¶

type BenchmarkSuite struct {
	Name        string            `json:"name"`
	Description string            `json:"description,omitempty"`
	Version     string            `json:"version"`
	Arms        []BenchmarkArm    `json:"arms"`
	Prompts     []BenchmarkPrompt `json:"prompts"`
	Sandbox     bool              `json:"sandbox,omitempty"`
	PostRun     string            `json:"post_run,omitempty"`
	Timeout     string            `json:"timeout,omitempty"`
}

BenchmarkSuite defines a repeatable set of comparison runs.

func LoadBenchmarkSuite ¶

func LoadBenchmarkSuite(path string) (*BenchmarkSuite, error)

LoadBenchmarkSuite reads a benchmark suite from a JSON file.

type BenchmarkSummary ¶

type BenchmarkSummary struct {
	TotalPrompts int                   `json:"total_prompts"`
	Arms         []BenchmarkArmSummary `json:"arms"`
}

BenchmarkSummary aggregates stats across all arms and prompts.

type CompareOptions ¶

type CompareOptions struct {
	Harnesses   []string       // harnesses to compare
	ArmModels   map[int]string // per-arm model overrides keyed by arm index
	ArmLabels   map[int]string // per-arm display labels
	Prompt      string         // prompt text
	WorkDir     string         // working directory for worktree operations
	Sandbox     bool           // run each arm in an isolated worktree
	KeepSandbox bool           // preserve worktrees after comparison
	PostRun     string         // command to run in each worktree after the agent completes
}

CompareOptions configures a comparison dispatch.

type ComparisonArm ¶

type ComparisonArm struct {
	Harness      string          `json:"harness"`
	Model        string          `json:"model,omitempty"`
	Output       string          `json:"output"`
	Diff         string          `json:"diff,omitempty"`         // git diff of side effects
	ToolCalls    []ToolCallEntry `json:"tool_calls,omitempty"`   // agent tool call log
	PostRunOut   string          `json:"post_run_out,omitempty"` // post-run command output
	PostRunOK    *bool           `json:"post_run_ok,omitempty"`  // post-run pass/fail
	Tokens       int             `json:"tokens,omitempty"`
	InputTokens  int             `json:"input_tokens,omitempty"`
	OutputTokens int             `json:"output_tokens,omitempty"`
	CostUSD      float64         `json:"cost_usd,omitempty"`
	DurationMS   int             `json:"duration_ms"`
	ExitCode     int             `json:"exit_code"`
	Error        string          `json:"error,omitempty"`
}

ComparisonArm holds the result of one harness arm in a comparison.

type ComparisonRecord ¶

type ComparisonRecord struct {
	ID        string          `json:"id"`
	Timestamp time.Time       `json:"timestamp"`
	Prompt    string          `json:"prompt"`
	Arms      []ComparisonArm `json:"arms"`
}

ComparisonRecord is the complete record of a comparison run.

func RunCompare ¶

func RunCompare(run RunFunc, opts CompareOptions) (*ComparisonRecord, error)

RunCompare dispatches the same prompt to multiple harnesses, optionally in isolated worktrees, and returns a ComparisonRecord.

type QuorumOptions ¶

type QuorumOptions struct {
	Harnesses []string // multiple harnesses to invoke
	Strategy  string   // any, majority, unanimous, or numeric
	Threshold int      // numeric threshold (when Strategy is "")
	Prompt    string
	Model     string
}

QuorumOptions configures a quorum dispatch.

type RunFunc ¶

type RunFunc func(harness, model, prompt string) RunResult

RunFunc is the single-invocation primitive that RunCompare and RunQuorum drive. It receives a harness name and a prompt, and returns a RunResult. Callers wire this to whatever execution engine they use (DDx Runner.Run, agent service.Execute, etc.).

type RunResult ¶

type RunResult struct {
	Harness      string
	Model        string
	Output       string
	ToolCalls    []ToolCallEntry
	Tokens       int
	InputTokens  int
	OutputTokens int
	CostUSD      float64
	DurationMS   int
	ExitCode     int
	Error        string
}

RunResult is the minimal result shape the comparison engine needs from a single harness invocation. Callers adapt their concrete result type (e.g. agent.Result in DDx, or service-level events in agent) to RunResult.

func RunQuorum ¶

func RunQuorum(run RunFunc, opts QuorumOptions) ([]RunResult, error)

RunQuorum invokes multiple harnesses and evaluates consensus. It returns the results from all harnesses; use QuorumMet to check success.

type ToolCallEntry ¶

type ToolCallEntry struct {
	Tool     string `json:"tool"`
	Input    string `json:"input"`
	Output   string `json:"output,omitempty"`
	Duration int    `json:"duration_ms,omitempty"`
	Error    string `json:"error,omitempty"`
}

ToolCallEntry records one tool execution during an agent run.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL