eval_analysis

package
v0.14.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 26, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

README

internal/eval_analysis

Benchmark analysis and dashboard generation for AILANG evaluation system.

File Organization

This package was reorganized in November 2025 to comply with the 800-line file size limit for AI-maintainability.

Core Analysis Files
  • comparison.go (195 lines) - Baseline comparison and regression detection
  • matrix.go (242 lines) - Performance matrix generation and aggregation
  • loader.go (279 lines) - Benchmark result loading and filtering
  • validate.go (221 lines) - Result validation and health checks
  • formatter.go (325 lines) - Human-readable output formatting
Export Files (split from export_docusaurus.go)

The original export_docusaurus.go (980 lines) was split into 4 focused files:

  • dashboard_io.go (145 lines) - Dashboard JSON I/O operations

    • loadExistingDashboard() - Load existing dashboard with history
    • mergeHistory() - Merge new results into history
    • buildHistoryEntryFromMatrix() - Create history entries
    • writeJSONAtomic() - Atomic file writes with validation
  • export_json.go (616 lines) - JSON export for client-side rendering

    • ExportBenchmarkJSON() - Main JSON export function
    • Agent vs standard metrics separation
    • Per-language, per-model, per-benchmark breakdowns
    • Fair comparison metrics (agent-comparable benchmarks)
  • export_mdx.go (208 lines) - MDX export for Docusaurus

    • ExportDocusaurusMDX() - Generate React-enhanced markdown
    • Model performance tables
    • Benchmark detail tables
    • Success stories and case studies
  • export_helpers.go (32 lines) - Shared formatting utilities

    • formatBenchmarkName() - Convert snake_case to Title Case
    • formatModelName() - Shorten model names for tables
Chain-Based Loading (v0.8.0+)
  • loader_chains.go (174 lines) - Load results from observatory.db chains
    • LoadResultsFromChain(chainID) - Load all benchmark results from a chain
    • LoadResultsFromLatestEvalChain() - Find and load most recent eval_suite chain
    • LoadBaselineFromChain(chainID) - Create Baseline for comparisons
    • stageToResult() - Convert chain stage + eval assessment to BenchmarkResult
Data Types
  • types.go (343 lines) - All data structure definitions
    • BenchmarkResult - Single benchmark run result
    • PerformanceMatrix - Aggregated performance data
    • DashboardJSON - Dashboard structure with history
    • Language, model, and benchmark stats
Tests
  • comparison_test.go (295 lines) - Comparison logic tests
  • matrix_test.go (230 lines) - Matrix generation tests
  • export_docusaurus_test.go (285 lines) - Dashboard I/O tests
    • History preservation
    • Version deduplication
    • Atomic write validation
    • Rollback on error

Usage

Generate Performance Matrix (file-based)
results := LoadResults("eval_results/baselines/v0.4.0")
matrix := GenerateMatrix(results, "v0.4.0")
Generate Performance Matrix (chain-based - v0.8.0+)
results, err := LoadResultsFromChain("e9c7501d-...")
matrix := GenerateMatrix(results, "v0.8.0")
Export Dashboard JSON
jsonStr, err := ExportBenchmarkJSON(matrix, history, results, "docs/static/benchmarks/latest.json")
// Automatically preserves history, validates, and writes atomically
Export Docusaurus MDX
mdx := ExportDocusaurusMDX(matrix, history)
os.WriteFile("docs/docs/benchmarks/performance.md", []byte(mdx), 0644)
Compare Baselines
baseline := LoadBaseline("eval_results/baselines/v0.4.0")
newResults := LoadResults("eval_results/baselines/v0.4.1")
report := Compare(baseline, newResults)
Compare Chain-Based Baselines (v0.8.0+)
baseline, _ := LoadBaselineFromChain("chain-id-1")
newResults, _ := LoadResultsFromChain("chain-id-2")
report := Compare(baseline, newResults)

Design Principles

  1. History Preservation - Dashboard JSON maintains full version history
  2. Atomic Writes - All file writes are atomic (temp + rename)
  3. Fair Comparisons - Agent metrics compare against same benchmark set
  4. Validation - JSON structure validated before writing
  5. AI-Friendly - Files kept under 800 lines for AI maintainability

Recent Changes

v0.8.0 (February 2026) - Chain-based result loading

  • Added loader_chains.go for loading results from observatory.db chains
  • Agent eval results now stored as chains (one stage per benchmark)
  • LoadResultsFromChain() returns same []*BenchmarkResult type as LoadResults()
  • Entire downstream pipeline (matrix, export, comparison) works unchanged

v0.4.0 (November 2025) - File split for AI-maintainability

  • Split export_docusaurus.go (980 lines) into 4 files (145, 616, 208, 32 lines)
  • All files now under 800-line limit
  • All tests passing (100% compatibility maintained)
  • Zero functional changes - pure refactoring

See Also

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DetectRefusal

func DetectRefusal(code, stderr, stdout string) bool

DetectRefusal returns true when any refusal pattern appears in code, stderr, or stdout. Matching is case-insensitive and substring-based, so "I CANNOT" and "i cannot" both trigger.

func ExportBenchmarkJSON

func ExportBenchmarkJSON(matrix *PerformanceMatrix, history []*Baseline, results []*BenchmarkResult, outputPath string) (string, error)

ExportBenchmarkJSON exports benchmark data as JSON for client-side rendering

func ExportCSV

func ExportCSV(results []*BenchmarkResult) (string, error)

ExportCSV generates a CSV export of benchmark results

func ExportDocusaurusMDX

func ExportDocusaurusMDX(matrix *PerformanceMatrix, history []*Baseline) string

ExportDocusaurusMDX generates an MDX file with React components for Docusaurus

func ExportHTML

func ExportHTML(matrix *PerformanceMatrix, history []*Baseline) (string, error)

ExportHTML generates an HTML report with Bootstrap styling

func ExportMarkdown

func ExportMarkdown(matrix *PerformanceMatrix, history []*Baseline) string

ExportMarkdown generates a GitHub-flavored markdown report

func FormatComparison

func FormatComparison(report *ComparisonReport, useColor bool) string

FormatComparison produces a human-readable comparison report

func FormatJSON

func FormatJSON(matrix *PerformanceMatrix) (string, error)

FormatJSON converts a matrix to pretty-printed JSON

func FormatJSONL

func FormatJSONL(results []*BenchmarkResult) (string, error)

FormatJSONL converts results to JSONL format (one JSON object per line)

func FormatMatrix

func FormatMatrix(matrix *PerformanceMatrix, useColor bool) string

FormatMatrix produces a human-readable matrix summary

func GenerateReport

func GenerateReport(matrix *PerformanceMatrix, history []*Baseline) string

GenerateReport creates a comprehensive evaluation report

func ListBaselines

func ListBaselines() ([]string, error)

ListBaselines returns a list of available baseline versions

func LoadBenchmarkTags

func LoadBenchmarkTags(dir string) map[string][]string

LoadBenchmarkTags reads every YAML in dir and returns a map of benchmark ID -> tag list. Benchmarks with unreadable specs are skipped silently — LoadSpec already warns on unknown tags.

func RefusalPatterns

func RefusalPatterns() []string

RefusalPatterns returns a copy of the refusal pattern list for tests and documentation. Exported so the M4 acceptance test can assert the ≥4 patterns invariant without reaching into package internals.

Types

type AILANGWin

type AILANGWin struct {
	ID    string `json:"id"`
	Model string `json:"model"`
}

AILANGWin names a (benchmark, model) cell where AILANG passed and Python failed — the atom of the AILANG-only-wins report.

type AILANGWinsReport

type AILANGWinsReport struct {
	Wins         []AILANGWin    `json:"wins"`
	PerBenchmark map[string]int `json:"per_benchmark"` // benchmark -> distinct models winning
	Patterns     []string       `json:"patterns"`      // benchmarks with ≥3 models winning
}

AILANGWinsReport aggregates wins at the cell level plus a pattern list of benchmarks where ≥3 distinct models agree that AILANG wins.

func DetectAILANGOnlyWins

func DetectAILANGOnlyWins(results []*BenchmarkResult) *AILANGWinsReport

DetectAILANGOnlyWins finds cells where AILANG passes and Python fails for the same (benchmark, model). A benchmark is a "pattern" win when ≥3 distinct models agree on it. If either language refused at a cell, the whole cell is dropped — a Python refusal masquerading as a failure would otherwise produce a false-positive win.

type Aggregates

type Aggregates struct {
	ZeroShotSuccess   float64 `json:"0-shot_success"`      // First attempt success rate
	FinalSuccess      float64 `json:"final_success"`       // After repair success rate
	RepairUsed        int     `json:"repair_used"`         // Number of repairs attempted
	RepairSuccessRate float64 `json:"repair_success_rate"` // Repair success rate
	TotalTokens       int     `json:"total_tokens"`
	TotalCostUSD      float64 `json:"total_cost_usd"`
	AvgDurationMs     float64 `json:"avg_duration_ms"`
}

Aggregates contains overall performance statistics

type Baseline

type Baseline struct {
	Version         string             `json:"version"`
	Timestamp       time.Time          `json:"timestamp"`
	Model           string             `json:"model"`
	Languages       string             `json:"languages"`
	SelfRepair      bool               `json:"self_repair"`
	TotalBenchmarks int                `json:"total_benchmarks"`
	SuccessCount    int                `json:"success_count"`
	FailCount       int                `json:"fail_count"`
	MatrixFile      string             `json:"matrix_file"`
	GitCommit       string             `json:"git_commit"`
	GitBranch       string             `json:"git_branch"`
	Results         []*BenchmarkResult `json:"-"` // Loaded separately
}

Baseline represents a stored baseline with metadata

func GetLatestBaseline

func GetLatestBaseline() (*Baseline, error)

GetLatestBaseline returns the most recent baseline version

func LoadBaseline

func LoadBaseline(dir string) (*Baseline, error)

LoadBaseline loads a baseline from a directory Expects baseline.json metadata + result JSON files

func LoadBaselineByVersion

func LoadBaselineByVersion(version string) (*Baseline, error)

LoadBaselineByVersion loads a baseline by version name Looks in eval_results/baselines/<version>

func LoadBaselineFromChain

func LoadBaselineFromChain(chainID string) (*Baseline, error)

LoadBaselineFromChain creates a Baseline from a chain's stages.

type BenchmarkChange

type BenchmarkChange struct {
	ID             string
	Lang           string
	Model          string
	BaselineStatus bool // true = passing, false = failing
	NewStatus      bool
	BaselineError  string
	NewError       string
}

BenchmarkChange represents a benchmark that changed status

func FindImprovements

func FindImprovements(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)

FindImprovements returns only benchmarks that were fixed

func FindRegressions

func FindRegressions(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)

FindRegressions returns only benchmarks that broke

type BenchmarkResult

type BenchmarkResult struct {
	ID            string    `json:"id"`
	Lang          string    `json:"lang"`
	Model         string    `json:"model"`
	Executor      string    `json:"executor,omitempty"` // Executor used: "claude", "gemini", etc. (agent mode)
	Seed          int64     `json:"seed"`
	InputTokens   int       `json:"input_tokens"`
	OutputTokens  int       `json:"output_tokens"`
	TotalTokens   int       `json:"total_tokens"`
	CostUSD       float64   `json:"cost_usd"`
	CompileOk     bool      `json:"compile_ok"`
	RuntimeOk     bool      `json:"runtime_ok"`
	StdoutOk      bool      `json:"stdout_ok"`
	DurationMs    int64     `json:"duration_ms"`
	CompileMs     int64     `json:"compile_ms"`
	ExecuteMs     int64     `json:"execute_ms"`
	ErrorCategory string    `json:"error_category"`
	Stdout        string    `json:"stdout,omitempty"`
	Stderr        string    `json:"stderr,omitempty"`
	Timestamp     time.Time `json:"timestamp"`
	Code          string    `json:"code,omitempty"`

	// Self-repair metrics (M-EVAL-LOOP)
	FirstAttemptOk  bool   `json:"first_attempt_ok"`
	RepairUsed      bool   `json:"repair_used"`
	RepairOk        bool   `json:"repair_ok"`
	ErrCode         string `json:"err_code,omitempty"`
	RepairTokensIn  int    `json:"repair_tokens_in,omitempty"`
	RepairTokensOut int    `json:"repair_tokens_out,omitempty"`

	// Prompt versioning
	PromptVersion string `json:"prompt_version,omitempty"`

	// Agent evaluation metrics (M-EVAL-AGENT)
	EvalMode        string `json:"eval_mode,omitempty"`        // "standard" or "agent"
	Condition       string `json:"condition,omitempty"`        // Experimental condition: "baseline", "agent_prompt", etc.
	AgentTurns      int    `json:"agent_turns,omitempty"`      // Number of conversation turns
	AgentTranscript string `json:"agent_transcript,omitempty"` // Full session log

	// Reproducibility
	BinaryHash string   `json:"binary_hash,omitempty"`
	StdlibHash string   `json:"stdlib_hash,omitempty"`
	Caps       []string `json:"caps,omitempty"`

	// Cross-harness comparison (M-EVAL-CROSS-HARNESS)
	// Logical model family for grouping paired harness results.
	// e.g. "claude-sonnet-4-6" shared by "claude" and "opencode" executors.
	ModelFamily string `json:"model_family,omitempty"`

	// Refusal detection (M-EVAL-SUITE-PREP M4): populated at load time
	// by DetectRefusal() scanning stdout+stderr. Not written by eval_harness,
	// purely a read-side annotation so historical results inherit it.
	RefusalDetected bool `json:"refusal_detected,omitempty"`
}

BenchmarkResult represents the result of a single benchmark execution This mirrors the JSON structure from internal/eval_harness/metrics.go

func Filter

func Filter(results []*BenchmarkResult, filter ResultFilter) []*BenchmarkResult

Filter applies the filter to results

func LoadLatestResultsPerModel

func LoadLatestResultsPerModel() ([]*BenchmarkResult, map[string]string, error)

LoadLatestResultsPerModel aggregates results from multiple baselines, keeping the latest result for each model. Returns results and a map of model -> baseline version used

func LoadResult

func LoadResult(path string) (*BenchmarkResult, error)

LoadResult loads a single benchmark result from a JSON file

func LoadResults

func LoadResults(dir string) ([]*BenchmarkResult, error)

LoadResults loads all benchmark results from a directory Returns results sorted by timestamp (newest first) Recursively searches all subdirectories for .json files

func LoadResultsFromChain

func LoadResultsFromChain(chainID string) ([]*BenchmarkResult, error)

LoadResultsFromChain loads all benchmark results from a chain's stages, converting EvalAssessment data into BenchmarkResult format. This allows the entire downstream pipeline (GenerateMatrix, ExportBenchmarkJSON, FormatComparison) to work unchanged.

func LoadResultsFromLatestEvalChain

func LoadResultsFromLatestEvalChain() ([]*BenchmarkResult, string, error)

LoadResultsFromLatestEvalChain finds the most recent eval_suite chain and loads its results.

func (*BenchmarkResult) ToSummaryEntry

func (r *BenchmarkResult) ToSummaryEntry() *SummaryEntry

ToSummaryEntry converts a BenchmarkResult to a SummaryEntry for JSONL export

type BenchmarkRun

type BenchmarkRun struct {
	Success        bool `json:"success"`
	FirstAttemptOk bool `json:"first_attempt_ok"`
	RepairUsed     bool `json:"repair_used"`
	Tokens         int  `json:"tokens"`
}

BenchmarkRun contains single benchmark execution stats

type BenchmarkStats

type BenchmarkStats struct {
	TotalRuns   int      `json:"total_runs"`
	SuccessRate float64  `json:"success_rate"`
	AvgTokens   float64  `json:"avg_tokens"`
	Languages   []string `json:"languages"`
}

BenchmarkStats contains per-benchmark performance

type ComparisonReport

type ComparisonReport struct {
	BaselineLabel string
	NewLabel      string
	Baseline      *Baseline
	New           *Baseline

	// Changes
	Fixed         []*BenchmarkChange
	Broken        []*BenchmarkChange
	StillPassing  []*BenchmarkResult
	StillFailing  []*BenchmarkResult
	NewBenchmarks []*BenchmarkResult
	Removed       []*BenchmarkResult

	// Aggregates
	BaselineSuccessRate float64
	NewSuccessRate      float64
	SuccessRateDelta    float64
	TotalBaselineBench  int
	TotalNewBench       int
}

ComparisonReport contains structured diff between two benchmark runs

func Compare

func Compare(baseline, new []*BenchmarkResult, baselineLabel, newLabel string) (*ComparisonReport, error)

Compare compares two sets of benchmark results and produces a detailed report

func CompareBaselines

func CompareBaselines(baseline, new *Baseline) (*ComparisonReport, error)

CompareBaselines compares two baselines (with metadata)

func (*ComparisonReport) HasImprovements

func (r *ComparisonReport) HasImprovements() bool

HasImprovements checks if there are any improvements

func (*ComparisonReport) HasRegressions

func (r *ComparisonReport) HasRegressions() bool

HasRegressions checks if there are any regressions

func (*ComparisonReport) ImprovementPercent

func (r *ComparisonReport) ImprovementPercent() float64

ImprovementPercent returns the improvement as a percentage

func (*ComparisonReport) NetChange

func (r *ComparisonReport) NetChange() int

NetChange returns the net change in passing benchmarks

func (*ComparisonReport) Summary

func (r *ComparisonReport) Summary() string

Summary returns a one-line summary of the comparison

type DashboardJSON

type DashboardJSON struct {
	Version    string                   `json:"version"`
	Timestamp  string                   `json:"timestamp"`
	TotalRuns  int                      `json:"totalRuns"`
	Aggregates map[string]interface{}   `json:"aggregates"`
	Tiers      map[string]TierAggregate `json:"tiers,omitempty"` // Per-tier aggregates: smoke/core/stretch/vision
	// M-DASH-V2: per-tag aggregates (12 canonical tags) with per-model
	// cross-sections so the dashboard can narrow the charts to a tag.
	Tags        map[string]*TagAggregate `json:"tags,omitempty"`
	Models      map[string]interface{}   `json:"models"`
	AgentModels map[string]interface{}   `json:"agentModels,omitempty"` // Agent-only models (separate from standard)
	Benchmarks  map[string]interface{}   `json:"benchmarks"`
	Languages   map[string]interface{}   `json:"languages"` // map[language]->stats
	Executors   map[string]interface{}   `json:"executors"` // map[executor]->agent stats (claude, gemini)
	// M-BENCHMARK-SECTION: harness-grouped aggregates for cross-harness comparison page.
	// Keys are agent_cli values ("claude", "gemini", "opencode", "codex").
	Harnesses map[string]interface{} `json:"harnesses,omitempty"`
	History   []HistoryEntry         `json:"history"`
	// M-DASH-V2: suite-change annotations rendered as ReferenceLine on every
	// time-series chart. Sourced from benchmarks/events.yml.
	Events []SuiteEvent `json:"events,omitempty"`
}

DashboardJSON represents the structure of docs/static/benchmarks/latest.json This is the single source of truth for the dashboard frontend

func (*DashboardJSON) Validate

func (d *DashboardJSON) Validate() error

Validate checks if a DashboardJSON structure is valid

type ErrorCodeStats

type ErrorCodeStats struct {
	Code          string  `json:"code"`
	Count         int     `json:"count"`
	RepairSuccess float64 `json:"repair_success"`
}

ErrorCodeStats contains per-error-code statistics

type ExecutorLangStats

type ExecutorLangStats struct {
	Runs    int
	Success int
	Turns   int
	Tokens  int
	Cost    float64
}

ExecutorLangStats exposes the counts the language aggregate loop needs to surface per-executor agent metrics alongside the default ones.

type ExportFormat

type ExportFormat string

ExportFormat represents the output format for reports

const (
	FormatMarkdown ExportFormat = "markdown"
	FormatHTML     ExportFormat = "html"
	FormatCSV      ExportFormat = "csv"
)

type HistoryEntry

type HistoryEntry struct {
	Version       string                 `json:"version"`
	Timestamp     string                 `json:"timestamp"`
	SuccessRate   float64                `json:"successRate"`
	TotalRuns     int                    `json:"totalRuns"`
	SuccessCount  int                    `json:"successCount"`
	Languages     string                 `json:"languages"`
	LanguageStats map[string]interface{} `json:"languageStats,omitempty"`
	ModelStats    map[string]interface{} `json:"modelStats,omitempty"` // Per-model, per-language stats for trend charts
	// M-DASH-V2: per-tier snapshots. Lets the time-series chart filter to
	// one tier retroactively (pre-v0.14.0 baselines use the CURRENT tier
	// mapping — docs describe this as an approximation).
	Tiers map[string]*TierHistoryPoint `json:"tiers,omitempty"`
}

HistoryEntry represents a single version's data in the history array

type LanguageStats

type LanguageStats struct {
	TotalRuns   int     `json:"total_runs"`
	SuccessRate float64 `json:"success_rate"`
	AvgTokens   float64 `json:"avg_tokens"`
}

LanguageStats contains per-language performance

type ModelDimensionStats

type ModelDimensionStats struct {
	SuccessRate   float64 `json:"successRate"`
	TotalRuns     int     `json:"totalRuns"`
	AvgTokens     float64 `json:"avgTokens"`
	APIErrorCount int     `json:"apiErrorCount,omitempty"`
	RefusalCount  int     `json:"refusalCount,omitempty"`
}

ModelDimensionStats is the per-(model, language) cross-section used in both TierAggregate.ModelStats and TagAggregate.ModelStats. Shape matches what the time-series chart reads from history.modelStats[model][lang] so the frontend can swap data sources cleanly.

type ModelReliability

type ModelReliability struct {
	APIErrorCount int     `json:"apiErrorCount"`
	APIErrorRate  float64 `json:"apiErrorRate"`
	RefusalCount  int     `json:"refusalCount"`
	RefusalRate   float64 `json:"refusalRate"`
	TotalRuns     int     `json:"totalRuns"`
	// Per-language api-error counts. AILANGAPIError/PythonAPIError kept for
	// backward compatibility; LanguageAPIErrors covers all eval languages.
	AILANGAPIError    int            `json:"ailangApiError,omitempty"`
	PythonAPIError    int            `json:"pythonApiError,omitempty"`
	LanguageAPIErrors map[string]int `json:"language_api_errors,omitempty"`
}

ModelReliability is the per-model counterpart: useful for the hover breakdown on the reliability card ("gemini-3-1-pro: 13/33 api errors").

type ModelStats

type ModelStats struct {
	TotalRuns       int                       `json:"total_runs"`
	Aggregates      Aggregates                `json:"aggregates"`
	Benchmarks      map[string]*BenchmarkRun  `json:"benchmarks"`
	BaselineVersion string                    `json:"baseline_version,omitempty"` // Which baseline these results came from
	Languages       map[string]*LanguageStats `json:"languages,omitempty"`        // Per-language breakdown for this model
}

ModelStats contains per-model performance

type PerformanceMatrix

type PerformanceMatrix struct {
	Version   string    `json:"version"`
	Timestamp time.Time `json:"timestamp"`
	TotalRuns int       `json:"total_runs"`

	// Overall aggregates
	Aggregates Aggregates `json:"aggregates"`

	// Breakdown by dimension
	Models         map[string]*ModelStats     `json:"models"`
	Benchmarks     map[string]*BenchmarkStats `json:"benchmarks"`
	ErrorCodes     []*ErrorCodeStats          `json:"error_codes"`
	Languages      map[string]*LanguageStats  `json:"languages"`
	PromptVersions map[string]*PromptStats    `json:"prompt_versions,omitempty"`
}

PerformanceMatrix contains aggregated performance data

func GenerateMatrix

func GenerateMatrix(results []*BenchmarkResult, version string) (*PerformanceMatrix, error)

GenerateMatrix generates a performance matrix from benchmark results This replaces the brittle jq-based bash script with type-safe Go code

func GenerateMatrixWithBaselines

func GenerateMatrixWithBaselines(results []*BenchmarkResult, version string, modelBaselines map[string]string) (*PerformanceMatrix, error)

GenerateMatrixWithBaselines generates a performance matrix with optional baseline version info per model

type PromptStats

type PromptStats struct {
	TotalRuns       int     `json:"total_runs"`
	ZeroShotSuccess float64 `json:"0-shot_success"`
	FinalSuccess    float64 `json:"final_success"`
	AvgTokens       float64 `json:"avg_tokens"`
}

PromptStats contains per-prompt-version performance

type ReliabilityCounts

type ReliabilityCounts struct {
	APIErrorCount int                          `json:"apiErrorCount"`
	APIErrorRate  float64                      `json:"apiErrorRate"`
	RefusalCount  int                          `json:"refusalCount"`
	RefusalRate   float64                      `json:"refusalRate"`
	PerModel      map[string]*ModelReliability `json:"perModel,omitempty"`
}

ReliabilityCounts is a small bag of counters surfaced at the top level of DashboardJSON.aggregates so the "API Reliability" card can render without drilling into tiers/models.

type ResultFilter

type ResultFilter struct {
	Model        string
	Lang         string
	Benchmark    string
	SuccessOnly  bool
	FailuresOnly bool
}

FilterResults returns results matching the given criteria

type SaturatedBenchmark

type SaturatedBenchmark struct {
	ID            string   `json:"id"`
	BaselinesSeen []string `json:"baselines_seen"` // versions contributing
	TotalCells    int      `json:"total_cells"`    // model × lang pairs
}

SaturatedBenchmark names a benchmark that hit 100% pass across every model × language pair in the considered baselines.

func DetectSaturation

func DetectSaturation(baselines []*Baseline, minBaselines int) []*SaturatedBenchmark

DetectSaturation returns benchmarks that pass 100% across every (model, language) cell in all considered baselines. Only baselines with ≥1 AILANG result are considered, to avoid "saturated" Python-only baselines reporting spurious wins.

If fewer than minBaselines baselines are available, saturation is computed over the ones that exist — better to return partial data with a clear "BaselinesSeen" list than nothing.

type SuiteEvent

type SuiteEvent struct {
	Version      string   `json:"version" yaml:"version"`
	Label        string   `json:"label" yaml:"label"`
	Kind         string   `json:"kind" yaml:"kind"` // "benchmark_add" | "benchmark_remove" | "taxonomy" | "prompt"
	Color        string   `json:"color,omitempty" yaml:"color,omitempty"`
	AffectsTiers []string `json:"affects_tiers,omitempty" yaml:"affects_tiers,omitempty"` // if set, event only renders when one of these tiers is selected
}

SuiteEvent is a timeline annotation (benchmark additions, taxonomy changes, etc.) loaded from benchmarks/events.yml. Rendered as a dashed ReferenceLine on every time-series chart.

func LoadSuiteEvents

func LoadSuiteEvents(path string) ([]SuiteEvent, error)

LoadSuiteEvents reads benchmarks/events.yml (or any path) and returns the parsed timeline annotations. Missing file returns an empty slice rather than an error — events are optional.

type SummaryEntry

type SummaryEntry struct {
	ID             string  `json:"id"`
	Lang           string  `json:"lang"`
	Model          string  `json:"model"`
	Executor       string  `json:"executor,omitempty"` // Executor used: "claude", "gemini" (agent mode)
	Seed           int64   `json:"seed"`
	PromptVersion  string  `json:"prompt_version,omitempty"`
	FirstAttemptOk bool    `json:"first_attempt_ok"`
	RepairUsed     bool    `json:"repair_used"`
	RepairOk       bool    `json:"repair_ok"`
	ErrCode        string  `json:"err_code,omitempty"`
	CompileOk      bool    `json:"compile_ok"`
	RuntimeOk      bool    `json:"runtime_ok"`
	StdoutOk       bool    `json:"stdout_ok"`
	ErrorCategory  string  `json:"error_category"`
	InputTokens    int     `json:"input_tokens"`
	OutputTokens   int     `json:"output_tokens"`
	TotalTokens    int     `json:"total_tokens"`
	CostUSD        float64 `json:"cost_usd"`
	DurationMs     int64   `json:"duration_ms"`
	Timestamp      string  `json:"timestamp"`
	Stderr         string  `json:"stderr,omitempty"`
	// Agent evaluation fields (M-EVAL-AGENT)
	EvalMode   string `json:"eval_mode,omitempty"`   // "standard" or "agent"
	Condition  string `json:"condition,omitempty"`   // Experimental condition: "baseline", "agent_prompt", etc.
	AgentTurns int    `json:"agent_turns,omitempty"` // Number of conversation turns
}

SummaryEntry is a simplified record for JSONL export

func (*SummaryEntry) MarshalJSON

func (s *SummaryEntry) MarshalJSON() ([]byte, error)

MarshalJSON implements custom JSON marshaling for JSONL (single-line)

type TagAggregate

type TagAggregate struct {
	Tag         string  `json:"tag"`
	AILANGPass  int     `json:"ailang_pass"`
	AILANGTotal int     `json:"ailang_total"`
	PythonPass  int     `json:"python_pass"`
	PythonTotal int     `json:"python_total"`
	Delta       float64 `json:"delta"` // ailangRate - pythonRate; kept for backward compat
	// LanguageBreakdown contains pass/total for ALL eval languages
	// (python, ailang, javascript, go, …). The typed AILANG*/Python* fields
	// above remain for backward compatibility.
	LanguageBreakdown map[string]*TagLangStats `json:"language_breakdown,omitempty"`
	// M-DASH-V2: unique benchmark IDs carrying this tag (useful for the
	// UI "N benchmarks in tag" chip).
	BenchmarkCount int `json:"benchmark_count,omitempty"`
	// M-DASH-V2: per-model cross-section so the dashboard can render
	// per-model bars filtered to this tag. Outer key is model name,
	// inner key is language.
	ModelStats map[string]map[string]*ModelDimensionStats `json:"model_stats,omitempty"`
}

TagAggregate summarises pass/total counts for one tag, per language, plus the AILANG vs Python delta in [-1,1].

type TagLangStats added in v0.14.2

type TagLangStats struct {
	Pass  int     `json:"pass"`
	Total int     `json:"total"`
	Rate  float64 `json:"rate"`
}

TagLangStats holds per-language pass/total inside a TagAggregate.

type TagReport

type TagReport struct {
	Tags       []string                 `json:"tags"`
	Aggregates map[string]*TagAggregate `json:"aggregates"`
}

TagReport is the output of GroupByTags: a sorted tag list plus the per-tag aggregates.

func GroupByTags

func GroupByTags(results []*BenchmarkResult, tags map[string][]string) *TagReport

GroupByTags builds a TagReport from benchmark results and the tag index from LoadBenchmarkTags. Results flagged RefusalDetected are excluded from pass/total counts so refusals do not inflate failure rates for every tag the benchmark happened to carry.

Each benchmark contributes to every tag it carries; a (benchmark, lang, model) run is one unit, so a benchmark tagged adt_pattern_match + recursion counts once in each column.

type TierAggregate

type TierAggregate struct {
	TotalRuns         int     `json:"total_runs"`
	AILANGRuns        int     `json:"ailang_runs"`
	PythonRuns        int     `json:"python_runs"`
	AILANGSuccessRate float64 `json:"ailang_success_rate"`
	PythonSuccessRate float64 `json:"python_success_rate"`
	BenchmarkCount    int     `json:"benchmark_count"` // unique benchmark IDs in this tier

	// Generic per-language breakdown — includes all eval languages (python,
	// ailang, javascript, go, …). The typed AILANG*/Python* fields above
	// remain for backward compatibility with existing dashboard consumers.
	LanguageStats map[string]*TierLanguageStats `json:"language_stats,omitempty"`

	// M-DASH-V2: per-tier × per-model breakdown so charts can filter
	// time-series data to this tier. Outer key is model name, inner key is
	// language. Nil when the tier has no runs.
	ModelStats map[string]map[string]*ModelDimensionStats `json:"model_stats,omitempty"`

	// M-DASH-V2: API reliability per tier. Splits by language so dashboards
	// can show "how many gemini-3-1-pro AILANG runs on core tier returned
	// api_error?" separately from Python.
	APIErrorCount  int `json:"api_error_count"`
	AILANGAPIError int `json:"ailang_api_error"`
	PythonAPIError int `json:"python_api_error"`

	// M-DASH-V2: refusal count per tier (RefusalDetected at load time).
	RefusalCount int `json:"refusal_count"`

	// M-DASH-V2: self-repair efficacy and cost for this tier. RepairDelta =
	// final pass rate − first-attempt pass rate; answers "does self-repair
	// help more on hard tiers?". AvgCostUSD split by language lets callers
	// tell whether stretch is 3× pricier on AILANG specifically.
	AILANGRepairDelta float64 `json:"ailang_repair_delta"`
	PythonRepairDelta float64 `json:"python_repair_delta"`
	AILANGAvgCostUSD  float64 `json:"ailang_avg_cost_usd"`
	PythonAvgCostUSD  float64 `json:"python_avg_cost_usd"`
}

TierAggregate contains per-tier pass-rate metrics. Populated by ExportBenchmarkJSON from the tier field attached to each benchmark result (resolved via the benchmark YAML's tier). The Core tier pass rate is the dashboard headline metric per M-EVAL-SUITE-PREP M6.

type TierHistoryPoint

type TierHistoryPoint struct {
	AILANGSuccessRate float64                                    `json:"ailang_success_rate"`
	PythonSuccessRate float64                                    `json:"python_success_rate"`
	AILANGRuns        int                                        `json:"ailang_runs"`
	PythonRuns        int                                        `json:"python_runs"`
	BenchmarkCount    int                                        `json:"benchmark_count"`
	ModelStats        map[string]map[string]*ModelDimensionStats `json:"modelStats,omitempty"`
	// Generic per-language breakdown for all eval languages (python, ailang,
	// javascript, go, …). The typed AILANG*/Python* fields remain for
	// backward compatibility.
	LanguageStats map[string]*TierLanguageStats `json:"language_stats,omitempty"`
}

TierHistoryPoint is a per-tier snapshot inside a single history entry. Lets PerModelTrend filter the time series to e.g. just the Core tier so the chart updates when TierToggle changes — not just the hero row.

type TierLanguageStats added in v0.14.2

type TierLanguageStats struct {
	Runs        int     `json:"runs"`
	Pass        int     `json:"pass"`
	SuccessRate float64 `json:"success_rate"`
	RepairDelta float64 `json:"repair_delta,omitempty"`
	AvgCostUSD  float64 `json:"avg_cost_usd,omitempty"`
	APIErrors   int     `json:"api_errors,omitempty"`
}

TierLanguageStats holds per-language aggregate metrics for one tier. Used in TierAggregate.LanguageStats and TierHistoryPoint.LanguageStats to surface data for all eval languages (python, ailang, javascript, go, …).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL