eval_analysis

package

v0.14.2 Latest Latest Go to latest Published: Apr 26, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sunholo-data/ailang

Links

Open Source Insights

README ¶

internal/eval_analysis

Benchmark analysis and dashboard generation for AILANG evaluation system.

File Organization

This package was reorganized in November 2025 to comply with the 800-line file size limit for AI-maintainability.

Core Analysis Files

comparison.go (195 lines) - Baseline comparison and regression detection
matrix.go (242 lines) - Performance matrix generation and aggregation
loader.go (279 lines) - Benchmark result loading and filtering
validate.go (221 lines) - Result validation and health checks
formatter.go (325 lines) - Human-readable output formatting

Export Files (split from export_docusaurus.go)

The original export_docusaurus.go (980 lines) was split into 4 focused files:

dashboard_io.go (145 lines) - Dashboard JSON I/O operations
- loadExistingDashboard() - Load existing dashboard with history
- mergeHistory() - Merge new results into history
- buildHistoryEntryFromMatrix() - Create history entries
- writeJSONAtomic() - Atomic file writes with validation
export_json.go (616 lines) - JSON export for client-side rendering
- ExportBenchmarkJSON() - Main JSON export function
- Agent vs standard metrics separation
- Per-language, per-model, per-benchmark breakdowns
- Fair comparison metrics (agent-comparable benchmarks)
export_mdx.go (208 lines) - MDX export for Docusaurus
- ExportDocusaurusMDX() - Generate React-enhanced markdown
- Model performance tables
- Benchmark detail tables
- Success stories and case studies
export_helpers.go (32 lines) - Shared formatting utilities
- formatBenchmarkName() - Convert snake_case to Title Case
- formatModelName() - Shorten model names for tables

Chain-Based Loading (v0.8.0+)

loader_chains.go (174 lines) - Load results from observatory.db chains
- LoadResultsFromChain(chainID) - Load all benchmark results from a chain
- LoadResultsFromLatestEvalChain() - Find and load most recent eval_suite chain
- LoadBaselineFromChain(chainID) - Create Baseline for comparisons
- stageToResult() - Convert chain stage + eval assessment to BenchmarkResult

Data Types

types.go (343 lines) - All data structure definitions
- BenchmarkResult - Single benchmark run result
- PerformanceMatrix - Aggregated performance data
- DashboardJSON - Dashboard structure with history
- Language, model, and benchmark stats

Tests

comparison_test.go (295 lines) - Comparison logic tests
matrix_test.go (230 lines) - Matrix generation tests
export_docusaurus_test.go (285 lines) - Dashboard I/O tests
- History preservation
- Version deduplication
- Atomic write validation
- Rollback on error

Usage

Generate Performance Matrix (file-based)

results := LoadResults("eval_results/baselines/v0.4.0")
matrix := GenerateMatrix(results, "v0.4.0")

Generate Performance Matrix (chain-based - v0.8.0+)

results, err := LoadResultsFromChain("e9c7501d-...")
matrix := GenerateMatrix(results, "v0.8.0")

Export Dashboard JSON

jsonStr, err := ExportBenchmarkJSON(matrix, history, results, "docs/static/benchmarks/latest.json")
// Automatically preserves history, validates, and writes atomically

Export Docusaurus MDX

mdx := ExportDocusaurusMDX(matrix, history)
os.WriteFile("docs/docs/benchmarks/performance.md", []byte(mdx), 0644)

Compare Baselines

baseline := LoadBaseline("eval_results/baselines/v0.4.0")
newResults := LoadResults("eval_results/baselines/v0.4.1")
report := Compare(baseline, newResults)

Compare Chain-Based Baselines (v0.8.0+)

baseline, _ := LoadBaselineFromChain("chain-id-1")
newResults, _ := LoadResultsFromChain("chain-id-2")
report := Compare(baseline, newResults)

Design Principles

History Preservation - Dashboard JSON maintains full version history
Atomic Writes - All file writes are atomic (temp + rename)
Fair Comparisons - Agent metrics compare against same benchmark set
Validation - JSON structure validated before writing
AI-Friendly - Files kept under 800 lines for AI maintainability

Recent Changes

v0.8.0 (February 2026) - Chain-based result loading

Added loader_chains.go for loading results from observatory.db chains
Agent eval results now stored as chains (one stage per benchmark)
LoadResultsFromChain() returns same []*BenchmarkResult type as LoadResults()
Entire downstream pipeline (matrix, export, comparison) works unchanged

v0.4.0 (November 2025) - File split for AI-maintainability

Split export_docusaurus.go (980 lines) into 4 files (145, 616, 208, 32 lines)
All files now under 800-line limit
All tests passing (100% compatibility maintained)
Zero functional changes - pure refactoring

Documentation ¶

Index ¶

func DetectRefusal(code, stderr, stdout string) bool
func ExportBenchmarkJSON(matrix *PerformanceMatrix, history []*Baseline, results []*BenchmarkResult, ...) (string, error)
func ExportCSV(results []*BenchmarkResult) (string, error)
func ExportDocusaurusMDX(matrix *PerformanceMatrix, history []*Baseline) string
func ExportHTML(matrix *PerformanceMatrix, history []*Baseline) (string, error)
func ExportMarkdown(matrix *PerformanceMatrix, history []*Baseline) string
func FormatComparison(report *ComparisonReport, useColor bool) string
func FormatJSON(matrix *PerformanceMatrix) (string, error)
func FormatJSONL(results []*BenchmarkResult) (string, error)
func FormatMatrix(matrix *PerformanceMatrix, useColor bool) string
func GenerateReport(matrix *PerformanceMatrix, history []*Baseline) string
func ListBaselines() ([]string, error)
func LoadBenchmarkTags(dir string) map[string][]string
func RefusalPatterns() []string
type AILANGWin
type AILANGWinsReport
- func DetectAILANGOnlyWins(results []*BenchmarkResult) *AILANGWinsReport
type Aggregates
type Baseline
- func GetLatestBaseline() (*Baseline, error)
- func LoadBaseline(dir string) (*Baseline, error)
- func LoadBaselineByVersion(version string) (*Baseline, error)
- func LoadBaselineFromChain(chainID string) (*Baseline, error)
type BenchmarkChange
- func FindImprovements(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)
- func FindRegressions(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)
type BenchmarkResult
- func Filter(results []*BenchmarkResult, filter ResultFilter) []*BenchmarkResult
- func LoadLatestResultsPerModel() ([]*BenchmarkResult, map[string]string, error)
- func LoadResult(path string) (*BenchmarkResult, error)
- func LoadResults(dir string) ([]*BenchmarkResult, error)
- func LoadResultsFromChain(chainID string) ([]*BenchmarkResult, error)
- func LoadResultsFromLatestEvalChain() ([]*BenchmarkResult, string, error)
- func (r *BenchmarkResult) ToSummaryEntry() *SummaryEntry
type BenchmarkRun
type BenchmarkStats
type ComparisonReport
- func Compare(baseline, new []*BenchmarkResult, baselineLabel, newLabel string) (*ComparisonReport, error)
- func CompareBaselines(baseline, new *Baseline) (*ComparisonReport, error)
- func (r *ComparisonReport) HasImprovements() bool
- func (r *ComparisonReport) HasRegressions() bool
- func (r *ComparisonReport) ImprovementPercent() float64
- func (r *ComparisonReport) NetChange() int
- func (r *ComparisonReport) Summary() string
type DashboardJSON
- func (d *DashboardJSON) Validate() error
type ErrorCodeStats
type ExecutorLangStats
type ExportFormat
type HistoryEntry
type LanguageStats
type ModelDimensionStats
type ModelReliability
type ModelStats
type PerformanceMatrix
- func GenerateMatrix(results []*BenchmarkResult, version string) (*PerformanceMatrix, error)
- func GenerateMatrixWithBaselines(results []*BenchmarkResult, version string, modelBaselines map[string]string) (*PerformanceMatrix, error)
type PromptStats
type ReliabilityCounts
type ResultFilter
type SaturatedBenchmark
- func DetectSaturation(baselines []*Baseline, minBaselines int) []*SaturatedBenchmark
type SuiteEvent
- func LoadSuiteEvents(path string) ([]SuiteEvent, error)
type SummaryEntry
- func (s *SummaryEntry) MarshalJSON() ([]byte, error)
type TagAggregate
type TagLangStats
type TagReport
- func GroupByTags(results []*BenchmarkResult, tags map[string][]string) *TagReport
type TierAggregate
type TierHistoryPoint
type TierLanguageStats

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DetectRefusal ¶

func DetectRefusal(code, stderr, stdout string) bool

DetectRefusal returns true when any refusal pattern appears in code, stderr, or stdout. Matching is case-insensitive and substring-based, so "I CANNOT" and "i cannot" both trigger.

func ExportBenchmarkJSON ¶

func ExportBenchmarkJSON(matrix *PerformanceMatrix, history []*Baseline, results []*BenchmarkResult, outputPath string) (string, error)

ExportBenchmarkJSON exports benchmark data as JSON for client-side rendering

func ExportCSV ¶

func ExportCSV(results []*BenchmarkResult) (string, error)

ExportCSV generates a CSV export of benchmark results

func ExportDocusaurusMDX ¶

func ExportDocusaurusMDX(matrix *PerformanceMatrix, history []*Baseline) string

ExportDocusaurusMDX generates an MDX file with React components for Docusaurus

func ExportHTML ¶

func ExportHTML(matrix *PerformanceMatrix, history []*Baseline) (string, error)

ExportHTML generates an HTML report with Bootstrap styling

func ExportMarkdown ¶

func ExportMarkdown(matrix *PerformanceMatrix, history []*Baseline) string

ExportMarkdown generates a GitHub-flavored markdown report

func FormatComparison ¶

func FormatComparison(report *ComparisonReport, useColor bool) string

FormatComparison produces a human-readable comparison report

func FormatJSON ¶

func FormatJSON(matrix *PerformanceMatrix) (string, error)

FormatJSON converts a matrix to pretty-printed JSON

func FormatJSONL ¶

func FormatJSONL(results []*BenchmarkResult) (string, error)

FormatJSONL converts results to JSONL format (one JSON object per line)

func FormatMatrix ¶

func FormatMatrix(matrix *PerformanceMatrix, useColor bool) string

FormatMatrix produces a human-readable matrix summary

func GenerateReport ¶

func GenerateReport(matrix *PerformanceMatrix, history []*Baseline) string

GenerateReport creates a comprehensive evaluation report

func ListBaselines ¶

func ListBaselines() ([]string, error)

ListBaselines returns a list of available baseline versions

func LoadBenchmarkTags ¶

func LoadBenchmarkTags(dir string) map[string][]string

LoadBenchmarkTags reads every YAML in dir and returns a map of benchmark ID -> tag list. Benchmarks with unreadable specs are skipped silently — LoadSpec already warns on unknown tags.

func RefusalPatterns ¶

func RefusalPatterns() []string

RefusalPatterns returns a copy of the refusal pattern list for tests and documentation. Exported so the M4 acceptance test can assert the ≥4 patterns invariant without reaching into package internals.

Types ¶

type AILANGWin ¶

type AILANGWin struct {
	ID    string `json:"id"`
	Model string `json:"model"`
}

AILANGWin names a (benchmark, model) cell where AILANG passed and Python failed — the atom of the AILANG-only-wins report.

type AILANGWinsReport ¶

type AILANGWinsReport struct {
	Wins         []AILANGWin    `json:"wins"`
	PerBenchmark map[string]int `json:"per_benchmark"` // benchmark -> distinct models winning
	Patterns     []string       `json:"patterns"`      // benchmarks with ≥3 models winning
}

AILANGWinsReport aggregates wins at the cell level plus a pattern list of benchmarks where ≥3 distinct models agree that AILANG wins.

func DetectAILANGOnlyWins ¶

func DetectAILANGOnlyWins(results []*BenchmarkResult) *AILANGWinsReport

DetectAILANGOnlyWins finds cells where AILANG passes and Python fails for the same (benchmark, model). A benchmark is a "pattern" win when ≥3 distinct models agree on it. If either language refused at a cell, the whole cell is dropped — a Python refusal masquerading as a failure would otherwise produce a false-positive win.

type Aggregates ¶

type Aggregates struct {
	ZeroShotSuccess   float64 `json:"0-shot_success"`      // First attempt success rate
	FinalSuccess      float64 `json:"final_success"`       // After repair success rate
	RepairUsed        int     `json:"repair_used"`         // Number of repairs attempted
	RepairSuccessRate float64 `json:"repair_success_rate"` // Repair success rate
	TotalTokens       int     `json:"total_tokens"`
	TotalCostUSD      float64 `json:"total_cost_usd"`
	AvgDurationMs     float64 `json:"avg_duration_ms"`
}

Aggregates contains overall performance statistics

type Baseline ¶

type Baseline struct {
	Version         string             `json:"version"`
	Timestamp       time.Time          `json:"timestamp"`
	Model           string             `json:"model"`
	Languages       string             `json:"languages"`
	SelfRepair      bool               `json:"self_repair"`
	TotalBenchmarks int                `json:"total_benchmarks"`
	SuccessCount    int                `json:"success_count"`
	FailCount       int                `json:"fail_count"`
	MatrixFile      string             `json:"matrix_file"`
	GitCommit       string             `json:"git_commit"`
	GitBranch       string             `json:"git_branch"`
	Results         []*BenchmarkResult `json:"-"` // Loaded separately
}

Baseline represents a stored baseline with metadata

func GetLatestBaseline ¶

func GetLatestBaseline() (*Baseline, error)

GetLatestBaseline returns the most recent baseline version

func LoadBaseline ¶

func LoadBaseline(dir string) (*Baseline, error)

LoadBaseline loads a baseline from a directory Expects baseline.json metadata + result JSON files

func LoadBaselineByVersion ¶

func LoadBaselineByVersion(version string) (*Baseline, error)

LoadBaselineByVersion loads a baseline by version name Looks in eval_results/baselines/<version>

func LoadBaselineFromChain ¶

func LoadBaselineFromChain(chainID string) (*Baseline, error)

LoadBaselineFromChain creates a Baseline from a chain's stages.

type BenchmarkChange ¶

type BenchmarkChange struct {
	ID             string
	Lang           string
	Model          string
	BaselineStatus bool // true = passing, false = failing
	NewStatus      bool
	BaselineError  string
	NewError       string
}

BenchmarkChange represents a benchmark that changed status

func FindImprovements ¶

func FindImprovements(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)

FindImprovements returns only benchmarks that were fixed

func FindRegressions ¶

func FindRegressions(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)

FindRegressions returns only benchmarks that broke

type BenchmarkResult ¶

type BenchmarkResult struct {
	ID            string    `json:"id"`
	Lang          string    `json:"lang"`
	Model         string    `json:"model"`
	Executor      string    `json:"executor,omitempty"` // Executor used: "claude", "gemini", etc. (agent mode)
	Seed          int64     `json:"seed"`
	InputTokens   int       `json:"input_tokens"`
	OutputTokens  int       `json:"output_tokens"`
	TotalTokens   int       `json:"total_tokens"`
	CostUSD       float64   `json:"cost_usd"`
	CompileOk     bool      `json:"compile_ok"`
	RuntimeOk     bool      `json:"runtime_ok"`
	StdoutOk      bool      `json:"stdout_ok"`
	DurationMs    int64     `json:"duration_ms"`
	CompileMs     int64     `json:"compile_ms"`
	ExecuteMs     int64     `json:"execute_ms"`
	ErrorCategory string    `json:"error_category"`
	Stdout        string    `json:"stdout,omitempty"`
	Stderr        string    `json:"stderr,omitempty"`
	Timestamp     time.Time `json:"timestamp"`
	Code          string    `json:"code,omitempty"`

	// Self-repair metrics (M-EVAL-LOOP)
	FirstAttemptOk  bool   `json:"first_attempt_ok"`
	RepairUsed      bool   `json:"repair_used"`
	RepairOk        bool   `json:"repair_ok"`
	ErrCode         string `json:"err_code,omitempty"`
	RepairTokensIn  int    `json:"repair_tokens_in,omitempty"`
	RepairTokensOut int    `json:"repair_tokens_out,omitempty"`

	// Prompt versioning
	PromptVersion string `json:"prompt_version,omitempty"`

	// Agent evaluation metrics (M-EVAL-AGENT)
	EvalMode        string `json:"eval_mode,omitempty"`        // "standard" or "agent"
	Condition       string `json:"condition,omitempty"`        // Experimental condition: "baseline", "agent_prompt", etc.
	AgentTurns      int    `json:"agent_turns,omitempty"`      // Number of conversation turns
	AgentTranscript string `json:"agent_transcript,omitempty"` // Full session log

	// Reproducibility
	BinaryHash string   `json:"binary_hash,omitempty"`
	StdlibHash string   `json:"stdlib_hash,omitempty"`
	Caps       []string `json:"caps,omitempty"`

	// Cross-harness comparison (M-EVAL-CROSS-HARNESS)
	// Logical model family for grouping paired harness results.
	// e.g. "claude-sonnet-4-6" shared by "claude" and "opencode" executors.
	ModelFamily string `json:"model_family,omitempty"`

	// Refusal detection (M-EVAL-SUITE-PREP M4): populated at load time
	// by DetectRefusal() scanning stdout+stderr. Not written by eval_harness,
	// purely a read-side annotation so historical results inherit it.
	RefusalDetected bool `json:"refusal_detected,omitempty"`
}

BenchmarkResult represents the result of a single benchmark execution This mirrors the JSON structure from internal/eval_harness/metrics.go

func Filter ¶

func Filter(results []*BenchmarkResult, filter ResultFilter) []*BenchmarkResult

Filter applies the filter to results

func LoadLatestResultsPerModel ¶

func LoadLatestResultsPerModel() ([]*BenchmarkResult, map[string]string, error)

LoadLatestResultsPerModel aggregates results from multiple baselines, keeping the latest result for each model. Returns results and a map of model -> baseline version used

func LoadResult ¶

func LoadResult(path string) (*BenchmarkResult, error)

LoadResult loads a single benchmark result from a JSON file

func LoadResults ¶

func LoadResults(dir string) ([]*BenchmarkResult, error)

LoadResults loads all benchmark results from a directory Returns results sorted by timestamp (newest first) Recursively searches all subdirectories for .json files

func LoadResultsFromChain ¶

func LoadResultsFromChain(chainID string) ([]*BenchmarkResult, error)

LoadResultsFromChain loads all benchmark results from a chain's stages, converting EvalAssessment data into BenchmarkResult format. This allows the entire downstream pipeline (GenerateMatrix, ExportBenchmarkJSON, FormatComparison) to work unchanged.

func LoadResultsFromLatestEvalChain ¶

func LoadResultsFromLatestEvalChain() ([]*BenchmarkResult, string, error)

LoadResultsFromLatestEvalChain finds the most recent eval_suite chain and loads its results.

func (*BenchmarkResult) ToSummaryEntry ¶

func (r *BenchmarkResult) ToSummaryEntry() *SummaryEntry

ToSummaryEntry converts a BenchmarkResult to a SummaryEntry for JSONL export

type BenchmarkRun ¶

type BenchmarkRun struct {
	Success        bool `json:"success"`
	FirstAttemptOk bool `json:"first_attempt_ok"`
	RepairUsed     bool `json:"repair_used"`
	Tokens         int  `json:"tokens"`
}

BenchmarkRun contains single benchmark execution stats

type BenchmarkStats ¶

type BenchmarkStats struct {
	TotalRuns   int      `json:"total_runs"`
	SuccessRate float64  `json:"success_rate"`
	AvgTokens   float64  `json:"avg_tokens"`
	Languages   []string `json:"languages"`
}

BenchmarkStats contains per-benchmark performance

type ComparisonReport ¶

type ComparisonReport struct {
	BaselineLabel string
	NewLabel      string
	Baseline      *Baseline
	New           *Baseline

	// Changes
	Fixed         []*BenchmarkChange
	Broken        []*BenchmarkChange
	StillPassing  []*BenchmarkResult
	StillFailing  []*BenchmarkResult
	NewBenchmarks []*BenchmarkResult
	Removed       []*BenchmarkResult

	// Aggregates
	BaselineSuccessRate float64
	NewSuccessRate      float64
	SuccessRateDelta    float64
	TotalBaselineBench  int
	TotalNewBench       int
}

ComparisonReport contains structured diff between two benchmark runs

func Compare ¶

func Compare(baseline, new []*BenchmarkResult, baselineLabel, newLabel string) (*ComparisonReport, error)

Compare compares two sets of benchmark results and produces a detailed report

func CompareBaselines ¶

func CompareBaselines(baseline, new *Baseline) (*ComparisonReport, error)

CompareBaselines compares two baselines (with metadata)

func (*ComparisonReport) HasImprovements ¶

func (r *ComparisonReport) HasImprovements() bool

HasImprovements checks if there are any improvements

func (*ComparisonReport) HasRegressions ¶

func (r *ComparisonReport) HasRegressions() bool

HasRegressions checks if there are any regressions

func (*ComparisonReport) ImprovementPercent ¶

func (r *ComparisonReport) ImprovementPercent() float64

ImprovementPercent returns the improvement as a percentage

func (*ComparisonReport) NetChange ¶

func (r *ComparisonReport) NetChange() int

NetChange returns the net change in passing benchmarks

func (*ComparisonReport) Summary ¶

func (r *ComparisonReport) Summary() string

Summary returns a one-line summary of the comparison

type DashboardJSON ¶

type DashboardJSON struct {
	Version    string                   `json:"version"`
	Timestamp  string                   `json:"timestamp"`
	TotalRuns  int                      `json:"totalRuns"`
	Aggregates map[string]interface{}   `json:"aggregates"`
	Tiers      map[string]TierAggregate `json:"tiers,omitempty"` // Per-tier aggregates: smoke/core/stretch/vision
	// M-DASH-V2: per-tag aggregates (12 canonical tags) with per-model
	// cross-sections so the dashboard can narrow the charts to a tag.
	Tags        map[string]*TagAggregate `json:"tags,omitempty"`
	Models      map[string]interface{}   `json:"models"`
	AgentModels map[string]interface{}   `json:"agentModels,omitempty"` // Agent-only models (separate from standard)
	Benchmarks  map[string]interface{}   `json:"benchmarks"`
	Languages   map[string]interface{}   `json:"languages"` // map[language]->stats
	Executors   map[string]interface{}   `json:"executors"` // map[executor]->agent stats (claude, gemini)
	// M-BENCHMARK-SECTION: harness-grouped aggregates for cross-harness comparison page.
	// Keys are agent_cli values ("claude", "gemini", "opencode", "codex").
	Harnesses map[string]interface{} `json:"harnesses,omitempty"`
	History   []HistoryEntry         `json:"history"`
	// M-DASH-V2: suite-change annotations rendered as ReferenceLine on every
	// time-series chart. Sourced from benchmarks/events.yml.
	Events []SuiteEvent `json:"events,omitempty"`
}

DashboardJSON represents the structure of docs/static/benchmarks/latest.json This is the single source of truth for the dashboard frontend

func (*DashboardJSON) Validate ¶

func (d *DashboardJSON) Validate() error

Validate checks if a DashboardJSON structure is valid

type ErrorCodeStats ¶

type ErrorCodeStats struct {
	Code          string  `json:"code"`
	Count         int     `json:"count"`
	RepairSuccess float64 `json:"repair_success"`
}

ErrorCodeStats contains per-error-code statistics

type ExecutorLangStats ¶

type ExecutorLangStats struct {
	Runs    int
	Success int
	Turns   int
	Tokens  int
	Cost    float64
}

ExecutorLangStats exposes the counts the language aggregate loop needs to surface per-executor agent metrics alongside the default ones.

type ExportFormat ¶

type ExportFormat string

ExportFormat represents the output format for reports

const (
	FormatMarkdown ExportFormat = "markdown"
	FormatHTML     ExportFormat = "html"
	FormatCSV      ExportFormat = "csv"
)

type HistoryEntry ¶

type HistoryEntry struct {
	Version       string                 `json:"version"`
	Timestamp     string                 `json:"timestamp"`
	SuccessRate   float64                `json:"successRate"`
	TotalRuns     int                    `json:"totalRuns"`
	SuccessCount  int                    `json:"successCount"`
	Languages     string                 `json:"languages"`
	LanguageStats map[string]interface{} `json:"languageStats,omitempty"`
	ModelStats    map[string]interface{} `json:"modelStats,omitempty"` // Per-model, per-language stats for trend charts
	// M-DASH-V2: per-tier snapshots. Lets the time-series chart filter to
	// one tier retroactively (pre-v0.14.0 baselines use the CURRENT tier
	// mapping — docs describe this as an approximation).
	Tiers map[string]*TierHistoryPoint `json:"tiers,omitempty"`
}

HistoryEntry represents a single version's data in the history array

type LanguageStats ¶

type LanguageStats struct {
	TotalRuns   int     `json:"total_runs"`
	SuccessRate float64 `json:"success_rate"`
	AvgTokens   float64 `json:"avg_tokens"`
}

LanguageStats contains per-language performance

type ModelDimensionStats ¶

type ModelDimensionStats struct {
	SuccessRate   float64 `json:"successRate"`
	TotalRuns     int     `json:"totalRuns"`
	AvgTokens     float64 `json:"avgTokens"`
	APIErrorCount int     `json:"apiErrorCount,omitempty"`
	RefusalCount  int     `json:"refusalCount,omitempty"`
}

ModelDimensionStats is the per-(model, language) cross-section used in both TierAggregate.ModelStats and TagAggregate.ModelStats. Shape matches what the time-series chart reads from history.modelStats[model][lang] so the frontend can swap data sources cleanly.

type ModelReliability ¶

type ModelReliability struct {
	APIErrorCount int     `json:"apiErrorCount"`
	APIErrorRate  float64 `json:"apiErrorRate"`
	RefusalCount  int     `json:"refusalCount"`
	RefusalRate   float64 `json:"refusalRate"`
	TotalRuns     int     `json:"totalRuns"`
	// Per-language api-error counts. AILANGAPIError/PythonAPIError kept for
	// backward compatibility; LanguageAPIErrors covers all eval languages.
	AILANGAPIError    int            `json:"ailangApiError,omitempty"`
	PythonAPIError    int            `json:"pythonApiError,omitempty"`
	LanguageAPIErrors map[string]int `json:"language_api_errors,omitempty"`
}

ModelReliability is the per-model counterpart: useful for the hover breakdown on the reliability card ("gemini-3-1-pro: 13/33 api errors").

type ModelStats ¶

type ModelStats struct {
	TotalRuns       int                       `json:"total_runs"`
	Aggregates      Aggregates                `json:"aggregates"`
	Benchmarks      map[string]*BenchmarkRun  `json:"benchmarks"`
	BaselineVersion string                    `json:"baseline_version,omitempty"` // Which baseline these results came from
	Languages       map[string]*LanguageStats `json:"languages,omitempty"`        // Per-language breakdown for this model
}

ModelStats contains per-model performance

type PerformanceMatrix ¶

type PerformanceMatrix struct {
	Version   string    `json:"version"`
	Timestamp time.Time `json:"timestamp"`
	TotalRuns int       `json:"total_runs"`

	// Overall aggregates
	Aggregates Aggregates `json:"aggregates"`

	// Breakdown by dimension
	Models         map[string]*ModelStats     `json:"models"`
	Benchmarks     map[string]*BenchmarkStats `json:"benchmarks"`
	ErrorCodes     []*ErrorCodeStats          `json:"error_codes"`
	Languages      map[string]*LanguageStats  `json:"languages"`
	PromptVersions map[string]*PromptStats    `json:"prompt_versions,omitempty"`
}

PerformanceMatrix contains aggregated performance data

func GenerateMatrix ¶

func GenerateMatrix(results []*BenchmarkResult, version string) (*PerformanceMatrix, error)

GenerateMatrix generates a performance matrix from benchmark results This replaces the brittle jq-based bash script with type-safe Go code

func GenerateMatrixWithBaselines ¶

func GenerateMatrixWithBaselines(results []*BenchmarkResult, version string, modelBaselines map[string]string) (*PerformanceMatrix, error)

GenerateMatrixWithBaselines generates a performance matrix with optional baseline version info per model

type PromptStats ¶

type PromptStats struct {
	TotalRuns       int     `json:"total_runs"`
	ZeroShotSuccess float64 `json:"0-shot_success"`
	FinalSuccess    float64 `json:"final_success"`
	AvgTokens       float64 `json:"avg_tokens"`
}

PromptStats contains per-prompt-version performance

type ReliabilityCounts ¶

type ReliabilityCounts struct {
	APIErrorCount int                          `json:"apiErrorCount"`
	APIErrorRate  float64                      `json:"apiErrorRate"`
	RefusalCount  int                          `json:"refusalCount"`
	RefusalRate   float64                      `json:"refusalRate"`
	PerModel      map[string]*ModelReliability `json:"perModel,omitempty"`
}

ReliabilityCounts is a small bag of counters surfaced at the top level of DashboardJSON.aggregates so the "API Reliability" card can render without drilling into tiers/models.

type ResultFilter ¶

type ResultFilter struct {
	Model        string
	Lang         string
	Benchmark    string
	SuccessOnly  bool
	FailuresOnly bool
}

FilterResults returns results matching the given criteria

type SaturatedBenchmark ¶

type SaturatedBenchmark struct {
	ID            string   `json:"id"`
	BaselinesSeen []string `json:"baselines_seen"` // versions contributing
	TotalCells    int      `json:"total_cells"`    // model × lang pairs
}

SaturatedBenchmark names a benchmark that hit 100% pass across every model × language pair in the considered baselines.

func DetectSaturation ¶

func DetectSaturation(baselines []*Baseline, minBaselines int) []*SaturatedBenchmark

DetectSaturation returns benchmarks that pass 100% across every (model, language) cell in all considered baselines. Only baselines with ≥1 AILANG result are considered, to avoid "saturated" Python-only baselines reporting spurious wins.

If fewer than minBaselines baselines are available, saturation is computed over the ones that exist — better to return partial data with a clear "BaselinesSeen" list than nothing.

type SuiteEvent ¶

type SuiteEvent struct {
	Version      string   `json:"version" yaml:"version"`
	Label        string   `json:"label" yaml:"label"`
	Kind         string   `json:"kind" yaml:"kind"` // "benchmark_add" | "benchmark_remove" | "taxonomy" | "prompt"
	Color        string   `json:"color,omitempty" yaml:"color,omitempty"`
	AffectsTiers []string `json:"affects_tiers,omitempty" yaml:"affects_tiers,omitempty"` // if set, event only renders when one of these tiers is selected
}

SuiteEvent is a timeline annotation (benchmark additions, taxonomy changes, etc.) loaded from benchmarks/events.yml. Rendered as a dashed ReferenceLine on every time-series chart.

func LoadSuiteEvents ¶

func LoadSuiteEvents(path string) ([]SuiteEvent, error)

LoadSuiteEvents reads benchmarks/events.yml (or any path) and returns the parsed timeline annotations. Missing file returns an empty slice rather than an error — events are optional.

type SummaryEntry ¶

type SummaryEntry struct {
	ID             string  `json:"id"`
	Lang           string  `json:"lang"`
	Model          string  `json:"model"`
	Executor       string  `json:"executor,omitempty"` // Executor used: "claude", "gemini" (agent mode)
	Seed           int64   `json:"seed"`
	PromptVersion  string  `json:"prompt_version,omitempty"`
	FirstAttemptOk bool    `json:"first_attempt_ok"`
	RepairUsed     bool    `json:"repair_used"`
	RepairOk       bool    `json:"repair_ok"`
	ErrCode        string  `json:"err_code,omitempty"`
	CompileOk      bool    `json:"compile_ok"`
	RuntimeOk      bool    `json:"runtime_ok"`
	StdoutOk       bool    `json:"stdout_ok"`
	ErrorCategory  string  `json:"error_category"`
	InputTokens    int     `json:"input_tokens"`
	OutputTokens   int     `json:"output_tokens"`
	TotalTokens    int     `json:"total_tokens"`
	CostUSD        float64 `json:"cost_usd"`
	DurationMs     int64   `json:"duration_ms"`
	Timestamp      string  `json:"timestamp"`
	Stderr         string  `json:"stderr,omitempty"`
	// Agent evaluation fields (M-EVAL-AGENT)
	EvalMode   string `json:"eval_mode,omitempty"`   // "standard" or "agent"
	Condition  string `json:"condition,omitempty"`   // Experimental condition: "baseline", "agent_prompt", etc.
	AgentTurns int    `json:"agent_turns,omitempty"` // Number of conversation turns
}

SummaryEntry is a simplified record for JSONL export

func (*SummaryEntry) MarshalJSON ¶

func (s *SummaryEntry) MarshalJSON() ([]byte, error)

MarshalJSON implements custom JSON marshaling for JSONL (single-line)

type TagAggregate ¶

type TagAggregate struct {
	Tag         string  `json:"tag"`
	AILANGPass  int     `json:"ailang_pass"`
	AILANGTotal int     `json:"ailang_total"`
	PythonPass  int     `json:"python_pass"`
	PythonTotal int     `json:"python_total"`
	Delta       float64 `json:"delta"` // ailangRate - pythonRate; kept for backward compat
	// LanguageBreakdown contains pass/total for ALL eval languages
	// (python, ailang, javascript, go, …). The typed AILANG*/Python* fields
	// above remain for backward compatibility.
	LanguageBreakdown map[string]*TagLangStats `json:"language_breakdown,omitempty"`
	// M-DASH-V2: unique benchmark IDs carrying this tag (useful for the
	// UI "N benchmarks in tag" chip).
	BenchmarkCount int `json:"benchmark_count,omitempty"`
	// M-DASH-V2: per-model cross-section so the dashboard can render
	// per-model bars filtered to this tag. Outer key is model name,
	// inner key is language.
	ModelStats map[string]map[string]*ModelDimensionStats `json:"model_stats,omitempty"`
}

TagAggregate summarises pass/total counts for one tag, per language, plus the AILANG vs Python delta in [-1,1].

type TagLangStats ¶ added in v0.14.2

type TagLangStats struct {
	Pass  int     `json:"pass"`
	Total int     `json:"total"`
	Rate  float64 `json:"rate"`
}

TagLangStats holds per-language pass/total inside a TagAggregate.

type TagReport ¶

type TagReport struct {
	Tags       []string                 `json:"tags"`
	Aggregates map[string]*TagAggregate `json:"aggregates"`
}

TagReport is the output of GroupByTags: a sorted tag list plus the per-tag aggregates.

func GroupByTags ¶

func GroupByTags(results []*BenchmarkResult, tags map[string][]string) *TagReport

GroupByTags builds a TagReport from benchmark results and the tag index from LoadBenchmarkTags. Results flagged RefusalDetected are excluded from pass/total counts so refusals do not inflate failure rates for every tag the benchmark happened to carry.

Each benchmark contributes to every tag it carries; a (benchmark, lang, model) run is one unit, so a benchmark tagged adt_pattern_match + recursion counts once in each column.

type TierAggregate ¶

type TierAggregate struct {
	TotalRuns         int     `json:"total_runs"`
	AILANGRuns        int     `json:"ailang_runs"`
	PythonRuns        int     `json:"python_runs"`
	AILANGSuccessRate float64 `json:"ailang_success_rate"`
	PythonSuccessRate float64 `json:"python_success_rate"`
	BenchmarkCount    int     `json:"benchmark_count"` // unique benchmark IDs in this tier

	// Generic per-language breakdown — includes all eval languages (python,
	// ailang, javascript, go, …). The typed AILANG*/Python* fields above
	// remain for backward compatibility with existing dashboard consumers.
	LanguageStats map[string]*TierLanguageStats `json:"language_stats,omitempty"`

	// M-DASH-V2: per-tier × per-model breakdown so charts can filter
	// time-series data to this tier. Outer key is model name, inner key is
	// language. Nil when the tier has no runs.
	ModelStats map[string]map[string]*ModelDimensionStats `json:"model_stats,omitempty"`

	// M-DASH-V2: API reliability per tier. Splits by language so dashboards
	// can show "how many gemini-3-1-pro AILANG runs on core tier returned
	// api_error?" separately from Python.
	APIErrorCount  int `json:"api_error_count"`
	AILANGAPIError int `json:"ailang_api_error"`
	PythonAPIError int `json:"python_api_error"`

	// M-DASH-V2: refusal count per tier (RefusalDetected at load time).
	RefusalCount int `json:"refusal_count"`

	// M-DASH-V2: self-repair efficacy and cost for this tier. RepairDelta =
	// final pass rate − first-attempt pass rate; answers "does self-repair
	// help more on hard tiers?". AvgCostUSD split by language lets callers
	// tell whether stretch is 3× pricier on AILANG specifically.
	AILANGRepairDelta float64 `json:"ailang_repair_delta"`
	PythonRepairDelta float64 `json:"python_repair_delta"`
	AILANGAvgCostUSD  float64 `json:"ailang_avg_cost_usd"`
	PythonAvgCostUSD  float64 `json:"python_avg_cost_usd"`
}

TierAggregate contains per-tier pass-rate metrics. Populated by ExportBenchmarkJSON from the tier field attached to each benchmark result (resolved via the benchmark YAML's tier). The Core tier pass rate is the dashboard headline metric per M-EVAL-SUITE-PREP M6.

type TierHistoryPoint ¶

type TierHistoryPoint struct {
	AILANGSuccessRate float64                                    `json:"ailang_success_rate"`
	PythonSuccessRate float64                                    `json:"python_success_rate"`
	AILANGRuns        int                                        `json:"ailang_runs"`
	PythonRuns        int                                        `json:"python_runs"`
	BenchmarkCount    int                                        `json:"benchmark_count"`
	ModelStats        map[string]map[string]*ModelDimensionStats `json:"modelStats,omitempty"`
	// Generic per-language breakdown for all eval languages (python, ailang,
	// javascript, go, …). The typed AILANG*/Python* fields remain for
	// backward compatibility.
	LanguageStats map[string]*TierLanguageStats `json:"language_stats,omitempty"`
}

TierHistoryPoint is a per-tier snapshot inside a single history entry. Lets PerModelTrend filter the time series to e.g. just the Core tier so the chart updates when TierToggle changes — not just the hero row.

type TierLanguageStats ¶ added in v0.14.2

type TierLanguageStats struct {
	Runs        int     `json:"runs"`
	Pass        int     `json:"pass"`
	SuccessRate float64 `json:"success_rate"`
	RepairDelta float64 `json:"repair_delta,omitempty"`
	AvgCostUSD  float64 `json:"avg_cost_usd,omitempty"`
	APIErrors   int     `json:"api_errors,omitempty"`
}

TierLanguageStats holds per-language aggregate metrics for one tier. Used in TierAggregate.LanguageStats and TierHistoryPoint.LanguageStats to surface data for all eval languages (python, ailang, javascript, go, …).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL