Documentation
¶
Index ¶
- func DetectRefusal(code, stderr, stdout string) bool
- func ExportBenchmarkJSON(matrix *PerformanceMatrix, history []*Baseline, results []*BenchmarkResult, ...) (string, error)
- func ExportCSV(results []*BenchmarkResult) (string, error)
- func ExportDocusaurusMDX(matrix *PerformanceMatrix, history []*Baseline) string
- func ExportHTML(matrix *PerformanceMatrix, history []*Baseline) (string, error)
- func ExportMarkdown(matrix *PerformanceMatrix, history []*Baseline) string
- func FormatComparison(report *ComparisonReport, useColor bool) string
- func FormatJSON(matrix *PerformanceMatrix) (string, error)
- func FormatJSONL(results []*BenchmarkResult) (string, error)
- func FormatMatrix(matrix *PerformanceMatrix, useColor bool) string
- func GenerateReport(matrix *PerformanceMatrix, history []*Baseline) string
- func ListBaselines() ([]string, error)
- func LoadBenchmarkTags(dir string) map[string][]string
- func RefusalPatterns() []string
- type AILANGWin
- type AILANGWinsReport
- type Aggregates
- type Baseline
- type BenchmarkChange
- type BenchmarkResult
- func Filter(results []*BenchmarkResult, filter ResultFilter) []*BenchmarkResult
- func LoadLatestResultsPerModel() ([]*BenchmarkResult, map[string]string, error)
- func LoadResult(path string) (*BenchmarkResult, error)
- func LoadResults(dir string) ([]*BenchmarkResult, error)
- func LoadResultsFromChain(chainID string) ([]*BenchmarkResult, error)
- func LoadResultsFromLatestEvalChain() ([]*BenchmarkResult, string, error)
- type BenchmarkRun
- type BenchmarkStats
- type ComparisonReport
- type DashboardJSON
- type ErrorCodeStats
- type ExecutorLangStats
- type ExportFormat
- type HistoryEntry
- type LanguageStats
- type ModelDimensionStats
- type ModelReliability
- type ModelStats
- type PerformanceMatrix
- type PromptStats
- type ReliabilityCounts
- type ResultFilter
- type SaturatedBenchmark
- type SuiteEvent
- type SummaryEntry
- type TagAggregate
- type TagLangStats
- type TagReport
- type TierAggregate
- type TierHistoryPoint
- type TierLanguageStats
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DetectRefusal ¶
DetectRefusal returns true when any refusal pattern appears in code, stderr, or stdout. Matching is case-insensitive and substring-based, so "I CANNOT" and "i cannot" both trigger.
func ExportBenchmarkJSON ¶
func ExportBenchmarkJSON(matrix *PerformanceMatrix, history []*Baseline, results []*BenchmarkResult, outputPath string) (string, error)
ExportBenchmarkJSON exports benchmark data as JSON for client-side rendering
func ExportCSV ¶
func ExportCSV(results []*BenchmarkResult) (string, error)
ExportCSV generates a CSV export of benchmark results
func ExportDocusaurusMDX ¶
func ExportDocusaurusMDX(matrix *PerformanceMatrix, history []*Baseline) string
ExportDocusaurusMDX generates an MDX file with React components for Docusaurus
func ExportHTML ¶
func ExportHTML(matrix *PerformanceMatrix, history []*Baseline) (string, error)
ExportHTML generates an HTML report with Bootstrap styling
func ExportMarkdown ¶
func ExportMarkdown(matrix *PerformanceMatrix, history []*Baseline) string
ExportMarkdown generates a GitHub-flavored markdown report
func FormatComparison ¶
func FormatComparison(report *ComparisonReport, useColor bool) string
FormatComparison produces a human-readable comparison report
func FormatJSON ¶
func FormatJSON(matrix *PerformanceMatrix) (string, error)
FormatJSON converts a matrix to pretty-printed JSON
func FormatJSONL ¶
func FormatJSONL(results []*BenchmarkResult) (string, error)
FormatJSONL converts results to JSONL format (one JSON object per line)
func FormatMatrix ¶
func FormatMatrix(matrix *PerformanceMatrix, useColor bool) string
FormatMatrix produces a human-readable matrix summary
func GenerateReport ¶
func GenerateReport(matrix *PerformanceMatrix, history []*Baseline) string
GenerateReport creates a comprehensive evaluation report
func ListBaselines ¶
ListBaselines returns a list of available baseline versions
func LoadBenchmarkTags ¶
LoadBenchmarkTags reads every YAML in dir and returns a map of benchmark ID -> tag list. Benchmarks with unreadable specs are skipped silently — LoadSpec already warns on unknown tags.
func RefusalPatterns ¶
func RefusalPatterns() []string
RefusalPatterns returns a copy of the refusal pattern list for tests and documentation. Exported so the M4 acceptance test can assert the ≥4 patterns invariant without reaching into package internals.
Types ¶
type AILANGWin ¶
AILANGWin names a (benchmark, model) cell where AILANG passed and Python failed — the atom of the AILANG-only-wins report.
type AILANGWinsReport ¶
type AILANGWinsReport struct {
Wins []AILANGWin `json:"wins"`
PerBenchmark map[string]int `json:"per_benchmark"` // benchmark -> distinct models winning
Patterns []string `json:"patterns"` // benchmarks with ≥3 models winning
}
AILANGWinsReport aggregates wins at the cell level plus a pattern list of benchmarks where ≥3 distinct models agree that AILANG wins.
func DetectAILANGOnlyWins ¶
func DetectAILANGOnlyWins(results []*BenchmarkResult) *AILANGWinsReport
DetectAILANGOnlyWins finds cells where AILANG passes and Python fails for the same (benchmark, model). A benchmark is a "pattern" win when ≥3 distinct models agree on it. If either language refused at a cell, the whole cell is dropped — a Python refusal masquerading as a failure would otherwise produce a false-positive win.
type Aggregates ¶
type Aggregates struct {
ZeroShotSuccess float64 `json:"0-shot_success"` // First attempt success rate
FinalSuccess float64 `json:"final_success"` // After repair success rate
RepairUsed int `json:"repair_used"` // Number of repairs attempted
RepairSuccessRate float64 `json:"repair_success_rate"` // Repair success rate
TotalTokens int `json:"total_tokens"`
TotalCostUSD float64 `json:"total_cost_usd"`
AvgDurationMs float64 `json:"avg_duration_ms"`
}
Aggregates contains overall performance statistics
type Baseline ¶
type Baseline struct {
Version string `json:"version"`
Timestamp time.Time `json:"timestamp"`
Model string `json:"model"`
Languages string `json:"languages"`
SelfRepair bool `json:"self_repair"`
TotalBenchmarks int `json:"total_benchmarks"`
SuccessCount int `json:"success_count"`
FailCount int `json:"fail_count"`
MatrixFile string `json:"matrix_file"`
GitCommit string `json:"git_commit"`
GitBranch string `json:"git_branch"`
Results []*BenchmarkResult `json:"-"` // Loaded separately
}
Baseline represents a stored baseline with metadata
func GetLatestBaseline ¶
GetLatestBaseline returns the most recent baseline version
func LoadBaseline ¶
LoadBaseline loads a baseline from a directory Expects baseline.json metadata + result JSON files
func LoadBaselineByVersion ¶
LoadBaselineByVersion loads a baseline by version name Looks in eval_results/baselines/<version>
func LoadBaselineFromChain ¶
LoadBaselineFromChain creates a Baseline from a chain's stages.
type BenchmarkChange ¶
type BenchmarkChange struct {
ID string
Lang string
Model string
BaselineStatus bool // true = passing, false = failing
NewStatus bool
BaselineError string
NewError string
}
BenchmarkChange represents a benchmark that changed status
func FindImprovements ¶
func FindImprovements(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)
FindImprovements returns only benchmarks that were fixed
func FindRegressions ¶
func FindRegressions(baseline, new []*BenchmarkResult) ([]*BenchmarkChange, error)
FindRegressions returns only benchmarks that broke
type BenchmarkResult ¶
type BenchmarkResult struct {
ID string `json:"id"`
Lang string `json:"lang"`
Model string `json:"model"`
Executor string `json:"executor,omitempty"` // Executor used: "claude", "gemini", etc. (agent mode)
Seed int64 `json:"seed"`
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
TotalTokens int `json:"total_tokens"`
CostUSD float64 `json:"cost_usd"`
CompileOk bool `json:"compile_ok"`
RuntimeOk bool `json:"runtime_ok"`
StdoutOk bool `json:"stdout_ok"`
DurationMs int64 `json:"duration_ms"`
CompileMs int64 `json:"compile_ms"`
ExecuteMs int64 `json:"execute_ms"`
ErrorCategory string `json:"error_category"`
Stdout string `json:"stdout,omitempty"`
Stderr string `json:"stderr,omitempty"`
Timestamp time.Time `json:"timestamp"`
Code string `json:"code,omitempty"`
// Self-repair metrics (M-EVAL-LOOP)
FirstAttemptOk bool `json:"first_attempt_ok"`
RepairUsed bool `json:"repair_used"`
RepairOk bool `json:"repair_ok"`
ErrCode string `json:"err_code,omitempty"`
RepairTokensIn int `json:"repair_tokens_in,omitempty"`
RepairTokensOut int `json:"repair_tokens_out,omitempty"`
// Prompt versioning
PromptVersion string `json:"prompt_version,omitempty"`
// Agent evaluation metrics (M-EVAL-AGENT)
EvalMode string `json:"eval_mode,omitempty"` // "standard" or "agent"
Condition string `json:"condition,omitempty"` // Experimental condition: "baseline", "agent_prompt", etc.
AgentTurns int `json:"agent_turns,omitempty"` // Number of conversation turns
AgentTranscript string `json:"agent_transcript,omitempty"` // Full session log
// Reproducibility
BinaryHash string `json:"binary_hash,omitempty"`
StdlibHash string `json:"stdlib_hash,omitempty"`
Caps []string `json:"caps,omitempty"`
// Cross-harness comparison (M-EVAL-CROSS-HARNESS)
// Logical model family for grouping paired harness results.
// e.g. "claude-sonnet-4-6" shared by "claude" and "opencode" executors.
ModelFamily string `json:"model_family,omitempty"`
// Refusal detection (M-EVAL-SUITE-PREP M4): populated at load time
// by DetectRefusal() scanning stdout+stderr. Not written by eval_harness,
// purely a read-side annotation so historical results inherit it.
RefusalDetected bool `json:"refusal_detected,omitempty"`
}
BenchmarkResult represents the result of a single benchmark execution This mirrors the JSON structure from internal/eval_harness/metrics.go
func Filter ¶
func Filter(results []*BenchmarkResult, filter ResultFilter) []*BenchmarkResult
Filter applies the filter to results
func LoadLatestResultsPerModel ¶
func LoadLatestResultsPerModel() ([]*BenchmarkResult, map[string]string, error)
LoadLatestResultsPerModel aggregates results from multiple baselines, keeping the latest result for each model. Returns results and a map of model -> baseline version used
func LoadResult ¶
func LoadResult(path string) (*BenchmarkResult, error)
LoadResult loads a single benchmark result from a JSON file
func LoadResults ¶
func LoadResults(dir string) ([]*BenchmarkResult, error)
LoadResults loads all benchmark results from a directory Returns results sorted by timestamp (newest first) Recursively searches all subdirectories for .json files
func LoadResultsFromChain ¶
func LoadResultsFromChain(chainID string) ([]*BenchmarkResult, error)
LoadResultsFromChain loads all benchmark results from a chain's stages, converting EvalAssessment data into BenchmarkResult format. This allows the entire downstream pipeline (GenerateMatrix, ExportBenchmarkJSON, FormatComparison) to work unchanged.
func LoadResultsFromLatestEvalChain ¶
func LoadResultsFromLatestEvalChain() ([]*BenchmarkResult, string, error)
LoadResultsFromLatestEvalChain finds the most recent eval_suite chain and loads its results.
func (*BenchmarkResult) ToSummaryEntry ¶
func (r *BenchmarkResult) ToSummaryEntry() *SummaryEntry
ToSummaryEntry converts a BenchmarkResult to a SummaryEntry for JSONL export
type BenchmarkRun ¶
type BenchmarkRun struct {
Success bool `json:"success"`
FirstAttemptOk bool `json:"first_attempt_ok"`
RepairUsed bool `json:"repair_used"`
Tokens int `json:"tokens"`
}
BenchmarkRun contains single benchmark execution stats
type BenchmarkStats ¶
type BenchmarkStats struct {
TotalRuns int `json:"total_runs"`
SuccessRate float64 `json:"success_rate"`
AvgTokens float64 `json:"avg_tokens"`
Languages []string `json:"languages"`
}
BenchmarkStats contains per-benchmark performance
type ComparisonReport ¶
type ComparisonReport struct {
BaselineLabel string
NewLabel string
Baseline *Baseline
New *Baseline
// Changes
Fixed []*BenchmarkChange
Broken []*BenchmarkChange
StillPassing []*BenchmarkResult
StillFailing []*BenchmarkResult
NewBenchmarks []*BenchmarkResult
Removed []*BenchmarkResult
// Aggregates
BaselineSuccessRate float64
NewSuccessRate float64
SuccessRateDelta float64
TotalBaselineBench int
TotalNewBench int
}
ComparisonReport contains structured diff between two benchmark runs
func Compare ¶
func Compare(baseline, new []*BenchmarkResult, baselineLabel, newLabel string) (*ComparisonReport, error)
Compare compares two sets of benchmark results and produces a detailed report
func CompareBaselines ¶
func CompareBaselines(baseline, new *Baseline) (*ComparisonReport, error)
CompareBaselines compares two baselines (with metadata)
func (*ComparisonReport) HasImprovements ¶
func (r *ComparisonReport) HasImprovements() bool
HasImprovements checks if there are any improvements
func (*ComparisonReport) HasRegressions ¶
func (r *ComparisonReport) HasRegressions() bool
HasRegressions checks if there are any regressions
func (*ComparisonReport) ImprovementPercent ¶
func (r *ComparisonReport) ImprovementPercent() float64
ImprovementPercent returns the improvement as a percentage
func (*ComparisonReport) NetChange ¶
func (r *ComparisonReport) NetChange() int
NetChange returns the net change in passing benchmarks
func (*ComparisonReport) Summary ¶
func (r *ComparisonReport) Summary() string
Summary returns a one-line summary of the comparison
type DashboardJSON ¶
type DashboardJSON struct {
Version string `json:"version"`
Timestamp string `json:"timestamp"`
TotalRuns int `json:"totalRuns"`
Aggregates map[string]interface{} `json:"aggregates"`
Tiers map[string]TierAggregate `json:"tiers,omitempty"` // Per-tier aggregates: smoke/core/stretch/vision
// M-DASH-V2: per-tag aggregates (12 canonical tags) with per-model
// cross-sections so the dashboard can narrow the charts to a tag.
Tags map[string]*TagAggregate `json:"tags,omitempty"`
Models map[string]interface{} `json:"models"`
AgentModels map[string]interface{} `json:"agentModels,omitempty"` // Agent-only models (separate from standard)
Benchmarks map[string]interface{} `json:"benchmarks"`
Languages map[string]interface{} `json:"languages"` // map[language]->stats
Executors map[string]interface{} `json:"executors"` // map[executor]->agent stats (claude, gemini)
// M-BENCHMARK-SECTION: harness-grouped aggregates for cross-harness comparison page.
// Keys are agent_cli values ("claude", "gemini", "opencode", "codex").
Harnesses map[string]interface{} `json:"harnesses,omitempty"`
History []HistoryEntry `json:"history"`
// M-DASH-V2: suite-change annotations rendered as ReferenceLine on every
// time-series chart. Sourced from benchmarks/events.yml.
Events []SuiteEvent `json:"events,omitempty"`
}
DashboardJSON represents the structure of docs/static/benchmarks/latest.json This is the single source of truth for the dashboard frontend
func (*DashboardJSON) Validate ¶
func (d *DashboardJSON) Validate() error
Validate checks if a DashboardJSON structure is valid
type ErrorCodeStats ¶
type ErrorCodeStats struct {
Code string `json:"code"`
Count int `json:"count"`
RepairSuccess float64 `json:"repair_success"`
}
ErrorCodeStats contains per-error-code statistics
type ExecutorLangStats ¶
ExecutorLangStats exposes the counts the language aggregate loop needs to surface per-executor agent metrics alongside the default ones.
type ExportFormat ¶
type ExportFormat string
ExportFormat represents the output format for reports
const ( FormatMarkdown ExportFormat = "markdown" FormatHTML ExportFormat = "html" FormatCSV ExportFormat = "csv" )
type HistoryEntry ¶
type HistoryEntry struct {
Version string `json:"version"`
Timestamp string `json:"timestamp"`
SuccessRate float64 `json:"successRate"`
TotalRuns int `json:"totalRuns"`
SuccessCount int `json:"successCount"`
Languages string `json:"languages"`
LanguageStats map[string]interface{} `json:"languageStats,omitempty"`
ModelStats map[string]interface{} `json:"modelStats,omitempty"` // Per-model, per-language stats for trend charts
// M-DASH-V2: per-tier snapshots. Lets the time-series chart filter to
// one tier retroactively (pre-v0.14.0 baselines use the CURRENT tier
// mapping — docs describe this as an approximation).
Tiers map[string]*TierHistoryPoint `json:"tiers,omitempty"`
}
HistoryEntry represents a single version's data in the history array
type LanguageStats ¶
type LanguageStats struct {
TotalRuns int `json:"total_runs"`
SuccessRate float64 `json:"success_rate"`
AvgTokens float64 `json:"avg_tokens"`
}
LanguageStats contains per-language performance
type ModelDimensionStats ¶
type ModelDimensionStats struct {
SuccessRate float64 `json:"successRate"`
TotalRuns int `json:"totalRuns"`
AvgTokens float64 `json:"avgTokens"`
APIErrorCount int `json:"apiErrorCount,omitempty"`
RefusalCount int `json:"refusalCount,omitempty"`
}
ModelDimensionStats is the per-(model, language) cross-section used in both TierAggregate.ModelStats and TagAggregate.ModelStats. Shape matches what the time-series chart reads from history.modelStats[model][lang] so the frontend can swap data sources cleanly.
type ModelReliability ¶
type ModelReliability struct {
APIErrorCount int `json:"apiErrorCount"`
APIErrorRate float64 `json:"apiErrorRate"`
RefusalCount int `json:"refusalCount"`
RefusalRate float64 `json:"refusalRate"`
TotalRuns int `json:"totalRuns"`
// Per-language api-error counts. AILANGAPIError/PythonAPIError kept for
// backward compatibility; LanguageAPIErrors covers all eval languages.
AILANGAPIError int `json:"ailangApiError,omitempty"`
PythonAPIError int `json:"pythonApiError,omitempty"`
LanguageAPIErrors map[string]int `json:"language_api_errors,omitempty"`
}
ModelReliability is the per-model counterpart: useful for the hover breakdown on the reliability card ("gemini-3-1-pro: 13/33 api errors").
type ModelStats ¶
type ModelStats struct {
TotalRuns int `json:"total_runs"`
Aggregates Aggregates `json:"aggregates"`
Benchmarks map[string]*BenchmarkRun `json:"benchmarks"`
BaselineVersion string `json:"baseline_version,omitempty"` // Which baseline these results came from
Languages map[string]*LanguageStats `json:"languages,omitempty"` // Per-language breakdown for this model
}
ModelStats contains per-model performance
type PerformanceMatrix ¶
type PerformanceMatrix struct {
Version string `json:"version"`
Timestamp time.Time `json:"timestamp"`
TotalRuns int `json:"total_runs"`
// Overall aggregates
Aggregates Aggregates `json:"aggregates"`
// Breakdown by dimension
Models map[string]*ModelStats `json:"models"`
Benchmarks map[string]*BenchmarkStats `json:"benchmarks"`
ErrorCodes []*ErrorCodeStats `json:"error_codes"`
Languages map[string]*LanguageStats `json:"languages"`
PromptVersions map[string]*PromptStats `json:"prompt_versions,omitempty"`
}
PerformanceMatrix contains aggregated performance data
func GenerateMatrix ¶
func GenerateMatrix(results []*BenchmarkResult, version string) (*PerformanceMatrix, error)
GenerateMatrix generates a performance matrix from benchmark results This replaces the brittle jq-based bash script with type-safe Go code
func GenerateMatrixWithBaselines ¶
func GenerateMatrixWithBaselines(results []*BenchmarkResult, version string, modelBaselines map[string]string) (*PerformanceMatrix, error)
GenerateMatrixWithBaselines generates a performance matrix with optional baseline version info per model
type PromptStats ¶
type PromptStats struct {
TotalRuns int `json:"total_runs"`
ZeroShotSuccess float64 `json:"0-shot_success"`
FinalSuccess float64 `json:"final_success"`
AvgTokens float64 `json:"avg_tokens"`
}
PromptStats contains per-prompt-version performance
type ReliabilityCounts ¶
type ReliabilityCounts struct {
APIErrorCount int `json:"apiErrorCount"`
APIErrorRate float64 `json:"apiErrorRate"`
RefusalCount int `json:"refusalCount"`
RefusalRate float64 `json:"refusalRate"`
PerModel map[string]*ModelReliability `json:"perModel,omitempty"`
}
ReliabilityCounts is a small bag of counters surfaced at the top level of DashboardJSON.aggregates so the "API Reliability" card can render without drilling into tiers/models.
type ResultFilter ¶
type ResultFilter struct {
Model string
Lang string
Benchmark string
SuccessOnly bool
FailuresOnly bool
}
FilterResults returns results matching the given criteria
type SaturatedBenchmark ¶
type SaturatedBenchmark struct {
ID string `json:"id"`
BaselinesSeen []string `json:"baselines_seen"` // versions contributing
TotalCells int `json:"total_cells"` // model × lang pairs
}
SaturatedBenchmark names a benchmark that hit 100% pass across every model × language pair in the considered baselines.
func DetectSaturation ¶
func DetectSaturation(baselines []*Baseline, minBaselines int) []*SaturatedBenchmark
DetectSaturation returns benchmarks that pass 100% across every (model, language) cell in all considered baselines. Only baselines with ≥1 AILANG result are considered, to avoid "saturated" Python-only baselines reporting spurious wins.
If fewer than minBaselines baselines are available, saturation is computed over the ones that exist — better to return partial data with a clear "BaselinesSeen" list than nothing.
type SuiteEvent ¶
type SuiteEvent struct {
Version string `json:"version" yaml:"version"`
Label string `json:"label" yaml:"label"`
Kind string `json:"kind" yaml:"kind"` // "benchmark_add" | "benchmark_remove" | "taxonomy" | "prompt"
Color string `json:"color,omitempty" yaml:"color,omitempty"`
AffectsTiers []string `json:"affects_tiers,omitempty" yaml:"affects_tiers,omitempty"` // if set, event only renders when one of these tiers is selected
}
SuiteEvent is a timeline annotation (benchmark additions, taxonomy changes, etc.) loaded from benchmarks/events.yml. Rendered as a dashed ReferenceLine on every time-series chart.
func LoadSuiteEvents ¶
func LoadSuiteEvents(path string) ([]SuiteEvent, error)
LoadSuiteEvents reads benchmarks/events.yml (or any path) and returns the parsed timeline annotations. Missing file returns an empty slice rather than an error — events are optional.
type SummaryEntry ¶
type SummaryEntry struct {
ID string `json:"id"`
Lang string `json:"lang"`
Model string `json:"model"`
Executor string `json:"executor,omitempty"` // Executor used: "claude", "gemini" (agent mode)
Seed int64 `json:"seed"`
PromptVersion string `json:"prompt_version,omitempty"`
FirstAttemptOk bool `json:"first_attempt_ok"`
RepairUsed bool `json:"repair_used"`
RepairOk bool `json:"repair_ok"`
ErrCode string `json:"err_code,omitempty"`
CompileOk bool `json:"compile_ok"`
RuntimeOk bool `json:"runtime_ok"`
StdoutOk bool `json:"stdout_ok"`
ErrorCategory string `json:"error_category"`
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
TotalTokens int `json:"total_tokens"`
CostUSD float64 `json:"cost_usd"`
DurationMs int64 `json:"duration_ms"`
Timestamp string `json:"timestamp"`
Stderr string `json:"stderr,omitempty"`
// Agent evaluation fields (M-EVAL-AGENT)
EvalMode string `json:"eval_mode,omitempty"` // "standard" or "agent"
Condition string `json:"condition,omitempty"` // Experimental condition: "baseline", "agent_prompt", etc.
AgentTurns int `json:"agent_turns,omitempty"` // Number of conversation turns
}
SummaryEntry is a simplified record for JSONL export
func (*SummaryEntry) MarshalJSON ¶
func (s *SummaryEntry) MarshalJSON() ([]byte, error)
MarshalJSON implements custom JSON marshaling for JSONL (single-line)
type TagAggregate ¶
type TagAggregate struct {
Tag string `json:"tag"`
AILANGPass int `json:"ailang_pass"`
AILANGTotal int `json:"ailang_total"`
PythonPass int `json:"python_pass"`
PythonTotal int `json:"python_total"`
Delta float64 `json:"delta"` // ailangRate - pythonRate; kept for backward compat
// LanguageBreakdown contains pass/total for ALL eval languages
// (python, ailang, javascript, go, …). The typed AILANG*/Python* fields
// above remain for backward compatibility.
LanguageBreakdown map[string]*TagLangStats `json:"language_breakdown,omitempty"`
// M-DASH-V2: unique benchmark IDs carrying this tag (useful for the
// UI "N benchmarks in tag" chip).
BenchmarkCount int `json:"benchmark_count,omitempty"`
// M-DASH-V2: per-model cross-section so the dashboard can render
// per-model bars filtered to this tag. Outer key is model name,
// inner key is language.
ModelStats map[string]map[string]*ModelDimensionStats `json:"model_stats,omitempty"`
}
TagAggregate summarises pass/total counts for one tag, per language, plus the AILANG vs Python delta in [-1,1].
type TagLangStats ¶ added in v0.14.2
type TagLangStats struct {
Pass int `json:"pass"`
Total int `json:"total"`
Rate float64 `json:"rate"`
}
TagLangStats holds per-language pass/total inside a TagAggregate.
type TagReport ¶
type TagReport struct {
Tags []string `json:"tags"`
Aggregates map[string]*TagAggregate `json:"aggregates"`
}
TagReport is the output of GroupByTags: a sorted tag list plus the per-tag aggregates.
func GroupByTags ¶
func GroupByTags(results []*BenchmarkResult, tags map[string][]string) *TagReport
GroupByTags builds a TagReport from benchmark results and the tag index from LoadBenchmarkTags. Results flagged RefusalDetected are excluded from pass/total counts so refusals do not inflate failure rates for every tag the benchmark happened to carry.
Each benchmark contributes to every tag it carries; a (benchmark, lang, model) run is one unit, so a benchmark tagged adt_pattern_match + recursion counts once in each column.
type TierAggregate ¶
type TierAggregate struct {
TotalRuns int `json:"total_runs"`
AILANGRuns int `json:"ailang_runs"`
PythonRuns int `json:"python_runs"`
AILANGSuccessRate float64 `json:"ailang_success_rate"`
PythonSuccessRate float64 `json:"python_success_rate"`
BenchmarkCount int `json:"benchmark_count"` // unique benchmark IDs in this tier
// Generic per-language breakdown — includes all eval languages (python,
// ailang, javascript, go, …). The typed AILANG*/Python* fields above
// remain for backward compatibility with existing dashboard consumers.
LanguageStats map[string]*TierLanguageStats `json:"language_stats,omitempty"`
// M-DASH-V2: per-tier × per-model breakdown so charts can filter
// time-series data to this tier. Outer key is model name, inner key is
// language. Nil when the tier has no runs.
ModelStats map[string]map[string]*ModelDimensionStats `json:"model_stats,omitempty"`
// M-DASH-V2: API reliability per tier. Splits by language so dashboards
// can show "how many gemini-3-1-pro AILANG runs on core tier returned
// api_error?" separately from Python.
APIErrorCount int `json:"api_error_count"`
AILANGAPIError int `json:"ailang_api_error"`
PythonAPIError int `json:"python_api_error"`
// M-DASH-V2: refusal count per tier (RefusalDetected at load time).
RefusalCount int `json:"refusal_count"`
// M-DASH-V2: self-repair efficacy and cost for this tier. RepairDelta =
// final pass rate − first-attempt pass rate; answers "does self-repair
// help more on hard tiers?". AvgCostUSD split by language lets callers
// tell whether stretch is 3× pricier on AILANG specifically.
AILANGRepairDelta float64 `json:"ailang_repair_delta"`
PythonRepairDelta float64 `json:"python_repair_delta"`
AILANGAvgCostUSD float64 `json:"ailang_avg_cost_usd"`
PythonAvgCostUSD float64 `json:"python_avg_cost_usd"`
}
TierAggregate contains per-tier pass-rate metrics. Populated by ExportBenchmarkJSON from the tier field attached to each benchmark result (resolved via the benchmark YAML's tier). The Core tier pass rate is the dashboard headline metric per M-EVAL-SUITE-PREP M6.
type TierHistoryPoint ¶
type TierHistoryPoint struct {
AILANGSuccessRate float64 `json:"ailang_success_rate"`
PythonSuccessRate float64 `json:"python_success_rate"`
AILANGRuns int `json:"ailang_runs"`
PythonRuns int `json:"python_runs"`
BenchmarkCount int `json:"benchmark_count"`
ModelStats map[string]map[string]*ModelDimensionStats `json:"modelStats,omitempty"`
// Generic per-language breakdown for all eval languages (python, ailang,
// javascript, go, …). The typed AILANG*/Python* fields remain for
// backward compatibility.
LanguageStats map[string]*TierLanguageStats `json:"language_stats,omitempty"`
}
TierHistoryPoint is a per-tier snapshot inside a single history entry. Lets PerModelTrend filter the time series to e.g. just the Core tier so the chart updates when TierToggle changes — not just the hero row.
type TierLanguageStats ¶ added in v0.14.2
type TierLanguageStats struct {
Runs int `json:"runs"`
Pass int `json:"pass"`
SuccessRate float64 `json:"success_rate"`
RepairDelta float64 `json:"repair_delta,omitempty"`
AvgCostUSD float64 `json:"avg_cost_usd,omitempty"`
APIErrors int `json:"api_errors,omitempty"`
}
TierLanguageStats holds per-language aggregate metrics for one tier. Used in TierAggregate.LanguageStats and TierHistoryPoint.LanguageStats to surface data for all eval languages (python, ailang, javascript, go, …).