eval

package
v1.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 30, 2026 License: GPL-3.0 Imports: 15 Imported by: 0

Documentation

Overview

Package eval is CommitBrief's review-quality eval harness (ADR-0018).

It scores a provider's actual review output against a curated known-answer corpus and reports precision / recall / false-positive rate. The corpus lives under testdata/corpus/<name>/, one directory per fixture: input.diff (the change under review), expected.json (the answer key), and, for the deterministic tier, mock_response.json (scripted findings fed to the mock provider).

Two execution tiers (ADR-0018 §3):

  • Deterministic tier — TestEvalMockCorpus runs the corpus through the mock provider with each fixture's scripted response. It validates the harness, matcher, and scoring math, runs in plain `go test ./...`, and is therefore part of the CI gate. It does NOT measure model quality.

  • Live tier — TestEvalLive (behind the `live` build tag, like the rest of the live provider tests) runs the corpus through a real provider and prints the quality scorecard. Non-deterministic and gated; it is the source of the README quality numbers, never a CI gate.

The harness consumes provider output through the locked --json schema v1 findings[] (ADR-0014); it introduces no new output contract.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CategoryRecall

type CategoryRecall struct {
	Category string
	Caught   int
	Total    int
}

CategoryRecall is per-category recall for one expected-finding category.

type ExpectedFinding

type ExpectedFinding struct {
	ID          string          `json:"id"`
	File        string          `json:"file"`
	Line        int             `json:"line"`
	LineTol     int             `json:"line_tol,omitempty"`
	Category    string          `json:"category"`
	MinSeverity render.Severity `json:"min_severity,omitempty"`
	Summary     string          `json:"summary"`
}

ExpectedFinding is one entry in a fixture answer key — a defect the review SHOULD surface (ADR-0018 §1). Category is reporting metadata, not a match criterion: the locked findings schema carries no category field.

type Fixture

type Fixture struct {
	Name             string
	Dir              string
	Language         string
	Diff             string
	Expected         []ExpectedFinding
	MustStaySilentOn []SilenceAnchor
	MockResponse     string

	// HeldOut marks the fixture as part of the generalization-only slice;
	// see answerKey.HeldOut.
	HeldOut bool
}

Fixture is one known-answer corpus entry: a diff plus its answer key. MockResponse is the scripted findings JSON used by the deterministic tier; it is empty when the fixture ships no mock_response.json.

func LoadCorpus

func LoadCorpus(root string) ([]Fixture, error)

LoadCorpus loads every fixture under root — each child directory that contains an input.diff. Fixtures are returned sorted by name so every run iterates deterministically.

func LoadFixture

func LoadFixture(dir string) (Fixture, error)

LoadFixture reads a single corpus directory: input.diff + expected.json (required) and mock_response.json (optional, for the deterministic tier).

func (Fixture) Categories

func (fx Fixture) Categories() []string

Categories returns the distinct categories a fixture exercises: the category of each expected finding, or "clean" for a clean control (no expected findings). Used to check that the held-out slice is representative rather than concentrated in one category.

type FixtureScore

type FixtureScore struct {
	Fixture string
	HeldOut bool // mirrors Fixture.HeldOut so a Scorecard can be split

	TruePositives  int // expected findings that were matched
	FalseNegatives int // expected findings that were missed
	FalsePositives int // produced findings that matched no expected finding

	SilenceViolations int // produced findings landing on a silence anchor
	SilenceAnchors    int // total silence anchors in the fixture

	// CaughtByCategory / MissedByCategory attribute each expected finding to
	// its category, giving a per-category recall breakdown (ADR-0018 §2).
	CaughtByCategory map[string]int
	MissedByCategory map[string]int
}

FixtureScore is the per-fixture outcome of scoring produced findings against the answer key.

func RunFixture

func RunFixture(ctx context.Context, p provider.Provider, fx Fixture, model string) (FixtureScore, error)

RunFixture runs one fixture through a provider and scores the result. An empty model uses the provider's default model.

func Score

func Score(produced []render.Finding, fx Fixture) FixtureScore

Score matches produced findings against a fixture's answer key using one-to-one greedy assignment (ADR-0018 §2) and returns the tally. The corpus is assumed fully annotated, so any produced finding that matches no expected finding counts as a false positive.

func (FixtureScore) FalsePositiveRate

func (s FixtureScore) FalsePositiveRate() float64

FalsePositiveRate = silence violations ÷ silence anchors. Returns 0 when the fixture defines no anchors.

func (FixtureScore) Precision

func (s FixtureScore) Precision() float64

Precision = TP / (TP + FP). A run that produced no findings is vacuously precise (returns 1) so it does not divide by zero or drag an aggregate.

func (FixtureScore) Recall

func (s FixtureScore) Recall() float64

Recall = TP / (TP + FN). A fixture that expects nothing (a clean diff) is fully recalled by definition (returns 1).

type Scorecard

type Scorecard struct {
	Provider string
	Model    string
	Fixtures []FixtureScore
}

Scorecard aggregates fixture scores for one provider+model run.

func RunCorpus

func RunCorpus(ctx context.Context, p provider.Provider, model string, fixtures []Fixture) (Scorecard, error)

RunCorpus runs every fixture through the provider and returns a Scorecard. A fixture hitting a transient provider error (isRetriable) is retried up to corpusAttempts times with linear backoff; a non-transient failure aborts the run after the first attempt.

func (Scorecard) CategoryRecall

func (sc Scorecard) CategoryRecall() []CategoryRecall

CategoryRecall returns recall per expected-finding category, sorted by category name for deterministic output.

func (Scorecard) Dev

func (sc Scorecard) Dev() Scorecard

Dev returns the tunable slice (fixtures the prompt/corpus may be tuned against). HeldOut returns the generalization-only slice (ADR-0018 §Goodhart). A change is overfitting when Dev recall rises but HeldOut recall does not.

func (Scorecard) FalsePositiveRate

func (sc Scorecard) FalsePositiveRate() float64

FalsePositiveRate is the corpus-wide silence violations ÷ silence anchors.

func (Scorecard) HeldOut

func (sc Scorecard) HeldOut() Scorecard

func (Scorecard) Precision

func (sc Scorecard) Precision() float64

Precision is the corpus-wide TP / (TP + FP).

func (Scorecard) Recall

func (sc Scorecard) Recall() float64

Recall is the corpus-wide TP / (TP + FN).

type SilenceAnchor

type SilenceAnchor struct {
	File   string `json:"file"`
	Line   int    `json:"line"`
	Reason string `json:"reason"`
}

SilenceAnchor marks a line a good review should NOT flag. A produced finding landing on (File, ~Line) is a measured false positive (ADR-0018 §2).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL