Documentation
¶
Overview ¶
Package eval is CommitBrief's review-quality eval harness (ADR-0018).
It scores a provider's actual review output against a curated known-answer corpus and reports precision / recall / false-positive rate. The corpus lives under testdata/corpus/<name>/, one directory per fixture: input.diff (the change under review), expected.json (the answer key), and, for the deterministic tier, mock_response.json (scripted findings fed to the mock provider).
Two execution tiers (ADR-0018 §3):
Deterministic tier — TestEvalMockCorpus runs the corpus through the mock provider with each fixture's scripted response. It validates the harness, matcher, and scoring math, runs in plain `go test ./...`, and is therefore part of the CI gate. It does NOT measure model quality.
Live tier — TestEvalLive (behind the `live` build tag, like the rest of the live provider tests) runs the corpus through a real provider and prints the quality scorecard. Non-deterministic and gated; it is the source of the README quality numbers, never a CI gate.
The harness consumes provider output through the locked --json schema v1 findings[] (ADR-0014); it introduces no new output contract.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CategoryRecall ¶
CategoryRecall is per-category recall for one expected-finding category.
type ExpectedFinding ¶
type ExpectedFinding struct {
ID string `json:"id"`
File string `json:"file"`
Line int `json:"line"`
LineTol int `json:"line_tol,omitempty"`
Category string `json:"category"`
MinSeverity render.Severity `json:"min_severity,omitempty"`
Summary string `json:"summary"`
}
ExpectedFinding is one entry in a fixture answer key — a defect the review SHOULD surface (ADR-0018 §1). Category is reporting metadata, not a match criterion: the locked findings schema carries no category field.
type Fixture ¶
type Fixture struct {
Name string
Dir string
Language string
Diff string
Expected []ExpectedFinding
MustStaySilentOn []SilenceAnchor
MockResponse string
// HeldOut marks the fixture as part of the generalization-only slice;
// see answerKey.HeldOut.
HeldOut bool
}
Fixture is one known-answer corpus entry: a diff plus its answer key. MockResponse is the scripted findings JSON used by the deterministic tier; it is empty when the fixture ships no mock_response.json.
func LoadCorpus ¶
LoadCorpus loads every fixture under root — each child directory that contains an input.diff. Fixtures are returned sorted by name so every run iterates deterministically.
func LoadFixture ¶
LoadFixture reads a single corpus directory: input.diff + expected.json (required) and mock_response.json (optional, for the deterministic tier).
func (Fixture) Categories ¶
Categories returns the distinct categories a fixture exercises: the category of each expected finding, or "clean" for a clean control (no expected findings). Used to check that the held-out slice is representative rather than concentrated in one category.
type FixtureScore ¶
type FixtureScore struct {
Fixture string
HeldOut bool // mirrors Fixture.HeldOut so a Scorecard can be split
TruePositives int // expected findings that were matched
FalseNegatives int // expected findings that were missed
FalsePositives int // produced findings that matched no expected finding
SilenceViolations int // produced findings landing on a silence anchor
SilenceAnchors int // total silence anchors in the fixture
// CaughtByCategory / MissedByCategory attribute each expected finding to
// its category, giving a per-category recall breakdown (ADR-0018 §2).
CaughtByCategory map[string]int
MissedByCategory map[string]int
}
FixtureScore is the per-fixture outcome of scoring produced findings against the answer key.
func RunFixture ¶
func RunFixture(ctx context.Context, p provider.Provider, fx Fixture, model string) (FixtureScore, error)
RunFixture runs one fixture through a provider and scores the result. An empty model uses the provider's default model.
func Score ¶
func Score(produced []render.Finding, fx Fixture) FixtureScore
Score matches produced findings against a fixture's answer key using one-to-one greedy assignment (ADR-0018 §2) and returns the tally. The corpus is assumed fully annotated, so any produced finding that matches no expected finding counts as a false positive.
func (FixtureScore) FalsePositiveRate ¶
func (s FixtureScore) FalsePositiveRate() float64
FalsePositiveRate = silence violations ÷ silence anchors. Returns 0 when the fixture defines no anchors.
func (FixtureScore) Precision ¶
func (s FixtureScore) Precision() float64
Precision = TP / (TP + FP). A run that produced no findings is vacuously precise (returns 1) so it does not divide by zero or drag an aggregate.
func (FixtureScore) Recall ¶
func (s FixtureScore) Recall() float64
Recall = TP / (TP + FN). A fixture that expects nothing (a clean diff) is fully recalled by definition (returns 1).
type Scorecard ¶
type Scorecard struct {
Provider string
Model string
Fixtures []FixtureScore
}
Scorecard aggregates fixture scores for one provider+model run.
func RunCorpus ¶
func RunCorpus(ctx context.Context, p provider.Provider, model string, fixtures []Fixture) (Scorecard, error)
RunCorpus runs every fixture through the provider and returns a Scorecard. A fixture hitting a transient provider error (isRetriable) is retried up to corpusAttempts times with linear backoff; a non-transient failure aborts the run after the first attempt.
func (Scorecard) CategoryRecall ¶
func (sc Scorecard) CategoryRecall() []CategoryRecall
CategoryRecall returns recall per expected-finding category, sorted by category name for deterministic output.
func (Scorecard) Dev ¶
Dev returns the tunable slice (fixtures the prompt/corpus may be tuned against). HeldOut returns the generalization-only slice (ADR-0018 §Goodhart). A change is overfitting when Dev recall rises but HeldOut recall does not.
func (Scorecard) FalsePositiveRate ¶
FalsePositiveRate is the corpus-wide silence violations ÷ silence anchors.
type SilenceAnchor ¶
type SilenceAnchor struct {
File string `json:"file"`
Line int `json:"line"`
Reason string `json:"reason"`
}
SilenceAnchor marks a line a good review should NOT flag. A produced finding landing on (File, ~Line) is a measured false positive (ADR-0018 §2).