summaryeval

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package summaryeval is a quality-gate harness for session summaries.

What this solves

Session summarization is a lossy LLM operation. Every change to the pipeline — swapping models, tuning prompts, adjusting pkg/tokenopt, adding a tokenstrip stage, landing a REDACT.md LLM redactor — can silently degrade summary quality in ways unit tests won't catch. Without a quality gate, "does the new thing make summaries worse?" is a flying-blind question.

Design

The harness scores a candidate summary against a hand-reviewed reference summary using a rubric of five dimensions:

  • Title fidelity — semantic match against reference title
  • Summary coverage — keyword overlap against reference summary
  • Key actions recall — set overlap over bullet points
  • Outcome correctness — exact match of success|partial|failed
  • Aha moments — count match + positional overlap

Each dimension produces a 0.0–1.0 score. A weighted aggregate gives an overall session score. A corpus of N sessions gives an aggregate distribution that can be tracked over time and gated in CI.

Why deterministic scoring (v1)

The scorer uses lexical metrics (Jaccard similarity, set overlap, exact match) rather than an LLM judge. This is a deliberate v1 choice:

  • Deterministic — same input always produces same score. Regressions are attributable to the thing that changed, not model variance.
  • Fast and free — no API calls. Runs on every CI build.
  • Self-contained — stdlib only, no Anthropic SDK dependency.

Lexical metrics are approximate. A summary that paraphrases using different vocabulary may score lower than one that copies the reference. That's a known tradeoff. V2 can introduce an optional LLM-judge scorer on top of the same Rubric/Score shapes for cases where semantic equivalence matters.

Curating the golden corpus

See pkg/summaryeval/CORPUS.md (or the README in the testdata directory) for the process: pick 20–30 diverse sessions, hand-review and polish summaries to a trusted reference, then run the harness to establish the baseline score distribution.

Index

Constants

View Source
const (
	DimTitle      = "title"
	DimSummary    = "summary"
	DimKeyActions = "key_actions"
	DimOutcome    = "outcome"
	DimAhaMoments = "aha_moments"
)

Dimension names the scored rubric dimensions. Constants so typos fail at compile time and callers can iterate a known set.

Variables

This section is empty.

Functions

func BuildJudgePrompt

func BuildJudgePrompt(reference *Summary, candidate Summary, opts JudgeOptions) string

BuildJudgePrompt constructs the prompt text sent to the LLM judge. Paired mode (reference != nil) asks for semantic equivalence; absolute mode (reference == nil) asks for on-merits evaluation against the rubric's definition of a good summary.

The prompt asks for a strict JSON response to simplify parsing and make the judge's output CI-gateable. Models that free-form their response will parse-fail and surface as errors.

func Dimensions

func Dimensions() []string

Dimensions returns the canonical ordered list. Callers use this for stable report ordering.

Types

type AhaMoment

type AhaMoment struct {
	Seq  int    `json:"seq"`
	Type string `json:"type"`
}

AhaMoment is a minimal moment shape for scoring. We score count match and approximate sequence-position overlap, not the full highlight text.

type Completer

type Completer func(ctx context.Context, prompt string) (CompletionResult, error)

Completer is the abstraction for calling an LLM. Callers provide the actual implementation; pkg/summaryeval does NOT depend on any specific SDK (Anthropic, OpenAI, Bedrock, etc.). This keeps the package importable with only stdlib and leaves all API concerns — authentication, retries, rate limiting, model selection, streaming — to the caller.

The prompt is the full text to send. The return is the model's raw response text. Errors propagate up to the Judge caller.

type CompletionResult

type CompletionResult struct {
	Text             string
	ModelUsed        string
	PromptTokens     int
	CompletionTokens int
}

CompletionResult is the response shape Completer returns. Token counts are optional — pass 0 when unknown.

type DimensionScore

type DimensionScore struct {
	Dimension string  `json:"dimension"`
	Score     float64 `json:"score"`
	Reason    string  `json:"reason,omitempty"`
}

DimensionScore is one dimension's 0.0–1.0 result for a single session with a reason string explaining how it was computed.

type Gates

type Gates struct {
	MinOverall    float64            `json:"min_overall,omitempty"`
	MinDimensions map[string]float64 `json:"min_dimensions,omitempty"`
}

Gates is an optional set of minimum-score thresholds. Any dimension or overall score below its threshold fails the gate and surfaces in Report.GatesFailed. Useful as a CI regression guard.

type GoldenSession

type GoldenSession struct {
	// Name identifies the session (matches ledger session dir name).
	Name string `json:"name"`

	// Notes is free-form human context explaining why this session was
	// chosen and what a good summary should capture. Not scored; helps
	// future curators.
	Notes string `json:"notes,omitempty"`

	// Reference is the trusted summary: what a great distillation looks
	// like for this session. Candidates are scored against this.
	Reference Summary `json:"reference"`
}

GoldenSession is a single entry in the reference corpus: a session name + the hand-reviewed reference summary we score candidates against. Stored on disk as <corpus>/<session_name>/reference.json.

func LoadCorpus

func LoadCorpus(corpusDir string) ([]GoldenSession, error)

LoadCorpus walks corpusDir and returns all GoldenSessions found. Each subdirectory containing reference.json is a session. Order is lexicographic by directory name for reproducible reports.

func LoadGoldenSession

func LoadGoldenSession(corpusDir, sessionName string) (*GoldenSession, error)

LoadGoldenSession reads a GoldenSession from a corpus directory layout:

<corpus>/<session_name>/reference.json

Returns (nil, nil) if the session dir doesn't exist — callers can distinguish "not curated" from "exists but broken".

type Judge

type Judge interface {
	Score(ctx context.Context, name string, reference *Summary, candidate Summary) (JudgeResult, error)
}

Judge evaluates a candidate Summary semantically, returning a JudgeResult. Complements the deterministic rubric Score (scorer.go) for cases where lexical metrics aren't enough — paraphrased summaries that convey the same meaning with different vocabulary score poorly on Jaccard but should score well on semantic equivalence.

Two modes:

  • Paired: Score with a non-nil reference. The judge evaluates semantic equivalence between candidate and reference.
  • Absolute: Score with a nil reference. The judge evaluates the candidate on its own merits against the rubric description, without needing a curated corpus.

Absolute mode is what the daemon runs in production — no corpus maintenance required, just "is this summary good on its face?"

func NewJudge

func NewJudge(c Completer, opts JudgeOptions) Judge

NewJudge constructs a Judge backed by the given Completer. The judge uses a fixed prompt template (see BuildJudgePrompt) and parses the LLM's JSON response. Callers that need different prompts or response shapes should implement Judge directly.

type JudgeOptions

type JudgeOptions struct {
	// ModelHint is a free-form tag the Judge can use to signal the
	// desired model class to the Completer. The Completer may ignore
	// it. Example: "haiku" for cheap/fast; "opus" for deep evaluation.
	// If empty, the Completer picks.
	ModelHint string

	// IncludeSuggestions, when true, asks the judge for up to 5 concrete
	// suggestions for improving the summary. Slight prompt length
	// increase; usually worth it for diagnostic mode.
	IncludeSuggestions bool

	// MaxRationaleChars caps the rationale length the judge is asked
	// to produce. Default: 600. Set lower to reduce completion tokens.
	MaxRationaleChars int
}

JudgeOptions tune a Judge's behavior.

type JudgeResult

type JudgeResult struct {
	// Name identifies the session scored.
	Name string `json:"name"`

	// Dimensions are per-dimension 0.0-1.0 scores (same dimension
	// names as Rubric). Absent dimensions are treated as 0.0 when
	// converting to SessionScore.
	Dimensions []DimensionScore `json:"dimensions"`

	// Overall is the judge's aggregate verdict, 0.0-1.0.
	Overall float64 `json:"overall"`

	// Rationale is a short human-readable explanation of the verdict.
	// Useful for humans debugging "why did this summary score 0.4?"
	Rationale string `json:"rationale,omitempty"`

	// Suggestions is a short list of specific, actionable fixes the
	// judge thinks would improve the summary. Surfaced to the user
	// via diagnostic output; consumed by CI in "hint to the prompt
	// engineer" mode.
	Suggestions []string `json:"suggestions,omitempty"`

	// ModelUsed identifies which model produced the judgment, for
	// reproducibility tracking ("haiku-4-5", "opus-4-7", etc.).
	ModelUsed string `json:"model_used,omitempty"`

	// DurationMs captures end-to-end judge latency including the LLM
	// call. Useful for operational telemetry.
	DurationMs int64 `json:"duration_ms,omitempty"`

	// PromptTokens / CompletionTokens are the raw token counts the
	// LLM reported, when available. Used for cost attribution.
	PromptTokens     int `json:"prompt_tokens,omitempty"`
	CompletionTokens int `json:"completion_tokens,omitempty"`
}

JudgeResult is what a Judge returns for one session.

func (JudgeResult) LogValue

func (jr JudgeResult) LogValue() slog.Value

LogValue implements slog.LogValuer so callers can emit one-line judge telemetry via the existing slog path.

func (JudgeResult) ToSessionScore

func (jr JudgeResult) ToSessionScore() SessionScore

ToSessionScore converts a JudgeResult to a SessionScore so it composes with the same ScoreCorpus / Report aggregation the deterministic scorer uses. The overall score is taken verbatim from the judge; individual dimensions carry through.

type Report

type Report struct {
	ScoredAt       time.Time          `json:"scored_at"`
	CorpusSize     int                `json:"corpus_size"`
	OverallMean    float64            `json:"overall_mean"`
	OverallMin     float64            `json:"overall_min"`
	OverallMax     float64            `json:"overall_max"`
	DimensionMeans map[string]float64 `json:"dimension_means"`
	Sessions       []SessionScore     `json:"sessions"`

	// GatesFailed lists any minimum-threshold gates that didn't pass.
	// Empty = all gates met (or no gates configured).
	GatesFailed []string `json:"gates_failed,omitempty"`
}

Report is the aggregate result of running an eval over a corpus.

func ScoreCorpus

func ScoreCorpus(corpus []GoldenSession, candidates map[string]Summary, w Weights, gates *Gates) Report

ScoreCorpus evaluates each golden session against its paired candidate and returns an aggregated Report. The candidates map is keyed by session name. Sessions with no paired candidate are scored as all-zeros (missing_candidate reason) and counted in the corpus size.

Gates (optional) are checked after aggregation; any unmet thresholds appear in Report.GatesFailed.

type SessionScore

type SessionScore struct {
	Name       string           `json:"name"`
	Dimensions []DimensionScore `json:"dimensions"`
	Overall    float64          `json:"overall"`
}

SessionScore is the full per-session result: all dimensions + a weighted aggregate.

func Score

func Score(name string, reference, candidate Summary, w Weights) SessionScore

Score evaluates a candidate summary against a reference using the given weights. Each dimension produces a 0.0–1.0 score; the overall is the weighted sum. All scoring is deterministic and lexical — same input always produces the same score. No LLM calls.

type Summary

type Summary struct {
	Title       string      `json:"title"`
	Summary     string      `json:"summary"`
	KeyActions  []string    `json:"key_actions"`
	Outcome     string      `json:"outcome"`
	AhaMoments  []AhaMoment `json:"aha_moments,omitempty"`
	TopicsFound []string    `json:"topics_found,omitempty"`
}

Summary is a minimal shape mirroring the fields from pkg/sessionsummary.SummarizeResponse that the rubric actually scores. Kept separate so summaryeval has no dependency on sessionsummary — the two evolve independently, and the eval harness doesn't care about the full summary schema (quality_score, sageox_score, chapter lists, etc. are out of scope).

func LoadCandidate

func LoadCandidate(path string) (*Summary, error)

LoadCandidate reads a candidate Summary from a JSON file. Used when comparing a distiller's output against a golden reference.

type Weights

type Weights struct {
	Title      float64 `json:"title"`
	Summary    float64 `json:"summary"`
	KeyActions float64 `json:"key_actions"`
	Outcome    float64 `json:"outcome"`
	AhaMoments float64 `json:"aha_moments"`
}

Weights are the rubric's per-dimension contribution to the overall score. Must sum to 1.0. Defaults chosen deliberately:

  • Title carries user-visible fidelity of the session's identity
  • Summary is the longest-form signal
  • Key actions are the "what was done" spine
  • Outcome is small but binary-important (success vs failed)
  • Aha moments are secondary but measurable

func DefaultWeights

func DefaultWeights() Weights

DefaultWeights returns the canonical rubric weights (sum = 1.0).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL