eval

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package eval is mneme's prompt evaluation harness. It runs hand-authored fixtures through the real Add/Search pipeline against a live LLM and scores extraction quality, so a prompt version is a decision backed by numbers and a prompt change can be caught when it regresses. It is not part of `go test ./...` (which must run offline); drive it via cmd/eval.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	LLM      provider.LLM
	Embedder provider.Embedder
	Store    store.Store
	Judge    provider.LLM // semantic match oracle; defaults to LLM
	K        int          // search top-k; defaults to 3
	Versions []string     // prompt versions; defaults to all registered
}

Config holds everything Run needs: the providers under test, a judge LLM for semantic scoring, the store to use, the search depth, and which prompt versions to score.

type Fixture

type Fixture struct {
	Name string `json:"name"`
	// Messages is the conversation fed to Add.
	Messages []types.Message `json:"messages"`
	// ExpectedFacts are the durable facts a good extractor should produce.
	ExpectedFacts []string `json:"expected_facts"`
	// Queries probe search recall (optional).
	Queries []Query `json:"queries"`
	// RequiredTokens are substrings that must survive verbatim into some
	// extracted fact — the deterministic specificity check (optional).
	RequiredTokens []string `json:"required_tokens,omitempty"`
	// SkipDedup opts a fixture out of the dedup-correctness check (e.g. the
	// nothing-to-extract case, where there is nothing to re-dedup).
	SkipDedup bool `json:"skip_dedup,omitempty"`
}

Fixture is one LoCoMo-style evaluation case.

func LoadFixtures

func LoadFixtures(dir string) ([]Fixture, error)

LoadFixtures reads and parses every *.json file in dir, sorted by name for a stable run order.

type Judge

type Judge struct {
	LLM provider.LLM
}

Judge decides whether two free-text facts assert the same thing, using an LLM for semantic matching. Facts are free text, so exact-string comparison is too brittle; the judge is the semantic-equality oracle the metrics build on.

func (Judge) Same

func (j Judge) Same(ctx context.Context, a, b string) bool

Same reports whether a and b mean the same thing. Semantic equality is symmetric, but the LLM judge is mildly order-sensitive, so we ask both orderings and accept a match if either says yes — this removes spurious asymmetry between the recall and precision passes (which compare the same pair in opposite orders). On any LLM error or ambiguous answer a single ask returns false: a conservative miss rather than a score-inflating false match.

type Query

type Query struct {
	Q            string   `json:"q"`
	ShouldRecall []string `json:"should_recall"`
}

Query is a retrieval probe: after a fixture is ingested, searching Q must surface a fact semantically matching one of ShouldRecall within top-k.

type Result

type Result struct {
	Name      string
	Extracted []string

	Recall    float64
	Precision float64

	HasSpecificity bool
	Specificity    float64

	HasSearch    bool
	SearchRecall float64

	HasDedup      bool
	DedupNewFacts int // 0 is ideal: re-ingesting added nothing new
}

Result is one fixture's score under one prompt version.

func (Result) Aggregate

func (r Result) Aggregate() float64

Aggregate is the fixture's single [0,1] score: the mean of its applicable metrics (specificity, search and dedup are only counted when the fixture exercises them).

type VersionReport

type VersionReport struct {
	Version string
	Results []Result
}

VersionReport collects all fixture results for one prompt version.

func Run

func Run(ctx context.Context, fixtures []Fixture, cfg Config) ([]VersionReport, error)

Run scores every fixture under every configured prompt version against the live providers, returning one VersionReport per version.

func (VersionReport) MeanAggregate

func (vr VersionReport) MeanAggregate() float64

func (VersionReport) MeanDedup

func (vr VersionReport) MeanDedup() float64

func (VersionReport) MeanPrecision

func (vr VersionReport) MeanPrecision() float64

func (VersionReport) MeanRecall

func (vr VersionReport) MeanRecall() float64

func (VersionReport) MeanSearchRecall

func (vr VersionReport) MeanSearchRecall() float64

func (VersionReport) MeanSpecificity

func (vr VersionReport) MeanSpecificity() float64

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL