Documentation
¶
Overview ¶
Package eval is mneme's prompt evaluation harness. It runs hand-authored fixtures through the real Add/Search pipeline against a live LLM and scores extraction quality, so a prompt version is a decision backed by numbers and a prompt change can be caught when it regresses. It is not part of `go test ./...` (which must run offline); drive it via cmd/eval.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct {
LLM provider.LLM
Embedder provider.Embedder
Store store.Store
Judge provider.LLM // semantic match oracle; defaults to LLM
K int // search top-k; defaults to 3
Versions []string // prompt versions; defaults to all registered
}
Config holds everything Run needs: the providers under test, a judge LLM for semantic scoring, the store to use, the search depth, and which prompt versions to score.
type Fixture ¶
type Fixture struct {
Name string `json:"name"`
// Messages is the conversation fed to Add.
Messages []types.Message `json:"messages"`
// ExpectedFacts are the durable facts a good extractor should produce.
ExpectedFacts []string `json:"expected_facts"`
// Queries probe search recall (optional).
Queries []Query `json:"queries"`
// RequiredTokens are substrings that must survive verbatim into some
// extracted fact — the deterministic specificity check (optional).
RequiredTokens []string `json:"required_tokens,omitempty"`
// SkipDedup opts a fixture out of the dedup-correctness check (e.g. the
// nothing-to-extract case, where there is nothing to re-dedup).
SkipDedup bool `json:"skip_dedup,omitempty"`
}
Fixture is one LoCoMo-style evaluation case.
func LoadFixtures ¶
LoadFixtures reads and parses every *.json file in dir, sorted by name for a stable run order.
type Judge ¶
Judge decides whether two free-text facts assert the same thing, using an LLM for semantic matching. Facts are free text, so exact-string comparison is too brittle; the judge is the semantic-equality oracle the metrics build on.
func (Judge) Same ¶
Same reports whether a and b mean the same thing. Semantic equality is symmetric, but the LLM judge is mildly order-sensitive, so we ask both orderings and accept a match if either says yes — this removes spurious asymmetry between the recall and precision passes (which compare the same pair in opposite orders). On any LLM error or ambiguous answer a single ask returns false: a conservative miss rather than a score-inflating false match.
type Query ¶
Query is a retrieval probe: after a fixture is ingested, searching Q must surface a fact semantically matching one of ShouldRecall within top-k.
type Result ¶
type Result struct {
Name string
Extracted []string
Recall float64
Precision float64
HasSpecificity bool
Specificity float64
HasSearch bool
SearchRecall float64
HasDedup bool
DedupNewFacts int // 0 is ideal: re-ingesting added nothing new
}
Result is one fixture's score under one prompt version.
type VersionReport ¶
VersionReport collects all fixture results for one prompt version.
func Run ¶
Run scores every fixture under every configured prompt version against the live providers, returning one VersionReport per version.
func (VersionReport) MeanAggregate ¶
func (vr VersionReport) MeanAggregate() float64
func (VersionReport) MeanDedup ¶
func (vr VersionReport) MeanDedup() float64
func (VersionReport) MeanPrecision ¶
func (vr VersionReport) MeanPrecision() float64
func (VersionReport) MeanRecall ¶
func (vr VersionReport) MeanRecall() float64
func (VersionReport) MeanSearchRecall ¶
func (vr VersionReport) MeanSearchRecall() float64
func (VersionReport) MeanSpecificity ¶
func (vr VersionReport) MeanSpecificity() float64