eval

package

v0.1.0 Latest Latest Go to latest Published: Jun 1, 2026 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/AccursedGalaxy/mneme

Links

Open Source Insights

Documentation ¶

Overview ¶

Package eval is mneme's prompt evaluation harness. It runs hand-authored fixtures through the real Add/Search pipeline against a live LLM and scores extraction quality, so a prompt version is a decision backed by numbers and a prompt change can be caught when it regresses. It is not part of `go test ./...` (which must run offline); drive it via cmd/eval.

Index ¶

type Config
type Fixture
- func LoadFixtures(dir string) ([]Fixture, error)
type Judge
- func (j Judge) Same(ctx context.Context, a, b string) bool
type Query
type Result
- func (r Result) Aggregate() float64
type VersionReport
- func Run(ctx context.Context, fixtures []Fixture, cfg Config) ([]VersionReport, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	LLM      provider.LLM
	Embedder provider.Embedder
	Store    store.Store
	Judge    provider.LLM // semantic match oracle; defaults to LLM
	K        int          // search top-k; defaults to 3
	Versions []string     // prompt versions; defaults to all registered
}

Config holds everything Run needs: the providers under test, a judge LLM for semantic scoring, the store to use, the search depth, and which prompt versions to score.

type Fixture ¶

type Fixture struct {
	Name string `json:"name"`
	// Messages is the conversation fed to Add.
	Messages []types.Message `json:"messages"`
	// ExpectedFacts are the durable facts a good extractor should produce.
	ExpectedFacts []string `json:"expected_facts"`
	// Queries probe search recall (optional).
	Queries []Query `json:"queries"`
	// RequiredTokens are substrings that must survive verbatim into some
	// extracted fact — the deterministic specificity check (optional).
	RequiredTokens []string `json:"required_tokens,omitempty"`
	// SkipDedup opts a fixture out of the dedup-correctness check (e.g. the
	// nothing-to-extract case, where there is nothing to re-dedup).
	SkipDedup bool `json:"skip_dedup,omitempty"`
}

Fixture is one LoCoMo-style evaluation case.

func LoadFixtures ¶

func LoadFixtures(dir string) ([]Fixture, error)

LoadFixtures reads and parses every *.json file in dir, sorted by name for a stable run order.

type Judge ¶

type Judge struct {
	LLM provider.LLM
}

Judge decides whether two free-text facts assert the same thing, using an LLM for semantic matching. Facts are free text, so exact-string comparison is too brittle; the judge is the semantic-equality oracle the metrics build on.

func (Judge) Same ¶

func (j Judge) Same(ctx context.Context, a, b string) bool

Same reports whether a and b mean the same thing. Semantic equality is symmetric, but the LLM judge is mildly order-sensitive, so we ask both orderings and accept a match if either says yes — this removes spurious asymmetry between the recall and precision passes (which compare the same pair in opposite orders). On any LLM error or ambiguous answer a single ask returns false: a conservative miss rather than a score-inflating false match.

type Query ¶

type Query struct {
	Q            string   `json:"q"`
	ShouldRecall []string `json:"should_recall"`
}

Query is a retrieval probe: after a fixture is ingested, searching Q must surface a fact semantically matching one of ShouldRecall within top-k.

type Result ¶

type Result struct {
	Name      string
	Extracted []string

	Recall    float64
	Precision float64

	HasSpecificity bool
	Specificity    float64

	HasSearch    bool
	SearchRecall float64

	HasDedup      bool
	DedupNewFacts int // 0 is ideal: re-ingesting added nothing new
}

Result is one fixture's score under one prompt version.

func (Result) Aggregate ¶

func (r Result) Aggregate() float64

Aggregate is the fixture's single [0,1] score: the mean of its applicable metrics (specificity, search and dedup are only counted when the fixture exercises them).

type VersionReport ¶

type VersionReport struct {
	Version string
	Results []Result
}

VersionReport collects all fixture results for one prompt version.

func Run ¶

func Run(ctx context.Context, fixtures []Fixture, cfg Config) ([]VersionReport, error)

Run scores every fixture under every configured prompt version against the live providers, returning one VersionReport per version.

func (VersionReport) MeanAggregate ¶

func (vr VersionReport) MeanAggregate() float64

func (VersionReport) MeanDedup ¶

func (vr VersionReport) MeanDedup() float64

func (VersionReport) MeanPrecision ¶

func (vr VersionReport) MeanPrecision() float64

func (VersionReport) MeanRecall ¶

func (vr VersionReport) MeanRecall() float64

func (VersionReport) MeanSearchRecall ¶

func (vr VersionReport) MeanSearchRecall() float64

func (VersionReport) MeanSpecificity ¶

func (vr VersionReport) MeanSpecificity() float64

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL