eval

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 18, 2026 License: Apache-2.0 Imports: 15 Imported by: 0

Documentation

Index

Constants

View Source
const (
	DifficultyEasy      = "easy"
	DifficultyMedium    = "medium"
	DifficultyHard      = "hard"
	DifficultyComplex   = "complex" // Used by legacy ComplexDataset(); not part of ALTAVision eval.
	DifficultySuperHard = "super-hard"
	DifficultyGraphTest = "graph-test"
)

Difficulty levels for evaluation datasets.

Variables

View Source
var RetrievalKValues = []int{1, 4, 8, 16, 32, 64}

RetrievalKValues are the k values at which P@k and R@k are computed.

Functions

func ALTAVisionAllDatasets

func ALTAVisionAllDatasets() map[string]Dataset

ALTAVisionAllDatasets returns all ALTAVision datasets keyed by difficulty.

func FormatReport

func FormatReport(r *Report) string

FormatReport produces a human-readable report string.

func GDPRAllDatasets added in v0.2.0

func GDPRAllDatasets() map[string]Dataset

GDPRAllDatasets returns all GDPR evaluation datasets keyed by difficulty level.

func LoadLegalBenchGroundTruth added in v0.2.0

func LoadLegalBenchGroundTruth(cfg LegalBenchConfig) (map[string][]GroundTruthSpan, error)

LoadLegalBenchGroundTruth loads the raw ground-truth spans from benchmark files. This is used for retrieval P@k/R@k computation (matching retrieved chunks against exact document spans).

func PDFComplexityReport

func PDFComplexityReport(results []PDFComplexityResult) string

PDFComplexityReport summarizes PDF complexity evaluation results.

func UsedCorpusFiles added in v0.2.0

func UsedCorpusFiles(cfg LegalBenchConfig) (map[string]struct{}, error)

UsedCorpusFiles returns the set of corpus file paths referenced by the loaded benchmark tests. When MaxTestsPerBenchmark is set, this returns only files needed for the subset — useful for skipping ingestion of unreferenced documents.

Types

type AggregateMetrics

type AggregateMetrics struct {
	AvgFaithfulness       float64 `json:"avg_faithfulness"`
	AvgRelevance          float64 `json:"avg_relevance"`
	AvgAccuracy           float64 `json:"avg_accuracy"`
	AvgStrictAccuracy     float64 `json:"avg_strict_accuracy"`
	AvgContextRecall      float64 `json:"avg_context_recall"`
	AvgCitationQuality    float64 `json:"avg_citation_quality"`
	AvgConfidence         float64 `json:"avg_confidence"`
	AvgClaimGrounding     float64 `json:"avg_claim_grounding"`
	AvgHallucinationScore float64 `json:"avg_hallucination_score"`

	// Retrieval metrics (populated when ground-truth spans are available)
	AvgRetrievalPrecision map[int]float64 `json:"avg_retrieval_precision,omitempty"` // k -> P@k
	AvgRetrievalRecall    map[int]float64 `json:"avg_retrieval_recall,omitempty"`    // k -> R@k
}

AggregateMetrics holds averaged metrics across all tests.

type Dataset

type Dataset struct {
	Name       string     `json:"name"`
	Difficulty string     `json:"difficulty"` // easy, medium, hard, complex, super-hard
	Tests      []TestCase `json:"tests"`
}

Dataset is a collection of test cases for evaluation.

func ALTAVisionEasyDataset

func ALTAVisionEasyDataset() Dataset

ALTAVisionEasyDataset returns 30 easy (single-fact lookup) test cases from the ALTAVision AV-FM/AV-FF technical manual.

Expected facts use pipe-separated alternatives (e.g. "Spanish|English") so accuracy scoring works regardless of the LLM's answer language.

func ALTAVisionGraphTestDataset added in v0.2.0

func ALTAVisionGraphTestDataset() Dataset

ALTAVisionGraphTestDataset returns 7 targeted test cases for evaluating graph-mode retrieval on the ALTAVision manual. These cover electrical specs, grounding, environment, anchoring, tracker card, beacon lights, and vortex cooling.

func ALTAVisionHardDataset

func ALTAVisionHardDataset() Dataset

ALTAVisionHardDataset returns 30 hard (multi-hop reasoning) test cases.

func ALTAVisionMediumDataset

func ALTAVisionMediumDataset() Dataset

ALTAVisionMediumDataset returns 30 medium (multi-fact, context-dependent) test cases.

func ALTAVisionSuperHardDataset

func ALTAVisionSuperHardDataset() Dataset

ALTAVisionSuperHardDataset returns 50 super-hard (synthesis/inference) test cases. Includes the original 30 (with fixes to Q2, Q19, Q25, Q30) plus 20 new tests in categories: graph-multi-hop, anti-hallucination, numerical, reasoning.

func ComplexDataset

func ComplexDataset() Dataset

ComplexDataset returns sample complex (cross-document) test cases.

func EasyDataset

func EasyDataset() Dataset

EasyDataset returns sample easy (single-fact) test cases.

func GDPREasyDataset added in v0.2.0

func GDPREasyDataset() Dataset

GDPREasyDataset returns 30 easy (single-fact lookup) test cases from the GDPR (Regulation (EU) 2016/679).

Expected facts use pipe-separated alternatives so accuracy scoring works regardless of the LLM's paraphrasing or formatting choices.

func GDPRHardDataset added in v0.2.0

func GDPRHardDataset() Dataset

GDPRHardDataset returns 30 hard (synthesis / regulatory-chain) test cases requiring deep understanding of interconnected GDPR provisions.

func GDPRMediumDataset added in v0.2.0

func GDPRMediumDataset() Dataset

GDPRMediumDataset returns 30 medium (multi-hop / cross-article) test cases requiring synthesis across multiple GDPR articles.

func GDPRSuperHardDataset added in v0.2.0

func GDPRSuperHardDataset() Dataset

GDPRSuperHardDataset returns 50 super-hard (adversarial / polysemy / edge-case) test cases designed to probe subtle distinctions and edge cases in the GDPR.

func LoadLegalBenchDatasets added in v0.2.0

func LoadLegalBenchDatasets(cfg LegalBenchConfig) ([]Dataset, error)

LoadLegalBenchDatasets loads LegalBench-RAG benchmark JSON files and converts them into GoReason Dataset format. Each benchmark file becomes a separate dataset (e.g., CUAD, ContractNLI, MAUD, PrivacyQA).

func MediumDataset

func MediumDataset() Dataset

MediumDataset returns sample medium (multi-hop) test cases.

type Evaluator

type Evaluator struct {
	// contains filtered or unexported fields
}

Evaluator runs evaluation test sets against a GoReason engine.

func NewEvaluator

func NewEvaluator(engine goreason.Engine) *Evaluator

NewEvaluator creates a new evaluator.

func (*Evaluator) Run

func (e *Evaluator) Run(ctx context.Context, dataset Dataset, opts ...goreason.QueryOption) (*Report, error)

Run executes an evaluation dataset against the engine.

func (*Evaluator) SetGroundTruth added in v0.2.0

func (e *Evaluator) SetGroundTruth(gt map[string][]GroundTruthSpan)

SetGroundTruth sets ground-truth spans for retrieval P@k/R@k computation. The map key is the query string.

func (*Evaluator) SetJudge added in v0.2.0

func (e *Evaluator) SetJudge(provider llm.Provider, model string)

SetJudge configures an LLM judge for semantic accuracy evaluation. When set, accuracy is computed via LLM instead of verbatim substring matching.

type FactCheck

type FactCheck struct {
	Fact      string `json:"fact"`
	Found     bool   `json:"found"`
	ChunkID   int64  `json:"chunk_id,omitempty"`
	ChunkRank int    `json:"chunk_rank,omitempty"`
	Details   string `json:"details,omitempty"`
}

FactCheck records whether a single expected fact was found at a pipeline stage.

type FullContextEvaluator added in v0.2.0

type FullContextEvaluator struct {
	// contains filtered or unexported fields
}

FullContextEvaluator sends the entire document text + question directly to an LLM provider, bypassing RAG entirely. This serves as a baseline to compare against Graph RAG and Basic RAG approaches.

func NewFullContextEvaluator added in v0.2.0

func NewFullContextEvaluator(provider llm.Provider, docText string) *FullContextEvaluator

NewFullContextEvaluator creates a full-context evaluator. The docText should contain the entire document content (e.g. extracted PDF text).

func (*FullContextEvaluator) Run added in v0.2.0

func (e *FullContextEvaluator) Run(ctx context.Context, dataset Dataset) (*Report, error)

Run executes an evaluation dataset by sending the full document text + each question to the LLM. It produces a Report with the same metric structure as the engine-based evaluator so results are directly comparable.

type GroundTruthCheck

type GroundTruthCheck struct {
	FactsInDB      []FactCheck `json:"facts_in_db"`
	FactsEmbedded  []FactCheck `json:"facts_embedded"`
	FactsRetrieved []FactCheck `json:"facts_retrieved"`
	FactsInAnswer  []FactCheck `json:"facts_in_answer"`
	Diagnosis      string      `json:"diagnosis"`
}

GroundTruthCheck diagnoses where each expected fact was lost in the pipeline.

type GroundTruthSpan added in v0.2.0

type GroundTruthSpan struct {
	FilePath string
	Start    int
	End      int
	Text     string
}

GroundTruthSpan records a ground-truth snippet location for retrieval evaluation.

type LegalBenchBenchmark added in v0.2.0

type LegalBenchBenchmark struct {
	Tests []LegalBenchTest `json:"tests"`
}

LegalBenchBenchmark is the top-level benchmark file structure.

type LegalBenchConfig added in v0.2.0

type LegalBenchConfig struct {
	// BenchmarkFiles are paths to benchmark JSON files.
	BenchmarkFiles []string
	// CorpusDir is the path to the corpus directory (for reading snippet text).
	CorpusDir string
	// MaxTestsPerBenchmark caps the number of tests loaded per benchmark file.
	// 0 means no limit (load all). 194 matches the LegalBench-RAG-mini subset.
	MaxTestsPerBenchmark int
}

LegalBenchConfig controls how LegalBench-RAG data is loaded.

type LegalBenchSnippet added in v0.2.0

type LegalBenchSnippet struct {
	FilePath string `json:"file_path"`
	Span     [2]int `json:"span"`   // [start, end] character offsets
	Answer   string `json:"answer"` // pre-extracted snippet text
}

LegalBenchSnippet is a ground-truth snippet from the LegalBench-RAG benchmark.

type LegalBenchTest added in v0.2.0

type LegalBenchTest struct {
	Query    string              `json:"query"`
	Snippets []LegalBenchSnippet `json:"snippets"`
	Tags     []string            `json:"tags"`
}

LegalBenchTest is a single Q&A test case from LegalBench-RAG.

type PDFComplexityResult

type PDFComplexityResult struct {
	Path            string  `json:"path"`
	ExpectedComplex bool    `json:"expected_complex"`
	DetectedComplex bool    `json:"detected_complex"`
	Score           float64 `json:"score"`
	Correct         bool    `json:"correct"`
	Details         string  `json:"details"`
}

PDFComplexityResult holds the evaluation of PDF complexity detection.

func EvaluatePDFComplexity

func EvaluatePDFComplexity(testCases []PDFComplexityTestCase) []PDFComplexityResult

EvaluatePDFComplexity tests the PDF complexity detector against known files.

type PDFComplexityTestCase

type PDFComplexityTestCase struct {
	Path            string `json:"path"`
	ExpectedComplex bool   `json:"expected_complex"`
	Description     string `json:"description"`
}

PDFComplexityTestCase defines a test for the complexity detector.

type ReasoningStep

type ReasoningStep struct {
	Round     int      `json:"round"`
	Action    string   `json:"action"`
	Prompt    string   `json:"prompt,omitempty"`
	Response  string   `json:"response,omitempty"`
	Tokens    int      `json:"tokens,omitempty"`
	ElapsedMs int64    `json:"elapsed_ms,omitempty"`
	Issues    []string `json:"issues,omitempty"`
}

ReasoningStep records a single round of reasoning with full context for replay.

type Report

type Report struct {
	Dataset         string                      `json:"dataset"`
	Difficulty      string                      `json:"difficulty,omitempty"`
	TotalTests      int                         `json:"total_tests"`
	Passed          int                         `json:"passed"`
	Failed          int                         `json:"failed"`
	Metrics         AggregateMetrics            `json:"metrics"`
	CategoryMetrics map[string]AggregateMetrics `json:"category_metrics,omitempty"`
	Results         []TestResult                `json:"results"`
	RunTime         time.Duration               `json:"run_time"`
	TokenUsage      TokenUsage                  `json:"token_usage"`
}

Report holds the results of an evaluation run.

type RetrievalTrace

type RetrievalTrace struct {
	VecResults          int      `json:"vec_results"`
	FTSResults          int      `json:"fts_results"`
	GraphResults        int      `json:"graph_results"`
	FusedResults        int      `json:"fused_results"`
	VecWeight           float64  `json:"vec_weight"`
	FTSWeight           float64  `json:"fts_weight"`
	GraphWeight         float64  `json:"graph_weight"`
	IdentifiersDetected bool     `json:"identifiers_detected"`
	FTSQuery            string   `json:"fts_query"`
	GraphEntities       []string `json:"graph_entities"`
	ElapsedMs           int64    `json:"elapsed_ms"`
}

RetrievalTrace holds the full retrieval breakdown for a query.

type SourceTrace

type SourceTrace struct {
	ChunkID    int64    `json:"chunk_id"`
	Heading    string   `json:"heading"`
	Content    string   `json:"content"`
	PageNumber int      `json:"page_number"`
	Score      float64  `json:"score"`
	Methods    []string `json:"methods,omitempty"`
	VecRank    int      `json:"vec_rank,omitempty"`
	FTSRank    int      `json:"fts_rank,omitempty"`
	GraphRank  int      `json:"graph_rank,omitempty"`
}

SourceTrace records a single retrieved chunk with its retrieval metadata.

type TestCase

type TestCase struct {
	Question      string   `json:"question"`
	ExpectedFacts []string `json:"expected_facts"` // Facts that should appear in the answer
	Category      string   `json:"category"`       // single-fact, multi-hop, cross-document, multi-fact, synthesis
	Explanation   string   `json:"explanation"`    // Ground truth reference with page citations
}

TestCase defines a single evaluation question.

type TestResult

type TestResult struct {
	Question           string   `json:"question"`
	ExpectedFacts      []string `json:"expected_facts"`
	Category           string   `json:"category,omitempty"`
	Explanation        string   `json:"explanation,omitempty"`
	Answer             string   `json:"answer"`
	Confidence         float64  `json:"confidence"`
	Faithfulness       float64  `json:"faithfulness"`
	Relevance          float64  `json:"relevance"`
	Accuracy           float64  `json:"accuracy"`
	StrictAccuracy     float64  `json:"strict_accuracy"`
	ContextRecall      float64  `json:"context_recall"`
	CitationQuality    float64  `json:"citation_quality"`
	ClaimGrounding     float64  `json:"claim_grounding"`
	HallucinationScore float64  `json:"hallucination_score"`
	Passed             bool     `json:"passed"`
	Error              string   `json:"error,omitempty"`
	PromptTokens       int      `json:"prompt_tokens"`
	CompletionTokens   int      `json:"completion_tokens"`
	TotalTokens        int      `json:"total_tokens"`

	// Timing
	ElapsedMs int64 `json:"elapsed_ms"`

	// Sources (the chunks the model actually saw)
	Sources []SourceTrace `json:"sources,omitempty"`

	// Retrieval breakdown
	Retrieval *RetrievalTrace `json:"retrieval,omitempty"`

	// Reasoning trace
	ReasoningSteps []ReasoningStep `json:"reasoning_steps,omitempty"`

	// Ground truth diagnosis
	GroundTruth *GroundTruthCheck `json:"ground_truth,omitempty"`

	// Retrieval metrics (populated when ground-truth spans are available)
	RetrievalPrecision map[int]float64 `json:"retrieval_precision,omitempty"` // k -> P@k
	RetrievalRecall    map[int]float64 `json:"retrieval_recall,omitempty"`    // k -> R@k
}

TestResult holds the result of a single test case with full diagnostics.

type TokenUsage

type TokenUsage struct {
	PromptTokens     int `json:"prompt_tokens"`
	CompletionTokens int `json:"completion_tokens"`
	TotalTokens      int `json:"total_tokens"`
}

TokenUsage aggregates LLM token consumption across an evaluation run.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL