Documentation
¶
Index ¶
- Constants
- Variables
- func ALTAVisionAllDatasets() map[string]Dataset
- func FormatReport(r *Report) string
- func GDPRAllDatasets() map[string]Dataset
- func LoadLegalBenchGroundTruth(cfg LegalBenchConfig) (map[string][]GroundTruthSpan, error)
- func PDFComplexityReport(results []PDFComplexityResult) string
- func UsedCorpusFiles(cfg LegalBenchConfig) (map[string]struct{}, error)
- type AggregateMetrics
- type Dataset
- func ALTAVisionEasyDataset() Dataset
- func ALTAVisionGraphTestDataset() Dataset
- func ALTAVisionHardDataset() Dataset
- func ALTAVisionMediumDataset() Dataset
- func ALTAVisionSuperHardDataset() Dataset
- func ComplexDataset() Dataset
- func EasyDataset() Dataset
- func GDPREasyDataset() Dataset
- func GDPRHardDataset() Dataset
- func GDPRMediumDataset() Dataset
- func GDPRSuperHardDataset() Dataset
- func LoadLegalBenchDatasets(cfg LegalBenchConfig) ([]Dataset, error)
- func MediumDataset() Dataset
- type Evaluator
- type FactCheck
- type FullContextEvaluator
- type GroundTruthCheck
- type GroundTruthSpan
- type LegalBenchBenchmark
- type LegalBenchConfig
- type LegalBenchSnippet
- type LegalBenchTest
- type PDFComplexityResult
- type PDFComplexityTestCase
- type ReasoningStep
- type Report
- type RetrievalTrace
- type SourceTrace
- type TestCase
- type TestResult
- type TokenUsage
Constants ¶
const ( DifficultyEasy = "easy" DifficultyMedium = "medium" DifficultyHard = "hard" DifficultyComplex = "complex" // Used by legacy ComplexDataset(); not part of ALTAVision eval. DifficultySuperHard = "super-hard" DifficultyGraphTest = "graph-test" )
Difficulty levels for evaluation datasets.
Variables ¶
var RetrievalKValues = []int{1, 4, 8, 16, 32, 64}
RetrievalKValues are the k values at which P@k and R@k are computed.
Functions ¶
func ALTAVisionAllDatasets ¶
ALTAVisionAllDatasets returns all ALTAVision datasets keyed by difficulty.
func FormatReport ¶
FormatReport produces a human-readable report string.
func GDPRAllDatasets ¶ added in v0.2.0
GDPRAllDatasets returns all GDPR evaluation datasets keyed by difficulty level.
func LoadLegalBenchGroundTruth ¶ added in v0.2.0
func LoadLegalBenchGroundTruth(cfg LegalBenchConfig) (map[string][]GroundTruthSpan, error)
LoadLegalBenchGroundTruth loads the raw ground-truth spans from benchmark files. This is used for retrieval P@k/R@k computation (matching retrieved chunks against exact document spans).
func PDFComplexityReport ¶
func PDFComplexityReport(results []PDFComplexityResult) string
PDFComplexityReport summarizes PDF complexity evaluation results.
func UsedCorpusFiles ¶ added in v0.2.0
func UsedCorpusFiles(cfg LegalBenchConfig) (map[string]struct{}, error)
UsedCorpusFiles returns the set of corpus file paths referenced by the loaded benchmark tests. When MaxTestsPerBenchmark is set, this returns only files needed for the subset — useful for skipping ingestion of unreferenced documents.
Types ¶
type AggregateMetrics ¶
type AggregateMetrics struct {
AvgFaithfulness float64 `json:"avg_faithfulness"`
AvgRelevance float64 `json:"avg_relevance"`
AvgAccuracy float64 `json:"avg_accuracy"`
AvgStrictAccuracy float64 `json:"avg_strict_accuracy"`
AvgContextRecall float64 `json:"avg_context_recall"`
AvgCitationQuality float64 `json:"avg_citation_quality"`
AvgConfidence float64 `json:"avg_confidence"`
AvgClaimGrounding float64 `json:"avg_claim_grounding"`
AvgHallucinationScore float64 `json:"avg_hallucination_score"`
// Retrieval metrics (populated when ground-truth spans are available)
AvgRetrievalPrecision map[int]float64 `json:"avg_retrieval_precision,omitempty"` // k -> P@k
AvgRetrievalRecall map[int]float64 `json:"avg_retrieval_recall,omitempty"` // k -> R@k
}
AggregateMetrics holds averaged metrics across all tests.
type Dataset ¶
type Dataset struct {
Name string `json:"name"`
Difficulty string `json:"difficulty"` // easy, medium, hard, complex, super-hard
Tests []TestCase `json:"tests"`
}
Dataset is a collection of test cases for evaluation.
func ALTAVisionEasyDataset ¶
func ALTAVisionEasyDataset() Dataset
ALTAVisionEasyDataset returns 30 easy (single-fact lookup) test cases from the ALTAVision AV-FM/AV-FF technical manual.
Expected facts use pipe-separated alternatives (e.g. "Spanish|English") so accuracy scoring works regardless of the LLM's answer language.
func ALTAVisionGraphTestDataset ¶ added in v0.2.0
func ALTAVisionGraphTestDataset() Dataset
ALTAVisionGraphTestDataset returns 7 targeted test cases for evaluating graph-mode retrieval on the ALTAVision manual. These cover electrical specs, grounding, environment, anchoring, tracker card, beacon lights, and vortex cooling.
func ALTAVisionHardDataset ¶
func ALTAVisionHardDataset() Dataset
ALTAVisionHardDataset returns 30 hard (multi-hop reasoning) test cases.
func ALTAVisionMediumDataset ¶
func ALTAVisionMediumDataset() Dataset
ALTAVisionMediumDataset returns 30 medium (multi-fact, context-dependent) test cases.
func ALTAVisionSuperHardDataset ¶
func ALTAVisionSuperHardDataset() Dataset
ALTAVisionSuperHardDataset returns 50 super-hard (synthesis/inference) test cases. Includes the original 30 (with fixes to Q2, Q19, Q25, Q30) plus 20 new tests in categories: graph-multi-hop, anti-hallucination, numerical, reasoning.
func ComplexDataset ¶
func ComplexDataset() Dataset
ComplexDataset returns sample complex (cross-document) test cases.
func EasyDataset ¶
func EasyDataset() Dataset
EasyDataset returns sample easy (single-fact) test cases.
func GDPREasyDataset ¶ added in v0.2.0
func GDPREasyDataset() Dataset
GDPREasyDataset returns 30 easy (single-fact lookup) test cases from the GDPR (Regulation (EU) 2016/679).
Expected facts use pipe-separated alternatives so accuracy scoring works regardless of the LLM's paraphrasing or formatting choices.
func GDPRHardDataset ¶ added in v0.2.0
func GDPRHardDataset() Dataset
GDPRHardDataset returns 30 hard (synthesis / regulatory-chain) test cases requiring deep understanding of interconnected GDPR provisions.
func GDPRMediumDataset ¶ added in v0.2.0
func GDPRMediumDataset() Dataset
GDPRMediumDataset returns 30 medium (multi-hop / cross-article) test cases requiring synthesis across multiple GDPR articles.
func GDPRSuperHardDataset ¶ added in v0.2.0
func GDPRSuperHardDataset() Dataset
GDPRSuperHardDataset returns 50 super-hard (adversarial / polysemy / edge-case) test cases designed to probe subtle distinctions and edge cases in the GDPR.
func LoadLegalBenchDatasets ¶ added in v0.2.0
func LoadLegalBenchDatasets(cfg LegalBenchConfig) ([]Dataset, error)
LoadLegalBenchDatasets loads LegalBench-RAG benchmark JSON files and converts them into GoReason Dataset format. Each benchmark file becomes a separate dataset (e.g., CUAD, ContractNLI, MAUD, PrivacyQA).
func MediumDataset ¶
func MediumDataset() Dataset
MediumDataset returns sample medium (multi-hop) test cases.
type Evaluator ¶
type Evaluator struct {
// contains filtered or unexported fields
}
Evaluator runs evaluation test sets against a GoReason engine.
func NewEvaluator ¶
func NewEvaluator(engine goreason.Engine) *Evaluator
NewEvaluator creates a new evaluator.
func (*Evaluator) Run ¶
func (e *Evaluator) Run(ctx context.Context, dataset Dataset, opts ...goreason.QueryOption) (*Report, error)
Run executes an evaluation dataset against the engine.
func (*Evaluator) SetGroundTruth ¶ added in v0.2.0
func (e *Evaluator) SetGroundTruth(gt map[string][]GroundTruthSpan)
SetGroundTruth sets ground-truth spans for retrieval P@k/R@k computation. The map key is the query string.
type FactCheck ¶
type FactCheck struct {
Fact string `json:"fact"`
Found bool `json:"found"`
ChunkID int64 `json:"chunk_id,omitempty"`
ChunkRank int `json:"chunk_rank,omitempty"`
Details string `json:"details,omitempty"`
}
FactCheck records whether a single expected fact was found at a pipeline stage.
type FullContextEvaluator ¶ added in v0.2.0
type FullContextEvaluator struct {
// contains filtered or unexported fields
}
FullContextEvaluator sends the entire document text + question directly to an LLM provider, bypassing RAG entirely. This serves as a baseline to compare against Graph RAG and Basic RAG approaches.
func NewFullContextEvaluator ¶ added in v0.2.0
func NewFullContextEvaluator(provider llm.Provider, docText string) *FullContextEvaluator
NewFullContextEvaluator creates a full-context evaluator. The docText should contain the entire document content (e.g. extracted PDF text).
type GroundTruthCheck ¶
type GroundTruthCheck struct {
FactsInDB []FactCheck `json:"facts_in_db"`
FactsEmbedded []FactCheck `json:"facts_embedded"`
FactsRetrieved []FactCheck `json:"facts_retrieved"`
FactsInAnswer []FactCheck `json:"facts_in_answer"`
Diagnosis string `json:"diagnosis"`
}
GroundTruthCheck diagnoses where each expected fact was lost in the pipeline.
type GroundTruthSpan ¶ added in v0.2.0
GroundTruthSpan records a ground-truth snippet location for retrieval evaluation.
type LegalBenchBenchmark ¶ added in v0.2.0
type LegalBenchBenchmark struct {
Tests []LegalBenchTest `json:"tests"`
}
LegalBenchBenchmark is the top-level benchmark file structure.
type LegalBenchConfig ¶ added in v0.2.0
type LegalBenchConfig struct {
// BenchmarkFiles are paths to benchmark JSON files.
BenchmarkFiles []string
// CorpusDir is the path to the corpus directory (for reading snippet text).
CorpusDir string
// MaxTestsPerBenchmark caps the number of tests loaded per benchmark file.
// 0 means no limit (load all). 194 matches the LegalBench-RAG-mini subset.
MaxTestsPerBenchmark int
}
LegalBenchConfig controls how LegalBench-RAG data is loaded.
type LegalBenchSnippet ¶ added in v0.2.0
type LegalBenchSnippet struct {
FilePath string `json:"file_path"`
Span [2]int `json:"span"` // [start, end] character offsets
Answer string `json:"answer"` // pre-extracted snippet text
}
LegalBenchSnippet is a ground-truth snippet from the LegalBench-RAG benchmark.
type LegalBenchTest ¶ added in v0.2.0
type LegalBenchTest struct {
Query string `json:"query"`
Snippets []LegalBenchSnippet `json:"snippets"`
Tags []string `json:"tags"`
}
LegalBenchTest is a single Q&A test case from LegalBench-RAG.
type PDFComplexityResult ¶
type PDFComplexityResult struct {
Path string `json:"path"`
ExpectedComplex bool `json:"expected_complex"`
DetectedComplex bool `json:"detected_complex"`
Score float64 `json:"score"`
Correct bool `json:"correct"`
Details string `json:"details"`
}
PDFComplexityResult holds the evaluation of PDF complexity detection.
func EvaluatePDFComplexity ¶
func EvaluatePDFComplexity(testCases []PDFComplexityTestCase) []PDFComplexityResult
EvaluatePDFComplexity tests the PDF complexity detector against known files.
type PDFComplexityTestCase ¶
type PDFComplexityTestCase struct {
Path string `json:"path"`
ExpectedComplex bool `json:"expected_complex"`
Description string `json:"description"`
}
PDFComplexityTestCase defines a test for the complexity detector.
type ReasoningStep ¶
type ReasoningStep struct {
Round int `json:"round"`
Action string `json:"action"`
Prompt string `json:"prompt,omitempty"`
Response string `json:"response,omitempty"`
Tokens int `json:"tokens,omitempty"`
ElapsedMs int64 `json:"elapsed_ms,omitempty"`
Issues []string `json:"issues,omitempty"`
}
ReasoningStep records a single round of reasoning with full context for replay.
type Report ¶
type Report struct {
Dataset string `json:"dataset"`
Difficulty string `json:"difficulty,omitempty"`
TotalTests int `json:"total_tests"`
Passed int `json:"passed"`
Failed int `json:"failed"`
Metrics AggregateMetrics `json:"metrics"`
CategoryMetrics map[string]AggregateMetrics `json:"category_metrics,omitempty"`
Results []TestResult `json:"results"`
RunTime time.Duration `json:"run_time"`
TokenUsage TokenUsage `json:"token_usage"`
}
Report holds the results of an evaluation run.
type RetrievalTrace ¶
type RetrievalTrace struct {
VecResults int `json:"vec_results"`
FTSResults int `json:"fts_results"`
GraphResults int `json:"graph_results"`
FusedResults int `json:"fused_results"`
VecWeight float64 `json:"vec_weight"`
FTSWeight float64 `json:"fts_weight"`
GraphWeight float64 `json:"graph_weight"`
IdentifiersDetected bool `json:"identifiers_detected"`
FTSQuery string `json:"fts_query"`
GraphEntities []string `json:"graph_entities"`
ElapsedMs int64 `json:"elapsed_ms"`
}
RetrievalTrace holds the full retrieval breakdown for a query.
type SourceTrace ¶
type SourceTrace struct {
ChunkID int64 `json:"chunk_id"`
Heading string `json:"heading"`
Content string `json:"content"`
PageNumber int `json:"page_number"`
Score float64 `json:"score"`
Methods []string `json:"methods,omitempty"`
VecRank int `json:"vec_rank,omitempty"`
FTSRank int `json:"fts_rank,omitempty"`
GraphRank int `json:"graph_rank,omitempty"`
}
SourceTrace records a single retrieved chunk with its retrieval metadata.
type TestCase ¶
type TestCase struct {
Question string `json:"question"`
ExpectedFacts []string `json:"expected_facts"` // Facts that should appear in the answer
Category string `json:"category"` // single-fact, multi-hop, cross-document, multi-fact, synthesis
Explanation string `json:"explanation"` // Ground truth reference with page citations
}
TestCase defines a single evaluation question.
type TestResult ¶
type TestResult struct {
Question string `json:"question"`
ExpectedFacts []string `json:"expected_facts"`
Category string `json:"category,omitempty"`
Explanation string `json:"explanation,omitempty"`
Answer string `json:"answer"`
Confidence float64 `json:"confidence"`
Faithfulness float64 `json:"faithfulness"`
Relevance float64 `json:"relevance"`
Accuracy float64 `json:"accuracy"`
StrictAccuracy float64 `json:"strict_accuracy"`
ContextRecall float64 `json:"context_recall"`
CitationQuality float64 `json:"citation_quality"`
ClaimGrounding float64 `json:"claim_grounding"`
HallucinationScore float64 `json:"hallucination_score"`
Passed bool `json:"passed"`
Error string `json:"error,omitempty"`
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
TotalTokens int `json:"total_tokens"`
// Timing
ElapsedMs int64 `json:"elapsed_ms"`
// Sources (the chunks the model actually saw)
Sources []SourceTrace `json:"sources,omitempty"`
// Retrieval breakdown
Retrieval *RetrievalTrace `json:"retrieval,omitempty"`
// Reasoning trace
ReasoningSteps []ReasoningStep `json:"reasoning_steps,omitempty"`
// Ground truth diagnosis
GroundTruth *GroundTruthCheck `json:"ground_truth,omitempty"`
// Retrieval metrics (populated when ground-truth spans are available)
RetrievalPrecision map[int]float64 `json:"retrieval_precision,omitempty"` // k -> P@k
RetrievalRecall map[int]float64 `json:"retrieval_recall,omitempty"` // k -> R@k
}
TestResult holds the result of a single test case with full diagnostics.
type TokenUsage ¶
type TokenUsage struct {
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
TotalTokens int `json:"total_tokens"`
}
TokenUsage aggregates LLM token consumption across an evaluation run.