evaluation

package
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 20, 2026 License: MIT Imports: 7 Imported by: 0

Documentation

Index

Constants

View Source
const (
	Relevant   = core.CRAGRelevant
	Irrelevant = core.CRAGIrrelevant
	Ambiguous  = core.CRAGAmbiguous
)

Variables

This section is empty.

Functions

func NewCRAGEvaluator

func NewCRAGEvaluator(llm chat.Client) core.CRAGEvaluator

func NewRAGEvaluator

func NewRAGEvaluator(llm chat.Client) core.RAGEvaluator

Types

type BenchmarkResult

type BenchmarkResult struct {
	TotalCases      int           `json:"total_cases"`
	AvgFaithfulness float32       `json:"avg_faithfulness"`
	AvgRelevance    float32       `json:"avg_relevance"`
	AvgPrecision    float32       `json:"avg_precision"`
	TotalDuration   time.Duration `json:"total_duration"`
	Results         []CaseResult  `json:"results"`
}

BenchmarkResult holds the overall results of a benchmark run.

func RunBenchmark

func RunBenchmark(ctx context.Context, retriever core.Retriever, judge LLMJudge, cases []TestCase, topK int) (*BenchmarkResult, error)

RunBenchmark executes a full evaluation suite against a retriever.

func (*BenchmarkResult) Summary

func (r *BenchmarkResult) Summary() string

Summary returns a human-readable summary of the benchmark.

type CaseResult

type CaseResult struct {
	Query             string        `json:"query"`
	Answer            string        `json:"answer"`
	FaithfulnessScore float32       `json:"faithfulness"`
	RelevanceScore    float32       `json:"relevance"`
	PrecisionScore    float32       `json:"precision"`
	Duration          time.Duration `json:"duration"`
}

CaseResult holds the evaluation result for a single test case.

type LLMJudge

type LLMJudge interface {
	// EvaluateFaithfulness checks if the generated answer is strictly grounded in the retrieved chunks.
	EvaluateFaithfulness(ctx context.Context, query string, chunks []*core.Chunk, answer string) (score float32, reason string, err error)

	// EvaluateAnswerRelevance checks if the answer effectively addresses the user's intent.
	EvaluateAnswerRelevance(ctx context.Context, query string, answer string) (score float32, reason string, err error)

	// EvaluateContextPrecision checks if the retrieved context actually contains the useful information.
	EvaluateContextPrecision(ctx context.Context, query string, chunks []*core.Chunk) (score float32, reason string, err error)
}

LLMJudge provides production-grade Evaluation metrics (e.g., RAGAS) using an LLM as the evaluator.

type Label

type Label = core.CRAGLabel

Re-export common types if needed, but prefer direct core usage

type RagasLLMJudge

type RagasLLMJudge struct {
	// contains filtered or unexported fields
}

RagasLLMJudge implements the LLMJudge interface using standard RAGAS-style prompts. It leverages a strong LLM (like GPT-4) to grade the pipeline's output.

func NewRagasLLMJudge

func NewRagasLLMJudge(judgeLLM chat.Client) *RagasLLMJudge

func (*RagasLLMJudge) EvaluateAnswerRelevance

func (j *RagasLLMJudge) EvaluateAnswerRelevance(ctx context.Context, query string, answer string) (float32, string, error)

EvaluateAnswerRelevance checks if the answer actually answers the user's question.

func (*RagasLLMJudge) EvaluateContextPrecision

func (j *RagasLLMJudge) EvaluateContextPrecision(ctx context.Context, query string, chunks []*core.Chunk) (float32, string, error)

EvaluateContextPrecision checks the quality of the retrieved chunks.

func (*RagasLLMJudge) EvaluateFaithfulness

func (j *RagasLLMJudge) EvaluateFaithfulness(ctx context.Context, query string, chunks []*core.Chunk, answer string) (float32, string, error)

EvaluateFaithfulness checks for hallucinations against the retrieved context.

type TestCase

type TestCase struct {
	Query       string `json:"query"`
	GroundTruth string `json:"ground_truth,omitempty"` // Optional: What we expect the answer to be
}

TestCase represents a single evaluation entry.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL