Documentation
¶
Index ¶
- Constants
- func NewCRAGEvaluator(llm chat.Client) core.CRAGEvaluator
- func NewRAGEvaluator(llm chat.Client) core.RAGEvaluator
- type BenchmarkResult
- type CaseResult
- type LLMJudge
- type Label
- type RagasLLMJudge
- func (j *RagasLLMJudge) EvaluateAnswerRelevance(ctx context.Context, query string, answer string) (float32, string, error)
- func (j *RagasLLMJudge) EvaluateContextPrecision(ctx context.Context, query string, chunks []*core.Chunk) (float32, string, error)
- func (j *RagasLLMJudge) EvaluateFaithfulness(ctx context.Context, query string, chunks []*core.Chunk, answer string) (float32, string, error)
- type TestCase
Constants ¶
View Source
const ( Relevant = core.CRAGRelevant Irrelevant = core.CRAGIrrelevant Ambiguous = core.CRAGAmbiguous )
Variables ¶
This section is empty.
Functions ¶
func NewCRAGEvaluator ¶
func NewCRAGEvaluator(llm chat.Client) core.CRAGEvaluator
func NewRAGEvaluator ¶
func NewRAGEvaluator(llm chat.Client) core.RAGEvaluator
Types ¶
type BenchmarkResult ¶
type BenchmarkResult struct {
TotalCases int `json:"total_cases"`
AvgFaithfulness float32 `json:"avg_faithfulness"`
AvgRelevance float32 `json:"avg_relevance"`
AvgPrecision float32 `json:"avg_precision"`
TotalDuration time.Duration `json:"total_duration"`
Results []CaseResult `json:"results"`
}
BenchmarkResult holds the overall results of a benchmark run.
func RunBenchmark ¶
func RunBenchmark(ctx context.Context, retriever core.Retriever, judge LLMJudge, cases []TestCase, topK int) (*BenchmarkResult, error)
RunBenchmark executes a full evaluation suite against a retriever.
func (*BenchmarkResult) Summary ¶
func (r *BenchmarkResult) Summary() string
Summary returns a human-readable summary of the benchmark.
type CaseResult ¶
type CaseResult struct {
Query string `json:"query"`
Answer string `json:"answer"`
FaithfulnessScore float32 `json:"faithfulness"`
RelevanceScore float32 `json:"relevance"`
PrecisionScore float32 `json:"precision"`
Duration time.Duration `json:"duration"`
}
CaseResult holds the evaluation result for a single test case.
type LLMJudge ¶
type LLMJudge interface {
// EvaluateFaithfulness checks if the generated answer is strictly grounded in the retrieved chunks.
EvaluateFaithfulness(ctx context.Context, query string, chunks []*core.Chunk, answer string) (score float32, reason string, err error)
// EvaluateAnswerRelevance checks if the answer effectively addresses the user's intent.
EvaluateAnswerRelevance(ctx context.Context, query string, answer string) (score float32, reason string, err error)
// EvaluateContextPrecision checks if the retrieved context actually contains the useful information.
EvaluateContextPrecision(ctx context.Context, query string, chunks []*core.Chunk) (score float32, reason string, err error)
}
LLMJudge provides production-grade Evaluation metrics (e.g., RAGAS) using an LLM as the evaluator.
type RagasLLMJudge ¶
type RagasLLMJudge struct {
// contains filtered or unexported fields
}
RagasLLMJudge implements the LLMJudge interface using standard RAGAS-style prompts. It leverages a strong LLM (like GPT-4) to grade the pipeline's output.
func NewRagasLLMJudge ¶
func NewRagasLLMJudge(judgeLLM chat.Client) *RagasLLMJudge
func (*RagasLLMJudge) EvaluateAnswerRelevance ¶
func (j *RagasLLMJudge) EvaluateAnswerRelevance(ctx context.Context, query string, answer string) (float32, string, error)
EvaluateAnswerRelevance checks if the answer actually answers the user's question.
Click to show internal directories.
Click to hide internal directories.