evaluation

package
v1.30.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 9, 2026 License: Apache-2.0 Imports: 30 Imported by: 0

Documentation

Overview

Package evaluation provides an evaluation framework for testing agents.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GenerateRunName

func GenerateRunName() string

GenerateRunName creates a memorable name for an evaluation run.

func Save

func Save(sess *session.Session, filename string) (string, error)

func SaveRunJSON

func SaveRunJSON(run *EvalRun, outputDir string) (string, error)

SaveRunJSON saves the eval run results to a JSON file. This is kept for backward compatibility and debugging purposes.

func SaveRunSessions

func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)

SaveRunSessions saves all eval sessions to a SQLite database file. The database follows the same schema as the main session store, allowing the sessions to be loaded and inspected using standard session tools.

func SaveRunSessionsJSON

func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)

SaveRunSessionsJSON saves all eval sessions to a single JSON file. Each session includes its eval criteria in the "evals" field. This complements SaveRunSessions which saves to SQLite, providing a human-readable format for inspection.

func SessionFromEvents

func SessionFromEvents(events []map[string]any, title string, questions []string) *session.Session

SessionFromEvents reconstructs a session from raw container output events. This parses the JSON events emitted by docker agent run --exec --json and builds a session with the conversation history.

Types

type Config

type Config struct {
	AgentFilename  string   // Path to the agent configuration file
	EvalsDir       string   // Directory containing evaluation files
	JudgeModel     string   // Model for relevance checking (format: provider/model, optional)
	Concurrency    int      // Number of concurrent runs (0 = number of CPUs)
	TTYFd          int      // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
	Only           []string // Only run evaluations matching these patterns
	BaseImage      string   // Custom base Docker image for running evaluations
	KeepContainers bool     // If true, don't remove containers after evaluation (skip --rm)
	EnvVars        []string // Environment variables to pass: KEY (value from env) or KEY=VALUE (explicit)
}

Config holds configuration for evaluation runs.

type EvalRun

type EvalRun struct {
	Name      string        `json:"name"`
	Timestamp time.Time     `json:"timestamp"`
	Duration  time.Duration `json:"duration"`
	Results   []Result      `json:"results"`
	Summary   Summary       `json:"summary"`
}

EvalRun contains the results and metadata for an evaluation run.

func Evaluate

func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)

Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type InputSession

type InputSession struct {
	*session.Session
	SourcePath string // Path to the source eval file (not serialized)
}

InputSession wraps a session with its source path for evaluation loading.

type Judge

type Judge struct {
	// contains filtered or unexported fields
}

Judge runs LLM-as-a-judge relevance checks concurrently.

func NewJudge

func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge

NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.

func (*Judge) CheckRelevance

func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed []RelevanceResult, errs []string)

CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed results with reasons, and any errors encountered.

type RelevanceResult

type RelevanceResult struct {
	Criterion string `json:"criterion"`
	Reason    string `json:"reason"`
}

RelevanceResult contains the result of a single relevance check.

type Result

type Result struct {
	InputPath         string            `json:"input_path"`
	Title             string            `json:"title"`
	Question          string            `json:"question"`
	Response          string            `json:"response"`
	Cost              float64           `json:"cost"`
	OutputTokens      int64             `json:"output_tokens"`
	Size              string            `json:"size"`
	SizeExpected      string            `json:"size_expected"`
	ToolCallsScore    float64           `json:"tool_calls_score"`
	ToolCallsExpected float64           `json:"tool_calls_score_expected"`
	HandoffsMatch     bool              `json:"handoffs"`
	RelevancePassed   float64           `json:"relevance"`
	RelevanceExpected float64           `json:"relevance_expected"`
	FailedRelevance   []RelevanceResult `json:"failed_relevance,omitempty"`
	Error             string            `json:"error,omitempty"`
	RawOutput         []map[string]any  `json:"raw_output,omitempty"`
	Session           *session.Session  `json:"-"` // Full session for database storage (not in JSON)
}

Result contains the evaluation results for a single test case.

type Runner

type Runner struct {
	Config
	// contains filtered or unexported fields
}

Runner runs evaluations against an agent.

func (*Runner) Run

func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)

Run executes all evaluations concurrently and returns results. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type Summary

type Summary struct {
	TotalEvals      int     `json:"total_evals"`
	FailedEvals     int     `json:"failed_evals"`
	TotalCost       float64 `json:"total_cost"`
	SizesPassed     int     `json:"sizes_passed"`
	SizesTotal      int     `json:"sizes_total"`
	ToolsF1Sum      float64 `json:"tools_f1_sum"`
	ToolsCount      int     `json:"tools_count"`
	HandoffsPassed  int     `json:"handoffs_passed"`
	HandoffsTotal   int     `json:"handoffs_total"`
	RelevancePassed float64 `json:"relevance_passed"`
	RelevanceTotal  float64 `json:"relevance_total"`
}

Summary contains aggregate statistics across all evaluations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL