evaluation

package
v1.20.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 5, 2026 License: Apache-2.0 Imports: 30 Imported by: 0

Documentation

Overview

Package evaluation provides an evaluation framework for testing agents.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GenerateRunName added in v1.19.0

func GenerateRunName() string

GenerateRunName creates a memorable name for an evaluation run.

func Save

func Save(sess *session.Session, filename string) (string, error)

func SaveRunJSON added in v1.19.1

func SaveRunJSON(run *EvalRun, outputDir string) (string, error)

SaveRunJSON saves the eval run results to a JSON file. This is kept for backward compatibility and debugging purposes.

func SaveRunSessions added in v1.20.5

func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)

SaveRunSessions saves all eval sessions to a SQLite database file. The database follows the same schema as the main session store, allowing the sessions to be loaded and inspected using standard session tools.

func SaveRunSessionsJSON added in v1.20.5

func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)

SaveRunSessionsJSON saves all eval sessions to a single JSON file. Each session includes its eval criteria in the "evals" field. This complements SaveRunSessions which saves to SQLite, providing a human-readable format for inspection.

func SessionFromEvents added in v1.20.5

func SessionFromEvents(events []map[string]any, title, question string) *session.Session

SessionFromEvents reconstructs a session from raw container output events. This parses the JSON events emitted by cagent --json and builds a session with the conversation history.

Types

type Config added in v1.19.0

type Config struct {
	AgentFilename  string   // Path to the agent configuration file
	EvalsDir       string   // Directory containing evaluation files
	JudgeModel     string   // Model for relevance checking (format: provider/model, optional)
	Concurrency    int      // Number of concurrent runs (0 = number of CPUs)
	TTYFd          int      // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
	Only           []string // Only run evaluations matching these patterns
	BaseImage      string   // Custom base Docker image for running evaluations
	KeepContainers bool     // If true, don't remove containers after evaluation (skip --rm)
	EnvVars        []string // Environment variables to pass: KEY (value from env) or KEY=VALUE (explicit)
}

Config holds configuration for evaluation runs.

type EvalRun added in v1.19.0

type EvalRun struct {
	Name      string        `json:"name"`
	Timestamp time.Time     `json:"timestamp"`
	Duration  time.Duration `json:"duration"`
	Results   []Result      `json:"results"`
	Summary   Summary       `json:"summary"`
}

EvalRun contains the results and metadata for an evaluation run.

func Evaluate

func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)

Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type InputSession added in v1.20.5

type InputSession struct {
	*session.Session
	SourcePath string // Path to the source eval file (not serialized)
}

InputSession wraps a session with its source path for evaluation loading.

type Judge added in v1.20.0

type Judge struct {
	// contains filtered or unexported fields
}

Judge runs LLM-as-a-judge relevance checks concurrently.

func NewJudge added in v1.20.0

func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge

NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.

func (*Judge) CheckRelevance added in v1.20.0

func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed []RelevanceResult, errs []string)

CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed results with reasons, and any errors encountered.

type RelevanceResult added in v1.20.5

type RelevanceResult struct {
	Criterion string `json:"criterion"`
	Reason    string `json:"reason"`
}

RelevanceResult contains the result of a single relevance check.

type Result

type Result struct {
	InputPath         string            `json:"input_path"`
	Title             string            `json:"title"`
	Question          string            `json:"question"`
	Response          string            `json:"response"`
	Cost              float64           `json:"cost"`
	OutputTokens      int64             `json:"output_tokens"`
	Size              string            `json:"size"`
	SizeExpected      string            `json:"size_expected"`
	ToolCallsScore    float64           `json:"tool_calls_score"`
	ToolCallsExpected float64           `json:"tool_calls_score_expected"`
	HandoffsMatch     bool              `json:"handoffs"`
	RelevancePassed   float64           `json:"relevance"`
	RelevanceExpected float64           `json:"relevance_expected"`
	FailedRelevance   []RelevanceResult `json:"failed_relevance,omitempty"`
	Error             string            `json:"error,omitempty"`
	RawOutput         []map[string]any  `json:"raw_output,omitempty"`
	Session           *session.Session  `json:"-"` // Full session for database storage (not in JSON)
}

Result contains the evaluation results for a single test case.

type Runner added in v1.19.0

type Runner struct {
	Config
	// contains filtered or unexported fields
}

Runner runs evaluations against an agent.

func (*Runner) Run added in v1.19.0

func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)

Run executes all evaluations concurrently and returns results. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type Summary added in v1.19.0

type Summary struct {
	TotalEvals      int     `json:"total_evals"`
	FailedEvals     int     `json:"failed_evals"`
	TotalCost       float64 `json:"total_cost"`
	SizesPassed     int     `json:"sizes_passed"`
	SizesTotal      int     `json:"sizes_total"`
	ToolsPassed     float64 `json:"tools_passed"`
	ToolsTotal      float64 `json:"tools_total"`
	HandoffsPassed  int     `json:"handoffs_passed"`
	HandoffsTotal   int     `json:"handoffs_total"`
	RelevancePassed float64 `json:"relevance_passed"`
	RelevanceTotal  float64 `json:"relevance_total"`
}

Summary contains aggregate statistics across all evaluations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL