evaluation

package

v1.30.0 Latest Latest Go to latest Published: Mar 9, 2026 License: Apache-2.0 Imports: 30 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/docker/docker-agent

Links

Open Source Insights

Documentation ¶

Overview ¶

Package evaluation provides an evaluation framework for testing agents.

Index ¶

func GenerateRunName() string
func Save(sess *session.Session, filename string) (string, error)
func SaveRunJSON(run *EvalRun, outputDir string) (string, error)
func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)
func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)
func SessionFromEvents(events []map[string]any, title string, questions []string) *session.Session
type Config
type EvalRun
- func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, ...) (*EvalRun, error)
type InputSession
type Judge
- func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge
- func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed []RelevanceResult, errs []string)
type RelevanceResult
type Result
type Runner
- func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)
type Summary

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GenerateRunName ¶

func GenerateRunName() string

GenerateRunName creates a memorable name for an evaluation run.

func Save ¶

func Save(sess *session.Session, filename string) (string, error)

func SaveRunJSON ¶

func SaveRunJSON(run *EvalRun, outputDir string) (string, error)

SaveRunJSON saves the eval run results to a JSON file. This is kept for backward compatibility and debugging purposes.

func SaveRunSessions ¶

func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)

SaveRunSessions saves all eval sessions to a SQLite database file. The database follows the same schema as the main session store, allowing the sessions to be loaded and inspected using standard session tools.

func SaveRunSessionsJSON ¶

func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)

SaveRunSessionsJSON saves all eval sessions to a single JSON file. Each session includes its eval criteria in the "evals" field. This complements SaveRunSessions which saves to SQLite, providing a human-readable format for inspection.

func SessionFromEvents ¶

func SessionFromEvents(events []map[string]any, title string, questions []string) *session.Session

SessionFromEvents reconstructs a session from raw container output events. This parses the JSON events emitted by docker agent run --exec --json and builds a session with the conversation history.

Types ¶

type Config ¶

type Config struct {
	AgentFilename  string   // Path to the agent configuration file
	EvalsDir       string   // Directory containing evaluation files
	JudgeModel     string   // Model for relevance checking (format: provider/model, optional)
	Concurrency    int      // Number of concurrent runs (0 = number of CPUs)
	TTYFd          int      // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
	Only           []string // Only run evaluations matching these patterns
	BaseImage      string   // Custom base Docker image for running evaluations
	KeepContainers bool     // If true, don't remove containers after evaluation (skip --rm)
	EnvVars        []string // Environment variables to pass: KEY (value from env) or KEY=VALUE (explicit)
}

Config holds configuration for evaluation runs.

type EvalRun ¶

type EvalRun struct {
	Name      string        `json:"name"`
	Timestamp time.Time     `json:"timestamp"`
	Duration  time.Duration `json:"duration"`
	Results   []Result      `json:"results"`
	Summary   Summary       `json:"summary"`
}

EvalRun contains the results and metadata for an evaluation run.

func Evaluate ¶

func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)

Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type InputSession ¶

type InputSession struct {
	*session.Session
	SourcePath string // Path to the source eval file (not serialized)
}

InputSession wraps a session with its source path for evaluation loading.

type Judge ¶

type Judge struct {
	// contains filtered or unexported fields
}

Judge runs LLM-as-a-judge relevance checks concurrently.

func NewJudge ¶

func NewJudge(model provider.Provider, runConfig *config.RuntimeConfig, concurrency int) *Judge

NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.

func (*Judge) CheckRelevance ¶

func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed []RelevanceResult, errs []string)

CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed results with reasons, and any errors encountered.

type RelevanceResult ¶

type RelevanceResult struct {
	Criterion string `json:"criterion"`
	Reason    string `json:"reason"`
}

RelevanceResult contains the result of a single relevance check.

type Result ¶

type Result struct {
	InputPath         string            `json:"input_path"`
	Title             string            `json:"title"`
	Question          string            `json:"question"`
	Response          string            `json:"response"`
	Cost              float64           `json:"cost"`
	OutputTokens      int64             `json:"output_tokens"`
	Size              string            `json:"size"`
	SizeExpected      string            `json:"size_expected"`
	ToolCallsScore    float64           `json:"tool_calls_score"`
	ToolCallsExpected float64           `json:"tool_calls_score_expected"`
	HandoffsMatch     bool              `json:"handoffs"`
	RelevancePassed   float64           `json:"relevance"`
	RelevanceExpected float64           `json:"relevance_expected"`
	FailedRelevance   []RelevanceResult `json:"failed_relevance,omitempty"`
	Error             string            `json:"error,omitempty"`
	RawOutput         []map[string]any  `json:"raw_output,omitempty"`
	Session           *session.Session  `json:"-"` // Full session for database storage (not in JSON)
}

Result contains the evaluation results for a single test case.

type Runner ¶

type Runner struct {
	Config
	// contains filtered or unexported fields
}

Runner runs evaluations against an agent.

func (*Runner) Run ¶

func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)

Run executes all evaluations concurrently and returns results. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type Summary ¶

type Summary struct {
	TotalEvals      int     `json:"total_evals"`
	FailedEvals     int     `json:"failed_evals"`
	TotalCost       float64 `json:"total_cost"`
	SizesPassed     int     `json:"sizes_passed"`
	SizesTotal      int     `json:"sizes_total"`
	ToolsF1Sum      float64 `json:"tools_f1_sum"`
	ToolsCount      int     `json:"tools_count"`
	HandoffsPassed  int     `json:"handoffs_passed"`
	HandoffsTotal   int     `json:"handoffs_total"`
	RelevancePassed float64 `json:"relevance_passed"`
	RelevanceTotal  float64 `json:"relevance_total"`
}

Summary contains aggregate statistics across all evaluations.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL