evaluation

package

v1.49.1 Latest Latest Go to latest Published: Apr 21, 2026 License: Apache-2.0 Imports: 31 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/docker/docker-agent

Links

Open Source Insights

Documentation ¶

Overview ¶

Package evaluation provides an evaluation framework for testing agents.

Index ¶

func GenerateRunName() string
func Save(sess *session.Session, filename string) (string, error)
func SaveRunJSON(run *EvalRun, outputDir string) (string, error)
func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)
func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)
func SessionFromEvents(events []map[string]any, title string, questions []string) *session.Session
type Config
type EvalRun
- func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, ...) (*EvalRun, error)
type InputSession
type Judge
- func NewJudge(model provider.Provider, concurrency int) *Judge
- func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (results []RelevanceResult, err error)
- func (j *Judge) Validate(ctx context.Context) error
type RelevanceResult
type Result
type RunOutput
type RunOutputConfig
type Runner
- func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)
type Summary

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GenerateRunName ¶

func GenerateRunName() string

GenerateRunName creates a memorable name for an evaluation run.

func Save ¶

func Save(sess *session.Session, filename string) (string, error)

func SaveRunJSON ¶

func SaveRunJSON(run *EvalRun, outputDir string) (string, error)

SaveRunJSON saves the eval run results to a JSON file. This is kept for backward compatibility and debugging purposes.

func SaveRunSessions ¶

func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)

SaveRunSessions saves all eval sessions to a SQLite database file. The database follows the same schema as the main session store, allowing the sessions to be loaded and inspected using standard session tools.

func SaveRunSessionsJSON ¶

func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)

SaveRunSessionsJSON saves the full evaluation run output to a JSON file. The output includes run metadata (config, summary) and all sessions with their eval criteria and scoring results (pass/fail, judge reasoning, errors).

func SessionFromEvents ¶

func SessionFromEvents(events []map[string]any, title string, questions []string) *session.Session

SessionFromEvents reconstructs a session from raw container output events. This parses the JSON events emitted by docker agent run --exec --json and builds a session with the conversation history.

Types ¶

type Config ¶

type Config struct {
	AgentFilename  string   // Path to the agent configuration file
	EvalsDir       string   // Directory containing evaluation files
	JudgeModel     string   // Model for relevance checking (format: provider/model, optional)
	Concurrency    int      // Number of concurrent runs (0 = number of CPUs)
	TTYFd          int      // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
	Only           []string // Only run evaluations matching these patterns
	BaseImage      string   // Custom base Docker image for running evaluations
	KeepContainers bool     // If true, don't remove containers after evaluation (skip --rm)
	EnvVars        []string // Environment variables to pass: KEY (value from env) or KEY=VALUE (explicit)
	Repeat         int      // Number of times to repeat each evaluation (default 1)
}

Config holds configuration for evaluation runs.

type EvalRun ¶

type EvalRun struct {
	Name      string        `json:"name"`
	Timestamp time.Time     `json:"timestamp"`
	Duration  time.Duration `json:"duration"`
	Config    Config        `json:"-"` // Used to build RunOutput, not serialized directly
	Results   []Result      `json:"results"`
	Summary   Summary       `json:"summary"`
}

EvalRun contains the results and metadata for an evaluation run.

func Evaluate ¶

func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)

Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type InputSession ¶

type InputSession struct {
	*session.Session

	SourcePath  string // Path to the source eval file (not serialized)
	RepeatIndex int    // Repeat iteration (1-based); 0 means no repeat
}

InputSession wraps a session with its source path for evaluation loading.

type Judge ¶

type Judge struct {
	// contains filtered or unexported fields
}

Judge runs LLM-as-a-judge relevance checks concurrently.

func NewJudge ¶

func NewJudge(model provider.Provider, concurrency int) *Judge

NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.

func (*Judge) CheckRelevance ¶

func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (results []RelevanceResult, err error)

CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns a result for every criterion (both passed and failed, each with a reason from the judge model), and an error if any check encountered an error (e.g. judge model misconfiguration). Errors cause a hard failure so that configuration issues are surfaced immediately rather than silently producing zero-relevance results.

func (*Judge) Validate ¶ added in v1.32.4

func (j *Judge) Validate(ctx context.Context) error

Validate performs an end-to-end check of the judge model by sending a trivial relevance prompt and verifying the response is valid structured JSON. This catches configuration errors (bad API key, unsupported model, missing structured-output support, etc.) before running any evaluations, allowing the framework to fail fast.

type RelevanceResult ¶

type RelevanceResult struct {
	Criterion string `json:"criterion"`
	Passed    bool   `json:"passed"`
	Reason    string `json:"reason"`
}

RelevanceResult contains the result of a single relevance check.

type Result ¶

type Result struct {
	InputPath         string            `json:"input_path"`
	Title             string            `json:"title"`
	Question          string            `json:"question"`
	Response          string            `json:"response"`
	Cost              float64           `json:"cost"`
	OutputTokens      int64             `json:"output_tokens"`
	Size              string            `json:"size"`
	SizeExpected      string            `json:"size_expected"`
	ToolCallsScore    float64           `json:"tool_calls_score"`
	ToolCallsExpected float64           `json:"tool_calls_score_expected"`
	RelevancePassed   float64           `json:"relevance"`
	RelevanceExpected float64           `json:"relevance_expected"`
	RelevanceResults  []RelevanceResult `json:"relevance_results,omitempty"`
	Error             string            `json:"error,omitempty"`
	RawOutput         []map[string]any  `json:"raw_output,omitempty"`
	Session           *session.Session  `json:"-"` // Full session for database storage (not in JSON)
}

Result contains the evaluation results for a single test case.

type RunOutput ¶ added in v1.42.0

type RunOutput struct {
	Name      string             `json:"name"`
	Timestamp time.Time          `json:"timestamp"`
	Duration  string             `json:"duration"`
	Config    RunOutputConfig    `json:"config"`
	Summary   Summary            `json:"summary"`
	Sessions  []*session.Session `json:"sessions"`
}

RunOutput is the top-level structure for the evaluation run JSON output.

type RunOutputConfig ¶ added in v1.42.0

type RunOutputConfig struct {
	Agent       string `json:"agent"`
	JudgeModel  string `json:"judge_model,omitempty"`
	Concurrency int    `json:"concurrency"`
	EvalsDir    string `json:"evals_dir"`
	BaseImage   string `json:"base_image,omitempty"`
}

RunOutputConfig captures the evaluation run configuration.

type Runner ¶

type Runner struct {
	Config
	// contains filtered or unexported fields
}

Runner runs evaluations against an agent.

func (*Runner) Run ¶

func (r *Runner) Run(ctx context.Context, ttyOut, out io.Writer, isTTY bool) ([]Result, error)

Run executes all evaluations concurrently and returns results. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).

type Summary ¶

type Summary struct {
	TotalEvals      int     `json:"total_evals"`
	FailedEvals     int     `json:"failed_evals"`
	TotalCost       float64 `json:"total_cost"`
	SizesPassed     int     `json:"sizes_passed"`
	SizesTotal      int     `json:"sizes_total"`
	ToolsF1Sum      float64 `json:"tools_f1_sum"`
	ToolsCount      int     `json:"tools_count"`
	RelevancePassed float64 `json:"relevance_passed"`
	RelevanceTotal  float64 `json:"relevance_total"`
}

Summary contains aggregate statistics across all evaluations.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL