Documentation
¶
Overview ¶
Package evaluation provides an evaluation framework for testing agents.
Index ¶
- func GenerateRunName() string
- func Save(sess *session.Session, filename string) (string, error)
- func SaveRunJSON(run *EvalRun, outputDir string) (string, error)
- func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)
- func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)
- func SessionFromEvents(events []map[string]any, title string, questions []string) *session.Session
- type Config
- type EvalRun
- type InputSession
- type Judge
- type RelevanceResult
- type Result
- type RunOutput
- type RunOutputConfig
- type Runner
- type Summary
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GenerateRunName ¶
func GenerateRunName() string
GenerateRunName creates a memorable name for an evaluation run.
func SaveRunJSON ¶
SaveRunJSON saves the eval run results to a JSON file. This is kept for backward compatibility and debugging purposes.
func SaveRunSessions ¶
SaveRunSessions saves all eval sessions to a SQLite database file. The database follows the same schema as the main session store, allowing the sessions to be loaded and inspected using standard session tools.
func SaveRunSessionsJSON ¶
SaveRunSessionsJSON saves the full evaluation run output to a JSON file. The output includes run metadata (config, summary) and all sessions with their eval criteria and scoring results (pass/fail, judge reasoning, errors).
func SessionFromEvents ¶
SessionFromEvents reconstructs a session from raw container output events. This parses the JSON events emitted by docker agent run --exec --json and builds a session with the conversation history.
Types ¶
type Config ¶
type Config struct {
AgentFilename string // Path to the agent configuration file
EvalsDir string // Directory containing evaluation files
JudgeModel string // Model for relevance checking (format: provider/model, optional)
Concurrency int // Number of concurrent runs (0 = number of CPUs)
TTYFd int // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
Only []string // Only run evaluations matching these patterns
BaseImage string // Custom base Docker image for running evaluations
KeepContainers bool // If true, don't remove containers after evaluation (skip --rm)
EnvVars []string // Environment variables to pass: KEY (value from env) or KEY=VALUE (explicit)
Repeat int // Number of times to repeat each evaluation (default 1)
}
Config holds configuration for evaluation runs.
type EvalRun ¶
type EvalRun struct {
Name string `json:"name"`
Timestamp time.Time `json:"timestamp"`
Duration time.Duration `json:"duration"`
Config Config `json:"-"` // Used to build RunOutput, not serialized directly
Results []Result `json:"results"`
Summary Summary `json:"summary"`
}
EvalRun contains the results and metadata for an evaluation run.
func Evaluate ¶
func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)
Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).
type InputSession ¶
type InputSession struct {
*session.Session
SourcePath string // Path to the source eval file (not serialized)
RepeatIndex int // Repeat iteration (1-based); 0 means no repeat
}
InputSession wraps a session with its source path for evaluation loading.
type Judge ¶
type Judge struct {
// contains filtered or unexported fields
}
Judge runs LLM-as-a-judge relevance checks concurrently.
func NewJudge ¶
NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.
func (*Judge) CheckRelevance ¶
func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (results []RelevanceResult, err error)
CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns a result for every criterion (both passed and failed, each with a reason from the judge model), and an error if any check encountered an error (e.g. judge model misconfiguration). Errors cause a hard failure so that configuration issues are surfaced immediately rather than silently producing zero-relevance results.
func (*Judge) Validate ¶ added in v1.32.4
Validate performs an end-to-end check of the judge model by sending a trivial relevance prompt and verifying the response is valid structured JSON. This catches configuration errors (bad API key, unsupported model, missing structured-output support, etc.) before running any evaluations, allowing the framework to fail fast.
type RelevanceResult ¶
type RelevanceResult struct {
Criterion string `json:"criterion"`
Passed bool `json:"passed"`
Reason string `json:"reason"`
}
RelevanceResult contains the result of a single relevance check.
type Result ¶
type Result struct {
InputPath string `json:"input_path"`
Title string `json:"title"`
Question string `json:"question"`
Response string `json:"response"`
Cost float64 `json:"cost"`
OutputTokens int64 `json:"output_tokens"`
Size string `json:"size"`
SizeExpected string `json:"size_expected"`
ToolCallsScore float64 `json:"tool_calls_score"`
ToolCallsExpected float64 `json:"tool_calls_score_expected"`
RelevancePassed float64 `json:"relevance"`
RelevanceExpected float64 `json:"relevance_expected"`
RelevanceResults []RelevanceResult `json:"relevance_results,omitempty"`
Error string `json:"error,omitempty"`
RawOutput []map[string]any `json:"raw_output,omitempty"`
Session *session.Session `json:"-"` // Full session for database storage (not in JSON)
}
Result contains the evaluation results for a single test case.
type RunOutput ¶ added in v1.42.0
type RunOutput struct {
Name string `json:"name"`
Timestamp time.Time `json:"timestamp"`
Duration string `json:"duration"`
Config RunOutputConfig `json:"config"`
Summary Summary `json:"summary"`
Sessions []*session.Session `json:"sessions"`
}
RunOutput is the top-level structure for the evaluation run JSON output.
type RunOutputConfig ¶ added in v1.42.0
type RunOutputConfig struct {
Agent string `json:"agent"`
JudgeModel string `json:"judge_model,omitempty"`
Concurrency int `json:"concurrency"`
EvalsDir string `json:"evals_dir"`
BaseImage string `json:"base_image,omitempty"`
}
RunOutputConfig captures the evaluation run configuration.
type Runner ¶
type Runner struct {
Config
// contains filtered or unexported fields
}
Runner runs evaluations against an agent.
type Summary ¶
type Summary struct {
TotalEvals int `json:"total_evals"`
FailedEvals int `json:"failed_evals"`
TotalCost float64 `json:"total_cost"`
SizesPassed int `json:"sizes_passed"`
SizesTotal int `json:"sizes_total"`
ToolsF1Sum float64 `json:"tools_f1_sum"`
ToolsCount int `json:"tools_count"`
RelevancePassed float64 `json:"relevance_passed"`
RelevanceTotal float64 `json:"relevance_total"`
}
Summary contains aggregate statistics across all evaluations.