Documentation
¶
Overview ¶
Package evaluation provides an evaluation framework for testing agents.
Index ¶
- func GenerateRunName() string
- func Save(sess *session.Session, filename string) (string, error)
- func SaveRunJSON(run *EvalRun, outputDir string) (string, error)
- func SaveRunSessions(ctx context.Context, run *EvalRun, outputDir string) (string, error)
- func SaveRunSessionsJSON(run *EvalRun, outputDir string) (string, error)
- func SessionFromEvents(events []map[string]any, title, question string) *session.Session
- type Config
- type EvalRun
- type InputSession
- type Judge
- type RelevanceResult
- type Result
- type Runner
- type Summary
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GenerateRunName ¶ added in v1.19.0
func GenerateRunName() string
GenerateRunName creates a memorable name for an evaluation run.
func SaveRunJSON ¶ added in v1.19.1
SaveRunJSON saves the eval run results to a JSON file. This is kept for backward compatibility and debugging purposes.
func SaveRunSessions ¶ added in v1.20.5
SaveRunSessions saves all eval sessions to a SQLite database file. The database follows the same schema as the main session store, allowing the sessions to be loaded and inspected using standard session tools.
func SaveRunSessionsJSON ¶ added in v1.20.5
SaveRunSessionsJSON saves all eval sessions to a single JSON file. Each session includes its eval criteria in the "evals" field. This complements SaveRunSessions which saves to SQLite, providing a human-readable format for inspection.
func SessionFromEvents ¶ added in v1.20.5
SessionFromEvents reconstructs a session from raw container output events. This parses the JSON events emitted by cagent --json and builds a session with the conversation history.
Types ¶
type Config ¶ added in v1.19.0
type Config struct {
AgentFilename string // Path to the agent configuration file
EvalsDir string // Directory containing evaluation files
JudgeModel string // Model for relevance checking (format: provider/model, optional)
Concurrency int // Number of concurrent runs (0 = number of CPUs)
TTYFd int // File descriptor for terminal size queries (e.g., int(os.Stdout.Fd()))
Only []string // Only run evaluations matching these patterns
BaseImage string // Custom base Docker image for running evaluations
KeepContainers bool // If true, don't remove containers after evaluation (skip --rm)
EnvVars []string // Environment variables to pass: KEY (value from env) or KEY=VALUE (explicit)
}
Config holds configuration for evaluation runs.
type EvalRun ¶ added in v1.19.0
type EvalRun struct {
Name string `json:"name"`
Timestamp time.Time `json:"timestamp"`
Duration time.Duration `json:"duration"`
Results []Result `json:"results"`
Summary Summary `json:"summary"`
}
EvalRun contains the results and metadata for an evaluation run.
func Evaluate ¶
func Evaluate(ctx context.Context, ttyOut, out io.Writer, isTTY bool, runName string, runConfig *config.RuntimeConfig, cfg Config) (*EvalRun, error)
Evaluate runs evaluations with a specified run name. ttyOut is used for progress bar rendering (should be the console/TTY). out is used for results and status messages (can be tee'd to a log file).
type InputSession ¶ added in v1.20.5
type InputSession struct {
*session.Session
SourcePath string // Path to the source eval file (not serialized)
}
InputSession wraps a session with its source path for evaluation loading.
type Judge ¶ added in v1.20.0
type Judge struct {
// contains filtered or unexported fields
}
Judge runs LLM-as-a-judge relevance checks concurrently.
func NewJudge ¶ added in v1.20.0
NewJudge creates a new Judge that runs relevance checks with the given concurrency. Concurrency defaults to 1 if n < 1.
func (*Judge) CheckRelevance ¶ added in v1.20.0
func (j *Judge) CheckRelevance(ctx context.Context, response string, criteria []string) (passed int, failed []RelevanceResult, errs []string)
CheckRelevance runs all relevance checks concurrently with the configured concurrency. It returns the number of passed checks, a slice of failed results with reasons, and any errors encountered.
type RelevanceResult ¶ added in v1.20.5
RelevanceResult contains the result of a single relevance check.
type Result ¶
type Result struct {
InputPath string `json:"input_path"`
Title string `json:"title"`
Question string `json:"question"`
Response string `json:"response"`
Cost float64 `json:"cost"`
OutputTokens int64 `json:"output_tokens"`
Size string `json:"size"`
SizeExpected string `json:"size_expected"`
ToolCallsScore float64 `json:"tool_calls_score"`
ToolCallsExpected float64 `json:"tool_calls_score_expected"`
HandoffsMatch bool `json:"handoffs"`
RelevancePassed float64 `json:"relevance"`
RelevanceExpected float64 `json:"relevance_expected"`
FailedRelevance []RelevanceResult `json:"failed_relevance,omitempty"`
Error string `json:"error,omitempty"`
RawOutput []map[string]any `json:"raw_output,omitempty"`
Session *session.Session `json:"-"` // Full session for database storage (not in JSON)
}
Result contains the evaluation results for a single test case.
type Runner ¶ added in v1.19.0
type Runner struct {
Config
// contains filtered or unexported fields
}
Runner runs evaluations against an agent.
type Summary ¶ added in v1.19.0
type Summary struct {
TotalEvals int `json:"total_evals"`
FailedEvals int `json:"failed_evals"`
TotalCost float64 `json:"total_cost"`
SizesPassed int `json:"sizes_passed"`
SizesTotal int `json:"sizes_total"`
ToolsF1Sum float64 `json:"tools_f1_sum"`
ToolsCount int `json:"tools_count"`
HandoffsPassed int `json:"handoffs_passed"`
HandoffsTotal int `json:"handoffs_total"`
RelevancePassed float64 `json:"relevance_passed"`
RelevanceTotal float64 `json:"relevance_total"`
}
Summary contains aggregate statistics across all evaluations.