eval

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 31, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Documentation

Overview

Package eval implements the CLASP Evaluation Framework (SDD-005).

Provides structured capability scoring for SOC agents across 6 dimensions with 5 maturity levels each. Supports automated scoring via LLM-as-judge and trend analysis via stored results.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SaveResult

func SaveResult(dir string, result *EvalResult) error

SaveResult saves an eval result to the results directory.

Types

type AgentProfile

type AgentProfile struct {
	AgentID    string                `json:"agent_id"`
	Results    []EvalResult          `json:"results"`
	Averages   map[Dimension]float64 `json:"averages"`
	OverallL   int                   `json:"overall_l"`
	EvalCount  int                   `json:"eval_count"`
	LastEvalAt time.Time             `json:"last_eval_at"`
}

AgentProfile aggregates multiple EvalResults into a capability profile.

func (*AgentProfile) ComputeAverages

func (p *AgentProfile) ComputeAverages()

ComputeAverages calculates per-dimension average scores across all results.

type Dimension

type Dimension string

Dimension represents a capability axis for agent evaluation.

const (
	DimPlanning   Dimension = "planning"
	DimToolUse    Dimension = "tool_use"
	DimMemory     Dimension = "memory"
	DimReasoning  Dimension = "reasoning"
	DimReflection Dimension = "reflection"
	DimPerception Dimension = "perception"
)

func AllDimensions

func AllDimensions() []Dimension

AllDimensions returns the 6 CLASP dimensions.

type EvalResult

type EvalResult struct {
	AgentID    string              `json:"agent_id"`
	Timestamp  time.Time           `json:"timestamp"`
	ScenarioID string              `json:"scenario_id"`
	Scores     map[Dimension]Score `json:"scores"`
	OverallL   int                 `json:"overall_l"` // 1-5 aggregate
	JudgeModel string              `json:"judge_model,omitempty"`
}

EvalResult represents the outcome of evaluating an agent on a scenario.

func (*EvalResult) ComputeOverall

func (r *EvalResult) ComputeOverall() int

ComputeOverall calculates the aggregate maturity level (average, rounded down).

type EvalScenario

type EvalScenario struct {
	ID          string      `json:"id"`
	Name        string      `json:"name"`
	Stage       Stage       `json:"stage"`
	Description string      `json:"description"`
	Inputs      []string    `json:"inputs"`
	Expected    string      `json:"expected"`
	Dimensions  []Dimension `json:"dimensions"` // Which dimensions this tests
}

EvalScenario defines a test scenario for agent evaluation.

func LoadScenarios

func LoadScenarios(path string) ([]EvalScenario, error)

LoadScenarios loads eval scenarios from a JSON file.

type Regression

type Regression struct {
	Dimension Dimension `json:"dimension"`
	Previous  float64   `json:"previous"`
	Current   float64   `json:"current"`
	Delta     float64   `json:"delta"`
}

DetectRegression compares current profile to a previous one. Returns dimensions where the score dropped.

func DetectRegressions

func DetectRegressions(previous, current *AgentProfile) []Regression

type Score

type Score struct {
	Level      int     `json:"level"`      // 1-5 maturity
	Confidence float64 `json:"confidence"` // 0.0-1.0
	Evidence   string  `json:"evidence"`   // Justification
}

Score represents a capability score for one dimension.

type Stage

type Stage string

Stage represents the security lifecycle stage of an eval scenario.

const (
	StageFind      Stage = "find"
	StageConfirm   Stage = "confirm"
	StageRootCause Stage = "root_cause"
	StageValidate  Stage = "validate"
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL