eval

package

v1.0.0 Latest Latest Go to latest Published: Mar 31, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/syntrex-lab/gomcp

Links

Open Source Insights

Documentation ¶

Overview ¶

Package eval implements the CLASP Evaluation Framework (SDD-005).

Provides structured capability scoring for SOC agents across 6 dimensions with 5 maturity levels each. Supports automated scoring via LLM-as-judge and trend analysis via stored results.

Index ¶

func SaveResult(dir string, result *EvalResult) error
type AgentProfile
- func (p *AgentProfile) ComputeAverages()
type Dimension
- func AllDimensions() []Dimension
type EvalResult
- func (r *EvalResult) ComputeOverall() int
type EvalScenario
- func LoadScenarios(path string) ([]EvalScenario, error)
type Regression
- func DetectRegressions(previous, current *AgentProfile) []Regression
type Score
type Stage

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func SaveResult ¶

func SaveResult(dir string, result *EvalResult) error

SaveResult saves an eval result to the results directory.

Types ¶

type AgentProfile ¶

type AgentProfile struct {
	AgentID    string                `json:"agent_id"`
	Results    []EvalResult          `json:"results"`
	Averages   map[Dimension]float64 `json:"averages"`
	OverallL   int                   `json:"overall_l"`
	EvalCount  int                   `json:"eval_count"`
	LastEvalAt time.Time             `json:"last_eval_at"`
}

AgentProfile aggregates multiple EvalResults into a capability profile.

func (*AgentProfile) ComputeAverages ¶

func (p *AgentProfile) ComputeAverages()

ComputeAverages calculates per-dimension average scores across all results.

type Dimension ¶

type Dimension string

Dimension represents a capability axis for agent evaluation.

const (
	DimPlanning   Dimension = "planning"
	DimToolUse    Dimension = "tool_use"
	DimMemory     Dimension = "memory"
	DimReasoning  Dimension = "reasoning"
	DimReflection Dimension = "reflection"
	DimPerception Dimension = "perception"
)

func AllDimensions ¶

func AllDimensions() []Dimension

AllDimensions returns the 6 CLASP dimensions.

type EvalResult ¶

type EvalResult struct {
	AgentID    string              `json:"agent_id"`
	Timestamp  time.Time           `json:"timestamp"`
	ScenarioID string              `json:"scenario_id"`
	Scores     map[Dimension]Score `json:"scores"`
	OverallL   int                 `json:"overall_l"` // 1-5 aggregate
	JudgeModel string              `json:"judge_model,omitempty"`
}

EvalResult represents the outcome of evaluating an agent on a scenario.

func (*EvalResult) ComputeOverall ¶

func (r *EvalResult) ComputeOverall() int

ComputeOverall calculates the aggregate maturity level (average, rounded down).

type EvalScenario ¶

type EvalScenario struct {
	ID          string      `json:"id"`
	Name        string      `json:"name"`
	Stage       Stage       `json:"stage"`
	Description string      `json:"description"`
	Inputs      []string    `json:"inputs"`
	Expected    string      `json:"expected"`
	Dimensions  []Dimension `json:"dimensions"` // Which dimensions this tests
}

EvalScenario defines a test scenario for agent evaluation.

func LoadScenarios ¶

func LoadScenarios(path string) ([]EvalScenario, error)

LoadScenarios loads eval scenarios from a JSON file.

type Regression ¶

type Regression struct {
	Dimension Dimension `json:"dimension"`
	Previous  float64   `json:"previous"`
	Current   float64   `json:"current"`
	Delta     float64   `json:"delta"`
}

DetectRegression compares current profile to a previous one. Returns dimensions where the score dropped.

func DetectRegressions ¶

func DetectRegressions(previous, current *AgentProfile) []Regression

type Score ¶

type Score struct {
	Level      int     `json:"level"`      // 1-5 maturity
	Confidence float64 `json:"confidence"` // 0.0-1.0
	Evidence   string  `json:"evidence"`   // Justification
}

Score represents a capability score for one dimension.

type Stage ¶

type Stage string

Stage represents the security lifecycle stage of an eval scenario.

const (
	StageFind      Stage = "find"
	StageConfirm   Stage = "confirm"
	StageRootCause Stage = "root_cause"
	StageValidate  Stage = "validate"
)

Source Files ¶

View all Source files

eval.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL