evaluations

package module
v0.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2025 License: Apache-2.0 Imports: 18 Imported by: 0

README

mcp-evals

A Go library and CLI for evaluating Model Context Protocol (MCP) servers using Claude. This tool connects to an MCP server, runs an agentic evaluation loop where Claude uses the server's tools to answer questions, and grades the responses across five dimensions: accuracy, completeness, relevance, clarity, and reasoning.

Use Cases

As a library: Programmatically evaluate MCP servers in Go code, integrate evaluation results into CI/CD pipelines, or build custom evaluation workflows.

As a CLI: Run evaluations from YAML/JSON configuration files with immediate pass/fail feedback, detailed scoring breakdowns, and optional trace output for debugging.

Installation

curl -sSfL https://raw.githubusercontent.com/wolfeidau/mcp-evals/main/install.sh | sh
Using Go
go install github.com/wolfeidau/mcp-evals/cmd/mcp-evals@latest

Quick Start

Create an evaluation config file (e.g., evals.yaml):

model: claude-3-5-sonnet-20241022
mcp_server:
  command: npx
  args: ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]

evals:
  - name: list-files
    description: Test filesystem listing
    prompt: "List files in the current directory"
    expected_result: "Should enumerate files with details"

Run evaluations:

export ANTHROPIC_API_KEY=your-api-key
mcp-evals run --config evals.yaml

CLI Commands

  • run - Execute evaluations (default command)
  • validate - Validate config file against JSON schema
  • schema - Generate JSON schema for configuration
  • help - Show help information

See mcp-evals <command> --help for detailed usage.

Advanced Features

Environment Variable Interpolation

Configuration files support environment variable interpolation using shell syntax. This enables matrix testing across different MCP server versions without duplicating configuration.

Supported syntax:

  • ${VAR} - Expand environment variable
  • $VAR - Short form expansion
  • ${VAR:-default} - Use default value if unset
  • ${VAR:+value} - Use value if VAR is set

Example configuration (matrix.yaml):

model: claude-3-5-sonnet-20241022
mcp_server:
  command: ${MCP_SERVER_PATH}
  args:
    - --port=${SERVER_PORT:-8080}
  env:
    - VERSION=${SERVER_VERSION}

evals:
  - name: version_check
    prompt: "What version are you running?"
    expected_result: "Should report ${SERVER_VERSION}"

Matrix testing across versions:

# Test v1.0.0
export MCP_SERVER_PATH=/releases/v1.0.0/mcp-server
export SERVER_VERSION=1.0.0
mcp-evals run --config matrix.yaml --trace-dir traces/v1.0.0

# Test v2.0.0
export MCP_SERVER_PATH=/releases/v2.0.0/mcp-server
export SERVER_VERSION=2.0.0
mcp-evals run --config matrix.yaml --trace-dir traces/v2.0.0

CI/CD matrix example (Buildkite):

steps:
  - label: ":test_tube: MCP Evals - {{matrix.version}}"
    command: |
      export MCP_SERVER_PATH=/releases/{{matrix.version}}/mcp-server
      export SERVER_VERSION={{matrix.version}}
      mcp-evals run --config matrix.yaml --trace-dir traces/{{matrix.version}}
    matrix:
      setup:
        version: ["1.0.0", "1.1.0", "2.0.0"]
    artifact_paths:
      - "traces/{{matrix.version}}/*.json"
Eval Filtering

Run a subset of evals using the --filter flag with a regex pattern:

# Run single eval
mcp-evals run --config evals.yaml --filter "^basic_addition$"

# Run all auth-related evals
mcp-evals run --config evals.yaml --filter "auth"

# Run multiple specific evals
mcp-evals run --config evals.yaml --filter "add|echo|get_user"

# Run all troubleshooting evals
mcp-evals run --config evals.yaml --filter "troubleshoot_.*"

This is useful for:

  • Fast iteration during development
  • Running specific test suites in CI/CD
  • Debugging individual evals without running the full suite
MCP Server Command-Line Overrides

Override MCP server configuration from the command line for quick testing:

# Override server command
mcp-evals run --config evals.yaml --mcp-command /path/to/dev/server

# Override with arguments
mcp-evals run --config evals.yaml \
  --mcp-command /path/to/server \
  --mcp-args="--port=9000" \
  --mcp-args="--verbose"

# Override environment variables
mcp-evals run --config evals.yaml \
  --mcp-env="API_TOKEN=xyz123" \
  --mcp-env="DEBUG=true"

# Combine with filtering for targeted testing
mcp-evals run --config evals.yaml \
  --mcp-command /path/to/dev/server \
  --filter "^new_feature_.*"

This is useful for:

  • Local development without modifying config files
  • Ad-hoc testing of experimental builds
  • Debugging with different server flags

Configuration

Evaluation configs support both YAML and JSON formats:

  • model - Anthropic model ID (required)
  • grading_model - Optional separate model for grading
  • timeout - Per-evaluation timeout (e.g., "2m", "30s")
  • max_steps - Maximum agentic loop iterations (default: 10)
  • max_tokens - Maximum tokens per LLM request (default: 4096)
  • mcp_server - Server command, args, and environment
  • evals - List of test cases with name, prompt, and expected result

Custom Grading Rubrics

Custom grading rubrics allow you to define specific, measurable criteria for each evaluation dimension. This makes grading more consistent and meaningful by providing concrete guidance to the grading LLM.

Why Use Rubrics?

Without rubrics, the grading LLM uses generic 1-5 scoring criteria. This can lead to:

  • Inconsistent scoring: Same response quality gets different grades
  • Lack of specificity: Generic criteria don't capture domain-specific requirements
  • Difficult iteration: Can't specify what matters most for your use case

Rubrics solve this by defining exactly what "accurate" or "complete" means for each evaluation.

Basic Example
evals:
  - name: troubleshoot_build
    prompt: "Troubleshoot the failed build at https://example.com/builds/123"
    expected_result: "Should identify root cause and provide remediation"

    grading_rubric:
      # Optional: Focus on specific dimensions (defaults to all 5)
      dimensions: ["accuracy", "completeness", "reasoning"]

      accuracy:
        description: "Correctness of root cause identification"
        must_have:
          - "Identifies actual failing job(s) by name or ID"
          - "Extracts real error messages from logs"
        penalties:
          - "Misidentifies root cause"
          - "Fabricates error messages not in logs"

      completeness:
        description: "Thoroughness of investigation"
        must_have:
          - "Examines job logs"
          - "Provides specific remediation steps"
        nice_to_have:
          - "Suggests preventive measures"

      # Optional: Minimum acceptable scores for pass/fail
      minimum_scores:
        accuracy: 4
        completeness: 3
Rubric Structure

Each dimension can specify:

  • description: What this dimension means for this specific eval
  • must_have: Required elements for high scores (4-5)
  • nice_to_have: Optional elements that improve scores
  • penalties: Elements that reduce scores (errors, omissions)

Available dimensions: accuracy, completeness, relevance, clarity, reasoning

LLM-Assisted Rubric Creation

Manually writing rubrics is time-consuming. Use an LLM to draft initial rubrics:

# Generate rubric from eval description
claude "Create a grading rubric for this eval: [paste your eval config]"

# Refine rubric from actual results
mcp-evals run --config evals.yaml --trace-dir traces
claude "Refine this rubric based on these results: $(cat traces/my_eval.json | jq '.grade')"

Best practices:

  1. Start generic, refine iteratively
  2. Use actual tool outputs and responses in prompts
  3. Focus on measurable criteria (not vague requirements)
  4. Run eval 3-5 times to validate consistency

See specs/grading_rubric.md for detailed guidance on creating rubrics.

How It Works

  1. Connects to the specified MCP server via command/transport
  2. Retrieves available tools from the MCP server
  3. Runs an agentic loop (max 10 steps) where Claude:
    • Receives the evaluation prompt and available MCP tools
    • Calls tools via the MCP protocol as needed
    • Accumulates tool results and continues reasoning
  4. Evaluates the final response using a separate LLM call that scores five dimensions on a 1-5 scale
  5. Returns structured results with pass/fail status (passing threshold: average score ≥ 3.0)

License

Apache License, Version 2.0 - Copyright Mark Wolfe

Documentation

Index

Constants

View Source
const (
	AgentSystemPrompt = "" /* 164-byte string literal not displayed */

	EvalSystemPrompt = `` /* 1251-byte string literal not displayed */

)

Variables

This section is empty.

Functions

func SchemaForEvalConfig

func SchemaForEvalConfig() (string, error)

Types

type AgenticStep

type AgenticStep struct {
	StepNumber               int           `json:"step_number"`                 // 1-indexed step number
	StartTime                time.Time     `json:"start_time"`                  // When this step started
	EndTime                  time.Time     `json:"end_time"`                    // When this step completed
	Duration                 time.Duration `json:"duration"`                    // Step execution duration
	ModelResponse            string        `json:"model_response"`              // Text content from assistant
	StopReason               string        `json:"stop_reason"`                 // end_turn, tool_use, max_tokens, etc.
	ToolCalls                []ToolCall    `json:"tool_calls"`                  // Tools executed in this step
	InputTokens              int           `json:"input_tokens"`                // Input tokens for this step
	OutputTokens             int           `json:"output_tokens"`               // Output tokens for this step
	CacheCreationInputTokens int           `json:"cache_creation_input_tokens"` // Tokens used to create cache
	CacheReadInputTokens     int           `json:"cache_read_input_tokens"`     // Tokens read from cache
	Error                    string        `json:"error,omitempty"`             // Error message if step failed
}

AgenticStep records a single iteration of the agentic loop

type DimensionCriteria

type DimensionCriteria struct {
	Description string   `yaml:"description,omitempty" json:"description,omitempty" jsonschema:"What this dimension means for this specific eval"`
	MustHave    []string `yaml:"must_have,omitempty" json:"must_have,omitempty" jsonschema:"Required elements for high scores (4-5)"`
	NiceToHave  []string `yaml:"nice_to_have,omitempty" json:"nice_to_have,omitempty" jsonschema:"Optional elements that improve scores"`
	Penalties   []string `` /* 128-byte string literal not displayed */
}

DimensionCriteria provides specific guidance for grading a dimension

type Eval

type Eval struct {
	Name              string         `yaml:"name" json:"name" jsonschema:"Unique identifier for this evaluation"`
	Description       string         `yaml:"description,omitempty" json:"description,omitempty" jsonschema:"Human-readable description of what this eval tests"`
	Prompt            string         `yaml:"prompt" json:"prompt" jsonschema:"The input prompt to send to the LLM"`
	ExpectedResult    string         `` /* 151-byte string literal not displayed */
	AgentSystemPrompt string         `` /* 157-byte string literal not displayed */
	GradingRubric     *GradingRubric `` /* 129-byte string literal not displayed */
}

Eval represents a single evaluation test case

type EvalClient

type EvalClient struct {
	// contains filtered or unexported fields
}

func NewEvalClient

func NewEvalClient(config EvalClientConfig) *EvalClient

func (*EvalClient) RunEval

func (ec *EvalClient) RunEval(ctx context.Context, eval Eval) (*EvalRunResult, error)

func (*EvalClient) RunEvals

func (ec *EvalClient) RunEvals(ctx context.Context, evals []Eval) ([]EvalRunResult, error)

RunEvals executes multiple evaluations and returns all results. Each eval reuses the same MCP session for efficiency. Individual eval failures are captured in EvalRunResult.Error and don't stop the batch.

type EvalClientConfig

type EvalClientConfig struct {
	APIKey               string
	BaseURL              string // Optional: if set, override the default Anthropic API endpoint
	Command              string
	Args                 []string
	Env                  []string
	Model                string
	GradingModel         string // Optional: if set, use this model for grading instead of Model
	AgentSystemPrompt    string // Optional: custom system prompt for the agent being evaluated
	MaxSteps             int
	MaxTokens            int
	EnablePromptCaching  *bool             // Optional: enable Anthropic prompt caching for tool definitions and system prompts. Default: true
	CacheTTL             string            // Optional: cache time-to-live, either "5m" (default) or "1h". Requires EnablePromptCaching=true
	EnforceMinimumScores *bool             // Optional: enforce minimum scores from grading rubrics. Default: true
	StderrCallback       func(line string) // Optional: called for each line written to stderr by the MCP server subprocess
}

func (*EvalClientConfig) ApplyDefaults

func (c *EvalClientConfig) ApplyDefaults() *EvalClientConfig

ApplyDefaults sets default values for optional configuration fields. This method modifies the config in-place and returns a pointer to it for method chaining.

type EvalConfig

type EvalConfig struct {
	Model                string          `yaml:"model" json:"model" jsonschema:"Anthropic model ID to use for evaluations"`
	GradingModel         string          `` /* 140-byte string literal not displayed */
	AgentSystemPrompt    string          `` /* 167-byte string literal not displayed */
	Timeout              string          `yaml:"timeout,omitempty" json:"timeout,omitempty" jsonschema:"Timeout duration for each evaluation (e.g., '2m', '30s')"`
	MaxSteps             MaxSteps        `yaml:"max_steps,omitempty" json:"max_steps,omitempty" jsonschema:"Maximum number of agentic loop iterations"`
	MaxTokens            MaxTokens       `yaml:"max_tokens,omitempty" json:"max_tokens,omitempty" jsonschema:"Maximum tokens per LLM request"`
	EnablePromptCaching  *bool           `` /* 198-byte string literal not displayed */
	CacheTTL             string          `` /* 162-byte string literal not displayed */
	EnforceMinimumScores *bool           `` /* 180-byte string literal not displayed */
	MCPServer            MCPServerConfig `yaml:"mcp_server" json:"mcp_server" jsonschema:"Configuration for the MCP server to evaluate"`
	Evals                []Eval          `yaml:"evals" json:"evals" jsonschema:"List of evaluation test cases to run"`
}

EvalConfig represents the top-level configuration for running evaluations

func LoadConfig

func LoadConfig(filePath string) (*EvalConfig, error)

LoadConfig loads an evaluation configuration from a YAML or JSON file. The file format is detected by the file extension (.yaml, .yml, or .json). Environment variables in the config file are expanded using ${VAR} or $VAR syntax. Supports shell-style default values: ${VAR:-default}

type EvalResult

type EvalResult struct {
	Prompt      string
	RawResponse string
}

type EvalRunResult

type EvalRunResult struct {
	Eval   Eval
	Result *EvalResult
	Grade  *GradeResult
	Error  error
	Trace  *EvalTrace // Complete execution trace for debugging and analysis
}

EvalRunResult combines the eval configuration with its execution results

type EvalTrace

type EvalTrace struct {
	Steps                    []AgenticStep `json:"steps"`                       // Each step in the agentic loop
	Grading                  *GradingTrace `json:"grading,omitempty"`           // Grading interaction details
	TotalDuration            time.Duration `json:"total_duration"`              // Total execution time
	TotalInputTokens         int           `json:"total_input_tokens"`          // Sum of input tokens across all steps
	TotalOutputTokens        int           `json:"total_output_tokens"`         // Sum of output tokens across all steps
	StepCount                int           `json:"step_count"`                  // Number of agentic steps executed
	ToolCallCount            int           `json:"tool_call_count"`             // Total number of tool calls made
	TotalCacheCreationTokens int           `json:"total_cache_creation_tokens"` // Sum of cache creation tokens across all steps
	TotalCacheReadTokens     int           `json:"total_cache_read_tokens"`     // Sum of cache read tokens across all steps
}

EvalTrace captures complete execution history of an evaluation run

type GradeResult

type GradeResult struct {
	Accuracy       int    `json:"accuracy"`
	Completeness   int    `json:"completeness"`
	Relevance      int    `json:"relevance"`
	Clarity        int    `json:"clarity"`
	Reasoning      int    `json:"reasoning"`
	OverallComment string `json:"overall_comments"`
}

type GradingRubric

type GradingRubric struct {
	// Optional: Override which dimensions to grade (defaults to all 5 standard dimensions)
	Dimensions []string `` /* 149-byte string literal not displayed */

	// Criteria for each dimension - what to look for when grading
	Accuracy     *DimensionCriteria `yaml:"accuracy,omitempty" json:"accuracy,omitempty" jsonschema:"Specific criteria for accuracy scoring"`
	Completeness *DimensionCriteria `yaml:"completeness,omitempty" json:"completeness,omitempty" jsonschema:"Specific criteria for completeness scoring"`
	Relevance    *DimensionCriteria `yaml:"relevance,omitempty" json:"relevance,omitempty" jsonschema:"Specific criteria for relevance scoring"`
	Clarity      *DimensionCriteria `yaml:"clarity,omitempty" json:"clarity,omitempty" jsonschema:"Specific criteria for clarity scoring"`
	Reasoning    *DimensionCriteria `yaml:"reasoning,omitempty" json:"reasoning,omitempty" jsonschema:"Specific criteria for reasoning scoring"`

	// Optional: Minimum acceptable scores for pass/fail
	MinimumScores map[string]int `` /* 126-byte string literal not displayed */
}

GradingRubric defines specific evaluation criteria for grading

func (*GradingRubric) CheckMinimumScores

func (r *GradingRubric) CheckMinimumScores(grade *GradeResult) error

CheckMinimumScores verifies that graded scores meet minimum thresholds

func (*GradingRubric) Validate

func (r *GradingRubric) Validate() error

Validate checks that the rubric is well-formed

type GradingTrace

type GradingTrace struct {
	UserPrompt               string        `json:"user_prompt"`                 // Original eval prompt
	ModelResponse            string        `json:"model_response"`              // Model's answer being graded
	ExpectedResult           string        `json:"expected_result"`             // Expected result description
	GradingPrompt            string        `json:"grading_prompt"`              // Full prompt sent to grader
	RawGradingOutput         string        `json:"raw_grading_output"`          // Complete LLM response before parsing
	StartTime                time.Time     `json:"start_time"`                  // When grading started
	EndTime                  time.Time     `json:"end_time"`                    // When grading completed
	Duration                 time.Duration `json:"duration"`                    // Grading duration
	InputTokens              int           `json:"input_tokens"`                // Input tokens for grading
	OutputTokens             int           `json:"output_tokens"`               // Output tokens for grading
	CacheCreationInputTokens int           `json:"cache_creation_input_tokens"` // Tokens used to create cache
	CacheReadInputTokens     int           `json:"cache_read_input_tokens"`     // Tokens read from cache
	Error                    string        `json:"error,omitempty"`             // Error message if grading failed
}

GradingTrace records the grading interaction with the LLM

type MCPServerConfig

type MCPServerConfig struct {
	Command string   `yaml:"command" json:"command" jsonschema:"Command to start the MCP server"`
	Args    []string `yaml:"args,omitempty" json:"args,omitempty" jsonschema:"Arguments to pass to the command"`
	Env     []string `yaml:"env,omitempty" json:"env,omitempty" jsonschema:"Environment variables to set for the MCP server"`
}

MCPServerConfig defines how to start the MCP server

type MaxSteps

type MaxSteps int

type MaxTokens

type MaxTokens int

type ToolCall

type ToolCall struct {
	ToolID    string          `json:"tool_id"`         // Unique ID from content block
	ToolName  string          `json:"tool_name"`       // MCP tool name
	StartTime time.Time       `json:"start_time"`      // When tool execution started
	EndTime   time.Time       `json:"end_time"`        // When tool execution completed
	Duration  time.Duration   `json:"duration"`        // Tool execution duration
	Input     json.RawMessage `json:"input"`           // Tool arguments as JSON
	Output    json.RawMessage `json:"output"`          // Tool result as JSON
	Success   bool            `json:"success"`         // Whether tool executed successfully
	Error     string          `json:"error,omitempty"` // Error message if tool failed
}

ToolCall captures details of a single tool invocation

type ValidationError

type ValidationError struct {
	Path    string // JSON path to the error (e.g., "mcp_server.command")
	Message string // Human-readable error message
}

ValidationError represents a single validation error with location information

type ValidationResult

type ValidationResult struct {
	Valid  bool
	Errors []ValidationError
}

ValidationResult contains the results of validating a config file

func ValidateConfigFile

func ValidateConfigFile(filePath string) (*ValidationResult, error)

ValidateConfigFile validates a configuration file against the JSON schema. It reads the file, converts YAML to JSON if needed, and validates against the schema.

Directories

Path Synopsis
cmd
mcp-evals command
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL