evidra-bench

module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 14, 2026 License: Apache-2.0

README

Bench

CI Release License Go Bench

Live infrastructure exams and regression testing for AI agents. Run the same real Kubernetes, Helm, Argo CD, Terraform, and AWS/LocalStack scenarios across models, MCP servers, skills, and remote agents. Track pass rate, cost, turns, token use, and failure patterns over time.

Bench answers the questions that matter before an agent touches production:

  • Can it fix the incident?
  • Did it diagnose before acting?
  • Did it loop, give up, or claim success too early?
  • Did a new model, prompt, MCP server, or skill regress behavior?
  • How many tokens and turns did the run waste?

The primary product site is https://bench.evidra.cc. Public exam suites and the leaderboard are the marketing surface. Private regression history, scheduled runs, customer incident suites, and failure reports are the product surface.

Why

Agent quality is not a single pass/fail number. The same prompt or tool server can make an easy scenario faster and make a harder scenario fail by skipping diagnosis. You need repeatable tests with real infrastructure state, artifacts, and comparable run history.

The public suites are exam-aligned marketing proof: Kubernetes, security, GitOps, Terraform, and cloud-ops tasks that show how agents behave in real environments. They are not official CNCF, Linux Foundation, HashiCorp, or AWS certifications.

Public Exam Suites

Bench packages the catalog into public suites that are easy to compare on a leaderboard and easy to explain in readiness reports:

Suite What it tests
Kubernetes Admin Exam Workloads, troubleshooting, networking, and storage in live clusters
Kubernetes Security Exam Pod security, RBAC, runtime disruption, and safe remediation
GitOps And Release Exam Helm and Argo CD drift, failed upgrades, rollback, and sync health
Terraform And Cloud Ops Exam Terraform state, import, drift, AWS controls, and cloud recovery
MCP Server Readiness Exam No-MCP/native-tools baseline versus a selected MCP server on non-trivial and chaos scenarios

See Public Exam Suites for the current suite map.

# Baseline model behavior
bench-cli run \
  --scenario kubernetes/broken-deployment \
  --provider bifrost \
  --model gemini-2.5-flash

# Same model with a skill prompt
bench-cli run \
  --scenario kubernetes/broken-deployment \
  --provider bifrost \
  --model gemini-2.5-flash \
  --skill-file skills/k8s-admin.md \
  --skill-id k8s-admin

# Same scenario through a selected MCP server
bench-cli run \
  --scenario kubernetes/broken-deployment \
  --provider bifrost \
  --model gemini-2.5-flash \
  --mcp-server "$MCP_SERVER" \
  --tool-server-id "$TOOL_SERVER_ID" \
  --tool-server-version "$TOOL_SERVER_VERSION"

Use Cases

User Question Bench answers
Platform teams Can this agent handle realistic incidents before we deploy it?
Agent builders Which model, prompt, or tool stack regressed this scenario?
MCP server builders Does this tool server improve outcomes without raising cost?
Skill authors Does this skill help on L3/L4, or only on easy L1 tasks?
Security teams Does the agent fix the issue without weakening controls?
Customers with incidents Can our past outages become private agent regression tests?

What Bench Measures

Bench checks the outcome and the path the agent took:

Metric What it shows
Pass rate Whether final infrastructure checks passed
Turns How many agent/tool iterations were needed
Tokens and cost Whether a change saves or burns budget
Duration Wall-clock runtime for the scenario
Tool calls What the agent inspected or changed
Timeline Discovery, diagnosis, action, and verification phases
Failure patterns Loops, premature success, missed diagnostics, unsafe actions

The next product layer is agent failure autopsy: a report that explains where the agent got stuck, what it missed, and which behavior caused the regression. See Agent Failure Autopsy.

Scenario Catalog

The catalog is organized by operational domain and difficulty:

Track What it tests
workloads Deployments, pods, scheduling, resources
troubleshooting Diagnosis, correlation, cascading failures
networking Services, DNS, ingress, network policies
storage PVCs, StorageClass behavior, volume expansion
pod-security RBAC, capabilities, PSA, CSR, AWS SG/S3
runtime-security Runtime disruptions and chaos resilience
release-ops Helm, Argo CD, rollbacks, GitOps
platform-eng Terraform state, drift, import, refactoring

Difficulty levels:

Level Name What it tests
L1 Fix One clear problem, one fix
L2 Diagnose The agent must investigate before fixing
L3 Judge The fix has traps, trade-offs, or safety constraints
L4 Investigate Multi-step forensics and root-cause tracing

Infrastructure categories:

Category Runtime
Kubernetes kind or k3d cluster
Helm kind or k3d cluster
Argo CD kind or k3d cluster
Terraform local state
AWS LocalStack, no cloud account required

Execution Adapters

Bench keeps scenario setup and verification local, then swaps how the agent is executed:

Adapter Example What it tests
Built-in provider loop --provider bifrost --model ... Raw model behavior with Bench-owned tools
MCP server --mcp-server "..." Model behavior through a tool server
A2A agent --adapter a2a --a2a-agent-url ... Remote agent behavior with local verification
CLI process --adapter cli External agent process compatibility
Skill prompt --skill-file ... Prompt/skill impact under fixed scenarios

Any MCP tool server can be tested by passing its command to --mcp-server:

bench-cli run \
  --scenario kubernetes/broken-deployment \
  --provider bifrost \
  --model sonnet \
  --mcp-server "$MCP_SERVER" \
  --tool-server-id "$TOOL_SERVER_ID" \
  --tool-server-version "$TOOL_SERVER_VERSION"

For a private report deliverable, use the report pack workflow. It runs a direct baseline and the selected MCP server over the same scenario slice, sends both sides to Bench, then prints live report URLs:

bench-cli report-pack \
  --model sonnet \
  --provider bifrost \
  --bench-url https://api.evidra.cc \
  --bench-api-key "$BENCH_API_KEY" \
  --mcp-server "$MCP_SERVER" \
  --tool-server-id "$TOOL_SERVER_ID" \
  --tool-server-version "$TOOL_SERVER_VERSION"

See Private Report Pack for the reporting workflow. The first public multi-server report is tracked in Kubernetes MCP Readiness 2026-05. The launch article is Kubernetes MCP Servers Passed. That Was Not Enough..

How It Works

acquire lease
  -> provision workspace
  -> bootstrap healthy baseline
  -> inject failure
  -> execute agent through adapter
  -> collect artifacts and timeline
  -> verify infrastructure outcome
  -> store result
  -> report leaderboard/private regression data

Scenario checks are declarative. The agent can fix the problem any way it wants as long as the final infrastructure state satisfies the checks.

Quick Start

Prerequisites: Go 1.25.10+, kind or k3d, kubectl, helm.

# Build
make build

# List scenarios
bench-cli scenario list

# Validate a scenario without a cluster run
bench-cli run --scenario kubernetes/broken-deployment --dry-run

# Run one scenario
bench-cli run \
  --scenario kubernetes/broken-deployment \
  --provider bifrost \
  --model gemini-2.5-flash \
  --reuse-cluster

# Certify on one track
bench-cli certify --track workloads --model sonnet --provider bifrost

# Run a full benchmark
bench-cli bench --provider bifrost --model sonnet --reuse-cluster

# Open the local TUI
bench-cli lab

Provider Setup

Route model requests directly or through a unified Bifrost gateway:

# Direct OpenAI-compatible endpoint
export INFRA_BENCH_BIFROST_URL=https://generativelanguage.googleapis.com/v1beta/openai
export INFRA_BENCH_BIFROST_AUTH_BEARER=$GEMINI_API_KEY
bench-cli run --provider bifrost --model gemini-2.5-flash --scenario ...

# Bifrost gateway
source .env
./scripts/bifrost-start.sh
export INFRA_BENCH_BIFROST_URL=http://localhost:9090/v1
bench-cli run --provider bifrost --model google/gemini-2.5-flash --scenario ...
bench-cli run --provider bifrost --model deepseek/deepseek-chat --scenario ...
bench-cli run --provider bifrost --model openai/gpt-4.1 --scenario ...

Claude CLI is also supported:

bench-cli run --provider claude --model sonnet --scenario ...

Multi-Stage Scenarios

Scenarios can inject failures sequentially while the agent stays in one session:

stages:
  - name: wrong-image
    break:
      apply: fixtures/wrong-image.yaml
    verify:
      - deployment-ready: bench/web

  - name: missing-secret
    break:
      apply: fixtures/delete-secret.yaml
      memory: compact
    agent_goal: "New issue: the API is returning database errors."
    verify:
      - resource-exists: bench/db-credentials

memory: compact summarizes prior context. memory: reset clears it. agent_goal sends a new user message mid-run.

Bench API And Runners

This repo owns the private bench control plane:

  • /v1/bench/* for runs, artifacts, analytics, trigger jobs, and scenario sync
  • /v1/runners/* for poll-based runner registration, job claim, and completion
  • /v1/certify for the direct executor contract used by bench-cli serve

Run the service locally:

BENCH_DATABASE_URL=postgres://bench:bench@localhost:5432/bench?sslmode=disable \
BENCH_API_KEY=dev-secret \
BENCH_SERVICE_ADDR=:8090 \
bench-cli serve

Hosted control-plane deployments that rely on remote runners should disable the direct executor so the API process does not provision a local cluster:

BENCH_CONTROL_PLANE_ONLY=true bench-cli serve --control-plane-only

Production deployment is intentionally out of scope for this repository. Keep environment-specific manifests, secrets, and hosted topology in a separate private infrastructure repository. This repo stays focused on code, local execution, API contracts, scenarios, and tests.

Documentation

Development

make test           # Go unit tests
make test-race      # with race detector
make fmt            # gofmt
make lint           # golangci-lint
make vuln           # govulncheck
make smoke          # dry-run all scenarios
make ui-dev         # Vite dev server for local UI
make ui-build       # production UI build

See Testing Guide for the full testing guide.

License

Apache License 2.0

Directories

Path Synopsis
cmd
bench-cli command
internal
auth
Package auth provides authentication middleware and context helpers.
Package auth provides authentication middleware and context helpers.
benchdb
Package benchdb manages PostgreSQL connections and bench schema migrations.
Package benchdb manages PostgreSQL connections and bench schema migrations.
pkg
a2a
adapter
Package adapter defines the agent adapter contract and built-in adapters.
Package adapter defines the agent adapter contract and built-in adapters.
agent
Package agent provides pluggable LLM providers and a multi-turn tool-use agent loop.
Package agent provides pluggable LLM providers and a multi-turn tool-use agent loop.
artifact
Package artifact writes local run artifact bundles.
Package artifact writes local run artifact bundles.
autopsy
Package autopsy classifies failed benchmark runs from existing artifacts.
Package autopsy classifies failed benchmark runs from existing artifacts.
bench
Package timeline classifies agent tool calls into decision phases.
Package timeline classifies agent tool calls into decision phases.
config
Package config defines run configuration for bench-cli.
Package config defines run configuration for bench-cli.
environment
Package environment manages disposable cluster lifecycles.
Package environment manages disposable cluster lifecycles.
harness
Package harness orchestrates the benchmark run loop.
Package harness orchestrates the benchmark run loop.
jobqueue
Package jobqueue provides River-based job scheduling for parallel bench runs.
Package jobqueue provides River-based job scheduling for parallel bench runs.
orchestrator
Package orchestrator manages the full bench lifecycle: provision cluster → run scenarios in parallel → teardown.
Package orchestrator manages the full bench lifecycle: provision cluster → run scenarios in parallel → teardown.
report
Package report provides offline evidence writing for benchmark runs.
Package report provides offline evidence writing for benchmark runs.
scenario
Package scenario defines the scenario model and loader.
Package scenario defines the scenario model and loader.
signalaudit
Package signalaudit loads and analyzes signal-audit expectations for run artifacts.
Package signalaudit loads and analyzes signal-audit expectations for run artifacts.
store
Package store provides structured result storage with SQLite + JSONL backup.
Package store provides structured result storage with SQLite + JSONL backup.
tui
Package tui provides an interactive terminal UI for browsing and running scenarios.
Package tui provides an interactive terminal UI for browsing and running scenarios.
verifier
Package verifier evaluates scenario outcome quality.
Package verifier evaluates scenario outcome quality.
workspace
Package workspace provides isolated directories for parallel bench jobs.
Package workspace provides isolated directories for parallel bench jobs.
generate-catalog reads all scenario.yaml files and writes ui/src/data/catalog.ts.
generate-catalog reads all scenario.yaml files and writes ui/src/data/catalog.ts.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL