eval_mcp_surfaces

command
v2.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2026 License: MIT Imports: 37 Imported by: 0

README

eval_mcp_surfaces

cmd/eval_mcp_surfaces evaluates whether a model can use the model-facing MCP catalog correctly. It can run schema-only validation, model-backed schema evaluation, and Docker-backed live MCP execution.

Inputs

Input Default Purpose
--tasks cmd/eval_mcp_surfaces/testdata/automated-mcp-surface-cases.md Executable Markdown fixture with MT-*, MS-*, and MF-* rows.
--model empty Single provider:model string or legacy Anthropic model name. Overrides --models and EVAL_MODELS.
--models empty Comma-separated provider:model list for local multi-model analysis. Defaults to EVAL_MODELS when --model is not set.
--tool-surface dynamic Model-facing catalog surface to evaluate: dynamic or meta. dynamic evaluates gitlab_find_action plus gitlab_execute_tool.
--tools-file empty Optional saved tools/list snapshot for schema/model comparison.
--preset empty Optional batch preset: docker-read, docker-mutating-safe, docker-destructive-safe, or schema-enterprise. Explicit flags override preset defaults.
--partition empty Optional fixture partition such as base-read, enterprise-read, or error-recovery.
--coverage-report empty Optional Markdown file listing uncovered high-risk routes for the selected run.
--compare empty Repeatable report path for comparison mode. Accepts token reports from cmd/audit_tokens and evaluation reports from this command.
--publish-docs false Publish reviewed evaluation reports into managed blocks in README.md and docs/testing/model-results.md.
--publish-from empty Repeatable reviewed eval_mcp_surfaces report path consumed by --publish-docs or --check-docs.
--publish-results-doc docs/testing/model-results.md Results document updated by --publish-docs.
--publish-readme README.md README file updated by --publish-docs.
--publish-label empty Human-readable result label used in generated documentation. Defaults to the report date.
--publish-mode replace-current Results block update mode: replace-current or append. README summary always reflects the current selected reports.
--check-docs false Verify managed documentation blocks match the selected reports without writing files.
--mcp-command empty External stdio MCP server command used by --execute-tools instead of the in-process current-source server.
--mcp-arg empty Repeatable argument passed to --mcp-command.
--mcp-env-file empty Env file passed to the external MCP command, usually .env or test/e2e/.env.docker.
--fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json Docker live fixture state generated by --prepare-fixtures.
--gitlab-env-file empty Env file for --backend=gitlab, usually test/e2e/.env.docker.

Supported model providers are anthropic, google, openai, and qwen. The evaluator reads ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, and QWEN_API_KEY only for the providers selected by --model, --models, or EVAL_MODELS. When no model source is configured, the default is anthropic:claude-haiku-4-5-20251001. Qwen uses QWEN_API_KEY directly, defaults to the international DashScope OpenAI-compatible endpoint, disables thinking for tool calls, and supports QWEN_BASE_URL or QWEN_CHAT_COMPLETIONS_URL for regional endpoints.

Common Commands

Dry-run the current catalog without model calls:

GITLAB_ENTERPRISE=false timeout 180s go run ./cmd/eval_mcp_surfaces \
  --dry-run \
  --repeat=1 \
  --out /tmp/eval-dry.md

Dry-run a saved snapshot partition and emit uncovered high-risk routes:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --tools-file dist/evaluation/mcp-surfaces/snapshots/current/tools.json \
  --dry-run \
  --partition base-read \
  --coverage-report dist/evaluation/mcp-surfaces/snapshots/current/coverage-base-read.md \
  --out dist/evaluation/mcp-surfaces/snapshots/current/schema-base-read.md

Run a targeted model-backed schema sample:

timeout 900s go run ./cmd/eval_mcp_surfaces \
  --model anthropic:claude-sonnet-4-6 \
  --task MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=8 \
  --retry-wait=65s \
  --out dist/evaluation/mcp-surfaces/schema-sample.md

Run the same small local analysis batch across all models configured in .env:

timeout 1800s go run ./cmd/eval_mcp_surfaces \
  --models "$EVAL_MODELS" \
  --task MT-001,MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=4 \
  --retry-wait=30s \
  --out dist/evaluation/mcp-surfaces/current-multi-model-smoke.md

When neither --model nor --models is provided, .env EVAL_MODELS is used if present; otherwise the evaluator uses the source-defined default model.

Run the full fixture corpus against Haiku using the current low-token dynamic surface:

timeout 7200s go run ./cmd/eval_mcp_surfaces \
  --tool-surface=dynamic \
  --model anthropic:claude-haiku-4-5-20251001 \
  --repeat=1 \
  --pause=250ms \
  --retries=4 \
  --retry-wait=65s \
  --out dist/evaluation/mcp-surfaces/dynamic-haiku-all.md

Run the deterministic dynamic search corpus after ranker changes:

go test ./internal/tools/dynamic/ -run TestDynamicSearchCorpus -count=1

Prepare Docker fixtures:

GITLAB_ENTERPRISE=false timeout 600s go run ./cmd/eval_mcp_surfaces \
  --backend=gitlab \
  --gitlab-env-file test/e2e/.env.docker \
  --prepare-fixtures \
  --fixtures-only \
  --fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json

Execute validated model calls against Docker GitLab CE:

GITLAB_ENTERPRISE=false timeout 900s go run ./cmd/eval_mcp_surfaces \
  --preset docker-read \
  --fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json \
  --task MS-014,MS-017,MS-020 \
  --out dist/evaluation/mcp-surfaces/live-smoke.md

Run the Enterprise schema-only batch:

GITLAB_ENTERPRISE=true timeout 180s go run ./cmd/eval_mcp_surfaces \
  --preset schema-enterprise \
  --out dist/evaluation/mcp-surfaces/schema-enterprise.md

Compare token and evaluation reports:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --compare dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/tokens.md \
  --compare dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/schema-base-read.md \
  --compare dist/evaluation/mcp-surfaces/snapshots/current/schema-base-read.md \
  --out dist/evaluation/mcp-surfaces/comparison/version-summary.md

Check dynamic call-efficiency gates from trace JSONL without parsing Markdown reports:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --check-efficiency dist/evaluation/mcp-surfaces/dynamic-full-live.traces/traces.jsonl \
  --out dist/evaluation/mcp-surfaces/dynamic-efficiency-check.md

Compare dynamic and meta-tool traces on identical task/model rows only:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --compare-traces dist/evaluation/mcp-surfaces/dynamic-full-live.traces/traces.jsonl \
  --compare-traces dist/evaluation/mcp-surfaces/meta-default-opaque-full-plus-reactivated-2026-05-13.traces/traces.jsonl \
  --out dist/evaluation/mcp-surfaces/meta-vs-dynamic-trace-comparison.md

Publish reviewed Docker reports into managed documentation blocks:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --publish-docs \
  --publish-from dist/evaluation/mcp-surfaces/docker-read-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Check the committed docs against the same reviewed reports without writing:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --check-docs \
  --publish-from dist/evaluation/mcp-surfaces/docker-read-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Run validated calls through an older or separately built stdio MCP server:

E2E_MODE=docker timeout 900s go run ./cmd/eval_mcp_surfaces \
  --tools-file dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/tools.json \
  --mcp-command dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/gitlab-mcp-server-release-2.0.0 \
  --mcp-env-file test/e2e/.env.docker \
  --execute-tools \
  --use-fixtures \
  --fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json \
  --task MS-028 \
  --skip-unavailable \
  --out dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/live-ms-028.md

The Docker presets apply safe defaults for --backend=gitlab, --gitlab-env-file test/e2e/.env.docker, --execute-tools, --use-fixtures, --skip-unavailable, and the matching partition. Override any of those flags explicitly when debugging a narrower case.

Long model-backed runs create the selected --out Markdown file at startup with a Status: running placeholder. The placeholder is replaced by the final metrics report when the run completes, or by a failure report if the evaluator stops before final metrics are available. For long local runs, always set an explicit --out path and redirect stdout/stderr to a sibling .log file so the terminal does not become the report artifact.

Use scripts/eval-compare-version.sh to orchestrate the standard snapshot, token audit, schema dry-run, optional model-backed run, and optional comparison report for one target label.

Safety

--execute-tools can mutate GitLab resources. It requires --backend=gitlab or --mcp-command plus E2E_MODE=docker unless --allow-live-mutations is explicitly set.

Keep reports, traces, snapshots, and fixture state under dist/evaluation/mcp-surfaces/; that directory is ignored by git.

Model-backed trace JSON records the normalized prompt flow plus provider HTTP request/response bodies and MCP CallTool request/response payloads. Provider authentication headers are not serialized; raw trace artifacts remain local and should not be published or committed.

--publish-docs is intentionally separate from normal runs. It consumes reviewed Markdown reports selected with --publish-from. Full GitLab-backed MCP reports without an explicit preset also read their local Trace artifacts JSONL. That lets the publisher split the table by preset and special partitions. It refuses to publish Docker metrics from GitLab-backed reports that did not use MCP tool execution. Partial Docker preset reports must use a --publish-label containing targeted so they are not mistaken for full preset results.

Docker live reports include a failure-triage section. It separates MCP implementation bugs, GitLab CE limitations, model route-selection misses, model parameter-shape misses, fixture setup failures, transient GitLab 5xx responses, timeout/resource exhaustion, destructive safety failures, and not-found results. Dynamic-surface reports also separate ranker_miss diagnostics from model discovery and execution failures when ranker-specific notes are available.

Documentation

Overview

Command eval_mcp_surfaces evaluates model behavior across MCP tool surfaces. By default it uses a mock GitLab client for catalog generation; --backend=gitlab points the in-memory MCP server at a real GitLab instance such as the Docker E2E environment.

Usage:

go run ./cmd/eval_mcp_surfaces/
go run ./cmd/eval_mcp_surfaces/ --max-tasks=5
go run ./cmd/eval_mcp_surfaces/ --dry-run
go run ./cmd/eval_mcp_surfaces/ --tools-file /tmp/tools_mcp_surfaces.json
go run ./cmd/eval_mcp_surfaces/ --publish-docs --publish-from dist/evaluation/mcp-surfaces/docker-read.md

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL