eval_mcp_surfaces

command

v2.0.2 Latest Latest Go to latest Published: May 19, 2026 License: MIT Imports: 37 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jmrplens/gitlab-mcp-server

Links

Open Source Insights

README ¶

eval_mcp_surfaces

cmd/eval_mcp_surfaces evaluates whether a model can use the model-facing MCP catalog correctly. It can run schema-only validation, model-backed schema evaluation, and Docker-backed live MCP execution.

Inputs

Input	Default	Purpose
`--tasks`	`cmd/eval_mcp_surfaces/testdata/automated-mcp-surface-cases.md`	Executable Markdown fixture with `MT-`, `MS-`, and `MF-*` rows.
`--model`	empty	Single `provider:model` string or legacy Anthropic model name. Overrides `--models` and `EVAL_MODELS`.
`--models`	empty	Comma-separated `provider:model` list for local multi-model analysis. Defaults to `EVAL_MODELS` when `--model` is not set.
`--tool-surface`	`dynamic`	Model-facing catalog surface to evaluate: `dynamic` or `meta`. `dynamic` evaluates `gitlab_find_action` plus `gitlab_execute_tool`.
`--tools-file`	empty	Optional saved `tools/list` snapshot for schema/model comparison.
`--preset`	empty	Optional batch preset: `docker-read`, `docker-mutating-safe`, `docker-destructive-safe`, or `schema-enterprise`. Explicit flags override preset defaults.
`--partition`	empty	Optional fixture partition such as `base-read`, `enterprise-read`, or `error-recovery`.
`--coverage-report`	empty	Optional Markdown file listing uncovered high-risk routes for the selected run.
`--compare`	empty	Repeatable report path for comparison mode. Accepts token reports from `cmd/audit_tokens` and evaluation reports from this command.
`--publish-docs`	`false`	Publish reviewed evaluation reports into managed blocks in `README.md` and `docs/testing/model-results.md`.
`--publish-from`	empty	Repeatable reviewed `eval_mcp_surfaces` report path consumed by `--publish-docs` or `--check-docs`.
`--publish-results-doc`	`docs/testing/model-results.md`	Results document updated by `--publish-docs`.
`--publish-readme`	`README.md`	README file updated by `--publish-docs`.
`--publish-label`	empty	Human-readable result label used in generated documentation. Defaults to the report date.
`--publish-mode`	`replace-current`	Results block update mode: `replace-current` or `append`. README summary always reflects the current selected reports.
`--check-docs`	`false`	Verify managed documentation blocks match the selected reports without writing files.
`--mcp-command`	empty	External stdio MCP server command used by `--execute-tools` instead of the in-process current-source server.
`--mcp-arg`	empty	Repeatable argument passed to `--mcp-command`.
`--mcp-env-file`	empty	Env file passed to the external MCP command, usually `.env` or `test/e2e/.env.docker`.
`--fixtures`	`dist/evaluation/mcp-surfaces/e2e-fixtures.json`	Docker live fixture state generated by `--prepare-fixtures`.
`--gitlab-env-file`	empty	Env file for `--backend=gitlab`, usually `test/e2e/.env.docker`.

Supported model providers are anthropic, google, openai, and qwen. The evaluator reads ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, and QWEN_API_KEY only for the providers selected by --model, --models, or EVAL_MODELS. When no model source is configured, the default is anthropic:claude-haiku-4-5-20251001. Qwen uses QWEN_API_KEY directly, defaults to the international DashScope OpenAI-compatible endpoint, disables thinking for tool calls, and supports QWEN_BASE_URL or QWEN_CHAT_COMPLETIONS_URL for regional endpoints.

Common Commands

Dry-run the current catalog without model calls:

GITLAB_ENTERPRISE=false timeout 180s go run ./cmd/eval_mcp_surfaces \
  --dry-run \
  --repeat=1 \
  --out /tmp/eval-dry.md

Dry-run a saved snapshot partition and emit uncovered high-risk routes:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --tools-file dist/evaluation/mcp-surfaces/snapshots/current/tools.json \
  --dry-run \
  --partition base-read \
  --coverage-report dist/evaluation/mcp-surfaces/snapshots/current/coverage-base-read.md \
  --out dist/evaluation/mcp-surfaces/snapshots/current/schema-base-read.md

Run a targeted model-backed schema sample:

timeout 900s go run ./cmd/eval_mcp_surfaces \
  --model anthropic:claude-sonnet-4-6 \
  --task MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=8 \
  --retry-wait=65s \
  --out dist/evaluation/mcp-surfaces/schema-sample.md

Run the same small local analysis batch across all models configured in .env:

timeout 1800s go run ./cmd/eval_mcp_surfaces \
  --models "$EVAL_MODELS" \
  --task MT-001,MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=4 \
  --retry-wait=30s \
  --out dist/evaluation/mcp-surfaces/current-multi-model-smoke.md

When neither --model nor --models is provided, .env EVAL_MODELS is used if present; otherwise the evaluator uses the source-defined default model.

Run the full fixture corpus against Haiku using the current low-token dynamic surface:

timeout 7200s go run ./cmd/eval_mcp_surfaces \
  --tool-surface=dynamic \
  --model anthropic:claude-haiku-4-5-20251001 \
  --repeat=1 \
  --pause=250ms \
  --retries=4 \
  --retry-wait=65s \
  --out dist/evaluation/mcp-surfaces/dynamic-haiku-all.md

Run the deterministic dynamic search corpus after ranker changes:

go test ./internal/tools/dynamic/ -run TestDynamicSearchCorpus -count=1

Prepare Docker fixtures:

GITLAB_ENTERPRISE=false timeout 600s go run ./cmd/eval_mcp_surfaces \
  --backend=gitlab \
  --gitlab-env-file test/e2e/.env.docker \
  --prepare-fixtures \
  --fixtures-only \
  --fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json

Execute validated model calls against Docker GitLab CE:

GITLAB_ENTERPRISE=false timeout 900s go run ./cmd/eval_mcp_surfaces \
  --preset docker-read \
  --fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json \
  --task MS-014,MS-017,MS-020 \
  --out dist/evaluation/mcp-surfaces/live-smoke.md

Run the Enterprise schema-only batch:

GITLAB_ENTERPRISE=true timeout 180s go run ./cmd/eval_mcp_surfaces \
  --preset schema-enterprise \
  --out dist/evaluation/mcp-surfaces/schema-enterprise.md

Compare token and evaluation reports:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --compare dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/tokens.md \
  --compare dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/schema-base-read.md \
  --compare dist/evaluation/mcp-surfaces/snapshots/current/schema-base-read.md \
  --out dist/evaluation/mcp-surfaces/comparison/version-summary.md

Check dynamic call-efficiency gates from trace JSONL without parsing Markdown reports:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --check-efficiency dist/evaluation/mcp-surfaces/dynamic-full-live.traces/traces.jsonl \
  --out dist/evaluation/mcp-surfaces/dynamic-efficiency-check.md

Compare dynamic and meta-tool traces on identical task/model rows only:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --compare-traces dist/evaluation/mcp-surfaces/dynamic-full-live.traces/traces.jsonl \
  --compare-traces dist/evaluation/mcp-surfaces/meta-default-opaque-full-plus-reactivated-2026-05-13.traces/traces.jsonl \
  --out dist/evaluation/mcp-surfaces/meta-vs-dynamic-trace-comparison.md

Publish reviewed Docker reports into managed documentation blocks:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --publish-docs \
  --publish-from dist/evaluation/mcp-surfaces/docker-read-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Check the committed docs against the same reviewed reports without writing:

timeout 180s go run ./cmd/eval_mcp_surfaces \
  --check-docs \
  --publish-from dist/evaluation/mcp-surfaces/docker-read-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/mcp-surfaces/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Run validated calls through an older or separately built stdio MCP server:

E2E_MODE=docker timeout 900s go run ./cmd/eval_mcp_surfaces \
  --tools-file dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/tools.json \
  --mcp-command dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/gitlab-mcp-server-release-2.0.0 \
  --mcp-env-file test/e2e/.env.docker \
  --execute-tools \
  --use-fixtures \
  --fixtures dist/evaluation/mcp-surfaces/e2e-fixtures.json \
  --task MS-028 \
  --skip-unavailable \
  --out dist/evaluation/mcp-surfaces/snapshots/release-2.0.0/live-ms-028.md

The Docker presets apply safe defaults for --backend=gitlab, --gitlab-env-file test/e2e/.env.docker, --execute-tools, --use-fixtures, --skip-unavailable, and the matching partition. Override any of those flags explicitly when debugging a narrower case.

Long model-backed runs create the selected --out Markdown file at startup with a Status: running placeholder. The placeholder is replaced by the final metrics report when the run completes, or by a failure report if the evaluator stops before final metrics are available. For long local runs, always set an explicit --out path and redirect stdout/stderr to a sibling .log file so the terminal does not become the report artifact.

Use scripts/eval-compare-version.sh to orchestrate the standard snapshot, token audit, schema dry-run, optional model-backed run, and optional comparison report for one target label.

Safety

--execute-tools can mutate GitLab resources. It requires --backend=gitlab or --mcp-command plus E2E_MODE=docker unless --allow-live-mutations is explicitly set.

Keep reports, traces, snapshots, and fixture state under dist/evaluation/mcp-surfaces/; that directory is ignored by git.

Model-backed trace JSON records the normalized prompt flow plus provider HTTP request/response bodies and MCP CallTool request/response payloads. Provider authentication headers are not serialized; raw trace artifacts remain local and should not be published or committed.

--publish-docs is intentionally separate from normal runs. It consumes reviewed Markdown reports selected with --publish-from. Full GitLab-backed MCP reports without an explicit preset also read their local Trace artifacts JSONL. That lets the publisher split the table by preset and special partitions. It refuses to publish Docker metrics from GitLab-backed reports that did not use MCP tool execution. Partial Docker preset reports must use a --publish-label containing targeted so they are not mistaken for full preset results.

Docker live reports include a failure-triage section. It separates MCP implementation bugs, GitLab CE limitations, model route-selection misses, model parameter-shape misses, fixture setup failures, transient GitLab 5xx responses, timeout/resource exhaustion, destructive safety failures, and not-found results. Dynamic-surface reports also separate ranker_miss diagnostics from model discovery and execution failures when ranker-specific notes are available.

Documentation ¶

Overview ¶

Command eval_mcp_surfaces evaluates model behavior across MCP tool surfaces. By default it uses a mock GitLab client for catalog generation; --backend=gitlab points the in-memory MCP server at a real GitLab instance such as the Docker E2E environment.

Usage:

go run ./cmd/eval_mcp_surfaces/
go run ./cmd/eval_mcp_surfaces/ --max-tasks=5
go run ./cmd/eval_mcp_surfaces/ --dry-run
go run ./cmd/eval_mcp_surfaces/ --tools-file /tmp/tools_mcp_surfaces.json
go run ./cmd/eval_mcp_surfaces/ --publish-docs --publish-from dist/evaluation/mcp-surfaces/docker-read.md

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

eval_mcp_surfaces

Inputs

Common Commands

Safety

Related Documentation

Documentation ¶

Overview ¶

Source Files ¶