README
¶
eval_meta_tools
cmd/eval_meta_tools evaluates whether a model can use the model-facing MCP catalog correctly. It can run schema-only validation, model-backed schema evaluation, and Docker-backed live MCP execution.
Inputs
| Input | Default | Purpose |
|---|---|---|
--tasks |
cmd/eval_meta_tools/testdata/automated-meta-tool-cases.md |
Executable Markdown fixture with MT-*, MS-*, and MF-* rows. |
--model |
empty | Single provider:model string or legacy Anthropic model name. Overrides --models and EVAL_MODELS. |
--models |
empty | Comma-separated provider:model list for local multi-model analysis. Defaults to EVAL_MODELS when --model is not set. |
--tools-file |
empty | Optional saved tools/list snapshot for schema/model comparison. |
--preset |
empty | Optional batch preset: docker-read, docker-mutating-safe, docker-destructive-safe, or schema-enterprise. Explicit flags override preset defaults. |
--partition |
empty | Optional fixture partition such as base-read, enterprise-read, or error-recovery. |
--coverage-report |
empty | Optional Markdown file listing uncovered high-risk routes for the selected run. |
--compare |
empty | Repeatable report path for comparison mode. Accepts token reports from cmd/audit_tokens and evaluation reports from this command. |
--publish-docs |
false |
Publish reviewed evaluation reports into managed blocks in README.md and docs/testing/model-results.md. |
--publish-from |
empty | Repeatable reviewed eval_meta_tools report path consumed by --publish-docs or --check-docs. |
--publish-results-doc |
docs/testing/model-results.md |
Results document updated by --publish-docs. |
--publish-readme |
README.md |
README file updated by --publish-docs. |
--publish-label |
empty | Human-readable result label used in generated documentation. Defaults to the report date. |
--publish-mode |
replace-current |
Results block update mode: replace-current or append. README summary always reflects the current selected reports. |
--check-docs |
false |
Verify managed documentation blocks match the selected reports without writing files. |
--mcp-command |
empty | External stdio MCP server command used by --execute-tools instead of the in-process current-source server. |
--mcp-arg |
empty | Repeatable argument passed to --mcp-command. |
--mcp-env-file |
empty | Env file passed to the external MCP command, usually .env or test/e2e/.env.docker. |
--fixtures |
dist/evaluation/meta-tools/e2e-fixtures.json |
Docker live fixture state generated by --prepare-fixtures. |
--gitlab-env-file |
empty | Env file for --backend=gitlab, usually test/e2e/.env.docker. |
Supported model providers are anthropic, google, openai, and qwen. The evaluator reads ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, and QWEN_API_KEY only for the providers selected by --model, --models, or EVAL_MODELS. Qwen uses QWEN_API_KEY directly, defaults to the international DashScope OpenAI-compatible endpoint, disables thinking for tool calls, and supports QWEN_BASE_URL or QWEN_CHAT_COMPLETIONS_URL for regional endpoints.
Common Commands
Dry-run the current catalog without model calls:
GITLAB_ENTERPRISE=false timeout 180s go run ./cmd/eval_meta_tools \
--dry-run \
--repeat=1 \
--out /tmp/eval-dry.md
Dry-run a saved snapshot partition and emit uncovered high-risk routes:
timeout 180s go run ./cmd/eval_meta_tools \
--tools-file dist/evaluation/meta-tools/snapshots/current/tools.json \
--dry-run \
--partition base-read \
--coverage-report dist/evaluation/meta-tools/snapshots/current/coverage-base-read.md \
--out dist/evaluation/meta-tools/snapshots/current/schema-base-read.md
Run a targeted model-backed schema sample:
timeout 900s go run ./cmd/eval_meta_tools \
--model anthropic:claude-sonnet-4-6 \
--task MS-001,MF-001 \
--repeat=1 \
--pause=250ms \
--retries=8 \
--retry-wait=65s \
--out dist/evaluation/meta-tools/schema-sample.md
Run the same small local analysis batch across all models configured in .env:
timeout 1800s go run ./cmd/eval_meta_tools \
--models "$EVAL_MODELS" \
--task MT-001,MS-001,MF-001 \
--repeat=1 \
--pause=250ms \
--retries=4 \
--retry-wait=30s \
--out dist/evaluation/meta-tools/current-multi-model-smoke.md
When neither --model nor --models is provided, .env EVAL_MODELS is used if present; otherwise the evaluator uses the source-defined default model.
Prepare Docker fixtures:
GITLAB_ENTERPRISE=false timeout 600s go run ./cmd/eval_meta_tools \
--backend=gitlab \
--gitlab-env-file test/e2e/.env.docker \
--prepare-fixtures \
--fixtures-only \
--fixtures dist/evaluation/meta-tools/e2e-fixtures.json
Execute validated model calls against Docker GitLab CE:
GITLAB_ENTERPRISE=false timeout 900s go run ./cmd/eval_meta_tools \
--preset docker-read \
--fixtures dist/evaluation/meta-tools/e2e-fixtures.json \
--task MS-014,MS-017,MS-020 \
--out dist/evaluation/meta-tools/live-smoke.md
Run the Enterprise schema-only batch:
GITLAB_ENTERPRISE=true timeout 180s go run ./cmd/eval_meta_tools \
--preset schema-enterprise \
--out dist/evaluation/meta-tools/schema-enterprise.md
Compare token and evaluation reports:
timeout 180s go run ./cmd/eval_meta_tools \
--compare dist/evaluation/meta-tools/snapshots/release-1.5.0/tokens.md \
--compare dist/evaluation/meta-tools/snapshots/release-1.5.0/schema-base-read.md \
--compare dist/evaluation/meta-tools/snapshots/current/schema-base-read.md \
--out dist/evaluation/meta-tools/comparison/version-summary.md
Publish reviewed Docker reports into managed documentation blocks:
timeout 180s go run ./cmd/eval_meta_tools \
--publish-docs \
--publish-from dist/evaluation/meta-tools/docker-read-all-models.md \
--publish-from dist/evaluation/meta-tools/docker-mutating-safe-all-models.md \
--publish-from dist/evaluation/meta-tools/docker-destructive-safe-all-models.md \
--publish-label "2026-05-05 Docker economy models"
Check the committed docs against the same reviewed reports without writing:
timeout 180s go run ./cmd/eval_meta_tools \
--check-docs \
--publish-from dist/evaluation/meta-tools/docker-read-all-models.md \
--publish-from dist/evaluation/meta-tools/docker-mutating-safe-all-models.md \
--publish-from dist/evaluation/meta-tools/docker-destructive-safe-all-models.md \
--publish-label "2026-05-05 Docker economy models"
Run validated calls through an older or separately built stdio MCP server:
E2E_MODE=docker timeout 900s go run ./cmd/eval_meta_tools \
--tools-file dist/evaluation/meta-tools/snapshots/release-1.5.0/tools.json \
--mcp-command dist/evaluation/meta-tools/snapshots/release-1.5.0/gitlab-mcp-server-release-1.5.0 \
--mcp-env-file test/e2e/.env.docker \
--execute-tools \
--use-fixtures \
--fixtures dist/evaluation/meta-tools/e2e-fixtures.json \
--task MS-028 \
--skip-unavailable \
--out dist/evaluation/meta-tools/snapshots/release-1.5.0/live-ms-028.md
The Docker presets apply safe defaults for --backend=gitlab, --gitlab-env-file test/e2e/.env.docker, --execute-tools, --use-fixtures, --skip-unavailable, and the matching partition. Override any of those flags explicitly when debugging a narrower case.
Use scripts/eval-compare-version.sh to orchestrate the standard snapshot, token audit, schema dry-run, optional model-backed run, and optional comparison report for one target label.
Safety
--execute-tools can mutate GitLab resources. It requires --backend=gitlab or --mcp-command plus E2E_MODE=docker unless --allow-live-mutations is explicitly set.
Keep reports, traces, snapshots, and fixture state under dist/evaluation/meta-tools/; that directory is ignored by git.
--publish-docs is intentionally separate from normal runs. It consumes only reviewed Markdown reports selected with --publish-from, never raw trace JSON, and refuses to publish Docker metrics from GitLab-backed reports that did not use MCP tool execution. Partial Docker preset reports must use a --publish-label containing targeted so they are not mistaken for full preset results.
Docker live reports include a failure-triage section that separates MCP implementation bugs, GitLab CE limitations, model route-selection misses, model parameter-shape misses, fixture setup failures, transient GitLab 5xx responses, timeout/resource exhaustion, destructive safety failures, and not-found results.
Related Documentation
Documentation
¶
Overview ¶
Command eval_meta_tools runs the meta-tool description evaluation fixture against model tool calling. By default it uses a mock GitLab client for catalog generation; --backend=gitlab points the in-memory MCP server at a real GitLab instance such as the Docker E2E environment.
Usage:
go run ./cmd/eval_meta_tools/ go run ./cmd/eval_meta_tools/ --max-tasks=5 go run ./cmd/eval_meta_tools/ --dry-run go run ./cmd/eval_meta_tools/ --tools-file /tmp/tools_meta.json go run ./cmd/eval_meta_tools/ --publish-docs --publish-from dist/evaluation/meta-tools/docker-read.md