eval_meta_tools

command
v1.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 6, 2026 License: MIT Imports: 30 Imported by: 0

README

eval_meta_tools

cmd/eval_meta_tools evaluates whether a model can use the model-facing MCP catalog correctly. It can run schema-only validation, model-backed schema evaluation, and Docker-backed live MCP execution.

Inputs

Input Default Purpose
--tasks cmd/eval_meta_tools/testdata/automated-meta-tool-cases.md Executable Markdown fixture with MT-*, MS-*, and MF-* rows.
--model empty Single provider:model string or legacy Anthropic model name. Overrides --models and EVAL_MODELS.
--models empty Comma-separated provider:model list for local multi-model analysis. Defaults to EVAL_MODELS when --model is not set.
--tools-file empty Optional saved tools/list snapshot for schema/model comparison.
--preset empty Optional batch preset: docker-read, docker-mutating-safe, docker-destructive-safe, or schema-enterprise. Explicit flags override preset defaults.
--partition empty Optional fixture partition such as base-read, enterprise-read, or error-recovery.
--coverage-report empty Optional Markdown file listing uncovered high-risk routes for the selected run.
--compare empty Repeatable report path for comparison mode. Accepts token reports from cmd/audit_tokens and evaluation reports from this command.
--publish-docs false Publish reviewed evaluation reports into managed blocks in README.md and docs/testing/model-results.md.
--publish-from empty Repeatable reviewed eval_meta_tools report path consumed by --publish-docs or --check-docs.
--publish-results-doc docs/testing/model-results.md Results document updated by --publish-docs.
--publish-readme README.md README file updated by --publish-docs.
--publish-label empty Human-readable result label used in generated documentation. Defaults to the report date.
--publish-mode replace-current Results block update mode: replace-current or append. README summary always reflects the current selected reports.
--check-docs false Verify managed documentation blocks match the selected reports without writing files.
--mcp-command empty External stdio MCP server command used by --execute-tools instead of the in-process current-source server.
--mcp-arg empty Repeatable argument passed to --mcp-command.
--mcp-env-file empty Env file passed to the external MCP command, usually .env or test/e2e/.env.docker.
--fixtures dist/evaluation/meta-tools/e2e-fixtures.json Docker live fixture state generated by --prepare-fixtures.
--gitlab-env-file empty Env file for --backend=gitlab, usually test/e2e/.env.docker.

Supported model providers are anthropic, google, openai, and qwen. The evaluator reads ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, and QWEN_API_KEY only for the providers selected by --model, --models, or EVAL_MODELS. Qwen uses QWEN_API_KEY directly, defaults to the international DashScope OpenAI-compatible endpoint, disables thinking for tool calls, and supports QWEN_BASE_URL or QWEN_CHAT_COMPLETIONS_URL for regional endpoints.

Common Commands

Dry-run the current catalog without model calls:

GITLAB_ENTERPRISE=false timeout 180s go run ./cmd/eval_meta_tools \
  --dry-run \
  --repeat=1 \
  --out /tmp/eval-dry.md

Dry-run a saved snapshot partition and emit uncovered high-risk routes:

timeout 180s go run ./cmd/eval_meta_tools \
  --tools-file dist/evaluation/meta-tools/snapshots/current/tools.json \
  --dry-run \
  --partition base-read \
  --coverage-report dist/evaluation/meta-tools/snapshots/current/coverage-base-read.md \
  --out dist/evaluation/meta-tools/snapshots/current/schema-base-read.md

Run a targeted model-backed schema sample:

timeout 900s go run ./cmd/eval_meta_tools \
  --model anthropic:claude-sonnet-4-6 \
  --task MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=8 \
  --retry-wait=65s \
  --out dist/evaluation/meta-tools/schema-sample.md

Run the same small local analysis batch across all models configured in .env:

timeout 1800s go run ./cmd/eval_meta_tools \
  --models "$EVAL_MODELS" \
  --task MT-001,MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=4 \
  --retry-wait=30s \
  --out dist/evaluation/meta-tools/current-multi-model-smoke.md

When neither --model nor --models is provided, .env EVAL_MODELS is used if present; otherwise the evaluator uses the source-defined default model.

Prepare Docker fixtures:

GITLAB_ENTERPRISE=false timeout 600s go run ./cmd/eval_meta_tools \
  --backend=gitlab \
  --gitlab-env-file test/e2e/.env.docker \
  --prepare-fixtures \
  --fixtures-only \
  --fixtures dist/evaluation/meta-tools/e2e-fixtures.json

Execute validated model calls against Docker GitLab CE:

GITLAB_ENTERPRISE=false timeout 900s go run ./cmd/eval_meta_tools \
  --preset docker-read \
  --fixtures dist/evaluation/meta-tools/e2e-fixtures.json \
  --task MS-014,MS-017,MS-020 \
  --out dist/evaluation/meta-tools/live-smoke.md

Run the Enterprise schema-only batch:

GITLAB_ENTERPRISE=true timeout 180s go run ./cmd/eval_meta_tools \
  --preset schema-enterprise \
  --out dist/evaluation/meta-tools/schema-enterprise.md

Compare token and evaluation reports:

timeout 180s go run ./cmd/eval_meta_tools \
  --compare dist/evaluation/meta-tools/snapshots/release-1.5.0/tokens.md \
  --compare dist/evaluation/meta-tools/snapshots/release-1.5.0/schema-base-read.md \
  --compare dist/evaluation/meta-tools/snapshots/current/schema-base-read.md \
  --out dist/evaluation/meta-tools/comparison/version-summary.md

Publish reviewed Docker reports into managed documentation blocks:

timeout 180s go run ./cmd/eval_meta_tools \
  --publish-docs \
  --publish-from dist/evaluation/meta-tools/docker-read-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Check the committed docs against the same reviewed reports without writing:

timeout 180s go run ./cmd/eval_meta_tools \
  --check-docs \
  --publish-from dist/evaluation/meta-tools/docker-read-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Run validated calls through an older or separately built stdio MCP server:

E2E_MODE=docker timeout 900s go run ./cmd/eval_meta_tools \
  --tools-file dist/evaluation/meta-tools/snapshots/release-1.5.0/tools.json \
  --mcp-command dist/evaluation/meta-tools/snapshots/release-1.5.0/gitlab-mcp-server-release-1.5.0 \
  --mcp-env-file test/e2e/.env.docker \
  --execute-tools \
  --use-fixtures \
  --fixtures dist/evaluation/meta-tools/e2e-fixtures.json \
  --task MS-028 \
  --skip-unavailable \
  --out dist/evaluation/meta-tools/snapshots/release-1.5.0/live-ms-028.md

The Docker presets apply safe defaults for --backend=gitlab, --gitlab-env-file test/e2e/.env.docker, --execute-tools, --use-fixtures, --skip-unavailable, and the matching partition. Override any of those flags explicitly when debugging a narrower case.

Use scripts/eval-compare-version.sh to orchestrate the standard snapshot, token audit, schema dry-run, optional model-backed run, and optional comparison report for one target label.

Safety

--execute-tools can mutate GitLab resources. It requires --backend=gitlab or --mcp-command plus E2E_MODE=docker unless --allow-live-mutations is explicitly set.

Keep reports, traces, snapshots, and fixture state under dist/evaluation/meta-tools/; that directory is ignored by git.

--publish-docs is intentionally separate from normal runs. It consumes only reviewed Markdown reports selected with --publish-from, never raw trace JSON, and refuses to publish Docker metrics from GitLab-backed reports that did not use MCP tool execution. Partial Docker preset reports must use a --publish-label containing targeted so they are not mistaken for full preset results.

Docker live reports include a failure-triage section that separates MCP implementation bugs, GitLab CE limitations, model route-selection misses, model parameter-shape misses, fixture setup failures, transient GitLab 5xx responses, timeout/resource exhaustion, destructive safety failures, and not-found results.

Documentation

Overview

Command eval_meta_tools runs the meta-tool description evaluation fixture against model tool calling. By default it uses a mock GitLab client for catalog generation; --backend=gitlab points the in-memory MCP server at a real GitLab instance such as the Docker E2E environment.

Usage:

go run ./cmd/eval_meta_tools/
go run ./cmd/eval_meta_tools/ --max-tasks=5
go run ./cmd/eval_meta_tools/ --dry-run
go run ./cmd/eval_meta_tools/ --tools-file /tmp/tools_meta.json
go run ./cmd/eval_meta_tools/ --publish-docs --publish-from dist/evaluation/meta-tools/docker-read.md

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL