eval_meta_tools

command

v1.5.0 Latest Latest Go to latest Published: May 6, 2026 License: MIT Imports: 30 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jmrplens/gitlab-mcp-server

Links

Open Source Insights

README ¶

eval_meta_tools

cmd/eval_meta_tools evaluates whether a model can use the model-facing MCP catalog correctly. It can run schema-only validation, model-backed schema evaluation, and Docker-backed live MCP execution.

Inputs

Input	Default	Purpose
`--tasks`	`cmd/eval_meta_tools/testdata/automated-meta-tool-cases.md`	Executable Markdown fixture with `MT-`, `MS-`, and `MF-*` rows.
`--model`	empty	Single `provider:model` string or legacy Anthropic model name. Overrides `--models` and `EVAL_MODELS`.
`--models`	empty	Comma-separated `provider:model` list for local multi-model analysis. Defaults to `EVAL_MODELS` when `--model` is not set.
`--tools-file`	empty	Optional saved `tools/list` snapshot for schema/model comparison.
`--preset`	empty	Optional batch preset: `docker-read`, `docker-mutating-safe`, `docker-destructive-safe`, or `schema-enterprise`. Explicit flags override preset defaults.
`--partition`	empty	Optional fixture partition such as `base-read`, `enterprise-read`, or `error-recovery`.
`--coverage-report`	empty	Optional Markdown file listing uncovered high-risk routes for the selected run.
`--compare`	empty	Repeatable report path for comparison mode. Accepts token reports from `cmd/audit_tokens` and evaluation reports from this command.
`--publish-docs`	`false`	Publish reviewed evaluation reports into managed blocks in `README.md` and `docs/testing/model-results.md`.
`--publish-from`	empty	Repeatable reviewed `eval_meta_tools` report path consumed by `--publish-docs` or `--check-docs`.
`--publish-results-doc`	`docs/testing/model-results.md`	Results document updated by `--publish-docs`.
`--publish-readme`	`README.md`	README file updated by `--publish-docs`.
`--publish-label`	empty	Human-readable result label used in generated documentation. Defaults to the report date.
`--publish-mode`	`replace-current`	Results block update mode: `replace-current` or `append`. README summary always reflects the current selected reports.
`--check-docs`	`false`	Verify managed documentation blocks match the selected reports without writing files.
`--mcp-command`	empty	External stdio MCP server command used by `--execute-tools` instead of the in-process current-source server.
`--mcp-arg`	empty	Repeatable argument passed to `--mcp-command`.
`--mcp-env-file`	empty	Env file passed to the external MCP command, usually `.env` or `test/e2e/.env.docker`.
`--fixtures`	`dist/evaluation/meta-tools/e2e-fixtures.json`	Docker live fixture state generated by `--prepare-fixtures`.
`--gitlab-env-file`	empty	Env file for `--backend=gitlab`, usually `test/e2e/.env.docker`.

Supported model providers are anthropic, google, openai, and qwen. The evaluator reads ANTHROPIC_API_KEY, GOOGLE_API_KEY, OPENAI_API_KEY, and QWEN_API_KEY only for the providers selected by --model, --models, or EVAL_MODELS. Qwen uses QWEN_API_KEY directly, defaults to the international DashScope OpenAI-compatible endpoint, disables thinking for tool calls, and supports QWEN_BASE_URL or QWEN_CHAT_COMPLETIONS_URL for regional endpoints.

Common Commands

Dry-run the current catalog without model calls:

GITLAB_ENTERPRISE=false timeout 180s go run ./cmd/eval_meta_tools \
  --dry-run \
  --repeat=1 \
  --out /tmp/eval-dry.md

Dry-run a saved snapshot partition and emit uncovered high-risk routes:

timeout 180s go run ./cmd/eval_meta_tools \
  --tools-file dist/evaluation/meta-tools/snapshots/current/tools.json \
  --dry-run \
  --partition base-read \
  --coverage-report dist/evaluation/meta-tools/snapshots/current/coverage-base-read.md \
  --out dist/evaluation/meta-tools/snapshots/current/schema-base-read.md

Run a targeted model-backed schema sample:

timeout 900s go run ./cmd/eval_meta_tools \
  --model anthropic:claude-sonnet-4-6 \
  --task MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=8 \
  --retry-wait=65s \
  --out dist/evaluation/meta-tools/schema-sample.md

Run the same small local analysis batch across all models configured in .env:

timeout 1800s go run ./cmd/eval_meta_tools \
  --models "$EVAL_MODELS" \
  --task MT-001,MS-001,MF-001 \
  --repeat=1 \
  --pause=250ms \
  --retries=4 \
  --retry-wait=30s \
  --out dist/evaluation/meta-tools/current-multi-model-smoke.md

When neither --model nor --models is provided, .env EVAL_MODELS is used if present; otherwise the evaluator uses the source-defined default model.

Prepare Docker fixtures:

GITLAB_ENTERPRISE=false timeout 600s go run ./cmd/eval_meta_tools \
  --backend=gitlab \
  --gitlab-env-file test/e2e/.env.docker \
  --prepare-fixtures \
  --fixtures-only \
  --fixtures dist/evaluation/meta-tools/e2e-fixtures.json

Execute validated model calls against Docker GitLab CE:

GITLAB_ENTERPRISE=false timeout 900s go run ./cmd/eval_meta_tools \
  --preset docker-read \
  --fixtures dist/evaluation/meta-tools/e2e-fixtures.json \
  --task MS-014,MS-017,MS-020 \
  --out dist/evaluation/meta-tools/live-smoke.md

Run the Enterprise schema-only batch:

GITLAB_ENTERPRISE=true timeout 180s go run ./cmd/eval_meta_tools \
  --preset schema-enterprise \
  --out dist/evaluation/meta-tools/schema-enterprise.md

Compare token and evaluation reports:

timeout 180s go run ./cmd/eval_meta_tools \
  --compare dist/evaluation/meta-tools/snapshots/release-1.5.0/tokens.md \
  --compare dist/evaluation/meta-tools/snapshots/release-1.5.0/schema-base-read.md \
  --compare dist/evaluation/meta-tools/snapshots/current/schema-base-read.md \
  --out dist/evaluation/meta-tools/comparison/version-summary.md

Publish reviewed Docker reports into managed documentation blocks:

timeout 180s go run ./cmd/eval_meta_tools \
  --publish-docs \
  --publish-from dist/evaluation/meta-tools/docker-read-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Check the committed docs against the same reviewed reports without writing:

timeout 180s go run ./cmd/eval_meta_tools \
  --check-docs \
  --publish-from dist/evaluation/meta-tools/docker-read-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-mutating-safe-all-models.md \
  --publish-from dist/evaluation/meta-tools/docker-destructive-safe-all-models.md \
  --publish-label "2026-05-05 Docker economy models"

Run validated calls through an older or separately built stdio MCP server:

E2E_MODE=docker timeout 900s go run ./cmd/eval_meta_tools \
  --tools-file dist/evaluation/meta-tools/snapshots/release-1.5.0/tools.json \
  --mcp-command dist/evaluation/meta-tools/snapshots/release-1.5.0/gitlab-mcp-server-release-1.5.0 \
  --mcp-env-file test/e2e/.env.docker \
  --execute-tools \
  --use-fixtures \
  --fixtures dist/evaluation/meta-tools/e2e-fixtures.json \
  --task MS-028 \
  --skip-unavailable \
  --out dist/evaluation/meta-tools/snapshots/release-1.5.0/live-ms-028.md

The Docker presets apply safe defaults for --backend=gitlab, --gitlab-env-file test/e2e/.env.docker, --execute-tools, --use-fixtures, --skip-unavailable, and the matching partition. Override any of those flags explicitly when debugging a narrower case.

Use scripts/eval-compare-version.sh to orchestrate the standard snapshot, token audit, schema dry-run, optional model-backed run, and optional comparison report for one target label.

Safety

--execute-tools can mutate GitLab resources. It requires --backend=gitlab or --mcp-command plus E2E_MODE=docker unless --allow-live-mutations is explicitly set.

Keep reports, traces, snapshots, and fixture state under dist/evaluation/meta-tools/; that directory is ignored by git.

--publish-docs is intentionally separate from normal runs. It consumes only reviewed Markdown reports selected with --publish-from, never raw trace JSON, and refuses to publish Docker metrics from GitLab-backed reports that did not use MCP tool execution. Partial Docker preset reports must use a --publish-label containing targeted so they are not mistaken for full preset results.

Docker live reports include a failure-triage section that separates MCP implementation bugs, GitLab CE limitations, model route-selection misses, model parameter-shape misses, fixture setup failures, transient GitLab 5xx responses, timeout/resource exhaustion, destructive safety failures, and not-found results.

Documentation ¶

Overview ¶

Command eval_meta_tools runs the meta-tool description evaluation fixture against model tool calling. By default it uses a mock GitLab client for catalog generation; --backend=gitlab points the in-memory MCP server at a real GitLab instance such as the Docker E2E environment.

Usage:

go run ./cmd/eval_meta_tools/
go run ./cmd/eval_meta_tools/ --max-tasks=5
go run ./cmd/eval_meta_tools/ --dry-run
go run ./cmd/eval_meta_tools/ --tools-file /tmp/tools_meta.json
go run ./cmd/eval_meta_tools/ --publish-docs --publish-from dist/evaluation/meta-tools/docker-read.md

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

eval_meta_tools

Inputs

Common Commands

Safety

Related Documentation

Documentation ¶

Overview ¶

Source Files ¶