bench

package
v0.45.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 23, 2026 License: MIT Imports: 16 Imported by: 0

README

mcpproxy benchmark harness

The reproducible numbers behind mcpproxy's marketing claims — token reduction, discovery accuracy, and latency — comparing three ways an agent can be wired to upstream MCP tools.

Roadmap item #19 (MCP-42). In-repo (bench/), reproducible, intended to be refreshed on release. Reports are never committed (Spec 065 CN-003); only code, fixtures, and this methodology are versioned.

The three modes

Mode What the agent sees in context mcpproxy server
baseline Every upstream tool definition, loaded directly (no proxy discovery)
retrieve_tools retrieve_tools + call_tool_read/write/destructive + read_cache + code_execution + management tools; tools found on demand via BM25 callToolServer
code_execution code_execution + retrieve_tools + management tools; many tools orchestrated from sandboxed JS in one round-trip codeExecServer

Both proxy modes also append the shared management tool setupstream_servers, quarantine_security, search_servers, list_registries — that the live routing-mode servers expose. These count against the proxy context cost: omitting them undercounts that cost and inflates the savings.

The per-mode catalog is derived directly from the live tool builders (buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go, via server.ProxyModeToolDefs), so it can never drift from production.

What ships today (deterministic, offline)

The token-reduction measurement is fully deterministic and runs with no network or LLM:

go run ./bench/cmd/bench            # scores the committed Spec 065 corpus
go test ./bench/                    # unit + invariant tests

It counts the context-token cost of each mode over a frozen tool corpus and reports the savings of each proxy mode versus the baseline. Output: a report.json and a self-contained dashboard.html in bench/results/ (gitignored).

Current deterministic result

Over the 45-tool Spec 065 reference corpus, counting tool name + description only (schemas excluded uniformly — see limitations), cl100k_base:

Mode Context tools Tokens Savings vs. baseline
baseline 45 1730
retrieve_tools 10 1431 ~17%
code_execution 6 986 ~43%

These are deliberately modest: the proxy context here is the full per-mode tool set (discovery + call-tool variants + management tools), and the corpus is small. Savings grow toward the asymptote as the upstream tool count rises (the baseline grows linearly while the proxy context stays fixed) — always quote the corpus size alongside a percentage. Reproduce with go run ./bench/cmd/bench.

Scoring rubric — token reduction
  • Tool universe: the frozen Spec 065 snapshot specs/065-evaluation-foundation/datasets/corpus_v1.tools.json — 45 tools across 7 no-auth reference servers. Frozen + versioned so scoring never runs against a drifting corpus (CN-002).
  • Tokenizer: tiktoken cl100k_base, a widely-used reproducible BPE (already a repo dependency). It is a model-agnostic estimator; exact counts for a specific pinned model (e.g. Claude) will differ, but the relative savings between modes are stable.
  • Proxy-mode tools: the complete per-mode catalog, derived from the live server builders — discovery, the call-tool variants, code_execution, and the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries). Nothing the agent actually sees is dropped from the proxy cost.
  • Cost of a tool: name + "\n" + description. JSON input schemas are excluded uniformly across all modes (the committed corpus snapshot does not carry schemas).
  • Savings for a mode m: 1 - tokens(m) / tokens(baseline).
Known limitations (read before quoting a number)
  • Schemas excluded — direction is not clean. Input schemas are dropped from both sides. The 45 baseline tools lose their schemas, but so do the proxy modes' management tools (e.g. upstream_servers carries a large multi-field schema). So the name+description-only number is not unambiguously conservative — it is its own well-defined metric. The live run below adds full schemas from GET /api/v1/tools for the exact headline number; quote that for marketing, not this offline estimate.
  • Savings scale with tool count. The 45-tool reference corpus is small; real deployments expose hundreds–thousands of tools, where the baseline grows linearly and the proxy context stays fixed, so savings approach the asymptote. Quote the corpus size alongside any percentage.
  • cl100k_base ≠ the pinned model's tokenizer. Pinning the exact tokenizer for the headline model is tracked as a follow-up (see "Roadmap").

Live run — full schemas + accuracy + latency

The live run boots mcpproxy over the Spec 065 reference-server config and measures the three headline claims against a running proxy. Everything here is still deterministic and LLM-free.

# 1. Boot the reproducible substrate (proxy + 7 no-auth reference servers)
docker compose -f bench/docker-compose.yml up --build -d

# 2. Score against the running proxy (writes bench/results/live_report.json)
go run ./bench/cmd/bench -live -proxy http://127.0.0.1:8092 -api-key eval-corpus-snapshot

What it adds over the offline token run:

  • Exact token number (full schemas). Pulls GET /api/v1/tools for the upstream tools with their full JSON input schemas and counts them against the proxy modes — whose management-tool schemas come from the same live builders as the offline run (server.ProxyModeToolDefs). Because schemas are counted on both sides, the savings is authoritative.
    • Safety valve (MCP-3161): if any proxy tool is missing a schema, counting the baseline's schemas alone would overstate savings, so the run withholds the headline % and reports raw token totals only (authoritative_headline: false). Never quote a withheld run.
  • Accuracy. Replays retrieval_golden_v1.json through the proxy's BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}, MRR, nDCG@10, MAP against the graded labels. Deterministic (BM25), so a single run is reported (runs_averaged: 1). The emitted retrieval block conforms to the Spec 065 score-report.schema.json shape — nested metrics + gate (verified by a schema-validation test). A standalone live run has no stored baseline to regress against, so gate.passed is true by construction; CI regression-gating against a committed baseline is the MCP-3133 lane.
  • Latency. Client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot cost of loading all tools. Measured client-side on purpose: the server's SearchToolsResponse.took field is currently a "0ms" stub.

What is scoped but not yet built (follow-ups)

These require decisions and/or other roles, so they are tracked as child issues rather than landed here:

  • End-to-end task success with a pinned LLM — requires a pinned model + an LLM-call budget; this is the only part that costs spend.
  • CI publish-on-release-tag → public static dashboard — Release/DevOps lane.

Dataset sources & provenance

  • Tool corpus + retrieval golden set: Spec 065 frozen datasets (specs/065-evaluation-foundation/datasets/), generated from 7 permissively reachable no-auth reference servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking).
  • Proxy + management tool definitions: derived at run time from the live server tool builders (internal/server/mcp_routing.gobuildCallToolModeTools / buildCodeExecModeTools, exposed via internal/server.ProxyModeToolDefs). No hand-maintained fixture — the benchmark cannot drift from the tools the proxy actually serves.

Reproducible live run

docker-compose.yml boots mcpproxy over the frozen reference-server config so the corpus and live tool list are reproducible across machines. The live accuracy/latency/full-schema scorers attach to it via -live (see "Live run" above). Pin the upstream-server images before publishing headline numbers (image drift can change the tool corpus).

Reviewer contact

Methodology questions / disputes: open an issue in smart-mcp-proxy/mcpproxy-go and tag the maintainers, or comment on the roadmap benchmark ticket (MCP-42).

Documentation

Overview

Package bench is the mcpproxy benchmark harness (roadmap #19 / MCP-42).

It produces the reproducible numbers behind mcpproxy's marketing claims — token reduction, discovery accuracy, and latency — by comparing three ways an agent can be wired to upstream MCP tools:

  • baseline: every upstream tool definition is loaded directly into the agent's context (no proxy discovery).
  • retrieve_tools: only mcpproxy's discovery + call_tool variants occupy the context; tools are found on demand via BM25 search.
  • code_execution: only code_execution + retrieve_tools occupy the context; the agent orchestrates many tools from sandboxed JS in one round-trip.

The token-reduction measurement in this file is fully deterministic and offline: it counts the context cost of each mode over a frozen tool corpus using the tiktoken cl100k_base encoding (a reproducible, model-agnostic estimator). It reuses the Spec 065 frozen corpus (specs/065-evaluation-foundation/datasets/corpus_v1.tools.json) as its tool universe so the benchmark scores a versioned, non-drifting snapshot (CN-002).

Methodology, limitations, and the live (docker-compose) run that adds full JSON input schemas and end-to-end accuracy/latency are documented in bench/README.md.

Index

Constants

View Source
const (
	ModeBaseline      = "baseline"
	ModeRetrieveTools = "retrieve_tools"
	ModeCodeExecution = "code_execution"
)

Routing modes the benchmark compares. The mode names mirror the mcpproxy MCP servers in internal/server/mcp.go (codeExecServer, callToolServer).

View Source
const DefaultEncoding = "cl100k_base"

DefaultEncoding is the tiktoken encoding used for token estimation. cl100k_base is a widely-used, reproducible BPE; exact counts for a specific pinned model (e.g. Claude) will differ, but the *relative* savings between modes are stable.

Variables

This section is empty.

Functions

func AveragePrecision

func AveragePrecision(ranked []string, labels []Label) float64

AveragePrecision is the mean of the precision values computed at each rank where a relevant tool is retrieved, divided by the total number of relevant tools (so unretrieved relevant tools lower the score). Binary relevance (relevance >= 1) is used, matching the standard MAP definition.

func NDCGAtK

func NDCGAtK(ranked []string, labels []Label, k int) float64

NDCGAtK is the normalized discounted cumulative gain at k using the graded relevance as the gain (linear gain, log2 position discount). 1.0 means the ranking is in ideal (relevance-descending) order; 0 means no gain in top-k.

func RecallAtK

func RecallAtK(ranked []string, labels []Label, k int) float64

RecallAtK is the fraction of the query's relevant tools (relevance >= 1) that appear in the top-k of the ranking. Returns 0 when there are no relevant tools (a degenerate query that should not be scored).

func ReciprocalRank

func ReciprocalRank(ranked []string, labels []Label) float64

ReciprocalRank is 1/rank of the first relevant tool in the ranking, or 0 if none of the ranked tools are relevant.

Types

type Corpus

type Corpus struct {
	Version string `json:"version"`
	Tools   []Tool `json:"tools"`
}

Corpus is a frozen, versioned set of tool definitions.

func LoadCorpus

func LoadCorpus(path string) (*Corpus, error)

LoadCorpus reads a frozen corpus snapshot (e.g. the Spec 065 corpus_v1.tools.json) from disk.

type GoldenQuery

type GoldenQuery struct {
	ID     string  `json:"id"`
	Query  string  `json:"query"`
	Labels []Label `json:"labels"`
}

GoldenQuery is one labelled query -> relevant-tool(s) judgement.

type GoldenSet

type GoldenSet struct {
	CorpusVersion string        `json:"corpus_version"`
	Queries       []GoldenQuery `json:"queries"`
}

GoldenSet is the frozen Spec 065 retrieval golden set (retrieval_golden_v1.json).

func LoadGoldenSet

func LoadGoldenSet(path string) (*GoldenSet, error)

LoadGoldenSet reads the Spec 065 retrieval golden set (retrieval_golden_v1.json) from disk.

type Label

type Label struct {
	ToolID    string `json:"tool_id"`
	Relevance int    `json:"relevance"`
}

Label is a graded relevance judgement for one tool against one query, taken from the Spec 065 retrieval golden set (relevance 2 = primary, 1 = related, 0 / absent = irrelevant).

type LatencyReport

type LatencyReport struct {
	Samples        int     `json:"samples"`
	P50ms          float64 `json:"p50_ms"`
	P95ms          float64 `json:"p95_ms"`
	P99ms          float64 `json:"p99_ms"`
	MaxMs          float64 `json:"max_ms"`
	LoadAllToolsMs float64 `json:"load_all_tools_ms"`
}

LatencyReport summarizes proxy-side retrieve_tools search latency versus the fixed one-shot cost of loading every tool. Times are client-measured (milliseconds); the server's SearchToolsResponse "took" field is a "0ms" stub.

type LiveClient

type LiveClient struct {
	BaseURL string
	APIKey  string
	HTTP    *http.Client
}

LiveClient talks to a running mcpproxy instance (e.g. the bench docker-compose substrate on 127.0.0.1:8092) over its REST API. It is used by the live benchmark run to pull the exact tool definitions (with schemas) and to replay the retrieval golden set through the proxy's BM25 search.

func NewLiveClient

func NewLiveClient(baseURL, apiKey string) *LiveClient

NewLiveClient builds a LiveClient for baseURL (e.g. "http://127.0.0.1:8092") authenticating with apiKey via the X-API-Key header.

func (*LiveClient) FetchUpstreamTools

func (c *LiveClient) FetchUpstreamTools(ctx context.Context) ([]Tool, error)

FetchUpstreamTools pulls the consolidated tool list (GET /api/v1/tools) and returns every upstream tool with its full JSON input schema, ready to feed into schema-aware token counting for the baseline.

func (*LiveClient) Search

func (c *LiveClient) Search(ctx context.Context, query string, limit int) (ranked []string, latency time.Duration, err error)

Search replays one query through the proxy's BM25 tool search (GET /api/v1/index/search) and returns the ranked tool IDs (server:tool, best first) plus the client-measured round-trip latency.

Latency is measured client-side on purpose: the server's SearchToolsResponse "took" field is currently a hardcoded "0ms" stub (internal/httpapi handleSearchTools), so it cannot be trusted as the proxy-side timing.

type LiveModeResult

type LiveModeResult struct {
	Mode         string  `json:"mode"`
	ContextTools int     `json:"context_tools"`
	Tokens       int     `json:"tokens"`
	SavingsRatio float64 `json:"savings_vs_baseline,omitempty"`
}

LiveModeResult is the per-mode context-token cost from the live run.

type LiveReport

type LiveReport struct {
	Proxy     string            `json:"proxy"`
	Encoding  string            `json:"encoding"`
	Tokens    *LiveTokenReport  `json:"tokens"`
	Retrieval *RetrievalMetrics `json:"retrieval"`
	Latency   *LatencyReport    `json:"latency"`
}

LiveReport is the full live benchmark result: exact-token comparison, retrieval accuracy, and search latency, all gathered from one running proxy.

func RunLive

func RunLive(ctx context.Context, client *LiveClient, golden *GoldenSet) (*LiveReport, error)

RunLive gathers the full live benchmark from a running proxy: it pulls the exact tool definitions (with schemas) for the token comparison, replays the golden set through the proxy's BM25 search for accuracy, and records the per-query search latency.

func (*LiveReport) WriteJSON

func (r *LiveReport) WriteJSON(dir string) (string, error)

WriteJSON writes the live report as indented JSON into dir/live_report.json (the dir is gitignored — reports are never committed, per Spec 065 CN-003).

type LiveTokenReport

type LiveTokenReport struct {
	Encoding               string           `json:"encoding"`
	UpstreamTools          int              `json:"upstream_tools"`
	BaselineTokens         int              `json:"baseline_tokens"`
	Modes                  []LiveModeResult `json:"modes"`
	ProxySchemasCounted    bool             `json:"proxy_schemas_counted"`
	BaselineSchemasCounted bool             `json:"baseline_schemas_counted"`
	AuthoritativeHeadline  bool             `json:"authoritative_headline"`
	Notes                  []string         `json:"notes"`
}

LiveTokenReport is the exact-token comparison from a live proxy, with the baseline upstream tools counted WITH their full JSON input schemas.

AuthoritativeHeadline gates the savings percentage: it is only true when schemas were counted on BOTH sides — the proxy management tools carry schemas (ProxySchemasCounted) AND the baseline upstream tools carry schemas (BaselineSchemasCounted). Counting schemas on one side only overstates or distorts savings — the exact error corrected in MCP-3161 — so when either side is schema-less the savings ratio is withheld and only raw token totals are reported. BaselineSchemasCounted also guards against a /api/v1/tools response that silently dropped upstream schemas (MCP-3167).

type ModeResult

type ModeResult struct {
	Mode         string  `json:"mode"`
	ContextTools int     `json:"context_tools"`
	Tokens       int     `json:"tokens"`
	SavingsRatio float64 `json:"savings_vs_baseline"`
}

ModeResult is the per-mode context-cost outcome.

type Report

type Report struct {
	Encoding      string       `json:"encoding"`
	CorpusVersion string       `json:"corpus_version"`
	CorpusTools   int          `json:"corpus_tools"`
	Modes         []ModeResult `json:"modes"`
	Notes         []string     `json:"notes"`
}

Report is the full token-reduction benchmark result.

func ComputeReport

func ComputeReport(tk *Tokenizer, corpus *Corpus) *Report

ComputeReport computes the per-mode context-token cost over the corpus and the savings of each proxy mode versus the baseline (all tools loaded directly).

func (*Report) WriteHTML

func (r *Report) WriteHTML(path string) error

WriteHTML renders the report as a self-contained static dashboard. The output is a single file with no external assets so it can be published as-is to a static host (CI release-tag publishing is tracked as a follow-up).

func (*Report) WriteJSON

func (r *Report) WriteJSON(path string) error

WriteJSON writes the report as indented JSON to path.

func (*Report) WriteReports

func (r *Report) WriteReports(dir string) (jsonPath, htmlPath string, err error)

WriteReports writes both report.json and dashboard.html into dir.

type RetrievalGate

type RetrievalGate struct {
	Passed    bool    `json:"passed"`
	Metric    string  `json:"metric,omitempty"`
	Tolerance float64 `json:"tolerance,omitempty"`
}

RetrievalGate is the `retrieval.gate` object of the score-report contract.

A standalone live run has no stored baseline to regress against, so the gate cannot fail by construction: Passed is true and Metric/Tolerance are empty. Regression gating against a committed baseline is the CI lane's job (MCP-3133) — that run fills Metric/Tolerance and can set Passed=false.

type RetrievalMetricValues

type RetrievalMetricValues struct {
	RecallAt map[int]float64 `json:"recall_at"`
	MRR      float64         `json:"mrr"`
	NDCGAt10 float64         `json:"ndcg_at_10"`
	MAP      float64         `json:"map"`
}

RetrievalMetricValues holds the aggregated metric numbers. It is the `retrieval.metrics` object of the Spec 065 score-report.schema.json contract.

type RetrievalMetrics

type RetrievalMetrics struct {
	CorpusVersion string                `json:"corpus_version"`
	GoldenVersion string                `json:"golden_version,omitempty"`
	RunsAveraged  int                   `json:"runs_averaged"`
	QueryCount    int                   `json:"query_count,omitempty"`
	Metrics       RetrievalMetricValues `json:"metrics"`
	Gate          RetrievalGate         `json:"gate"`
}

RetrievalMetrics is the aggregated retrieval-quality report over a golden set. Its JSON shape IS the Spec 065 score-report.schema.json `retrieval` block (nested `metrics` + `gate`), so a live report's retrieval payload validates against that contract directly.

func ScoreRetrieval

func ScoreRetrieval(golden *GoldenSet, search SearchFunc, ks []int) (*RetrievalMetrics, error)

ScoreRetrieval replays every golden query through search and aggregates Recall@k (for each k in ks), MRR, nDCG@10 and MAP as the mean over all queries. The search is deterministic (BM25), so a single run is averaged.

type SearchFunc

type SearchFunc func(query string, limit int) (ranked []string, err error)

SearchFunc replays one query through the retrieval system under test and returns the ranked tool IDs (most relevant first), limited to `limit`.

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer wraps a tiktoken encoding for reproducible token estimation.

func NewTokenizer

func NewTokenizer(encoding string) (*Tokenizer, error)

NewTokenizer constructs a Tokenizer for the given tiktoken encoding name.

func (*Tokenizer) Count

func (t *Tokenizer) Count(text string) int

Count returns the number of tokens in text.

func (*Tokenizer) CountTool

func (t *Tokenizer) CountTool(tl Tool) int

CountTool returns the context-token cost of a single tool definition.

It counts the tool name and description only. Input JSON schemas are excluded uniformly across every mode because the committed Spec 065 corpus snapshot does not carry schemas. Schemas are dropped from BOTH sides — the baseline's upstream tools and the proxy modes' management tools (e.g. upstream_servers carries a large multi-field schema) — so this is a well-defined name+description-only metric, not an unambiguously conservative one. The live docker-compose run (README.md) adds full schemas from GET /api/v1/tools for the exact headline number.

func (*Tokenizer) CountToolWithSchema

func (t *Tokenizer) CountToolWithSchema(tl Tool) int

CountToolWithSchema returns the context-token cost of a tool definition INCLUDING its JSON input schema (name + description + schema). This is the authoritative per-tool context cost an agent actually pays. A tool with no schema counts identically to CountTool, so mixing schema-bearing (live) and schemaless tools in one report is well-defined. Used by the live run, where both the baseline upstream tools AND the proxy management tools carry their real schemas — counting schemas on BOTH sides is what keeps the headline savings honest rather than overstated.

type Tool

type Tool struct {
	ToolID      string          `json:"tool_id"`
	Server      string          `json:"server"`
	Name        string          `json:"tool"`
	Description string          `json:"description"`
	Schema      json.RawMessage `json:"schema,omitempty"`
}

Tool is a single tool definition the benchmark scores token cost over. It matches the shape of both the Spec 065 corpus snapshot and the embedded proxy-tool fixture. Schema is optional: the committed corpus snapshot is description-only (nil schema), while the live run (live.go) populates it with each tool's full JSON input schema for the exact-token headline.

func ProxyToolsForMode

func ProxyToolsForMode(mode string) []Tool

ProxyToolsForMode returns the built-in mcpproxy proxy + management tool definitions that occupy the agent's context window in the given routing mode.

The catalog is derived directly from the live server tool builders (internal/server.ProxyModeToolDefs → buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go). This is the single source of truth: both routing modes append the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries), so deriving from the builders guarantees the benchmark counts the real per-mode context cost and can never drift from production by re-introducing the undercount that inflated the headline savings (MCP-3161).

Directories

Path Synopsis
cmd
bench command
Command bench runs the mcpproxy benchmark.
Command bench runs the mcpproxy benchmark.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL