Documentation
¶
Overview ¶
Package bench is the mcpproxy benchmark harness (roadmap #19 / MCP-42).
It produces the reproducible numbers behind mcpproxy's marketing claims — token reduction, discovery accuracy, and latency — by comparing three ways an agent can be wired to upstream MCP tools:
- baseline: every upstream tool definition is loaded directly into the agent's context (no proxy discovery).
- retrieve_tools: only mcpproxy's discovery + call_tool variants occupy the context; tools are found on demand via BM25 search.
- code_execution: only code_execution + retrieve_tools occupy the context; the agent orchestrates many tools from sandboxed JS in one round-trip.
The token-reduction measurement in this file is fully deterministic and offline: it counts the context cost of each mode over a frozen tool corpus using the tiktoken cl100k_base encoding (a reproducible, model-agnostic estimator). It reuses the Spec 065 frozen corpus (specs/065-evaluation-foundation/datasets/corpus_v1.tools.json) as its tool universe so the benchmark scores a versioned, non-drifting snapshot (CN-002).
Methodology, limitations, and the live (docker-compose) run that adds full JSON input schemas and end-to-end accuracy/latency are documented in bench/README.md.
Index ¶
- Constants
- func AveragePrecision(ranked []string, labels []Label) float64
- func NDCGAtK(ranked []string, labels []Label, k int) float64
- func RecallAtK(ranked []string, labels []Label, k int) float64
- func ReciprocalRank(ranked []string, labels []Label) float64
- type Corpus
- type GoldenQuery
- type GoldenSet
- type Label
- type LatencyReport
- type LiveClient
- type LiveModeResult
- type LiveReport
- type LiveTokenReport
- type ModeResult
- type Report
- type RetrievalGate
- type RetrievalMetricValues
- type RetrievalMetrics
- type SearchFunc
- type Tokenizer
- type Tool
Constants ¶
const ( ModeBaseline = "baseline" ModeRetrieveTools = "retrieve_tools" ModeCodeExecution = "code_execution" )
Routing modes the benchmark compares. The mode names mirror the mcpproxy MCP servers in internal/server/mcp.go (codeExecServer, callToolServer).
const DefaultEncoding = "cl100k_base"
DefaultEncoding is the tiktoken encoding used for token estimation. cl100k_base is a widely-used, reproducible BPE; exact counts for a specific pinned model (e.g. Claude) will differ, but the *relative* savings between modes are stable.
Variables ¶
This section is empty.
Functions ¶
func AveragePrecision ¶
AveragePrecision is the mean of the precision values computed at each rank where a relevant tool is retrieved, divided by the total number of relevant tools (so unretrieved relevant tools lower the score). Binary relevance (relevance >= 1) is used, matching the standard MAP definition.
func NDCGAtK ¶
NDCGAtK is the normalized discounted cumulative gain at k using the graded relevance as the gain (linear gain, log2 position discount). 1.0 means the ranking is in ideal (relevance-descending) order; 0 means no gain in top-k.
func RecallAtK ¶
RecallAtK is the fraction of the query's relevant tools (relevance >= 1) that appear in the top-k of the ranking. Returns 0 when there are no relevant tools (a degenerate query that should not be scored).
func ReciprocalRank ¶
ReciprocalRank is 1/rank of the first relevant tool in the ranking, or 0 if none of the ranked tools are relevant.
Types ¶
type Corpus ¶
Corpus is a frozen, versioned set of tool definitions.
func LoadCorpus ¶
LoadCorpus reads a frozen corpus snapshot (e.g. the Spec 065 corpus_v1.tools.json) from disk.
type GoldenQuery ¶
type GoldenQuery struct {
ID string `json:"id"`
Query string `json:"query"`
Labels []Label `json:"labels"`
}
GoldenQuery is one labelled query -> relevant-tool(s) judgement.
type GoldenSet ¶
type GoldenSet struct {
CorpusVersion string `json:"corpus_version"`
Queries []GoldenQuery `json:"queries"`
}
GoldenSet is the frozen Spec 065 retrieval golden set (retrieval_golden_v1.json).
func LoadGoldenSet ¶
LoadGoldenSet reads the Spec 065 retrieval golden set (retrieval_golden_v1.json) from disk.
type Label ¶
Label is a graded relevance judgement for one tool against one query, taken from the Spec 065 retrieval golden set (relevance 2 = primary, 1 = related, 0 / absent = irrelevant).
type LatencyReport ¶
type LatencyReport struct {
Samples int `json:"samples"`
P50ms float64 `json:"p50_ms"`
P95ms float64 `json:"p95_ms"`
P99ms float64 `json:"p99_ms"`
MaxMs float64 `json:"max_ms"`
LoadAllToolsMs float64 `json:"load_all_tools_ms"`
}
LatencyReport summarizes proxy-side retrieve_tools search latency versus the fixed one-shot cost of loading every tool. Times are client-measured (milliseconds); the server's SearchToolsResponse "took" field is a "0ms" stub.
type LiveClient ¶
LiveClient talks to a running mcpproxy instance (e.g. the bench docker-compose substrate on 127.0.0.1:8092) over its REST API. It is used by the live benchmark run to pull the exact tool definitions (with schemas) and to replay the retrieval golden set through the proxy's BM25 search.
func NewLiveClient ¶
func NewLiveClient(baseURL, apiKey string) *LiveClient
NewLiveClient builds a LiveClient for baseURL (e.g. "http://127.0.0.1:8092") authenticating with apiKey via the X-API-Key header.
func (*LiveClient) FetchUpstreamTools ¶
func (c *LiveClient) FetchUpstreamTools(ctx context.Context) ([]Tool, error)
FetchUpstreamTools pulls the consolidated tool list (GET /api/v1/tools) and returns every upstream tool with its full JSON input schema, ready to feed into schema-aware token counting for the baseline.
func (*LiveClient) Search ¶
func (c *LiveClient) Search(ctx context.Context, query string, limit int) (ranked []string, latency time.Duration, err error)
Search replays one query through the proxy's BM25 tool search (GET /api/v1/index/search) and returns the ranked tool IDs (server:tool, best first) plus the client-measured round-trip latency.
Latency is measured client-side on purpose: the server's SearchToolsResponse "took" field is currently a hardcoded "0ms" stub (internal/httpapi handleSearchTools), so it cannot be trusted as the proxy-side timing.
type LiveModeResult ¶
type LiveModeResult struct {
Mode string `json:"mode"`
ContextTools int `json:"context_tools"`
Tokens int `json:"tokens"`
SavingsRatio float64 `json:"savings_vs_baseline,omitempty"`
}
LiveModeResult is the per-mode context-token cost from the live run.
type LiveReport ¶
type LiveReport struct {
Proxy string `json:"proxy"`
Encoding string `json:"encoding"`
Tokens *LiveTokenReport `json:"tokens"`
Retrieval *RetrievalMetrics `json:"retrieval"`
Latency *LatencyReport `json:"latency"`
}
LiveReport is the full live benchmark result: exact-token comparison, retrieval accuracy, and search latency, all gathered from one running proxy.
func RunLive ¶
func RunLive(ctx context.Context, client *LiveClient, golden *GoldenSet) (*LiveReport, error)
RunLive gathers the full live benchmark from a running proxy: it pulls the exact tool definitions (with schemas) for the token comparison, replays the golden set through the proxy's BM25 search for accuracy, and records the per-query search latency.
type LiveTokenReport ¶
type LiveTokenReport struct {
Encoding string `json:"encoding"`
UpstreamTools int `json:"upstream_tools"`
BaselineTokens int `json:"baseline_tokens"`
Modes []LiveModeResult `json:"modes"`
ProxySchemasCounted bool `json:"proxy_schemas_counted"`
BaselineSchemasCounted bool `json:"baseline_schemas_counted"`
AuthoritativeHeadline bool `json:"authoritative_headline"`
Notes []string `json:"notes"`
}
LiveTokenReport is the exact-token comparison from a live proxy, with the baseline upstream tools counted WITH their full JSON input schemas.
AuthoritativeHeadline gates the savings percentage: it is only true when schemas were counted on BOTH sides — the proxy management tools carry schemas (ProxySchemasCounted) AND the baseline upstream tools carry schemas (BaselineSchemasCounted). Counting schemas on one side only overstates or distorts savings — the exact error corrected in MCP-3161 — so when either side is schema-less the savings ratio is withheld and only raw token totals are reported. BaselineSchemasCounted also guards against a /api/v1/tools response that silently dropped upstream schemas (MCP-3167).
type ModeResult ¶
type ModeResult struct {
Mode string `json:"mode"`
ContextTools int `json:"context_tools"`
Tokens int `json:"tokens"`
SavingsRatio float64 `json:"savings_vs_baseline"`
}
ModeResult is the per-mode context-cost outcome.
type Report ¶
type Report struct {
Encoding string `json:"encoding"`
CorpusVersion string `json:"corpus_version"`
CorpusTools int `json:"corpus_tools"`
Modes []ModeResult `json:"modes"`
Notes []string `json:"notes"`
}
Report is the full token-reduction benchmark result.
func ComputeReport ¶
ComputeReport computes the per-mode context-token cost over the corpus and the savings of each proxy mode versus the baseline (all tools loaded directly).
func (*Report) WriteHTML ¶
WriteHTML renders the report as a self-contained static dashboard. The output is a single file with no external assets so it can be published as-is to a static host (CI release-tag publishing is tracked as a follow-up).
type RetrievalGate ¶
type RetrievalGate struct {
Passed bool `json:"passed"`
Metric string `json:"metric,omitempty"`
Tolerance float64 `json:"tolerance,omitempty"`
}
RetrievalGate is the `retrieval.gate` object of the score-report contract.
A standalone live run has no stored baseline to regress against, so the gate cannot fail by construction: Passed is true and Metric/Tolerance are empty. Regression gating against a committed baseline is the CI lane's job (MCP-3133) — that run fills Metric/Tolerance and can set Passed=false.
type RetrievalMetricValues ¶
type RetrievalMetricValues struct {
RecallAt map[int]float64 `json:"recall_at"`
MRR float64 `json:"mrr"`
NDCGAt10 float64 `json:"ndcg_at_10"`
MAP float64 `json:"map"`
}
RetrievalMetricValues holds the aggregated metric numbers. It is the `retrieval.metrics` object of the Spec 065 score-report.schema.json contract.
type RetrievalMetrics ¶
type RetrievalMetrics struct {
CorpusVersion string `json:"corpus_version"`
GoldenVersion string `json:"golden_version,omitempty"`
RunsAveraged int `json:"runs_averaged"`
QueryCount int `json:"query_count,omitempty"`
Metrics RetrievalMetricValues `json:"metrics"`
Gate RetrievalGate `json:"gate"`
}
RetrievalMetrics is the aggregated retrieval-quality report over a golden set. Its JSON shape IS the Spec 065 score-report.schema.json `retrieval` block (nested `metrics` + `gate`), so a live report's retrieval payload validates against that contract directly.
func ScoreRetrieval ¶
func ScoreRetrieval(golden *GoldenSet, search SearchFunc, ks []int) (*RetrievalMetrics, error)
ScoreRetrieval replays every golden query through search and aggregates Recall@k (for each k in ks), MRR, nDCG@10 and MAP as the mean over all queries. The search is deterministic (BM25), so a single run is averaged.
type SearchFunc ¶
SearchFunc replays one query through the retrieval system under test and returns the ranked tool IDs (most relevant first), limited to `limit`.
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer wraps a tiktoken encoding for reproducible token estimation.
func NewTokenizer ¶
NewTokenizer constructs a Tokenizer for the given tiktoken encoding name.
func (*Tokenizer) CountTool ¶
CountTool returns the context-token cost of a single tool definition.
It counts the tool name and description only. Input JSON schemas are excluded uniformly across every mode because the committed Spec 065 corpus snapshot does not carry schemas. Schemas are dropped from BOTH sides — the baseline's upstream tools and the proxy modes' management tools (e.g. upstream_servers carries a large multi-field schema) — so this is a well-defined name+description-only metric, not an unambiguously conservative one. The live docker-compose run (README.md) adds full schemas from GET /api/v1/tools for the exact headline number.
func (*Tokenizer) CountToolWithSchema ¶
CountToolWithSchema returns the context-token cost of a tool definition INCLUDING its JSON input schema (name + description + schema). This is the authoritative per-tool context cost an agent actually pays. A tool with no schema counts identically to CountTool, so mixing schema-bearing (live) and schemaless tools in one report is well-defined. Used by the live run, where both the baseline upstream tools AND the proxy management tools carry their real schemas — counting schemas on BOTH sides is what keeps the headline savings honest rather than overstated.
type Tool ¶
type Tool struct {
ToolID string `json:"tool_id"`
Server string `json:"server"`
Name string `json:"tool"`
Description string `json:"description"`
Schema json.RawMessage `json:"schema,omitempty"`
}
Tool is a single tool definition the benchmark scores token cost over. It matches the shape of both the Spec 065 corpus snapshot and the embedded proxy-tool fixture. Schema is optional: the committed corpus snapshot is description-only (nil schema), while the live run (live.go) populates it with each tool's full JSON input schema for the exact-token headline.
func ProxyToolsForMode ¶
ProxyToolsForMode returns the built-in mcpproxy proxy + management tool definitions that occupy the agent's context window in the given routing mode.
The catalog is derived directly from the live server tool builders (internal/server.ProxyModeToolDefs → buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go). This is the single source of truth: both routing modes append the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries), so deriving from the builders guarantees the benchmark counts the real per-mode context cost and can never drift from production by re-introducing the undercount that inflated the headline savings (MCP-3161).