Documentation
¶
Overview ¶
Package observability provides OpenTelemetry-based metrics for the mcp-data-platform server.
Phase 1 instruments two chokepoints: the MCP tool-call middleware and the apigateway outbound HTTP path. Metrics are exported in Prometheus format on a separate HTTP listener so scrape traffic is isolated from the main MCP/HTTP listener.
Configuration is environment-only in this phase to keep the surface small. See ConfigFromEnv for the recognized variables.
Index ¶
- Constants
- func ClassifyError(err error) string
- func ClassifyToolCallResult(err error, isToolError bool, errCategory string) string
- func HTTPStatusCategory(status int, transportErr error) string
- func HTTPStatusClass(status int) string
- type APIGatewayAttrs
- type CategorizedError
- type Config
- type Listener
- type Metrics
- func (m *Metrics) DecInflightToolCalls(ctx context.Context)
- func (m *Metrics) Enabled() bool
- func (m *Metrics) Handler() http.Handler
- func (m *Metrics) IncInflightToolCalls(ctx context.Context)
- func (m *Metrics) RecordAPIGatewayOutbound(ctx context.Context, attrs APIGatewayAttrs, duration time.Duration)
- func (m *Metrics) RecordToolCall(ctx context.Context, attrs ToolCallAttrs, duration time.Duration)
- func (m *Metrics) Shutdown(ctx context.Context) error
- type ToolCallAttrs
Constants ¶
const ( StatusOK = "ok" StatusAuthErr = "auth_err" StatusAuthzErr = "authz_err" StatusValidationErr = "validation_err" StatusUpstreamErr = "upstream_err" StatusInternalErr = "internal_err" )
Status category labels for tool calls and outbound HTTP. The set is closed and small so total label cardinality on counters and histograms stays bounded.
const ( OutcomeOK = "ok" OutcomeUpstream4xx = "upstream_4xx" OutcomeUpstream5xx = "upstream_5xx" OutcomeTransportErr = "transport_err" OutcomeUpstreamTimeout = "upstream_timeout" )
Audit outcome categories for upstream-proxying toolkits (e.g. the apigateway). These are bounded labels that distinguish gateway-level failure (the gateway could not reach the upstream) from upstream-level failure (the upstream responded with an error status). The two are fundamentally different operational concerns and should not share a status code or a success boolean:
- The gateway returning 502 means the gateway broke.
- The upstream returning 502 (proxied through the gateway as wire 200 with the upstream code in the body) means the upstream broke. The gateway did its job.
audit_logs.error_category and the Phase 1 status_category label adopt these constants so dashboards can alert on each independently.
const ( // MetaAuditOutcome carries one of the Outcome* string constants // above. When present and not OutcomeOK, the audit middleware // sets success=false and uses the value as error_category. MetaAuditOutcome = "audit_outcome" // MetaAuditOutcomeMessage carries an optional human-readable // summary of the outcome (typically the upstream status text or // the scrubbed transport error). Used to populate // audit_logs.error_message when no other source is available. MetaAuditOutcomeMessage = "audit_outcome_message" )
Well-known CallToolResult Meta keys read by the audit middleware to override its success / error_category derivation. Toolkits that proxy external services populate these so the audit row reflects the real upstream outcome instead of just "the MCP tool ran." Keys are namespaced under "audit_" to keep them out of the way of other _meta consumers.
const ( StatusClass2xx = "2xx" StatusClass3xx = "3xx" StatusClass4xx = "4xx" StatusClass5xx = "5xx" StatusClassOther = "other" )
HTTP status class labels for outbound calls. The "other" bucket covers transport-level failures (status code 0) and the rarely-seen 1xx informational range. Recording the raw status code as a label would explode cardinality.
const ( CategoryAuth = "authentication_failed" CategoryAuthz = "authorization_denied" CategoryDeclined = "user_declined" )
Category constants recognized by ClassifyToolCall when a CategorizedError is returned. These match the values pkg/middleware.ErrCategory* uses so the platform's existing error taxonomy maps to bounded metric labels without duplication.
const DefaultListenAddr = ":9090"
DefaultListenAddr is the address the /metrics listener binds to when OTEL_METRICS_ADDR is unset. Port 9090 is the conventional Prometheus scrape port and does not collide with the platform's main HTTP port (8080 by default).
Variables ¶
This section is empty.
Functions ¶
func ClassifyError ¶
ClassifyError maps an error returned from a tool handler (or from any internal stage of the call) to a bounded status_category label. A nil error yields StatusOK.
The classifier prefers a CategorizedError's ErrorCategory() over string inspection so the platform's error taxonomy stays authoritative. Categories the metrics package does not recognize fall through to StatusInternalErr — a recognized-but-unmapped category is a signal that the taxonomy and the classifier have drifted; the deliberate bucket makes the drift visible in a dashboard.
func ClassifyToolCallResult ¶
ClassifyToolCallResult maps the (err, isToolError, errCategory) triple from an MCP tool call to a bounded status_category. This is the shape pkg/middleware.MCPAuditMiddleware already computes, so the metrics middleware can pass through the same fields without re-deriving them.
Logic:
- err != nil → ClassifyError(err) (protocol-level failure)
- !isToolError → StatusOK
- isToolError with a recognized category → mapped label
- isToolError without a category → StatusUpstreamErr (most tool-level errors are upstream — Trino query failures, S3 access errors, DataHub fetch errors, etc.)
func HTTPStatusCategory ¶
HTTPStatusCategory returns the status_category label for an outbound HTTP call. 2xx and 3xx are treated as OK; 4xx and 5xx as upstream errors. Transport errors (status 0) are upstream errors too — the upstream did not respond.
func HTTPStatusClass ¶
HTTPStatusClass returns the bounded class label for an HTTP status code. Status 0 is reserved for transport-level errors (no response received); it maps to StatusClassOther so it is recordable without inflating the 5xx bucket.
Types ¶
type APIGatewayAttrs ¶
APIGatewayAttrs is the bounded label set for outbound HTTP from the apigateway toolkit. Connection is operator-configured (small set); the URL, path, query string, and raw status code are NOT recorded as labels — they would be cardinality bombs and live on trace spans instead.
type CategorizedError ¶
CategorizedError lets call sites attach a category to an error that the metrics layer can read without a string-match. This mirrors the pattern used by pkg/middleware's PlatformError so the existing auth/authz/declined categories surface in metrics without a second classification scheme.
type Config ¶
type Config struct {
// Enabled gates the entire subsystem. When false, New returns a
// Metrics value whose Record methods are no-ops, the listener is
// not started, and no OTel MeterProvider is constructed.
Enabled bool
// ListenAddr is the bind address for the /metrics HTTP listener,
// e.g. ":9090" or "127.0.0.1:9090". Ignored when Enabled is false.
ListenAddr string
}
Config holds the operator-configurable knobs for the metrics subsystem. Phase 1 keeps this minimal; tracing and per-toolkit instrumentation in later phases may add fields.
func ConfigFromEnv ¶
func ConfigFromEnv() Config
ConfigFromEnv reads the observability configuration from environment variables. Unset or unparsable values fall back to the defaults so the platform can boot even with a partial configuration.
type Listener ¶
type Listener struct {
// contains filtered or unexported fields
}
Listener runs the dedicated HTTP server that exposes /metrics. The server is separate from the platform's main HTTP listener so that:
- scrape traffic does not share the MCP/admin/portal auth path,
- the metrics port can sit behind a NetworkPolicy (or be unreachable from outside the cluster) without affecting client-facing routes,
- a slow or stuck scraper cannot starve the main listener's accept loop.
Listener is a no-op when the underlying Metrics is nil or the listen address is empty; callers can mount it unconditionally.
func NewListener ¶
NewListener constructs a Listener for the supplied Metrics. The listener serves only /metrics on its mux; all other paths return 404. When metrics are disabled NewListener returns nil so callers can mount it unconditionally and observe a nil receiver as the "disabled" signal.
type Metrics ¶
type Metrics struct {
// contains filtered or unexported fields
}
Metrics owns the OTel MeterProvider and the registered instruments. A nil *Metrics is a valid no-op recorder: every Record method becomes a fast nil-check, so call sites can record unconditionally without an enabled check.
func New ¶
New builds a Metrics instance from the supplied config. When cfg.Enabled is false New returns (nil, nil) so callers receive a no-op recorder without an error path; this keeps the boot sequence simple in cmd/mcp-data-platform when metrics are off. The nil-no-op shape is intentional and documented on every Record method — callers can invoke them unconditionally.
When enabled, New constructs a fresh prometheus.Registry (NOT the default registerer) so the platform's metrics are isolated from any other library that may publish to the default registry. The Go runtime and process collectors are registered explicitly so "go_goroutines", "process_cpu_seconds_total", and friends are available on the same /metrics endpoint without extra wiring.
func (*Metrics) DecInflightToolCalls ¶
DecInflightToolCalls decrements the in-flight gauge. Nil-safe.
func (*Metrics) Enabled ¶
Enabled reports whether the recorder is active. The middleware uses this only to skip building label sets when nothing will be recorded; Record methods themselves are nil-safe.
func (*Metrics) Handler ¶
Handler returns the /metrics HTTP handler. Returns http.NotFoundHandler when m is nil so cmd/main can mount the handler unconditionally.
func (*Metrics) IncInflightToolCalls ¶
IncInflightToolCalls increments the in-flight gauge. Paired with DecInflightToolCalls in a defer at the tool-call middleware so the gauge cannot leak even on panic. Nil-safe.
func (*Metrics) RecordAPIGatewayOutbound ¶
func (m *Metrics) RecordAPIGatewayOutbound(ctx context.Context, attrs APIGatewayAttrs, duration time.Duration)
RecordAPIGatewayOutbound records one outbound HTTP observation. Nil-safe.
func (*Metrics) RecordToolCall ¶
RecordToolCall records one tool-call observation. Nil-safe.
type ToolCallAttrs ¶
ToolCallAttrs is the bounded label set for tool-call metrics. The metrics layer never reads request bodies, user identifiers, or session IDs — those are span attributes (phase 2) and audit log fields, not Prometheus labels.