observability

package
v1.64.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

Documentation

Overview

Package observability provides OpenTelemetry-based metrics for the mcp-data-platform server.

Phase 1 instruments two chokepoints: the MCP tool-call middleware and the apigateway outbound HTTP path. Metrics are exported in Prometheus format on a separate HTTP listener so scrape traffic is isolated from the main MCP/HTTP listener.

Configuration is environment-only in this phase to keep the surface small. See ConfigFromEnv for the recognized variables.

Index

Constants

View Source
const (
	StatusOK            = "ok"
	StatusAuthErr       = "auth_err"
	StatusAuthzErr      = "authz_err"
	StatusValidationErr = "validation_err"
	StatusUpstreamErr   = "upstream_err"
	StatusInternalErr   = "internal_err"
)

Status category labels for tool calls and outbound HTTP. The set is closed and small so total label cardinality on counters and histograms stays bounded.

View Source
const (
	OutcomeOK              = "ok"
	OutcomeUpstream4xx     = "upstream_4xx"
	OutcomeUpstream5xx     = "upstream_5xx"
	OutcomeTransportErr    = "transport_err"
	OutcomeUpstreamTimeout = "upstream_timeout"
)

Audit outcome categories for upstream-proxying toolkits (e.g. the apigateway). These are bounded labels that distinguish gateway-level failure (the gateway could not reach the upstream) from upstream-level failure (the upstream responded with an error status). The two are fundamentally different operational concerns and should not share a status code or a success boolean:

  • The gateway returning 502 means the gateway broke.
  • The upstream returning 502 (proxied through the gateway as wire 200 with the upstream code in the body) means the upstream broke. The gateway did its job.

audit_logs.error_category and the Phase 1 status_category label adopt these constants so dashboards can alert on each independently.

View Source
const (
	// MetaAuditOutcome carries one of the Outcome* string constants
	// above. When present and not OutcomeOK, the audit middleware
	// sets success=false and uses the value as error_category.
	MetaAuditOutcome = "audit_outcome"

	// MetaAuditOutcomeMessage carries an optional human-readable
	// summary of the outcome (typically the upstream status text or
	// the scrubbed transport error). Used to populate
	// audit_logs.error_message when no other source is available.
	MetaAuditOutcomeMessage = "audit_outcome_message"
)

Well-known CallToolResult Meta keys read by the audit middleware to override its success / error_category derivation. Toolkits that proxy external services populate these so the audit row reflects the real upstream outcome instead of just "the MCP tool ran." Keys are namespaced under "audit_" to keep them out of the way of other _meta consumers.

View Source
const (
	StatusClass2xx   = "2xx"
	StatusClass3xx   = "3xx"
	StatusClass4xx   = "4xx"
	StatusClass5xx   = "5xx"
	StatusClassOther = "other"
)

HTTP status class labels for outbound calls. The "other" bucket covers transport-level failures (status code 0) and the rarely-seen 1xx informational range. Recording the raw status code as a label would explode cardinality.

View Source
const (
	CategoryAuth     = "authentication_failed"
	CategoryAuthz    = "authorization_denied"
	CategoryDeclined = "user_declined"
)

Category constants recognized by ClassifyToolCall when a CategorizedError is returned. These match the values pkg/middleware.ErrCategory* uses so the platform's existing error taxonomy maps to bounded metric labels without duplication.

View Source
const DefaultListenAddr = ":9090"

DefaultListenAddr is the address the /metrics listener binds to when OTEL_METRICS_ADDR is unset. Port 9090 is the conventional Prometheus scrape port and does not collide with the platform's main HTTP port (8080 by default).

Variables

This section is empty.

Functions

func ClassifyError

func ClassifyError(err error) string

ClassifyError maps an error returned from a tool handler (or from any internal stage of the call) to a bounded status_category label. A nil error yields StatusOK.

The classifier prefers a CategorizedError's ErrorCategory() over string inspection so the platform's error taxonomy stays authoritative. Categories the metrics package does not recognize fall through to StatusInternalErr — a recognized-but-unmapped category is a signal that the taxonomy and the classifier have drifted; the deliberate bucket makes the drift visible in a dashboard.

func ClassifyToolCallResult

func ClassifyToolCallResult(err error, isToolError bool, errCategory string) string

ClassifyToolCallResult maps the (err, isToolError, errCategory) triple from an MCP tool call to a bounded status_category. This is the shape pkg/middleware.MCPAuditMiddleware already computes, so the metrics middleware can pass through the same fields without re-deriving them.

Logic:

  • err != nil → ClassifyError(err) (protocol-level failure)
  • !isToolError → StatusOK
  • isToolError with a recognized category → mapped label
  • isToolError without a category → StatusUpstreamErr (most tool-level errors are upstream — Trino query failures, S3 access errors, DataHub fetch errors, etc.)

func HTTPStatusCategory

func HTTPStatusCategory(status int, transportErr error) string

HTTPStatusCategory returns the status_category label for an outbound HTTP call. 2xx and 3xx are treated as OK; 4xx and 5xx as upstream errors. Transport errors (status 0) are upstream errors too — the upstream did not respond.

func HTTPStatusClass

func HTTPStatusClass(status int) string

HTTPStatusClass returns the bounded class label for an HTTP status code. Status 0 is reserved for transport-level errors (no response received); it maps to StatusClassOther so it is recordable without inflating the 5xx bucket.

Types

type APIGatewayAttrs

type APIGatewayAttrs struct {
	Connection      string
	HTTPStatusClass string
	StatusCategory  string
}

APIGatewayAttrs is the bounded label set for outbound HTTP from the apigateway toolkit. Connection is operator-configured (small set); the URL, path, query string, and raw status code are NOT recorded as labels — they would be cardinality bombs and live on trace spans instead.

type CategorizedError

type CategorizedError interface {
	error
	ErrorCategory() string
}

CategorizedError lets call sites attach a category to an error that the metrics layer can read without a string-match. This mirrors the pattern used by pkg/middleware's PlatformError so the existing auth/authz/declined categories surface in metrics without a second classification scheme.

type Config

type Config struct {
	// Enabled gates the entire subsystem. When false, New returns a
	// Metrics value whose Record methods are no-ops, the listener is
	// not started, and no OTel MeterProvider is constructed.
	Enabled bool

	// ListenAddr is the bind address for the /metrics HTTP listener,
	// e.g. ":9090" or "127.0.0.1:9090". Ignored when Enabled is false.
	ListenAddr string
}

Config holds the operator-configurable knobs for the metrics subsystem. Phase 1 keeps this minimal; tracing and per-toolkit instrumentation in later phases may add fields.

func ConfigFromEnv

func ConfigFromEnv() Config

ConfigFromEnv reads the observability configuration from environment variables. Unset or unparsable values fall back to the defaults so the platform can boot even with a partial configuration.

type Listener

type Listener struct {
	// contains filtered or unexported fields
}

Listener runs the dedicated HTTP server that exposes /metrics. The server is separate from the platform's main HTTP listener so that:

  • scrape traffic does not share the MCP/admin/portal auth path,
  • the metrics port can sit behind a NetworkPolicy (or be unreachable from outside the cluster) without affecting client-facing routes,
  • a slow or stuck scraper cannot starve the main listener's accept loop.

Listener is a no-op when the underlying Metrics is nil or the listen address is empty; callers can mount it unconditionally.

func NewListener

func NewListener(m *Metrics) *Listener

NewListener constructs a Listener for the supplied Metrics. The listener serves only /metrics on its mux; all other paths return 404. When metrics are disabled NewListener returns nil so callers can mount it unconditionally and observe a nil receiver as the "disabled" signal.

func (*Listener) Shutdown

func (l *Listener) Shutdown(ctx context.Context) error

Shutdown gracefully stops the listener. Safe to call on a nil receiver or before Start (returns nil).

func (*Listener) Start

func (l *Listener) Start(ctx context.Context) error

Start begins serving in a background goroutine. The supplied context is observed only for the "address already in use" race during startup; long-lived shutdown should go through Shutdown.

type Metrics

type Metrics struct {
	// contains filtered or unexported fields
}

Metrics owns the OTel MeterProvider and the registered instruments. A nil *Metrics is a valid no-op recorder: every Record method becomes a fast nil-check, so call sites can record unconditionally without an enabled check.

func New

func New(cfg Config) (*Metrics, error)

New builds a Metrics instance from the supplied config. When cfg.Enabled is false New returns (nil, nil) so callers receive a no-op recorder without an error path; this keeps the boot sequence simple in cmd/mcp-data-platform when metrics are off. The nil-no-op shape is intentional and documented on every Record method — callers can invoke them unconditionally.

When enabled, New constructs a fresh prometheus.Registry (NOT the default registerer) so the platform's metrics are isolated from any other library that may publish to the default registry. The Go runtime and process collectors are registered explicitly so "go_goroutines", "process_cpu_seconds_total", and friends are available on the same /metrics endpoint without extra wiring.

func (*Metrics) DecInflightToolCalls

func (m *Metrics) DecInflightToolCalls(ctx context.Context)

DecInflightToolCalls decrements the in-flight gauge. Nil-safe.

func (*Metrics) Enabled

func (m *Metrics) Enabled() bool

Enabled reports whether the recorder is active. The middleware uses this only to skip building label sets when nothing will be recorded; Record methods themselves are nil-safe.

func (*Metrics) Handler

func (m *Metrics) Handler() http.Handler

Handler returns the /metrics HTTP handler. Returns http.NotFoundHandler when m is nil so cmd/main can mount the handler unconditionally.

func (*Metrics) IncInflightToolCalls

func (m *Metrics) IncInflightToolCalls(ctx context.Context)

IncInflightToolCalls increments the in-flight gauge. Paired with DecInflightToolCalls in a defer at the tool-call middleware so the gauge cannot leak even on panic. Nil-safe.

func (*Metrics) RecordAPIGatewayOutbound

func (m *Metrics) RecordAPIGatewayOutbound(ctx context.Context, attrs APIGatewayAttrs, duration time.Duration)

RecordAPIGatewayOutbound records one outbound HTTP observation. Nil-safe.

func (*Metrics) RecordToolCall

func (m *Metrics) RecordToolCall(ctx context.Context, attrs ToolCallAttrs, duration time.Duration)

RecordToolCall records one tool-call observation. Nil-safe.

func (*Metrics) Shutdown

func (m *Metrics) Shutdown(ctx context.Context) error

Shutdown flushes the meter provider and releases resources. Safe to call on a nil receiver so cmd/main's shutdown path stays branch-free.

type ToolCallAttrs

type ToolCallAttrs struct {
	Tool           string
	ToolkitKind    string
	Persona        string
	StatusCategory string
}

ToolCallAttrs is the bounded label set for tool-call metrics. The metrics layer never reads request bodies, user identifiers, or session IDs — those are span attributes (phase 2) and audit log fields, not Prometheus labels.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL