Documentation
¶
Overview ¶
Package metrics defines and registers all custom Prometheus metrics for AgentBox. All metrics are registered to the controller-runtime shared registry so they are exposed via the same --metrics-bind-address endpoint as the controller metrics.
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( PoolReplicasDesired *prometheus.GaugeVec PoolReplicasIdle *prometheus.GaugeVec PoolReplicasRunning *prometheus.GaugeVec PoolReplicasStarting *prometheus.GaugeVec PoolReplicasStopping *prometheus.GaugeVec PoolReplicasFailed *prometheus.GaugeVec )
Pool replica gauges — one per replica phase, labelled by namespace/pool/team/user.
var ( // SandboxClaimDuration observes how long ClaimIdlePod takes. // outcome: "success" | "no_idle" | "timeout" | "error" SandboxClaimDuration *prometheus.HistogramVec // SandboxStartingDuration observes the image-pull / startup time (claimedAt → startedAt). // stop_reason label is absent here; use for P99 startup latency breakdowns. SandboxStartingDuration *prometheus.HistogramVec // SandboxRunningDuration observes actual sandbox running time (startedAt → terminatedAt). // stop_reason: "Completed" | "Failed" | "Canceled" | "Evicted" SandboxRunningDuration *prometheus.HistogramVec // SandboxRecycleDuration observes the Stopping→Idle recycle time (terminatedAt → recycledAt). SandboxRecycleDuration *prometheus.HistogramVec )
Sandbox lifecycle histograms.
var ( // SandboxCreateTotal counts sandbox creation attempts. // result: "success" | "no_idle" | "timeout" | "error" SandboxCreateTotal *prometheus.CounterVec // SandboxDeleteTotal counts sandbox deletions. // stop_reason: "Completed" | "Canceled" | "Failed" SandboxDeleteTotal *prometheus.CounterVec // InplaceUpdateTotal counts TriggerUpdateWithOptions calls. // result: "success" | "conflict" | "error" // (conflict covers both k8s resource version conflicts and phase mismatches) // target: TargetPodPhase value (e.g. "running", "idle") InplaceUpdateTotal *prometheus.CounterVec )
Sandbox operation counters.
var ( HTTPRequestsTotal *prometheus.CounterVec HTTPRequestDuration *prometheus.HistogramVec )
HTTP API metrics (Gin middleware).
var ( // ScheduleReadyQSize is the current number of pods in the per-pool ready queue // (idle pods known to the scheduler, not yet dispatched). ScheduleReadyQSize *prometheus.GaugeVec // ScheduleReservationsSize is the current number of per-pool inflight reservations // (pods either being CAS'd or recently claimed within the TTL window). ScheduleReservationsSize *prometheus.GaugeVec // ScheduleCASOutcomeTotal counts TriggerUpdateWithOptions outcomes from the scheduler. // outcome: "success" | "retriable" (phase mismatch / k8s conflict) | "hard" (other errors). ScheduleCASOutcomeTotal *prometheus.CounterVec // ScheduleDispatchLatencySeconds measures the time from request enqueue to the // moment the CAS goroutine starts executing TriggerUpdateWithOptions. ScheduleDispatchLatencySeconds *prometheus.HistogramVec // ScheduleRefreshTotal counts ready-queue refresh attempts. outcome: "ok" | "throttled" | "error". ScheduleRefreshTotal *prometheus.CounterVec // ScheduleReservationTTLExpiredTotal counts reservations removed by TTL sweep // (i.e. reservations not explicitly released by the CAS outcome handler). ScheduleReservationTTLExpiredTotal *prometheus.CounterVec // ScheduleSkippedScaleDownProtectedTotal counts refreshes where pods were skipped // because they carried the scale-down-protected annotation. ScheduleSkippedScaleDownProtectedTotal *prometheus.CounterVec // ScheduleReadyQueueEvictedTotal counts pods discarded from the ready queue at // dispatch time because they were no longer present in the informer cache or had // transitioned out of Idle (e.g. deleted during scale-down). ScheduleReadyQueueEvictedTotal *prometheus.CounterVec )
Stream scheduler metrics (pkg/lifecycle/schedule). Labels are namespace/pool/team/user. The current Pool model is per-user, so scheduler instances can retain the owning team/user when they are created.
var ( // SandboxRunningInfo is an info gauge (value always 1) that maps running sandbox IDs // to their pod names. Present only while the sandbox is in Running state. // Labels: namespace, pool, pod, sandbox_id, team, user. // Use for PromQL joins with kube CPU/memory metrics via namespace+pod labels. SandboxRunningInfo *prometheus.GaugeVec )
Sandbox info gauges.
Functions ¶
func GinPrometheusMiddleware ¶
func GinPrometheusMiddleware(api string) gin.HandlerFunc
counts and latencies. api should be "native" or "e2b" to distinguish between the two API servers; this avoids path-collision ambiguity when both servers expose routes with identical patterns (e.g. /sandboxes/:id).
Types ¶
This section is empty.