metrics

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 10, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

README

pkg/metrics — Prometheus Metrics

All custom metrics are registered to the controller-runtime shared registry and exposed uniformly via --metrics-bind-address (default :8082, plain HTTP).

Files

  • metrics.go — Metric variables definition and init() registration
  • middleware.goGinPrometheusMiddleware(api string) Gin middleware, where api accepts "native" or "e2b"

Metrics List

Metric Name Type Labels Description
agentbox_sandboxpool_replicas_desired Gauge namespace, pool, team, user Desired replica count
agentbox_sandboxpool_replicas_idle Gauge namespace, pool, team, user Idle replica count
agentbox_sandboxpool_replicas_running Gauge namespace, pool, team, user Running replica count
agentbox_sandboxpool_replicas_starting Gauge namespace, pool, team, user Starting replica count
agentbox_sandboxpool_replicas_stopping Gauge namespace, pool, team, user Stopping (recycling) replica count
agentbox_sandboxpool_replicas_failed Gauge namespace, pool, team, user Failed replica count
agentbox_sandbox_claim_duration_seconds Histogram namespace, pool, team, user, outcome Time spent in ClaimIdlePod; outcome: success/no_idle/timeout/error
agentbox_sandbox_starting_duration_seconds Histogram namespace, pool, team, user, outcome Sandbox startup duration; outcome=success: claimedAt→startedAt; outcome=canceled: claimedAt→terminatedAt (user canceled before Running)
agentbox_sandbox_running_duration_seconds Histogram namespace, pool, team, user, stop_reason Actual sandbox running duration (startedAt → terminatedAt)
agentbox_sandbox_recycle_duration_seconds Histogram namespace, pool, team, user Sandbox recycle duration (terminatedAt → recycledAt, i.e. Stopping→Idle image restore)
agentbox_sandbox_running_info Gauge (always 1) namespace, pool, pod, sandbox_id, team, user Mapping of Running Sandbox → Pod; exists only while the Sandbox is Running, used for PromQL joins with kube metrics
agentbox_sandbox_create_total Counter namespace, pool, team, user, result Creation request count; result: success/no_idle/error
agentbox_sandbox_delete_total Counter namespace, pool, team, user, stop_reason Deletion count; stop_reason: Completed/Canceled/Released/Failed (includes all paths: API stop, idle timeout, OOM/Crash recycling, eviction cleanup)
agentbox_inplace_update_total Counter namespace, pool, target, user, team, result In-place update attempt count; target: TargetPodPhase (running/idle); result: success/conflict/error (conflict covers k8s version conflicts and phase mismatches)
agentbox_http_requests_total Counter method, path, status_code, api HTTP request count; api: native/e2b
agentbox_http_request_duration_seconds Histogram method, path, status_code, api HTTP request latency
agentbox_schedule_ready_queue_size Gauge namespace, pool, team, user Current number of idle pods in the per-pool scheduler ready queue (known to the scheduler, not yet dispatched)
agentbox_schedule_reservations_size Gauge namespace, pool, team, user Current number of inflight reservations (pods being CAS'd or recently claimed within TTL window)
agentbox_schedule_cas_outcome_total Counter namespace, pool, team, user, outcome TriggerUpdateWithOptions outcomes from the streaming scheduler; outcome: success/retriable (phase mismatch or k8s conflict)/hard (other errors)
agentbox_schedule_dispatch_latency_seconds Histogram namespace, pool, team, user Time from request enqueue to CAS goroutine start (scheduler responsiveness)
agentbox_schedule_refresh_total Counter namespace, pool, team, user, outcome Per-pool ready-queue refresh attempts; outcome: ok/throttled/error
agentbox_schedule_reservation_ttl_expired_total Counter namespace, pool, team, user Reservations removed by TTL sweep (not explicitly released by the CAS outcome handler)
agentbox_schedule_skipped_scale_down_protected_total Counter namespace, pool, team, user Pods skipped during refresh because they carry the scale-down-protected annotation
agentbox_schedule_ready_queue_evicted_total Counter namespace, pool, team, user Pods discarded from the ready queue at dispatch time because they were absent from the informer cache or no longer Idle (e.g. deleted during scale-down)

The team/user labels are derived from SandboxPool.Labels["scheduling.navix.sh/team"] and ["scheduling.navix.sh/user"], which are passed by the caller via the API.

The HTTP path label uses c.FullPath() (the route template, e.g., /v1/sandboxes/:id) to prevent high cardinality issues caused by specific parameter values.

Associating with Kube Resource Metrics using agentbox_sandbox_running_info

agentbox_sandbox_running_info is an Info-type metric (its value is always 1) that exists only when the Sandbox is in the Running phase. By joining it with native kube metrics using the namespace + pod labels, you can associate CPU/Memory usage with specific Sandboxes.

# Query CPU usage rate for a specific sandbox (5m average)
rate(container_cpu_usage_seconds_total{namespace="default", container!=""}[5m])
  * on(namespace, pod) group_left(sandbox_id)
  agentbox_sandbox_running_info{sandbox_id="<your-sandbox-id>"}

# Query memory usage for a specific sandbox
container_memory_working_set_bytes{namespace="default"}
  * on(namespace, pod) group_left(sandbox_id)
  agentbox_sandbox_running_info{sandbox_id="<your-sandbox-id>"}

# Query memory usage for all running sandboxes, grouped by sandbox_id / user
container_memory_working_set_bytes
  * on(namespace, pod) group_left(sandbox_id, team, user)
  agentbox_sandbox_running_info

Metric Lifecycle:

  • Set: Upon completion of Starting→Running in syncInplaceUpdatePhases (sandboxpool_controller.go)
  • Delete: Upon completion of Stopping→Idle in syncInplaceUpdatePhases, right before cleanupSandboxMetadata is called

Adding New Metrics

  1. Declare the variable in metrics.go and register it in init() (using MustRegister).
  2. Call methods like .Set() / .Inc() / .Observe() in the business logic code.
  3. Verify the metric values in the corresponding unit or integration tests.

Prometheus Scraping Configuration

config/prometheus/monitor.yaml contains the ServiceMonitor (requires Prometheus Operator to be installed in the cluster):

  • Port: http-metrics (TCP 8082, corresponding to config/default/metrics_service.yaml)
  • Protocol: HTTP (no TLS), scrape interval 30s

How to enable: Uncomment # - ../prometheus in config/default/kustomization.yaml, then run make sync-crds-to-helm.

Local validation:

curl http://localhost:8082/metrics | grep agentbox_

Documentation

Overview

Package metrics defines and registers all custom Prometheus metrics for AgentBox. All metrics are registered to the controller-runtime shared registry so they are exposed via the same --metrics-bind-address endpoint as the controller metrics.

Index

Constants

This section is empty.

Variables

View Source
var (
	PoolReplicasDesired  *prometheus.GaugeVec
	PoolReplicasIdle     *prometheus.GaugeVec
	PoolReplicasRunning  *prometheus.GaugeVec
	PoolReplicasStarting *prometheus.GaugeVec
	PoolReplicasStopping *prometheus.GaugeVec
	PoolReplicasFailed   *prometheus.GaugeVec
)

Pool replica gauges — one per replica phase, labelled by namespace/pool/team/user.

View Source
var (
	// SandboxClaimDuration observes how long ClaimIdlePod takes.
	// outcome: "success" | "no_idle" | "timeout" | "error"
	SandboxClaimDuration *prometheus.HistogramVec

	// SandboxStartingDuration observes the image-pull / startup time (claimedAt → startedAt).
	// stop_reason label is absent here; use for P99 startup latency breakdowns.
	SandboxStartingDuration *prometheus.HistogramVec

	// SandboxRunningDuration observes actual sandbox running time (startedAt → terminatedAt).
	// stop_reason: "Completed" | "Failed" | "Canceled" | "Evicted"
	SandboxRunningDuration *prometheus.HistogramVec

	// SandboxRecycleDuration observes the Stopping→Idle recycle time (terminatedAt → recycledAt).
	SandboxRecycleDuration *prometheus.HistogramVec
)

Sandbox lifecycle histograms.

View Source
var (
	// SandboxCreateTotal counts sandbox creation attempts.
	// result: "success" | "no_idle" | "timeout" | "error"
	SandboxCreateTotal *prometheus.CounterVec

	// SandboxDeleteTotal counts sandbox deletions.
	// stop_reason: "Completed" | "Canceled" | "Failed"
	SandboxDeleteTotal *prometheus.CounterVec

	// InplaceUpdateTotal counts TriggerUpdateWithOptions calls.
	// result: "success" | "conflict" | "error"
	// (conflict covers both k8s resource version conflicts and phase mismatches)
	// target: TargetPodPhase value (e.g. "running", "idle")
	InplaceUpdateTotal *prometheus.CounterVec
)

Sandbox operation counters.

View Source
var (
	HTTPRequestsTotal   *prometheus.CounterVec
	HTTPRequestDuration *prometheus.HistogramVec
)

HTTP API metrics (Gin middleware).

View Source
var (
	// ScheduleReadyQSize is the current number of pods in the per-pool ready queue
	// (idle pods known to the scheduler, not yet dispatched).
	ScheduleReadyQSize *prometheus.GaugeVec

	// ScheduleReservationsSize is the current number of per-pool inflight reservations
	// (pods either being CAS'd or recently claimed within the TTL window).
	ScheduleReservationsSize *prometheus.GaugeVec

	// ScheduleCASOutcomeTotal counts TriggerUpdateWithOptions outcomes from the scheduler.
	// outcome: "success" | "retriable" (phase mismatch / k8s conflict) | "hard" (other errors).
	ScheduleCASOutcomeTotal *prometheus.CounterVec

	// ScheduleDispatchLatencySeconds measures the time from request enqueue to the
	// moment the CAS goroutine starts executing TriggerUpdateWithOptions.
	ScheduleDispatchLatencySeconds *prometheus.HistogramVec

	// ScheduleRefreshTotal counts ready-queue refresh attempts. outcome: "ok" | "throttled" | "error".
	ScheduleRefreshTotal *prometheus.CounterVec

	// ScheduleReservationTTLExpiredTotal counts reservations removed by TTL sweep
	// (i.e. reservations not explicitly released by the CAS outcome handler).
	ScheduleReservationTTLExpiredTotal *prometheus.CounterVec

	// ScheduleSkippedScaleDownProtectedTotal counts refreshes where pods were skipped
	// because they carried the scale-down-protected annotation.
	ScheduleSkippedScaleDownProtectedTotal *prometheus.CounterVec

	// ScheduleReadyQueueEvictedTotal counts pods discarded from the ready queue at
	// dispatch time because they were no longer present in the informer cache or had
	// transitioned out of Idle (e.g. deleted during scale-down).
	ScheduleReadyQueueEvictedTotal *prometheus.CounterVec
)

Stream scheduler metrics (pkg/lifecycle/schedule). Labels are namespace/pool/team/user. The current Pool model is per-user, so scheduler instances can retain the owning team/user when they are created.

View Source
var (
	// SandboxRunningInfo is an info gauge (value always 1) that maps running sandbox IDs
	// to their pod names. Present only while the sandbox is in Running state.
	// Labels: namespace, pool, pod, sandbox_id, team, user.
	// Use for PromQL joins with kube CPU/memory metrics via namespace+pod labels.
	SandboxRunningInfo *prometheus.GaugeVec
)

Sandbox info gauges.

Functions

func GinPrometheusMiddleware

func GinPrometheusMiddleware(api string) gin.HandlerFunc

counts and latencies. api should be "native" or "e2b" to distinguish between the two API servers; this avoids path-collision ambiguity when both servers expose routes with identical patterns (e.g. /sandboxes/:id).

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL