metrics

package
v0.0.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 11, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

README

pkg/metrics — Prometheus Metrics

All custom metrics are registered to the controller-runtime shared registry and exposed uniformly via --metrics-bind-address (default :8082, plain HTTP).

Files

  • metrics.go — Metric variables definition and init() registration
  • middleware.goGinPrometheusMiddleware(api string) Gin middleware, where api accepts "native" or "e2b"

Metrics List

Metric Name Type Labels Description
agentbox_sandboxpool_replicas_desired Gauge namespace, pool, team, user, sandbox_env Desired replica count
agentbox_sandboxpool_replicas_idle Gauge namespace, pool, team, user, sandbox_env Idle replica count
agentbox_sandboxpool_replicas_running Gauge namespace, pool, team, user, sandbox_env Running replica count
agentbox_sandboxpool_replicas_starting Gauge namespace, pool, team, user, sandbox_env Starting replica count
agentbox_sandboxpool_replicas_stopping Gauge namespace, pool, team, user, sandbox_env Stopping (recycling) replica count
agentbox_sandboxpool_replicas_failed Gauge namespace, pool, team, user, sandbox_env Failed replica count
agentbox_sandbox_claim_duration_seconds Histogram namespace, pool, team, user, sandbox_env, outcome Time spent in ClaimIdlePod; outcome: success/no_idle/timeout/error
agentbox_sandbox_starting_duration_seconds Histogram namespace, pool, team, user, sandbox_env, outcome Sandbox startup duration; outcome=success: claimedAt→startedAt; outcome=canceled: claimedAt→terminatedAt (user canceled before Running)
agentbox_sandbox_running_duration_seconds Histogram namespace, pool, team, user, sandbox_env, stop_reason Actual sandbox running duration (startedAt → terminatedAt)
agentbox_sandbox_recycle_duration_seconds Histogram namespace, pool, team, user, sandbox_env Sandbox recycle duration (terminatedAt → recycledAt, i.e. Stopping→Idle image restore)
agentbox_sandbox_running_info Gauge (always 1) namespace, pool, pod, sandbox_id, team, user, sandbox_env Mapping of Running Sandbox → Pod; exists only while the Sandbox is Running, used for PromQL joins with kube metrics
agentbox_sandbox_create_total Counter namespace, pool, team, user, sandbox_env, result Creation request count; result: success/no_idle/error
agentbox_sandbox_delete_total Counter namespace, pool, team, user, sandbox_env, stop_reason Deletion count; stop_reason: Completed/Canceled/Released/Failed (includes all paths: API stop, idle timeout, OOM/Crash recycling, eviction cleanup)
agentbox_inplace_update_total Counter namespace, pool, target, user, team, sandbox_env, result In-place update attempt count; target: TargetPodPhase (running/idle); result: success/conflict/error (conflict covers k8s version conflicts and phase mismatches)
agentbox_http_requests_total Counter method, path, status_code, api HTTP request count; api: native/e2b
agentbox_http_request_duration_seconds Histogram method, path, status_code, api HTTP request latency
agentbox_schedule_ready_queue_size Gauge namespace, pool, team, user, sandbox_env Current number of idle pods in the per-pool scheduler ready queue (known to the scheduler, not yet dispatched)
agentbox_schedule_reservations_size Gauge namespace, pool, team, user, sandbox_env Current number of inflight reservations (pods being CAS'd or recently claimed within TTL window)
agentbox_schedule_cas_outcome_total Counter namespace, pool, team, user, sandbox_env, outcome TriggerUpdateWithOptions outcomes from the streaming scheduler; outcome: success/retriable (phase mismatch or k8s conflict)/hard (other errors)
agentbox_schedule_dispatch_latency_seconds Histogram namespace, pool, team, user, sandbox_env Time from request enqueue to CAS goroutine start (scheduler responsiveness)
agentbox_schedule_refresh_total Counter namespace, pool, team, user, sandbox_env, outcome Per-pool ready-queue refresh attempts; outcome: ok/throttled/error
agentbox_schedule_reservation_ttl_expired_total Counter namespace, pool, team, user, sandbox_env Reservations removed by TTL sweep (not explicitly released by the CAS outcome handler)
agentbox_schedule_skipped_scale_down_protected_total Counter namespace, pool, team, user, sandbox_env Pods skipped during refresh because they carry the scale-down-protected annotation
agentbox_schedule_ready_queue_evicted_total Counter namespace, pool, team, user, sandbox_env Pods discarded from the ready queue at dispatch time because they were absent from the informer cache or no longer Idle (e.g. deleted during scale-down)

The team/user labels are derived from SandboxPool.Labels["scheduling.navix.sh/team"] and ["scheduling.navix.sh/user"], which are passed by the caller via the API.

The sandbox_env label is derived from SandboxPool.Labels["agentbox.navix.sh/env"] and identifies the owning SandboxEnv. It is stamped onto every member Pool by the SandboxEnv reconciler and inherited by pods at creation time, so all per-Pool / per-Sandbox metrics carry it. Pools that pre-date Env adoption may briefly emit with an empty sandbox_env until the next reconcile.

The HTTP path label uses c.FullPath() (the route template, e.g., /v1/sandboxes/:id) to prevent high cardinality issues caused by specific parameter values.

Associating with Kube Resource Metrics using agentbox_sandbox_running_info

agentbox_sandbox_running_info is an Info-type metric (its value is always 1) that exists only when the Sandbox is in the Running phase. By joining it with native kube metrics using the namespace + pod labels, you can associate CPU/Memory usage with specific Sandboxes.

# Query CPU usage rate for a specific sandbox (5m average)
rate(container_cpu_usage_seconds_total{namespace="default", container!=""}[5m])
  * on(namespace, pod) group_left(sandbox_id)
  agentbox_sandbox_running_info{sandbox_id="<your-sandbox-id>"}

# Query memory usage for a specific sandbox
container_memory_working_set_bytes{namespace="default"}
  * on(namespace, pod) group_left(sandbox_id)
  agentbox_sandbox_running_info{sandbox_id="<your-sandbox-id>"}

# Query memory usage for all running sandboxes, grouped by sandbox_id / user
container_memory_working_set_bytes
  * on(namespace, pod) group_left(sandbox_id, team, user)
  agentbox_sandbox_running_info

Metric Lifecycle:

  • Set: Upon completion of Starting→Running in syncInplaceUpdatePhases (sandboxpool_controller.go)
  • Delete: Upon completion of Stopping→Idle in syncInplaceUpdatePhases, right before cleanupSandboxMetadata is called

Adding New Metrics

  1. Declare the variable in metrics.go and register it in init() (using MustRegister).
  2. Call methods like .Set() / .Inc() / .Observe() in the business logic code.
  3. Verify the metric values in the corresponding unit or integration tests.

Prometheus Scraping Configuration

config/prometheus/monitor.yaml contains the ServiceMonitor (requires Prometheus Operator to be installed in the cluster):

  • Port: http-metrics (TCP 8082, corresponding to config/default/metrics_service.yaml)
  • Protocol: HTTP (no TLS), scrape interval 30s

How to enable: Uncomment # - ../prometheus in config/default/kustomization.yaml, then run make sync-crds-to-helm.

Local validation:

curl http://localhost:8082/metrics | grep agentbox_

Documentation

Overview

Package metrics defines and registers all custom Prometheus metrics for AgentBox. All metrics are registered to the controller-runtime shared registry so they are exposed via the same --metrics-bind-address endpoint as the controller metrics.

Index

Constants

This section is empty.

Variables

View Source
var (
	PoolReplicasDesired  *prometheus.GaugeVec
	PoolReplicasIdle     *prometheus.GaugeVec
	PoolReplicasRunning  *prometheus.GaugeVec
	PoolReplicasStarting *prometheus.GaugeVec
	PoolReplicasStopping *prometheus.GaugeVec
	PoolReplicasFailed   *prometheus.GaugeVec
)

Pool replica gauges — one per replica phase, labelled by namespace/pool/team/user.

View Source
var (
	// SandboxClaimDuration observes how long ClaimIdlePod takes.
	// outcome: "success" | "no_idle" | "timeout" | "error"
	SandboxClaimDuration *prometheus.HistogramVec

	// SandboxStartingDuration observes the image-pull / startup time (claimedAt → startedAt).
	// stop_reason label is absent here; use for P99 startup latency breakdowns.
	SandboxStartingDuration *prometheus.HistogramVec

	// SandboxRunningDuration observes actual sandbox running time (startedAt → terminatedAt).
	// stop_reason: "Completed" | "Failed" | "Canceled" | "Evicted"
	SandboxRunningDuration *prometheus.HistogramVec

	// SandboxRecycleDuration observes the Stopping→Idle recycle time (terminatedAt → recycledAt).
	SandboxRecycleDuration *prometheus.HistogramVec
)

Sandbox lifecycle histograms.

View Source
var (
	// SandboxCreateTotal counts sandbox creation attempts.
	// result: "success" | "no_idle" | "timeout" | "error"
	SandboxCreateTotal *prometheus.CounterVec

	// SandboxDeleteTotal counts sandbox deletions.
	// stop_reason: "Completed" | "Canceled" | "Failed"
	SandboxDeleteTotal *prometheus.CounterVec

	// InplaceUpdateTotal counts TriggerUpdateWithOptions calls.
	// result: "success" | "conflict" | "error"
	// (conflict covers both k8s resource version conflicts and phase mismatches)
	// target: TargetPodPhase value (e.g. "running", "idle")
	InplaceUpdateTotal *prometheus.CounterVec
)

Sandbox operation counters.

View Source
var (
	HTTPRequestsTotal   *prometheus.CounterVec
	HTTPRequestDuration *prometheus.HistogramVec
)

HTTP API metrics (Gin middleware).

View Source
var (
	// ScheduleReadyQSize is the current number of pods in the per-pool ready queue
	// (idle pods known to the scheduler, not yet dispatched).
	ScheduleReadyQSize *prometheus.GaugeVec

	// ScheduleReservationsSize is the current number of per-pool inflight reservations
	// (pods either being CAS'd or recently claimed within the TTL window).
	ScheduleReservationsSize *prometheus.GaugeVec

	// ScheduleCASOutcomeTotal counts TriggerUpdateWithOptions outcomes from the scheduler.
	// outcome: "success" | "retriable" (phase mismatch / k8s conflict) | "hard" (other errors).
	ScheduleCASOutcomeTotal *prometheus.CounterVec

	// ScheduleDispatchLatencySeconds measures the time from request enqueue to the
	// moment the CAS goroutine starts executing TriggerUpdateWithOptions.
	ScheduleDispatchLatencySeconds *prometheus.HistogramVec

	// ScheduleRefreshTotal counts ready-queue refresh attempts. outcome: "ok" | "throttled" | "error".
	ScheduleRefreshTotal *prometheus.CounterVec

	// ScheduleReservationTTLExpiredTotal counts reservations removed by TTL sweep
	// (i.e. reservations not explicitly released by the CAS outcome handler).
	ScheduleReservationTTLExpiredTotal *prometheus.CounterVec

	// ScheduleSkippedScaleDownProtectedTotal counts refreshes where pods were skipped
	// because they carried the scale-down-protected annotation.
	ScheduleSkippedScaleDownProtectedTotal *prometheus.CounterVec

	// ScheduleReadyQueueEvictedTotal counts pods discarded from the ready queue at
	// dispatch time because they were no longer present in the informer cache or had
	// transitioned out of Idle (e.g. deleted during scale-down).
	ScheduleReadyQueueEvictedTotal *prometheus.CounterVec
)

Stream scheduler metrics (pkg/lifecycle/schedule). Labels are namespace/pool/team/user. The current Pool model is per-user, so scheduler instances can retain the owning team/user when they are created.

View Source
var (
	// SandboxRunningInfo is an info gauge (value always 1) that maps running sandbox IDs
	// to their pod names. Present only while the sandbox is in Running state.
	// Labels: namespace, pool, pod, sandbox_id, team, user.
	// Use for PromQL joins with kube CPU/memory metrics via namespace+pod labels.
	SandboxRunningInfo *prometheus.GaugeVec
)

Sandbox info gauges.

Functions

func GinPrometheusMiddleware

func GinPrometheusMiddleware(api string) gin.HandlerFunc

counts and latencies. api should be "native" or "e2b" to distinguish between the two API servers; this avoids path-collision ambiguity when both servers expose routes with identical patterns (e.g. /sandboxes/:id).

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL