defaults

package
v0.12.0-rc1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2026 License: Apache-2.0 Imports: 3 Imported by: 0

Documentation

Overview

Package defaults provides centralized configuration constants for the AICR system.

This package defines timeout values, retry parameters, and other configuration defaults used across the codebase. Centralizing these values ensures consistency and makes tuning easier.

Timeout Categories

Timeouts are organized by component:

  • Collector timeouts: For system data collection operations
  • Handler timeouts: For HTTP request processing
  • Server timeouts: For HTTP server configuration
  • Kubernetes timeouts: For K8s API operations
  • HTTP client timeouts: For outbound HTTP requests

Usage

Import and use constants directly:

import "github.com/NVIDIA/aicr/pkg/defaults"

ctx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout)
defer cancel()

Timeout Guidelines

When choosing timeout values:

  • Collectors: 10s default, respects parent context deadline
  • HTTP handlers: 30s for recipes, 60s for bundles
  • K8s operations: 30s for API calls, 5m for job completion
  • Server shutdown: 30s for graceful shutdown

Index

Constants

View Source
const (
	// CollectorTimeout is the default timeout for collector operations.
	// Collectors should respect parent context deadlines when shorter.
	CollectorTimeout = 10 * time.Second

	// CollectorK8sTimeout is the timeout for Kubernetes API calls in collectors.
	// Covers 6 sequential sub-collectors (server, image, policy, node, helm, argocd).
	CollectorK8sTimeout = 60 * time.Second

	// NFDDetectionTimeout is the timeout for NFD-based hardware detection.
	// PCI enumeration and kernel module listing are fast local operations
	// reading from sysfs/procfs, so a short timeout is sufficient.
	NFDDetectionTimeout = 5 * time.Second
)

Collector timeouts for data collection operations.

View Source
const (
	// CollectorTopologyTimeout is the timeout for node topology collection.
	// Longer than standard K8s collector because of paginated node listing.
	CollectorTopologyTimeout = 90 * time.Second

	// TopologyListPageSize is the number of nodes per List API page.
	TopologyListPageSize = int64(500)
)

Node topology collector constants.

View Source
const (
	// RecipeHandlerTimeout is the timeout for recipe generation requests.
	RecipeHandlerTimeout = 30 * time.Second

	// RecipeBuildTimeout is the internal timeout for recipe building.
	// Should be less than RecipeHandlerTimeout to allow error handling.
	RecipeBuildTimeout = 25 * time.Second

	// BundleHandlerTimeout is the timeout for bundle generation requests.
	// Longer than recipe due to file I/O operations.
	BundleHandlerTimeout = 60 * time.Second

	// RecipeCacheTTL is the default cache duration for recipe responses.
	RecipeCacheTTL = 10 * time.Minute
)

Handler timeouts for HTTP request processing.

View Source
const (
	// ServerReadTimeout is the maximum duration for reading request headers.
	ServerReadTimeout = 10 * time.Second

	// ServerReadHeaderTimeout prevents slow header attacks.
	ServerReadHeaderTimeout = 5 * time.Second

	// ServerWriteTimeout is the maximum duration for writing a response.
	ServerWriteTimeout = 30 * time.Second

	// ServerIdleTimeout is the maximum duration to wait for the next request.
	ServerIdleTimeout = 120 * time.Second

	// ServerShutdownTimeout is the maximum duration for graceful shutdown.
	ServerShutdownTimeout = 30 * time.Second
)

Server timeouts for HTTP server configuration.

View Source
const (
	// K8sJobCreationTimeout is the timeout for creating K8s Job resources.
	K8sJobCreationTimeout = 30 * time.Second

	// K8sPodReadyTimeout is the timeout for waiting for pods to be ready.
	// Needs headroom for image pull + scheduling in large clusters.
	K8sPodReadyTimeout = 2 * time.Minute

	// K8sJobCompletionTimeout is the default timeout for job completion.
	K8sJobCompletionTimeout = 5 * time.Minute

	// K8sCleanupTimeout is the timeout for cleanup operations.
	K8sCleanupTimeout = 30 * time.Second

	// K8sPodTerminationWaitTimeout is the maximum time to wait for a Job pod
	// to fully terminate after the Job is deleted. Prevents race conditions
	// where RBAC resources are cleaned up while the pod is still running
	// cleanup operations (e.g., chainsaw namespace deletion).
	// Must exceed the default Kubernetes terminationGracePeriodSeconds (30s).
	K8sPodTerminationWaitTimeout = 60 * time.Second
)

Kubernetes timeouts for K8s API operations.

View Source
const (
	// HTTPClientTimeout is the default total timeout for HTTP requests.
	HTTPClientTimeout = 30 * time.Second

	// HTTPConnectTimeout is the timeout for establishing connections.
	HTTPConnectTimeout = 5 * time.Second

	// HTTPTLSHandshakeTimeout is the timeout for TLS handshake.
	HTTPTLSHandshakeTimeout = 5 * time.Second

	// HTTPResponseHeaderTimeout is the timeout for reading response headers.
	HTTPResponseHeaderTimeout = 10 * time.Second

	// HTTPIdleConnTimeout is the timeout for idle connections in the pool.
	HTTPIdleConnTimeout = 90 * time.Second

	// HTTPKeepAlive is the keep-alive duration for connections.
	HTTPKeepAlive = 30 * time.Second

	// HTTPExpectContinueTimeout is the timeout for Expect: 100-continue.
	HTTPExpectContinueTimeout = 1 * time.Second
)

HTTP client timeouts for outbound requests.

View Source
const (
	// CLISnapshotTimeout is the default timeout for snapshot operations.
	CLISnapshotTimeout = 5 * time.Minute

	// InteractiveOIDCTimeout is the maximum time to wait for a user to complete
	// browser-based OIDC authentication. Prevents indefinite blocking if the
	// browser flow is started but never completed.
	InteractiveOIDCTimeout = 5 * time.Minute
)

CLI timeouts for command-line operations.

View Source
const (
	// ResourceVerificationTimeout is the timeout for verifying individual
	// expected resources exist and are healthy during deployment validation.
	ResourceVerificationTimeout = 10 * time.Second

	// ComponentRenderTimeout is the maximum time to render a single component
	// via helm template or manifest file rendering during resource discovery.
	ComponentRenderTimeout = 60 * time.Second
)

Validation phase timeouts for validation phase operations. Validation phase timeouts.

View Source
const (
	// ChainsawAssertTimeout is the outer timeout for the chainsaw binary process.
	// Must be greater than the chainsaw-internal assert timeout (spec.timeouts.assert
	// in health check YAML files, currently 5m) to avoid killing the process early.
	ChainsawAssertTimeout = 6 * time.Minute

	// ChainsawMaxParallel is the maximum number of concurrent assertion
	// runs during component health checks.
	ChainsawMaxParallel = 4

	// AssertRetryInterval is the polling interval between health check
	// assertion retries. Assertions are retried at this interval until
	// they pass or the ChainsawAssertTimeout expires.
	AssertRetryInterval = 5 * time.Second
)

Chainsaw assertion configuration for component health checks.

View Source
const (
	// CheckExecutionTimeout is the parent context timeout for checks running
	// inside a K8s Job. Must be long enough for the slowest behavioral check
	// and shorter than the catalog-level Job timeout (activeDeadlineSeconds).
	//
	// The ceiling is set by the cold-start inference benchmark, which runs
	// the following phases serially under the parent ctx:
	//   InferenceNamespaceTerminationWait ( 5m, prior run's namespace drain)
	// + InferenceWorkloadReadyTimeout     (10m, image pull + model load)
	// + InferenceHealthTimeout            ( 5m, endpoint readiness probe)
	// + InferencePerfPodTimeout           ( 5m, AIPerf pod scheduling)
	// + InferencePerfJobTimeout           (15m, AIPerf benchmark runtime)
	// ──────────────────────────────────────
	// = 40m worst-case phase sum; 45m ceiling gives 5m headroom for slow
	//   image registries and slog/K8s API round-trips between phases.
	// Deferred cleanup (K8sCleanupTimeout, ~30s) runs under a fresh
	// context.Background and doesn't consume this budget — but see the
	// inference-perf catalog entry for the corresponding Job-level bump.
	CheckExecutionTimeout = 45 * time.Minute

	// DRATestPodTimeout is the timeout for the DRA test pod to complete.
	// The pod runs a simple CUDA device check but may need time for image pull.
	DRATestPodTimeout = 5 * time.Minute

	// GangTestPodTimeout is the timeout for gang scheduling test pods to complete.
	// Two pods must be co-scheduled, each pulling a CUDA image and running nvidia-smi.
	GangTestPodTimeout = 5 * time.Minute
)

Conformance test timeouts for DRA and gang scheduling validation.

View Source
const (
	// AIServiceMetricsWaitTimeout is the maximum time to wait for GPU metrics
	// to appear in Prometheus. DCGM exporter may not have scraped yet when
	// the validator runs, especially on fresh deployments.
	AIServiceMetricsWaitTimeout = 2 * time.Minute

	// AIServiceMetricsPollInterval is the polling interval between Prometheus
	// queries when waiting for GPU metric time series to appear.
	AIServiceMetricsPollInterval = 10 * time.Second
)

AI service metrics conformance validation.

View Source
const (
	// HPAScaleTimeout is the timeout for waiting for HPA to report scaling intent.
	// The HPA needs time to read metrics and compute desired replicas.
	HPAScaleTimeout = 3 * time.Minute

	// HPAPollInterval is the interval for polling HPA status during behavioral tests.
	HPAPollInterval = 10 * time.Second
)

HPA behavioral test timeouts for conformance validation.

View Source
const (
	// KarpenterNodeTimeout is the timeout for Karpenter to provision KWOK nodes.
	KarpenterNodeTimeout = 3 * time.Minute

	// KarpenterPollInterval is the interval for polling Karpenter node provisioning.
	KarpenterPollInterval = 10 * time.Second
)

Karpenter behavioral test timeouts for conformance validation.

View Source
const (
	// TrainerCRDEstablishedTimeout is the time to wait for Kubeflow Trainer CRDs
	// to reach the Established condition after installation.
	TrainerCRDEstablishedTimeout = 2 * time.Minute

	// TrainerControllerReadyTimeout is the time to wait for the Kubeflow Trainer
	// controller-manager Deployment to have at least one ready replica after installation.
	TrainerControllerReadyTimeout = 2 * time.Minute

	// NCCLTrainJobTimeout is the maximum time to wait for the NCCL all-reduce TrainJob to complete.
	NCCLTrainJobTimeout = 30 * time.Minute

	// NCCLLauncherPodTimeout is the maximum time to wait for the NCCL launcher pod to be created.
	NCCLLauncherPodTimeout = 5 * time.Minute

	// NCCLTrainerArchiveDownloadTimeout is the timeout for downloading the Kubeflow Trainer
	// source archive from GitHub. The archive is several MB, so a longer timeout than the
	// standard HTTPClientTimeout is appropriate.
	NCCLTrainerArchiveDownloadTimeout = 5 * time.Minute
)

Kubeflow Trainer install timeouts for NCCL performance validation.

View Source
const (
	// InferenceHealthTimeout is the maximum time to wait for the inference
	// endpoint to become healthy before running the benchmark.
	InferenceHealthTimeout = 5 * time.Minute

	// InferenceHealthPollInterval is the polling interval for health checks.
	InferenceHealthPollInterval = 10 * time.Second

	// InferencePerfJobTimeout is the maximum time for the AIPerf benchmark Job
	// to complete. AIPerf with 100 requests at concurrency 16 typically finishes
	// in a few minutes; this provides headroom for model loading and warmup.
	InferencePerfJobTimeout = 15 * time.Minute

	// InferencePerfPodTimeout is the maximum time to wait for the AIPerf pod
	// to be created and scheduled.
	InferencePerfPodTimeout = 5 * time.Minute

	// InferenceWorkloadReadyTimeout is the maximum time to wait for the
	// DynamoGraphDeployment to reach the "successful" state. Includes image
	// pull, model loading, and health check readiness for all workers.
	InferenceWorkloadReadyTimeout = 10 * time.Minute

	// InferenceNamespaceTerminationWait is the maximum time to wait for a
	// prior run's benchmark namespace to finish terminating before a new run
	// re-creates it. Dynamo CRs with finalizers can hold the namespace in
	// Terminating state for 2-3 minutes while cascade deletion propagates;
	// waiting avoids the "... forbidden: ... because it is being terminated"
	// race on subsequent resource creates.
	InferenceNamespaceTerminationWait = 5 * time.Minute
)

Inference performance validation timeouts.

View Source
const (
	// DeploymentScaleTimeout is the timeout for waiting for Deployment controller
	// to observe and act on HPA scale-up by increasing replica count.
	DeploymentScaleTimeout = 2 * time.Minute

	// PodScheduleTimeout is the timeout for waiting for test pods to be scheduled
	// on Karpenter-provisioned nodes after the HPA scales up.
	PodScheduleTimeout = 2 * time.Minute
)

Deployment and pod scheduling test timeouts for conformance validation.

View Source
const (
	// PodWaitTimeout is the maximum time to wait for pod operations to complete.
	PodWaitTimeout = 10 * time.Minute

	// PodPollInterval is the interval for polling pod status.
	// Used in legacy polling code (to be replaced with watch API in Phase 3).
	PodPollInterval = 500 * time.Millisecond

	// ValidationPodTimeout is the timeout for validation pod operations.
	ValidationPodTimeout = 10 * time.Minute

	// DiagnosticTimeout is the timeout for collecting diagnostic information.
	DiagnosticTimeout = 2 * time.Minute

	// PodReadyTimeout is the timeout for waiting for pods to become ready.
	PodReadyTimeout = 2 * time.Minute

	// PreflightCleanupTimeout bounds the best-effort probe-pod delete in
	// deferred validator preflight cleanup paths, which run with
	// context.Background() so they still fire after the parent context
	// has been canceled.
	PreflightCleanupTimeout = 30 * time.Second
)

Pod operation timeouts for validation and agent operations.

View Source
const (
	// HTTPResponseBodyLimit is the maximum size in bytes for HTTP response bodies
	// read by conformance checks (e.g., Prometheus metric scrapes). Prevents
	// unbounded reads from in-cluster services.
	HTTPResponseBodyLimit = 1 * 1024 * 1024 // 1 MiB

	// MaxErrorBodySize is the maximum size in bytes for HTTP error response bodies.
	// Bounds io.ReadAll on error paths to prevent unbounded memory allocation.
	MaxErrorBodySize = 4096
)

HTTP response limits for conformance checks.

View Source
const (
	// JobTTLAfterFinished is the time-to-live for completed Jobs.
	// Jobs are kept for debugging purposes before automatic cleanup.
	JobTTLAfterFinished = 1 * time.Hour

	// AgentJobActiveDeadline is the active deadline for K8s agent Jobs.
	// Prevents runaway Jobs from consuming cluster resources indefinitely.
	AgentJobActiveDeadline = 5 * time.Hour
)

Job configuration constants.

View Source
const (
	// ServerDefaultRateLimit is the default requests per second for the rate limiter.
	ServerDefaultRateLimit = 100

	// ServerDefaultRateLimitBurst is the maximum burst size for the rate limiter.
	ServerDefaultRateLimitBurst = 200

	// ServerRetryAfterSeconds is the Retry-After header value when rate limited.
	ServerRetryAfterSeconds = "1"
)

Server rate limiting constants.

View Source
const (
	// ValidatorWaitBuffer is added to the catalog timeout when waiting for Job
	// completion. Accounts for pod scheduling, image pull, and graceful termination.
	ValidatorWaitBuffer = 30 * time.Second

	// ValidatorDefaultTimeout is the default per-validator timeout if not
	// specified in the catalog. Used as fallback only.
	ValidatorDefaultTimeout = 5 * time.Minute

	// ValidatorTerminationGracePeriod is the time between SIGTERM and SIGKILL
	// for validator containers. Validators should trap SIGTERM and write partial
	// results within this window.
	ValidatorTerminationGracePeriod = 30 * time.Second

	// ValidatorMaxStdoutLines is the maximum number of stdout lines captured
	// per validator. Lines beyond this are truncated (keeping the last N lines)
	// to prevent ConfigMap overflow.
	ValidatorMaxStdoutLines = 1000

	// ValidatorMaxStdoutLineLength is the maximum length of a single stdout
	// line. Lines exceeding this are truncated with a suffix indicating the
	// number of dropped characters. Prevents oversized report output from
	// inline JSON payloads (e.g., Prometheus metric scrapes).
	ValidatorMaxStdoutLineLength = 512

	// ValidatorDefaultCPU is the default CPU request/limit for validator containers
	// when not specified in the catalog entry.
	ValidatorDefaultCPU = "1"

	// ValidatorDefaultMemory is the default memory request/limit for validator
	// containers when not specified in the catalog entry.
	ValidatorDefaultMemory = "1Gi"
)

Validator constants.

View Source
const (
	// TrainerControllerPollInterval is the retry interval when waiting
	// for the Kubeflow Trainer controller-manager to become ready.
	TrainerControllerPollInterval = 2 * time.Second

	// TrainingRuntimePollInterval is the retry interval when waiting
	// for a TrainingRuntime resource to become visible via the API.
	TrainingRuntimePollInterval = 500 * time.Millisecond
)

Retry poll intervals for validator wait loops.

View Source
const (
	// TerminationLogMaxSize is the maximum size in bytes of the K8s
	// termination log message written to /dev/termination-log.
	TerminationLogMaxSize = 4096

	// ConfigMapStatusTruncateLen is the maximum length for ConfigMap
	// status data before truncation in autoscaler status collection.
	ConfigMapStatusTruncateLen = 2000

	// AutoscalerMaxEvents is the maximum number of autoscaler events
	// to capture when collecting cluster autoscaler evidence.
	AutoscalerMaxEvents = 10

	// MetricsDisplayLimit is the maximum number of custom metrics
	// resources to display in AI service metrics evidence.
	MetricsDisplayLimit = 20
)

Termination and truncation limits for validator output.

View Source
const (
	// GPUOperatorNamespace is the default namespace for the GPU operator.
	GPUOperatorNamespace = "gpu-operator"

	// KubeSystemNamespace is the standard kube-system namespace.
	KubeSystemNamespace = "kube-system"
)

Well-known Kubernetes resource names shared across validators.

View Source
const (
	// CNCFSubmissionTimeout is the timeout for CNCF submission evidence
	// collection. CNCF submission deploys GPU workloads and runs HPA tests.
	CNCFSubmissionTimeout = 20 * time.Minute
)

CNCF conformance submission timeout.

View Source
const (
	// CoScheduleWindow is the maximum time span between PodScheduled timestamps
	// for gang-scheduled pods. If pods are scheduled further apart than this,
	// they are not considered co-scheduled.
	CoScheduleWindow = 30 * time.Second
)

Gang scheduling co-scheduling validation.

View Source
const (
	// ConfigMapWriteTimeout is the timeout for writing to ConfigMaps.
	ConfigMapWriteTimeout = 30 * time.Second
)

ConfigMap timeouts for Kubernetes ConfigMap operations.

View Source
const (
	// FileParserMaxSize is the maximum file size in bytes for the file collector parser.
	FileParserMaxSize = 1 << 20 // 1MB
)

File parser limits.

View Source
const (
	// LogScannerBufferSize is the maximum line size for reading pod logs.
	// Larger than the default 64KB to handle container runtime line splitting
	// and long go test -json output events.
	LogScannerBufferSize = 1 << 20 // 1MB
)

Log scanner buffer sizes.

View Source
const (
	// MaxSigstoreBundleSize is the maximum size in bytes for a .sigstore.json file.
	// Prevents unbounded memory allocation when reading attestation bundles.
	// A typical Sigstore bundle is under 100KB; 10 MiB provides generous headroom.
	MaxSigstoreBundleSize = 10 * 1024 * 1024 // 10 MiB
)

Attestation file size limits.

View Source
const (
	// ProbeImage is the multi-arch (amd64+arm64) toolbox used by validator
	// probe Pods. busybox provides /bin/sh, grep, ls, sleep in ~2 MB.
	ProbeImage = "busybox:1.37"
)
View Source
const (
	// RuntimeClassCheckTimeout is the timeout for verifying RuntimeClass
	// existence in the cluster during agent deployment.
	RuntimeClassCheckTimeout = 5 * time.Second
)

Validator runtime class check timeout.

View Source
const (
	// ServerMaxHeaderBytes is the maximum size of request headers (64KB).
	// Prevents header-based attacks.
	ServerMaxHeaderBytes = 1 << 16
)

Server size limits.

Variables

This section is empty.

Functions

func NewHTTPClient added in v0.8.2

func NewHTTPClient(timeout time.Duration) *http.Client

NewHTTPClient returns an *http.Client with a standard transport and the given timeout. If timeout is zero, HTTPClientTimeout is used.

func NewHTTPTransport added in v0.8.2

func NewHTTPTransport() *http.Transport

NewHTTPTransport returns an *http.Transport configured with the standard timeout constants from this package.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL