Documentation
¶
Overview ¶
Package defaults provides centralized configuration constants for the AICR system.
This package defines timeout values, retry parameters, and other configuration defaults used across the codebase. Centralizing these values ensures consistency and makes tuning easier.
Timeout Categories ¶
Timeouts are organized by component:
- Collector timeouts: For system data collection operations
- Handler timeouts: For HTTP request processing
- Server timeouts: For HTTP server configuration
- Kubernetes timeouts: For K8s API operations
- HTTP client timeouts: For outbound HTTP requests
Usage ¶
Import and use constants directly:
import "github.com/NVIDIA/aicr/pkg/defaults" ctx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout) defer cancel()
Timeout Guidelines ¶
When choosing timeout values:
- Collectors: 10s default, respects parent context deadline
- HTTP handlers: 30s for recipes, 60s for bundles
- K8s operations: 30s for API calls, 5m for job completion
- Server shutdown: 30s for graceful shutdown
Index ¶
Constants ¶
const ( // CollectorTimeout is the default timeout for collector operations. // Collectors should respect parent context deadlines when shorter. CollectorTimeout = 10 * time.Second // CollectorK8sTimeout is the timeout for Kubernetes API calls in collectors. CollectorK8sTimeout = 30 * time.Second )
Collector timeouts for data collection operations.
const ( // RecipeHandlerTimeout is the timeout for recipe generation requests. RecipeHandlerTimeout = 30 * time.Second // RecipeBuildTimeout is the internal timeout for recipe building. // Should be less than RecipeHandlerTimeout to allow error handling. RecipeBuildTimeout = 25 * time.Second // BundleHandlerTimeout is the timeout for bundle generation requests. // Longer than recipe due to file I/O operations. BundleHandlerTimeout = 60 * time.Second // RecipeCacheTTL is the default cache duration for recipe responses. RecipeCacheTTL = 10 * time.Minute )
Handler timeouts for HTTP request processing.
const ( // ServerReadTimeout is the maximum duration for reading request headers. ServerReadTimeout = 10 * time.Second // ServerReadHeaderTimeout prevents slow header attacks. ServerReadHeaderTimeout = 5 * time.Second // ServerWriteTimeout is the maximum duration for writing a response. ServerWriteTimeout = 30 * time.Second // ServerIdleTimeout is the maximum duration to wait for the next request. ServerIdleTimeout = 120 * time.Second // ServerShutdownTimeout is the maximum duration for graceful shutdown. ServerShutdownTimeout = 30 * time.Second )
Server timeouts for HTTP server configuration.
const ( // K8sJobCreationTimeout is the timeout for creating K8s Job resources. K8sJobCreationTimeout = 30 * time.Second // K8sPodReadyTimeout is the timeout for waiting for pods to be ready. K8sPodReadyTimeout = 60 * time.Second // K8sJobCompletionTimeout is the default timeout for job completion. K8sJobCompletionTimeout = 5 * time.Minute // K8sCleanupTimeout is the timeout for cleanup operations. K8sCleanupTimeout = 30 * time.Second )
Kubernetes timeouts for K8s API operations.
const ( // HTTPClientTimeout is the default total timeout for HTTP requests. HTTPClientTimeout = 30 * time.Second // HTTPConnectTimeout is the timeout for establishing connections. HTTPConnectTimeout = 5 * time.Second // HTTPTLSHandshakeTimeout is the timeout for TLS handshake. HTTPTLSHandshakeTimeout = 5 * time.Second // HTTPResponseHeaderTimeout is the timeout for reading response headers. HTTPResponseHeaderTimeout = 10 * time.Second // HTTPIdleConnTimeout is the timeout for idle connections in the pool. HTTPIdleConnTimeout = 90 * time.Second // HTTPKeepAlive is the keep-alive duration for connections. HTTPKeepAlive = 30 * time.Second // HTTPExpectContinueTimeout is the timeout for Expect: 100-continue. HTTPExpectContinueTimeout = 1 * time.Second )
HTTP client timeouts for outbound requests.
const ( // CLISnapshotTimeout is the default timeout for snapshot operations. CLISnapshotTimeout = 5 * time.Minute // InteractiveOIDCTimeout is the maximum time to wait for a user to complete // browser-based OIDC authentication. Prevents indefinite blocking if the // browser flow is started but never completed. InteractiveOIDCTimeout = 5 * time.Minute )
CLI timeouts for command-line operations.
const ( // ValidateReadinessTimeout is the default timeout for readiness validation. ValidateReadinessTimeout = 5 * time.Minute // ValidateDeploymentTimeout is the default timeout for deployment validation. ValidateDeploymentTimeout = 10 * time.Minute // ValidatePerformanceTimeout is the default timeout for performance validation. // Performance tests may take longer due to GPU benchmarks. ValidatePerformanceTimeout = 30 * time.Minute // ValidateConformanceTimeout is the default timeout for conformance validation. ValidateConformanceTimeout = 15 * time.Minute // ResourceVerificationTimeout is the timeout for verifying individual // expected resources exist and are healthy during deployment validation. ResourceVerificationTimeout = 10 * time.Second // ComponentRenderTimeout is the maximum time to render a single component // via helm template or manifest file rendering during resource discovery. ComponentRenderTimeout = 60 * time.Second )
Validation phase timeouts for validation phase operations. These are used when the recipe does not specify a timeout.
const ( // ChainsawAssertTimeout is the timeout for health check assertions // when evaluating component assert files against live cluster resources. ChainsawAssertTimeout = 2 * time.Minute // ChainsawMaxParallel is the maximum number of concurrent assertion // runs during component health checks. ChainsawMaxParallel = 4 // AssertRetryInterval is the polling interval between health check // assertion retries. Assertions are retried at this interval until // they pass or the ChainsawAssertTimeout expires. AssertRetryInterval = 5 * time.Second )
Chainsaw assertion configuration for component health checks.
const ( // CheckExecutionTimeout is the parent context timeout for checks running // inside a K8s Job. Must be long enough for behavioral checks (DRA pod // creation + image pull + GPU allocation + isolation verification) and // shorter than the Job-level ValidateConformanceTimeout. CheckExecutionTimeout = 10 * time.Minute // DRATestPodTimeout is the timeout for the DRA test pod to complete. // The pod runs a simple CUDA device check but may need time for image pull. DRATestPodTimeout = 5 * time.Minute // GangTestPodTimeout is the timeout for gang scheduling test pods to complete. // Two pods must be co-scheduled, each pulling a CUDA image and running nvidia-smi. GangTestPodTimeout = 5 * time.Minute )
Conformance test timeouts for DRA and gang scheduling validation.
const ( // HPAScaleTimeout is the timeout for waiting for HPA to report scaling intent. // The HPA needs time to read metrics and compute desired replicas. HPAScaleTimeout = 3 * time.Minute // HPAPollInterval is the interval for polling HPA status during behavioral tests. HPAPollInterval = 10 * time.Second )
HPA behavioral test timeouts for conformance validation.
const ( // KarpenterNodeTimeout is the timeout for Karpenter to provision KWOK nodes. KarpenterNodeTimeout = 3 * time.Minute // KarpenterPollInterval is the interval for polling Karpenter node provisioning. KarpenterPollInterval = 10 * time.Second )
Karpenter behavioral test timeouts for conformance validation.
const ( // DeploymentScaleTimeout is the timeout for waiting for Deployment controller // to observe and act on HPA scale-up by increasing replica count. DeploymentScaleTimeout = 2 * time.Minute // PodScheduleTimeout is the timeout for waiting for test pods to be scheduled // on Karpenter-provisioned nodes after the HPA scales up. PodScheduleTimeout = 2 * time.Minute )
Deployment and pod scheduling test timeouts for conformance validation.
const ( // PodWaitTimeout is the maximum time to wait for pod operations to complete. PodWaitTimeout = 10 * time.Minute // PodPollInterval is the interval for polling pod status. // Used in legacy polling code (to be replaced with watch API in Phase 3). PodPollInterval = 500 * time.Millisecond // ValidationPodTimeout is the timeout for validation pod operations. ValidationPodTimeout = 10 * time.Minute // DiagnosticTimeout is the timeout for collecting diagnostic information. DiagnosticTimeout = 2 * time.Minute // PodReadyTimeout is the timeout for waiting for pods to become ready. PodReadyTimeout = 2 * time.Minute )
Pod operation timeouts for validation and agent operations.
const ( // ArtifactMaxDataSize is the maximum size in bytes of a single artifact's Data field. // Ensures each base64-encoded ARTIFACT: line stays well under the bufio.Scanner // default 64KB limit (base64 expands ~4/3, so 8KB → ~11KB encoded). ArtifactMaxDataSize = 8 * 1024 // ArtifactMaxPerCheck is the maximum number of artifacts a single check can record. ArtifactMaxPerCheck = 20 )
Artifact limits for conformance evidence capture.
const ( // HTTPResponseBodyLimit is the maximum size in bytes for HTTP response bodies // read by conformance checks (e.g., Prometheus metric scrapes). Prevents // unbounded reads from in-cluster services. HTTPResponseBodyLimit = 1 * 1024 * 1024 // 1 MiB // MaxErrorBodySize is the maximum size in bytes for HTTP error response bodies. // Bounds io.ReadAll on error paths to prevent unbounded memory allocation. MaxErrorBodySize = 4096 )
HTTP response limits for conformance checks.
const ( // CoScheduleWindow is the maximum time span between PodScheduled timestamps // for gang-scheduled pods. If pods are scheduled further apart than this, // they are not considered co-scheduled. CoScheduleWindow = 30 * time.Second )
Gang scheduling co-scheduling validation.
const ( // ConfigMapWriteTimeout is the timeout for writing to ConfigMaps. ConfigMapWriteTimeout = 30 * time.Second )
ConfigMap timeouts for Kubernetes ConfigMap operations.
const ( // EvidenceRenderTimeout is the timeout for rendering conformance evidence markdown. EvidenceRenderTimeout = 30 * time.Second )
Evidence rendering timeouts.
const ( // JobTTLAfterFinished is the time-to-live for completed Jobs. // Jobs are kept for debugging purposes before automatic cleanup. JobTTLAfterFinished = 1 * time.Hour )
Job configuration constants.
const ( // MaxSigstoreBundleSize is the maximum size in bytes for a .sigstore.json file. // Prevents unbounded memory allocation when reading attestation bundles. // A typical Sigstore bundle is under 100KB; 10 MiB provides generous headroom. MaxSigstoreBundleSize = 10 * 1024 * 1024 // 10 MiB )
Attestation file size limits.
const ( // ServerMaxHeaderBytes is the maximum size of request headers (64KB). // Prevents header-based attacks. ServerMaxHeaderBytes = 1 << 16 )
Server size limits.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
This section is empty.