Documentation
¶
Index ¶
- func CheckAIServiceMetrics(ctx *checks.ValidationContext) error
- func CheckAcceleratorMetrics(ctx *checks.ValidationContext) error
- func CheckClusterAutoscaling(ctx *checks.ValidationContext) error
- func CheckDRASupport(ctx *checks.ValidationContext) error
- func CheckGPUOperatorHealth(ctx *checks.ValidationContext) error
- func CheckGangScheduling(ctx *checks.ValidationContext) error
- func CheckInferenceGateway(ctx *checks.ValidationContext) error
- func CheckPlatformHealth(ctx *checks.ValidationContext) error
- func CheckPodAutoscaling(ctx *checks.ValidationContext) error
- func CheckRobustController(ctx *checks.ValidationContext) error
- func CheckSecureAcceleratorAccess(ctx *checks.ValidationContext) error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CheckAIServiceMetrics ¶
func CheckAIServiceMetrics(ctx *checks.ValidationContext) error
CheckAIServiceMetrics validates CNCF requirement #5: AI Service Metrics. Discovers the Prometheus service URL from the recipe's kube-prometheus-stack component, then verifies GPU metric time series exist and that the custom metrics API is available.
func CheckAcceleratorMetrics ¶
func CheckAcceleratorMetrics(ctx *checks.ValidationContext) error
CheckAcceleratorMetrics validates CNCF requirement #4: Accelerator Metrics. Calls the DCGM exporter metrics endpoint directly via in-cluster DNS and verifies that all required GPU metrics are present.
func CheckClusterAutoscaling ¶
func CheckClusterAutoscaling(ctx *checks.ValidationContext) error
CheckClusterAutoscaling validates CNCF requirement #8a: Cluster Autoscaling. Verifies the Karpenter controller deployment is running and at least one NodePool has nvidia.com/gpu limits configured. Skips gracefully when Karpenter is not installed (e.g., Kind CI clusters).
func CheckDRASupport ¶
func CheckDRASupport(ctx *checks.ValidationContext) error
CheckDRASupport validates CNCF requirement #2: DRA Support. Verifies DRA driver controller deployment, kubelet plugin DaemonSet, and that ResourceSlices (resource.k8s.io/v1 GA) exist advertising GPU resources.
func CheckGPUOperatorHealth ¶
func CheckGPUOperatorHealth(ctx *checks.ValidationContext) error
CheckGPUOperatorHealth validates CNCF requirement #1: GPU Management. Verifies GPU operator deployment, ClusterPolicy state=ready, and DCGM exporter DaemonSet.
func CheckGangScheduling ¶
func CheckGangScheduling(ctx *checks.ValidationContext) error
CheckGangScheduling validates CNCF requirement #7: Gang Scheduling. Verifies KAI scheduler deployments are running, required CRDs exist, and exercises gang scheduling by creating a PodGroup with 2 GPU pods that must be co-scheduled via the KAI scheduler.
func CheckInferenceGateway ¶
func CheckInferenceGateway(ctx *checks.ValidationContext) error
CheckInferenceGateway validates CNCF requirement #6: Inference Gateway. Verifies GatewayClass "kgateway" is accepted, Gateway "inference-gateway" is programmed, and required Gateway API + InferencePool CRDs exist.
func CheckPlatformHealth ¶
func CheckPlatformHealth(ctx *checks.ValidationContext) error
CheckPlatformHealth validates that all expected platform components from the recipe are deployed and healthy. It checks namespace existence and expectedResources health for each componentRef in the recipe.
func CheckPodAutoscaling ¶
func CheckPodAutoscaling(ctx *checks.ValidationContext) error
CheckPodAutoscaling validates CNCF requirement #8b: Pod Autoscaling. Verifies that the custom metrics API is available, GPU custom metrics have data (with retries to account for prometheus-adapter relist delay), and the external metrics API exposes GPU metrics.
func CheckRobustController ¶
func CheckRobustController(ctx *checks.ValidationContext) error
CheckRobustController validates CNCF requirement #9: Robust Controller. Verifies the Dynamo operator is deployed, its validating webhook is operational, and the DynamoGraphDeployment CRD exists.
func CheckSecureAcceleratorAccess ¶
func CheckSecureAcceleratorAccess(ctx *checks.ValidationContext) error
CheckSecureAcceleratorAccess validates CNCF requirement #3: Secure Accelerator Access. Creates a DRA-based GPU test pod with unique names, waits for completion, and verifies proper access patterns: resourceClaims instead of device plugin, no hostPath to GPU devices, and ResourceClaim is allocated.
Types ¶
This section is empty.