conformance

package
v0.7.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 25, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CheckAIServiceMetrics

func CheckAIServiceMetrics(ctx *checks.ValidationContext) error

CheckAIServiceMetrics validates CNCF requirement #5: AI Service Metrics. Verifies that GPU metric time series exist in Prometheus and that the custom metrics API is available.

func CheckAcceleratorMetrics

func CheckAcceleratorMetrics(ctx *checks.ValidationContext) error

CheckAcceleratorMetrics validates CNCF requirement #4: Accelerator Metrics. Calls the DCGM exporter metrics endpoint directly via in-cluster DNS and verifies that all required GPU metrics are present.

func CheckClusterAutoscaling

func CheckClusterAutoscaling(ctx *checks.ValidationContext) error

CheckClusterAutoscaling validates CNCF requirement #8a: Cluster Autoscaling. Verifies the Karpenter controller deployment is running and at least one NodePool has nvidia.com/gpu limits configured. Skips gracefully when Karpenter is not installed (e.g., Kind CI clusters).

func CheckDRASupport

func CheckDRASupport(ctx *checks.ValidationContext) error

CheckDRASupport validates CNCF requirement #2: DRA Support. Verifies DRA driver controller deployment, kubelet plugin DaemonSet, and that ResourceSlices (resource.k8s.io/v1 GA) exist advertising GPU resources.

func CheckGPUOperatorHealth

func CheckGPUOperatorHealth(ctx *checks.ValidationContext) error

CheckGPUOperatorHealth validates CNCF requirement #1: GPU Management. Verifies GPU operator deployment, ClusterPolicy state=ready, and DCGM exporter DaemonSet.

func CheckGangScheduling

func CheckGangScheduling(ctx *checks.ValidationContext) error

CheckGangScheduling validates CNCF requirement #7: Gang Scheduling. Verifies KAI scheduler deployments are running, required CRDs exist, and exercises gang scheduling by creating a PodGroup with 2 GPU pods that must be co-scheduled via the KAI scheduler.

func CheckInferenceGateway

func CheckInferenceGateway(ctx *checks.ValidationContext) error

CheckInferenceGateway validates CNCF requirement #6: Inference Gateway. Verifies GatewayClass "kgateway" is accepted, Gateway "inference-gateway" is programmed, and required Gateway API + InferencePool CRDs exist.

func CheckPlatformHealth

func CheckPlatformHealth(ctx *checks.ValidationContext) error

CheckPlatformHealth validates that all expected platform components from the recipe are deployed and healthy. It checks namespace existence and expectedResources health for each componentRef in the recipe.

func CheckPodAutoscaling

func CheckPodAutoscaling(ctx *checks.ValidationContext) error

CheckPodAutoscaling validates CNCF requirement #8b: Pod Autoscaling. Verifies that the custom metrics API is available, GPU custom metrics have data (with retries to account for prometheus-adapter relist delay), and the external metrics API exposes GPU metrics.

func CheckRobustController

func CheckRobustController(ctx *checks.ValidationContext) error

CheckRobustController validates CNCF requirement #9: Robust Controller. Verifies the Dynamo operator is deployed, its validating webhook is operational, and the DynamoGraphDeployment CRD exists.

func CheckSecureAcceleratorAccess

func CheckSecureAcceleratorAccess(ctx *checks.ValidationContext) error

CheckSecureAcceleratorAccess validates CNCF requirement #3: Secure Accelerator Access. Creates a DRA-based GPU test pod with unique names, waits for completion, and verifies proper access patterns: resourceClaims instead of device plugin, no hostPath to GPU devices, and ResourceClaim is allocated.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL