conformance

package

v0.8.13 Latest Latest Go to latest Published: Mar 5, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

Documentation ¶

Index ¶

func CheckAIServiceMetrics(ctx *checks.ValidationContext) error
func CheckAcceleratorMetrics(ctx *checks.ValidationContext) error
func CheckClusterAutoscaling(ctx *checks.ValidationContext) error
func CheckDRASupport(ctx *checks.ValidationContext) error
func CheckGPUOperatorHealth(ctx *checks.ValidationContext) error
func CheckGangScheduling(ctx *checks.ValidationContext) error
func CheckInferenceGateway(ctx *checks.ValidationContext) error
func CheckPlatformHealth(ctx *checks.ValidationContext) error
func CheckPodAutoscaling(ctx *checks.ValidationContext) error
func CheckRobustController(ctx *checks.ValidationContext) error
func CheckSecureAcceleratorAccess(ctx *checks.ValidationContext) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CheckAIServiceMetrics ¶

func CheckAIServiceMetrics(ctx *checks.ValidationContext) error

CheckAIServiceMetrics validates CNCF requirement #5: AI Service Metrics. Discovers the Prometheus service URL from the recipe's kube-prometheus-stack component, then verifies GPU metric time series exist and that the custom metrics API is available.

func CheckAcceleratorMetrics ¶

func CheckAcceleratorMetrics(ctx *checks.ValidationContext) error

CheckAcceleratorMetrics validates CNCF requirement #4: Accelerator Metrics. Calls the DCGM exporter metrics endpoint directly via in-cluster DNS and verifies that all required GPU metrics are present.

func CheckClusterAutoscaling ¶

func CheckClusterAutoscaling(ctx *checks.ValidationContext) error

CheckClusterAutoscaling validates CNCF requirement #8a: Cluster Autoscaling. Verifies the Karpenter controller deployment is running and at least one NodePool has nvidia.com/gpu limits configured. Skips gracefully when Karpenter is not installed (e.g., Kind CI clusters).

func CheckDRASupport ¶

func CheckDRASupport(ctx *checks.ValidationContext) error

CheckDRASupport validates CNCF requirement #2: DRA Support. Verifies DRA driver controller deployment, kubelet plugin DaemonSet, and that ResourceSlices (resource.k8s.io/v1 GA) exist advertising GPU resources.

func CheckGPUOperatorHealth ¶

func CheckGPUOperatorHealth(ctx *checks.ValidationContext) error

CheckGPUOperatorHealth validates CNCF requirement #1: GPU Management. Verifies GPU operator deployment, ClusterPolicy state=ready, and DCGM exporter DaemonSet.

func CheckGangScheduling ¶

func CheckGangScheduling(ctx *checks.ValidationContext) error

CheckGangScheduling validates CNCF requirement #7: Gang Scheduling. Verifies KAI scheduler deployments are running, required CRDs exist, and exercises gang scheduling by creating a PodGroup with 2 GPU pods that must be co-scheduled via the KAI scheduler.

func CheckInferenceGateway ¶

func CheckInferenceGateway(ctx *checks.ValidationContext) error

CheckInferenceGateway validates CNCF requirement #6: Inference Gateway. Verifies GatewayClass "kgateway" is accepted, Gateway "inference-gateway" is programmed, and required Gateway API + InferencePool CRDs exist.

func CheckPlatformHealth ¶

func CheckPlatformHealth(ctx *checks.ValidationContext) error

CheckPlatformHealth validates that all expected platform components from the recipe are deployed and healthy. It checks namespace existence and expectedResources health for each componentRef in the recipe.

func CheckPodAutoscaling ¶

func CheckPodAutoscaling(ctx *checks.ValidationContext) error

CheckPodAutoscaling validates CNCF requirement #8b: Pod Autoscaling. Verifies that the custom metrics API is available, GPU custom metrics have data (with retries to account for prometheus-adapter relist delay), and the external metrics API exposes GPU metrics.

func CheckRobustController ¶

func CheckRobustController(ctx *checks.ValidationContext) error

CheckRobustController validates CNCF requirement #9: Robust Controller. Verifies the Dynamo operator is deployed, its validating webhook is operational, and the DynamoGraphDeployment CRD exists.

func CheckSecureAcceleratorAccess ¶

func CheckSecureAcceleratorAccess(ctx *checks.ValidationContext) error

CheckSecureAcceleratorAccess validates CNCF requirement #3: Secure Accelerator Access. Creates a DRA-based GPU test pod with unique names, waits for completion, and verifies proper access patterns: resourceClaims instead of device plugin, no hostPath to GPU devices, and ResourceClaim is allocated.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL