ai-conformance

command
v0.8.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2026 License: Apache-2.0 Imports: 20 Imported by: 0

README

AI-Conformance Inference Stack — Chainsaw Tests

Validates that the NVIDIA AI-conformance inference stack is correctly generated by AICR and healthy on a production EKS cluster. The stack satisfies CNCF AI Conformance requirements for GPU scheduling (KAI Scheduler), inference routing (kgateway with Gateway API Inference Extension), and the NVIDIA Dynamo serving platform.

Recipe

Generated with:

aicr recipe \
  --service eks \
  --accelerator h100 \
  --os ubuntu \
  --intent inference \
  --platform dynamo \
  --output recipe.yaml

Overlay chain: basemonitoring-hpaekseks-inferenceh100-eks-inferenceh100-eks-ubuntu-inferenceh100-eks-ubuntu-inference-dynamo

Bundle generated with:

aicr bundle \
  --recipe recipe.yaml \
  --output ./bundle \
  --system-node-selector nodeGroup=system-pool \
  --accelerated-node-selector nodeGroup=gpu-worker \
  --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule

Components (16)

Component Namespace Type What is Validated
cert-manager cert-manager Helm 3 Deployments (controller, webhook, cainjector)
gpu-operator gpu-operator Helm Operator Deployment, ClusterPolicy ready, 6 DaemonSets (driver, device-plugin, dcgm-exporter, toolkit, gfd, validator)
nvsentinel nvsentinel Helm Controller Deployment, platform-connector DaemonSet
skyhook-operator skyhook Helm Controller-manager Deployment
kube-prometheus-stack monitoring Helm 3 Deployments (operator, grafana, kube-state-metrics), 2 StatefulSets (prometheus, alertmanager), node-exporter DaemonSet
k8s-ephemeral-storage-metrics monitoring Helm Deployment
prometheus-adapter monitoring Helm Deployment
aws-ebs-csi-driver kube-system Helm Disabled by default (EKS managed addon)
aws-efa kube-system Helm Device plugin DaemonSet
kgateway-crds kgateway-system Helm CRDs only (Gateway API + Inference Extension)
kgateway kgateway-system Helm Controller Deployment
skyhook-customizations skyhook Manifest No workloads (NodeConfiguration CRs)
nvidia-dra-driver-gpu nvidia-dra-driver Helm Controller Deployment, kubelet-plugin DaemonSet
kai-scheduler kai-scheduler Helm Scheduler Deployment
dynamo-crds dynamo-system Helm CRDs only
dynamo-platform dynamo-system Helm Operator Deployment, etcd StatefulSet, NATS StatefulSet

Test Structure

tests/chainsaw/ai-conformance/
├── README.md
├── offline/                             # No cluster needed
│   ├── chainsaw-test.yaml               # Recipe + bundle generation
│   └── assert-recipe.yaml               # Recipe structure assertion
└── cluster/                             # Requires deployed stack
    ├── chainsaw-test.yaml               # Cluster health check orchestration
    ├── assert-namespaces.yaml           # 9 namespaces exist
    ├── assert-crds.yaml                 # Critical CRDs installed
    ├── assert-cert-manager.yaml         # cert-manager healthy
    ├── assert-gpu-operator.yaml         # GPU operator + DaemonSets healthy
    ├── assert-monitoring.yaml           # Prometheus stack healthy
    ├── assert-kube-system.yaml          # AWS EFA healthy
    ├── assert-kgateway.yaml             # kgateway healthy
    ├── assert-skyhook.yaml              # Skyhook operator healthy
    ├── assert-nvsentinel.yaml           # NVSentinel healthy
    ├── assert-dra-driver.yaml           # DRA driver healthy
    ├── assert-kai-scheduler.yaml        # KAI scheduler healthy
    └── assert-dynamo.yaml               # Dynamo platform healthy

Prerequisites

Offline tests
  • Built aicr binary (go build -o dist/e2e/aicr ./cmd/aicr)
  • Chainsaw installed (brew install kyverno/tap/chainsaw)
  • No cluster needed
Cluster tests
  • Chainsaw installed
  • kubectl configured with access to the target cluster
  • AI-conformance inference stack deployed (via deploy.sh from the bundle)
  • At least one GPU node with H100 GPUs (for DaemonSet health checks)

Running

Offline — recipe + bundle generation
go build -o dist/e2e/aicr ./cmd/aicr
AICR_BIN=$(pwd)/dist/e2e/aicr chainsaw test \
  --no-cluster \
  --test-dir tests/chainsaw/ai-conformance/offline
Cluster — post-deployment health check
chainsaw test \
  --test-dir tests/chainsaw/ai-conformance/cluster

To override the default kubeconfig:

chainsaw test \
  --test-dir tests/chainsaw/ai-conformance/cluster \
  --kube-config-overrides /path/to/kubeconfig

Timeouts

Component Group Timeout Reason
Namespaces, CRDs 2m Should exist immediately after deployment
cert-manager, kgateway, skyhook, monitoring, kai-scheduler 5m Standard Deployment rollout
gpu-operator, nvidia-dra-driver-gpu 10m GPU driver compilation on nodes is slow
dynamo-platform 5m Operator + etcd + NATS startup

Assertion Patterns

  • Deployments: Polls until status.conditions[type=Available].status = "True"
  • DaemonSets: Polls until numberReady > 0 and desiredNumberScheduled > 0
  • StatefulSets: Polls until readyReplicas > 0
  • ClusterPolicy: Polls until status.state = ready (GPU operator umbrella check)
  • CRDs: Asserts existence by fully-qualified name
  • Namespaces: Asserts status.phase = Active

Chainsaw retries assertions continuously until the timeout expires. If a resource doesn't exist yet, it keeps polling until it appears or times out.

Customization

Skipping disabled components

The aws-ebs-csi-driver component is disabled by default on EKS (the CSI driver is a managed addon). It is excluded from cluster assertions. If you enabled it with --set aws-ebs-csi-driver.enabled=true, add an assertion step for it.

Adjusting resource names

DaemonSet names for the GPU operator are created by the operator's ClusterPolicy, not the Helm chart directly. If your deployment uses non-default names, update assert-gpu-operator.yaml. The ClusterPolicy status assertion (status.state: ready) serves as a safety net — it validates the entire GPU stack regardless of individual DaemonSet names.

Documentation

Overview

ai-conformance-check parses Chainsaw assertion YAML files and verifies that every declared resource exists in the target Kubernetes cluster.

Usage:

go run ./tests/chainsaw/ai-conformance/ [--kubeconfig PATH] [--dir PATH]

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL