README
¶
AI-Conformance Inference Stack — Chainsaw Tests
Validates that the NVIDIA AI-conformance inference stack is correctly generated by AICR and healthy on a production EKS cluster. The stack satisfies CNCF AI Conformance requirements for GPU scheduling (KAI Scheduler), inference routing (kgateway with Gateway API Inference Extension), and the NVIDIA Dynamo serving platform.
Recipe
Generated with:
aicr recipe \
--service eks \
--accelerator h100 \
--os ubuntu \
--intent inference \
--platform dynamo \
--output recipe.yaml
Overlay chain: base → monitoring-hpa → eks → eks-inference → h100-eks-inference → h100-eks-ubuntu-inference → h100-eks-ubuntu-inference-dynamo
Bundle generated with:
aicr bundle \
--recipe recipe.yaml \
--output ./bundle \
--system-node-selector nodeGroup=system-pool \
--accelerated-node-selector nodeGroup=gpu-worker \
--accelerated-node-toleration nvidia.com/gpu=present:NoSchedule
Components (16)
| Component | Namespace | Type | What is Validated |
|---|---|---|---|
| cert-manager | cert-manager | Helm | 3 Deployments (controller, webhook, cainjector) |
| gpu-operator | gpu-operator | Helm | Operator Deployment, ClusterPolicy ready, 6 DaemonSets (driver, device-plugin, dcgm-exporter, toolkit, gfd, validator) |
| nvsentinel | nvsentinel | Helm | Controller Deployment, platform-connector DaemonSet |
| skyhook-operator | skyhook | Helm | Controller-manager Deployment |
| kube-prometheus-stack | monitoring | Helm | 3 Deployments (operator, grafana, kube-state-metrics), 2 StatefulSets (prometheus, alertmanager), node-exporter DaemonSet |
| k8s-ephemeral-storage-metrics | monitoring | Helm | Deployment |
| prometheus-adapter | monitoring | Helm | Deployment |
| aws-ebs-csi-driver | kube-system | Helm | Disabled by default (EKS managed addon) |
| aws-efa | kube-system | Helm | Device plugin DaemonSet |
| kgateway-crds | kgateway-system | Helm | CRDs only (Gateway API + Inference Extension) |
| kgateway | kgateway-system | Helm | Controller Deployment |
| skyhook-customizations | skyhook | Manifest | No workloads (NodeConfiguration CRs) |
| nvidia-dra-driver-gpu | nvidia-dra-driver | Helm | Controller Deployment, kubelet-plugin DaemonSet |
| kai-scheduler | kai-scheduler | Helm | Scheduler Deployment |
| dynamo-crds | dynamo-system | Helm | CRDs only |
| dynamo-platform | dynamo-system | Helm | Operator Deployment, etcd StatefulSet, NATS StatefulSet |
Test Structure
tests/chainsaw/ai-conformance/
├── README.md
├── offline/ # No cluster needed
│ ├── chainsaw-test.yaml # Recipe + bundle generation
│ └── assert-recipe.yaml # Recipe structure assertion
└── cluster/ # Requires deployed stack
├── chainsaw-test.yaml # Cluster health check orchestration
├── assert-namespaces.yaml # 9 namespaces exist
├── assert-crds.yaml # Critical CRDs installed
├── assert-cert-manager.yaml # cert-manager healthy
├── assert-gpu-operator.yaml # GPU operator + DaemonSets healthy
├── assert-monitoring.yaml # Prometheus stack healthy
├── assert-kube-system.yaml # AWS EFA healthy
├── assert-kgateway.yaml # kgateway healthy
├── assert-skyhook.yaml # Skyhook operator healthy
├── assert-nvsentinel.yaml # NVSentinel healthy
├── assert-dra-driver.yaml # DRA driver healthy
├── assert-kai-scheduler.yaml # KAI scheduler healthy
└── assert-dynamo.yaml # Dynamo platform healthy
Prerequisites
Offline tests
- Built aicr binary (
go build -o dist/e2e/aicr ./cmd/aicr) - Chainsaw installed (
brew install kyverno/tap/chainsaw) - No cluster needed
Cluster tests
- Chainsaw installed
kubectlconfigured with access to the target cluster- AI-conformance inference stack deployed (via
deploy.shfrom the bundle) - At least one GPU node with H100 GPUs (for DaemonSet health checks)
Running
Offline — recipe + bundle generation
go build -o dist/e2e/aicr ./cmd/aicr
AICR_BIN=$(pwd)/dist/e2e/aicr chainsaw test \
--no-cluster \
--test-dir tests/chainsaw/ai-conformance/offline
Cluster — post-deployment health check
chainsaw test \
--test-dir tests/chainsaw/ai-conformance/cluster
To override the default kubeconfig:
chainsaw test \
--test-dir tests/chainsaw/ai-conformance/cluster \
--kube-config-overrides /path/to/kubeconfig
Timeouts
| Component Group | Timeout | Reason |
|---|---|---|
| Namespaces, CRDs | 2m | Should exist immediately after deployment |
| cert-manager, kgateway, skyhook, monitoring, kai-scheduler | 5m | Standard Deployment rollout |
| gpu-operator, nvidia-dra-driver-gpu | 10m | GPU driver compilation on nodes is slow |
| dynamo-platform | 5m | Operator + etcd + NATS startup |
Assertion Patterns
- Deployments: Polls until
status.conditions[type=Available].status = "True" - DaemonSets: Polls until
numberReady > 0anddesiredNumberScheduled > 0 - StatefulSets: Polls until
readyReplicas > 0 - ClusterPolicy: Polls until
status.state = ready(GPU operator umbrella check) - CRDs: Asserts existence by fully-qualified name
- Namespaces: Asserts
status.phase = Active
Chainsaw retries assertions continuously until the timeout expires. If a resource doesn't exist yet, it keeps polling until it appears or times out.
Customization
Skipping disabled components
The aws-ebs-csi-driver component is disabled by default on EKS (the CSI driver is a managed addon). It is excluded from cluster assertions. If you enabled it with --set aws-ebs-csi-driver.enabled=true, add an assertion step for it.
Adjusting resource names
DaemonSet names for the GPU operator are created by the operator's ClusterPolicy, not the Helm chart directly. If your deployment uses non-default names, update assert-gpu-operator.yaml. The ClusterPolicy status assertion (status.state: ready) serves as a safety net — it validates the entire GPU stack regardless of individual DaemonSet names.