ai-conformance

command

v0.8.13 Latest Latest Go to latest Published: Mar 5, 2026 License: Apache-2.0 Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

README ¶

AI-Conformance Inference Stack — Chainsaw Tests

Validates that the NVIDIA AI-conformance inference stack is correctly generated by AICR and healthy on a production EKS cluster. The stack satisfies CNCF AI Conformance requirements for GPU scheduling (KAI Scheduler), inference routing (kgateway with Gateway API Inference Extension), and the NVIDIA Dynamo serving platform.

Recipe

Generated with:

aicr recipe \
  --service eks \
  --accelerator h100 \
  --os ubuntu \
  --intent inference \
  --platform dynamo \
  --output recipe.yaml

Overlay chain: base → monitoring-hpa → eks → eks-inference → h100-eks-inference → h100-eks-ubuntu-inference → h100-eks-ubuntu-inference-dynamo

Bundle generated with:

aicr bundle \
  --recipe recipe.yaml \
  --output ./bundle \
  --system-node-selector nodeGroup=system-pool \
  --accelerated-node-selector nodeGroup=gpu-worker \
  --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule

Components (16)

Component	Namespace	Type	What is Validated
cert-manager	cert-manager	Helm	3 Deployments (controller, webhook, cainjector)
gpu-operator	gpu-operator	Helm	Operator Deployment, ClusterPolicy ready, 6 DaemonSets (driver, device-plugin, dcgm-exporter, toolkit, gfd, validator)
nvsentinel	nvsentinel	Helm	Controller Deployment, platform-connector DaemonSet
skyhook-operator	skyhook	Helm	Controller-manager Deployment
kube-prometheus-stack	monitoring	Helm	3 Deployments (operator, grafana, kube-state-metrics), 2 StatefulSets (prometheus, alertmanager), node-exporter DaemonSet
k8s-ephemeral-storage-metrics	monitoring	Helm	Deployment
prometheus-adapter	monitoring	Helm	Deployment
aws-ebs-csi-driver	kube-system	Helm	Disabled by default (EKS managed addon)
aws-efa	kube-system	Helm	Device plugin DaemonSet
kgateway-crds	kgateway-system	Helm	CRDs only (Gateway API + Inference Extension)
kgateway	kgateway-system	Helm	Controller Deployment
skyhook-customizations	skyhook	Manifest	No workloads (NodeConfiguration CRs)
nvidia-dra-driver-gpu	nvidia-dra-driver	Helm	Controller Deployment, kubelet-plugin DaemonSet
kai-scheduler	kai-scheduler	Helm	Scheduler Deployment
dynamo-crds	dynamo-system	Helm	CRDs only
dynamo-platform	dynamo-system	Helm	Operator Deployment, etcd StatefulSet, NATS StatefulSet

Test Structure

tests/chainsaw/ai-conformance/
├── README.md
├── offline/                             # No cluster needed
│   ├── chainsaw-test.yaml               # Recipe + bundle generation
│   └── assert-recipe.yaml               # Recipe structure assertion
└── cluster/                             # Requires deployed stack
    ├── chainsaw-test.yaml               # Cluster health check orchestration
    ├── assert-namespaces.yaml           # 9 namespaces exist
    ├── assert-crds.yaml                 # Critical CRDs installed
    ├── assert-cert-manager.yaml         # cert-manager healthy
    ├── assert-gpu-operator.yaml         # GPU operator + DaemonSets healthy
    ├── assert-monitoring.yaml           # Prometheus stack healthy
    ├── assert-kube-system.yaml          # AWS EFA healthy
    ├── assert-kgateway.yaml             # kgateway healthy
    ├── assert-skyhook.yaml              # Skyhook operator healthy
    ├── assert-nvsentinel.yaml           # NVSentinel healthy
    ├── assert-dra-driver.yaml           # DRA driver healthy
    ├── assert-kai-scheduler.yaml        # KAI scheduler healthy
    └── assert-dynamo.yaml               # Dynamo platform healthy

Prerequisites

Offline tests

Built aicr binary (go build -o dist/e2e/aicr ./cmd/aicr)
Chainsaw installed (brew install kyverno/tap/chainsaw)
No cluster needed

Cluster tests

Chainsaw installed
kubectl configured with access to the target cluster
AI-conformance inference stack deployed (via deploy.sh from the bundle)
At least one GPU node with H100 GPUs (for DaemonSet health checks)

Running

Offline — recipe + bundle generation

go build -o dist/e2e/aicr ./cmd/aicr
AICR_BIN=$(pwd)/dist/e2e/aicr chainsaw test \
  --no-cluster \
  --test-dir tests/chainsaw/ai-conformance/offline

Cluster — post-deployment health check

chainsaw test \
  --test-dir tests/chainsaw/ai-conformance/cluster

To override the default kubeconfig:

chainsaw test \
  --test-dir tests/chainsaw/ai-conformance/cluster \
  --kube-config-overrides /path/to/kubeconfig

Timeouts

Component Group	Timeout	Reason
Namespaces, CRDs	2m	Should exist immediately after deployment
cert-manager, kgateway, skyhook, monitoring, kai-scheduler	5m	Standard Deployment rollout
gpu-operator, nvidia-dra-driver-gpu	10m	GPU driver compilation on nodes is slow
dynamo-platform	5m	Operator + etcd + NATS startup

Assertion Patterns

Deployments: Polls until status.conditions[type=Available].status = "True"
DaemonSets: Polls until numberReady > 0 and desiredNumberScheduled > 0
StatefulSets: Polls until readyReplicas > 0
ClusterPolicy: Polls until status.state = ready (GPU operator umbrella check)
CRDs: Asserts existence by fully-qualified name
Namespaces: Asserts status.phase = Active

Chainsaw retries assertions continuously until the timeout expires. If a resource doesn't exist yet, it keeps polling until it appears or times out.

Customization

Skipping disabled components

The aws-ebs-csi-driver component is disabled by default on EKS (the CSI driver is a managed addon). It is excluded from cluster assertions. If you enabled it with --set aws-ebs-csi-driver.enabled=true, add an assertion step for it.

Adjusting resource names

DaemonSet names for the GPU operator are created by the operator's ClusterPolicy, not the Helm chart directly. If your deployment uses non-default names, update assert-gpu-operator.yaml. The ClusterPolicy status assertion (status.state: ready) serves as a safety net — it validates the entire GPU stack regardless of individual DaemonSet names.

Documentation ¶

Overview ¶

ai-conformance-check parses Chainsaw assertion YAML files and verifies that every declared resource exists in the target Kubernetes cluster.

Usage:

go run ./tests/chainsaw/ai-conformance/ [--kubeconfig PATH] [--dir PATH]

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL