aicr

module
v0.15.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2026 License: Apache-2.0

README

NVIDIA AI Cluster Runtime

On Push CI On Tag Release License

AI Cluster Runtime (AICR) makes it easy to stand up GPU-accelerated Kubernetes clusters. It captures known-good combinations of drivers, operators, kernels, and system configurations and publishes them as version-locked recipes — reproducible artifacts for Helm, Argo CD, Flux, and Helmfile.

Full documentation: docs.nvidia.com/aicr

Why We Built This

Running GPU-accelerated Kubernetes clusters reliably is hard. Small differences in kernel versions, drivers, container runtimes, operators, and Kubernetes releases can cause failures that are difficult to diagnose and expensive to reproduce.

Historically, this knowledge has lived in internal validation pipelines and runbooks. AI Cluster Runtime makes it available to everyone.

Every AICR recipe is:

  • Optimized — Tuned for a specific combination of hardware, cloud, OS, and workload intent.
  • Validated — Passes automated constraint and compatibility checks before publishing.
  • Reproducible — Same inputs produce identical deployments every time.

Every AICR recipe also carries two kinds of cryptographic proof: where it came from (provenance — signed by NVIDIA CI, verifiable offline) and that it actually works on real hardware (validity — including signed validation results from contributors with cluster access NVIDIA doesn't have). See SECURITY.md and the bundle attestation, recipe evidence, and build provenance demos for the full chain.

Quick Start

# Install the CLI (Homebrew)
brew tap NVIDIA/aicr
brew install aicr

# Or use the install script
curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s --

# Generate a recipe for your environment
aicr recipe --service eks --accelerator h100 --os ubuntu \
  --intent training --platform kubeflow -o recipe.yaml

# Inspect any hydrated value (e.g., the resolved GPU driver version)
aicr query --service eks --accelerator h100 --os ubuntu --intent training --platform kubeflow \
  --selector components.gpu-operator.values.driver.version

# Render it into deployment-ready bundles (helm, argocd, flux, or helmfile)
aicr bundle --recipe recipe.yaml --deployer argocd --output ./bundles

# After deploying the bundle, validate the running cluster against the recipe
aicr validate --recipe recipe.yaml

The contents of the bundles/ directory depend on the chosen --deployer: Argo CD Application manifests for argocd, a Helm chart app-of-apps for argocd-helm, HelmRelease and Kustomization manifests for flux, helmfile.yaml release graph for helmfile, or simple Helm commands for helm.

See the Installation Guide for manual installation, building from source, and container images.

Features

Feature Description
aicr CLI Single binary for the full workflow: snapshot, recipe, bundle, validate, verify, diff, and trust management.
API Server (aicrd) REST API exposing the same capabilities as the CLI. Run in-cluster for CI/CD integration or air-gapped environments.
Go Library (github.com/NVIDIA/aicr/pkg/client/v1) Stable Go SDK facade for in-process consumers — same workflow (resolve, bundle, snapshot, validate) callable from any Go program without a subprocess or REST hop. Per-Client isolation supports multi-tenant use.
Snapshot Agent Kubernetes Job that captures live cluster state (GPU hardware, drivers, kernel, OS, operators, K8s config) into a ConfigMap for validation against recipes.
Multi-Deployer Bundles Render the same recipe into Helm, Argo CD (App of Apps or Helm chart variant), Flux, or Helmfile artifacts — pick whichever fits your GitOps pipeline.
Multi-Phase Validation Deployment, performance (training and inference), and conformance phases — run all or one at a time.
Drift Detection aicr diff compares two snapshots to surface configuration drift between clusters or over time.
Supply Chain Security SLSA Level 3 provenance, signed SBOMs, image attestations (Cosign / Sigstore), and aicr verify for offline bundle verification.

Supported Components

AICR recipes compose components from the following groups:

Group Examples
GPU stack GPU Operator, DRA GPU Driver, Network Operator, NFD, NVSentinel
Cloud integration AWS EFA, AWS EBS CSI, GKE NCCL TCPxO
Node tuning Nodewright Operator and customizations, cert-manager
Observability kube-prometheus-stack, Prometheus Operator CRDs, Prometheus Adapter, ephemeral-storage metrics
Training platforms Kubeflow Trainer, Slinky Slurm Operator, KAI Scheduler, Kueue
Inference platforms Dynamo, Grove, NIM Operator, Agent Gateway

See the full Component Catalog for every component, pinned version, and source. Don't see what you need? Open an issue — feedback helps inform future validation priorities.

Supported Environments
Dimension Values
Services AKS, BCM, EKS, GKE, Kind, LKE, OKE
Accelerators A100, B200, GB200, H100, H200, L40, RTX PRO 6000
Operating systems Amazon Linux, COS, RHEL, Talos, Ubuntu
Workload intents Inference, Training
Platforms Dynamo, Kubeflow, NIM, Run:ai, Slurm (Slinky)

How It Works

A recipe is a version-locked configuration for a specific environment. You describe your target (cloud, GPU, OS, workload intent, optional platform), and the recipe engine matches it against a library of validated overlays — layered configurations that compose bottom-up from base defaults through cloud, accelerator, OS, and workload-specific tuning. Composable mixins carry shared fragments (OS constraints, platform components) so a leaf overlay only declares what is unique to it.

The bundler materializes a recipe into deployment-ready artifacts: one folder per component, each with Helm values, checksums, and a README. The validator compares a recipe against a live cluster snapshot — first checking declarative constraints, then optionally running deployment, performance, and conformance phases inside the cluster.

This separation means the same validated configuration works whether you deploy with Helm, Argo CD, Flux, Helmfile, or a custom pipeline.

What AI Cluster Runtime Is Not

  • Not a Kubernetes distribution
  • Not a cluster provisioner or lifecycle management system
  • Not a managed control plane or hosted service
  • Not a replacement for your cloud provider or OEM platform
  • Not a generic configuration management platform

At its core, AICR is a cluster configuration generator. You bring your GPU-accelerated Kubernetes cluster and your deployment tooling; AICR generates the runtime configuration artifacts your tools deploy to the cluster. AICR can also validate that the configuration was correctly materialized and that it delivers the expected performance characteristics.

Documentation

Full documentation lives at docs.nvidia.com/aicr. Key entry points:

For contributors:

Resources

  • Roadmap — Feature priorities and development timeline
  • Adopters — Organizations and projects using or building on AICR
  • Security — Supply chain security, vulnerability reporting, and verification
  • Releases — Binaries, SBOMs, and attestations
  • Issues — Bugs, feature requests, and questions
  • Slack — Join Kubernetes Slack and visit the #aicr channel

Contributing

AI Cluster Runtime is under Apache 2.0 LICENSE. Contributions are welcome: new recipes for environments we haven't covered, additional bundler formats, validation checks, or bug reports. See CONTRIBUTING.md for development setup and the PR process.

Directories

Path Synopsis
cmd
aicr command
aicrd command
gate command
Command gate runs a Chainsaw test bundle in a polling loop against the cluster reachable from the current KUBECONFIG / in-cluster service account.
Command gate runs a Chainsaw test bundle in a polling loop against the cluster reachable from the current KUBECONFIG / in-cluster service account.
pkg
bom
Package bom builds CycloneDX 1.6 software bills-of-materials describing the container images AICR can deploy.
Package bom builds CycloneDX 1.6 software bills-of-materials describing the container images AICR can deploy.
build
Package build defines the BuildSpec schema and the load / validate / write-back primitives used by the build pipeline.
Package build defines the BuildSpec schema and the load / validate / write-back primitives used by the build pipeline.
bundler
Package bundler provides orchestration for generating deployment bundles from recipes.
Package bundler provides orchestration for generating deployment bundles from recipes.
bundler/attestation
Package attestation provides bundle attestation using Sigstore signing.
Package attestation provides bundle attestation using Sigstore signing.
bundler/checksum
Package checksum provides SHA256 checksum generation for bundle verification.
Package checksum provides SHA256 checksum generation for bundle verification.
bundler/config
Package config provides configuration options for bundler implementations.
Package config provides configuration options for bundler implementations.
bundler/deployer
Package deployer defines the shared interface and types for bundle deployers.
Package deployer defines the shared interface and types for bundle deployers.
bundler/deployer/argocd
Package argocd provides Argo CD Application generation for recipes.
Package argocd provides Argo CD Application generation for recipes.
bundler/deployer/argocdhelm
Package argocdhelm generates a Helm chart app-of-apps for Argo CD with dynamic install-time values.
Package argocdhelm generates a Helm chart app-of-apps for Argo CD with dynamic install-time values.
bundler/deployer/flux
Package flux provides Flux manifest generation for AICR recipes.
Package flux provides Flux manifest generation for AICR recipes.
bundler/deployer/helm
Package helm generates per-component Helm bundles from recipe results.
Package helm generates per-component Helm bundles from recipe results.
bundler/deployer/helmfile
Package helmfile generates a helmfile.yaml release graph from a configured recipe.
Package helmfile generates a helmfile.yaml release graph from a configured recipe.
bundler/deployer/localformat
Package localformat writes the uniform numbered local-chart bundle layout.
Package localformat writes the uniform numbered local-chart bundle layout.
bundler/gatemanifest
Package gatemanifest synthesizes the Kubernetes manifests for a component readiness gate Job (ServiceAccount, RBAC, ConfigMap, Job).
Package gatemanifest synthesizes the Kubernetes manifests for a component readiness gate Job (ServiceAccount, RBAC, ConfigMap, Job).
bundler/registry
Package registry provides thread-safe registration and retrieval of bundler implementations.
Package registry provides thread-safe registration and retrieval of bundler implementations.
bundler/result
Package result provides types for tracking bundle generation results.
Package result provides types for tracking bundle generation results.
bundler/types
Package types defines the type system for bundler implementations.
Package types defines the type system for bundler implementations.
bundler/verifier
Package verifier implements offline bundle verification with a four-level trust model.
Package verifier implements offline bundle verification with a four-level trust model.
chainsawgate/runner
Package runner contains the chainsaw-test evaluation machinery used by the standalone `gate` CLI.
Package runner contains the chainsaw-test evaluation machinery used by the standalone `gate` CLI.
cli
Package cli implements the command-line interface for the AICR aicr tool.
Package cli implements the command-line interface for the AICR aicr tool.
client/v1
Package aicr is the stable, public Go library surface for external consumers of the AI Cluster Runtime.
Package aicr is the stable, public Go library surface for external consumers of the AI Cluster Runtime.
collector
Package collector provides interfaces and implementations for collecting system configuration data.
Package collector provides interfaces and implementations for collecting system configuration data.
collector/file
Package file provides a configurable parser for line-oriented configuration files (e.g., /etc/default/grub, /etc/os-release, /proc/sys entries).
Package file provides a configurable parser for line-oriented configuration files (e.g., /etc/default/grub, /etc/os-release, /proc/sys entries).
collector/gpu
Package gpu collects GPU hardware data via driver-free NFD/PCI enumeration.
Package gpu collects GPU hardware data via driver-free NFD/PCI enumeration.
collector/k8s
Package k8s collects Kubernetes cluster configuration data.
Package k8s collects Kubernetes cluster configuration data.
collector/os
Package os collects operating system configuration data.
Package os collects operating system configuration data.
collector/systemd
Package systemd collects systemd service configuration data.
Package systemd collects systemd service configuration data.
collector/talos
Package talos provides Talos-specific collector implementations used in place of the systemd D-Bus and /proc-based OS collectors when the recipe criteria declares os: talos.
Package talos provides Talos-specific collector implementations used in place of the systemd D-Bus and /proc-based OS collectors when the recipe criteria declares os: talos.
component
Package component provides shared bundler utilities used by pkg/bundler and its deployers.
Package component provides shared bundler utilities used by pkg/bundler and its deployers.
config
Package config defines the AICRConfig file schema accepted by the aicr CLI's --config flag on the snapshot, recipe, bundle, and validate commands.
Package config defines the AICRConfig file schema accepted by the aicr CLI's --config flag on the snapshot, recipe, bundle, and validate commands.
constraints
Package constraints parses and evaluates constraint expressions (e.g.
Package constraints parses and evaluates constraint expressions (e.g.
defaults
Package defaults provides centralized configuration constants for the AICR system.
Package defaults provides centralized configuration constants for the AICR system.
diff
Package diff compares AICR snapshots to detect configuration drift.
Package diff compares AICR snapshots to detect configuration drift.
errors
Package errors provides structured error types for better observability and programmatic error handling across the application.
Package errors provides structured error types for better observability and programmatic error handling across the application.
evidence
Package evidence is an umbrella for AICR's evidence kinds.
Package evidence is an umbrella for AICR's evidence kinds.
evidence/attestation
Package attestation implements the recipe-test-attestation evidence kind defined in ADR-007 (docs/design/007-recipe-evidence.md).
Package attestation implements the recipe-test-attestation evidence kind defined in ADR-007 (docs/design/007-recipe-evidence.md).
evidence/cncf
Package cncf renders CNCF AI Conformance evidence markdown from CTRF reports.
Package cncf renders CNCF AI Conformance evidence markdown from CTRF reports.
evidence/verifier
Package verifier implements `aicr evidence verify`: offline verification of a recipe-evidence v1 bundle produced by `aicr validate --emit-attestation`.
Package verifier implements `aicr evidence verify`: offline verification of a recipe-evidence v1 bundle produced by `aicr validate --emit-attestation`.
fingerprint
Package fingerprint extracts a structured cluster identity from a snapshot's collector measurements and compares it against a recipe's criteria.
Package fingerprint extracts a structured cluster identity from a snapshot's collector measurements and compares it against a recipe's criteria.
header
Package header provides common header types for AICR data structures.
Package header provides common header types for AICR data structures.
health
Package health computes per-recipe structural health across the whole criteria matrix, as specified by ADR-009 (Recipe Health Tracking).
Package health computes per-recipe structural health across the whole criteria matrix, as specified by ADR-009 (Recipe Health Tracking).
helm
Package helm provides shared Helm chart rendering utilities used by both the mirror image discovery pipeline and the BOM generator.
Package helm provides shared Helm chart rendering utilities used by both the mirror image discovery pipeline and the BOM generator.
helm/helmtest
Package helmtest provides test helpers for consumers of pkg/helm.
Package helmtest provides test helpers for consumers of pkg/helm.
k8s
Package k8s provides Kubernetes integration for AI Cluster Runtime.
Package k8s provides Kubernetes integration for AI Cluster Runtime.
k8s/agent
Package agent provides Kubernetes Job deployment for automated snapshot capture.
Package agent provides Kubernetes Job deployment for automated snapshot capture.
k8s/client
Package client provides a singleton Kubernetes client for efficient cluster interactions.
Package client provides a singleton Kubernetes client for efficient cluster interactions.
k8s/pod
Package pod provides shared utilities for Kubernetes Job and Pod operations.
Package pod provides shared utilities for Kubernetes Job and Pod operations.
logging
Package logging provides structured logging utilities for AICR components.
Package logging provides structured logging utilities for AICR components.
manifest
Package manifest provides Helm-compatible template rendering for manifest files.
Package manifest provides Helm-compatible template rendering for manifest files.
measurement
Package measurement provides types and utilities for collecting, comparing, and filtering system measurements from various sources (Kubernetes, GPU, OS, SystemD).
Package measurement provides types and utilities for collecting, comparing, and filtering system measurements from various sources (Kubernetes, GPU, OS, SystemD).
mirror
Package mirror discovers container images and Helm charts referenced by a recipe and emits the list in formats consumable by air-gap tools (Hauler, Zarf) and general-purpose formats (JSON, YAML).
Package mirror discovers container images and Helm charts referenced by a recipe and emits the list in formats consumable by air-gap tools (Hauler, Zarf) and general-purpose formats (JSON, YAML).
netutil
Package netutil holds small, dependency-free networking helpers shared across packages that have no other reason to depend on one another (e.g.
Package netutil holds small, dependency-free networking helpers shared across packages that have no other reason to depend on one another (e.g.
oci
Package oci provides functionality for packaging and pushing artifacts to OCI-compliant registries.
Package oci provides functionality for packaging and pushing artifacts to OCI-compliant registries.
recipe
Package recipe provides recipe building and matching functionality.
Package recipe provides recipe building and matching functionality.
recipe/catalog
Package catalog provides signing and verification for the AICR recipe catalog (registry.yaml + validators/catalog.yaml).
Package catalog provides signing and verification for the AICR recipe catalog (registry.yaml + validators/catalog.yaml).
recipe/oskind
Package oskind is the single source of truth for the string values of the OS recipe criterion.
Package oskind is the single source of truth for the string values of the OS recipe criterion.
serializer
Package serializer provides encoding and decoding of measurement data in multiple formats.
Package serializer provides encoding and decoding of measurement data in multiple formats.
server
Package server implements the aicrd HTTP server: the AICR System Configuration Recommendation API defined in api/aicr/v1/server.yaml.
Package server implements the aicrd HTTP server: the AICR System Configuration Recommendation API defined in api/aicr/v1/server.yaml.
snapshotter
Package snapshotter captures comprehensive system configuration snapshots.
Package snapshotter captures comprehensive system configuration snapshots.
trust
Package trust manages Sigstore trusted root material for offline attestation verification.
Package trust manages Sigstore trusted root material for offline attestation verification.
validator
Package validator evaluates a recipe's constraints and validation checks against a cluster snapshot and the live cluster.
Package validator evaluates a recipe's constraints and validation checks against a cluster snapshot and the live cluster.
validator/catalog
Package catalog provides the declarative validator catalog.
Package catalog provides the declarative validator catalog.
validator/ctrf
Package ctrf provides Go types and utilities for the Common Test Report Format (CTRF).
Package ctrf provides Go types and utilities for the Common Test Report Format (CTRF).
validator/labels
Package labels provides shared label constants for validation resources.
Package labels provides shared label constants for validation resources.
validator/v1
Package v1 defines AICR's validator input format (v1alpha1).
Package v1 defines AICR's validator input format (v1alpha1).
version
Package version provides semantic version parsing and comparison with flexible precision support.
Package version provides semantic version parsing and comparison with flexible precision support.
tests
chainsaw/ai-conformance command
ai-conformance-check parses Chainsaw assertion YAML files and verifies that every declared resource exists in the target Kubernetes cluster.
ai-conformance-check parses Chainsaw assertion YAML files and verifies that every declared resource exists in the target Kubernetes cluster.
chainsaw/signing/bundle-attestation-private-sigstore/tlsproxy command
tlsproxy is a minimal HTTPS-termination reverse proxy used only by the private-Sigstore e2e suite.
tlsproxy is a minimal HTTPS-termination reverse proxy used only by the private-Sigstore e2e suite.
tools
bom command
Command bom renders every Helm chart in recipes/registry.yaml at its pinned version and emits a CycloneDX 1.6 JSON BOM plus a Markdown summary listing every container image AICR can deploy.
Command bom renders every Helm chart in recipes/registry.yaml at its pinned version and emits a CycloneDX 1.6 JSON BOM plus a Markdown summary listing every container image AICR can deploy.
coverage command
Command coverage generates docs/user/coverage-matrix.md — a structural matrix of which critical user journeys (CUJs) and CLI verbs are exercised by an in-repo test or demo, on what hardware class, and at what cadence.
Command coverage generates docs/user/coverage-matrix.md — a structural matrix of which critical user journeys (CUJs) and CLI verbs are exercised by an in-repo test or demo, on what hardware class, and at what cadence.
health command
Command health computes catalog-wide recipe structural health via pkg/health.Compute and renders a Markdown matrix.
Command health computes catalog-wide recipe structural health via pkg/health.Compute and renders a Markdown matrix.
Package validators provides shared utilities for v2 validator containers.
Package validators provides shared utilities for v2 validator containers.
chainsaw
Package chainsaw executes Chainsaw-style assertions against a live Kubernetes cluster, in-process.
Package chainsaw executes Chainsaw-style assertions against a live Kubernetes cluster, in-process.
conformance command
conformance is a validator container for all conformance phase checks.
conformance is a validator container for all conformance phase checks.
deployment command
deployment is a validator container for all deployment phase checks.
deployment is a validator container for all deployment phase checks.
helper
Package helper provides shared utilities for v2 validator containers.
Package helper provides shared utilities for v2 validator containers.
performance command
performance is a validator container for all performance phase checks.
performance is a validator container for all performance phase checks.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL