checks

package

v0.8.12 Latest Latest Go to latest Published: Mar 5, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

README ¶

Validation Checks and Constraint Registry

This package provides a registration framework for validation checks and constraint validators that run inside Kubernetes Jobs.

Overview

Architecture Overview

Validation checks run inside Kubernetes Jobs to verify cluster configuration and state. This architecture enables:

Cluster Access: Checks query live Kubernetes resources
Isolation: Each check runs in a dedicated Job for resource control
Testability: Graceful degradation when cluster access is unavailable
Observability: Captured results and logs for debugging

Two Types of Validation

Type	Purpose	Returns	Example
Check	Named validation test	`error`	`"operator-health"` checks if pods are running
Constraint Validator	Evaluates constraint expressions	`(actual string, passed bool, error)`	`"Deployment.gpu-operator.version"` checks version >= v24.6.0

Key difference:

Checks verify a condition and return pass/fail
Constraint Validators extract a value and evaluate it against a constraint expression

Phase-Specific Execution

Phase	Constraints	Checks	Execution Context
Readiness	Evaluated inline from snapshot	N/A (constraint-only)	Snapshot data only
Deployment	Run in Jobs (need cluster access)	Run in Jobs	Snapshot + Live cluster
Performance	Run in Jobs (need measurements)	Run in Jobs	Snapshot + Live cluster
Conformance	Run in Jobs (need cluster access)	Run in Jobs	Snapshot + Live cluster

Key Insight: Readiness = Constraints Only. It validates prerequisites from snapshot data with no cluster access and no Jobs. All other phases need live cluster access, so their constraints AND checks run inside Jobs.

Directory Structure

pkg/validator/checks/
├── README.md                    # This file - Complete documentation
├── registry.go                  # Registration infrastructure
├── runner.go                    # Test runner for Job execution
├── generator.go                 # Code generator for new checks/constraints
├── deployment/                  # Deployment phase checks + constraints
│   ├── operator_health_check.go           # Check registration and implementation
│   ├── operator_health_check_test.go      # Integration test (runs in Jobs)
│   ├── operator_health_check_unit_test.go # Unit test (runs locally)
│   ├── gpu_operator_version_constraint.go           # Constraint validator
│   ├── gpu_operator_version_constraint_test.go      # Integration test
│   └── gpu_operator_version_constraint_unit_test.go # Unit test
├── performance/                 # Performance phase checks + constraints
│   ├── nccl_all_reduce_bw_constraint.go           # NCCL all-reduce BW constraint + registration
│   ├── nccl_all_reduce_bw_constraint_test.go      # Integration test (TestNcclAllReduceBw — runs in Jobs)
│   ├── nccl_all_reduce_bw_constraint_unit_test.go # Unit test (runs locally without cluster)
│   ├── trainer_lifecycle.go                       # Kubeflow Trainer install/uninstall lifecycle
│   └── testdata/h100/eks/                         # EKS+H100 TrainingRuntime/TrainJob templates
│       ├── runtime.yaml
│       └── trainjob.yaml
└── conformance/                 # Conformance phase checks + constraints

File Naming Convention

Type	Files Generated
Check	`<name>_check.go`, `<name>_check_test.go`, `<name>_check_unit_test.go`
Constraint	`<name>_constraint.go`, `<name>_constraint_test.go`, `<name>_constraint_unit_test.go`

Getting Started

Quick Start (5 minutes)

Use the generator to create a new check or constraint with all required files:

1. Generate a check:

make generate-validator ARGS="--check my-check --phase deployment --description 'Verify my component is healthy'"

This creates:

my_check_check.go - Registration and validator function
my_check_check_test.go - Integration test (runs in Kubernetes Jobs)
my_check_check_unit_test.go - Unit test (runs locally)
my_check_recipe.yaml - Sample recipe for testing
my_check_README.md - Instructions

2. Implement the validator function:

// pkg/validator/checks/deployment/my_check_check.go
func validateMyCheck(ctx *checks.ValidationContext) error {
    pods, err := ctx.Clientset.CoreV1().Pods("my-namespace").List(
        ctx.Context,
        metav1.ListOptions{LabelSelector: "app=my-component"},
    )
    if err != nil {
        return fmt.Errorf("failed to list pods: %w", err)
    }

    if len(pods.Items) == 0 {
        return fmt.Errorf("no pods found")
    }

    for _, pod := range pods.Items {
        if pod.Status.Phase == "Running" {
            return nil
        }
    }
    return fmt.Errorf("no pods running")
}

3. Add unit tests:

// pkg/validator/checks/deployment/my_check_check_unit_test.go
func TestValidateMyCheck(t *testing.T) {
    tests := []struct {
        name    string
        setup   func() *checks.ValidationContext
        wantErr bool
    }{
        {
            name: "pods running",
            setup: func() *checks.ValidationContext {
                return &checks.ValidationContext{
                    Context:   context.Background(),
                    Clientset: fake.NewSimpleClientset(&runningPod),
                }
            },
            wantErr: false,
        },
        {
            name: "no pods found",
            setup: func() *checks.ValidationContext {
                return &checks.ValidationContext{
                    Context:   context.Background(),
                    Clientset: fake.NewSimpleClientset(),
                }
            },
            wantErr: true,
        },
    }
    // ... test execution
}

4. Run unit tests:

go test -short -v ./pkg/validator/checks/deployment/... -run TestValidateMyCheck

5. Use in recipe:

validation:
  deployment:
    checks:
      - my-check

Done! Your check will run inside validation Jobs.

Generate a constraint validator:

make generate-validator ARGS="--constraint Deployment.my-app.version --phase deployment"

Key Principles

Readiness = Constraints Only - Pre-deployment constraints evaluated inline from snapshot data (no checks, no Jobs, no cluster access)
Other Phases = Cluster Access Required - Deployment/Performance/Conformance need live queries
Self-Registration - Checks auto-discover via init()
Job Isolation - Each check runs in its own Job for resource control
Graceful Degradation - Test mode handles missing cluster gracefully

Example Recipe Usage

# expectedResources are declared on componentRefs (used by expected-resources check)
componentRefs:
  - name: gpu-operator
    type: Helm
    expectedResources:
      - kind: Deployment
        name: gpu-operator
        namespace: gpu-operator
      - kind: DaemonSet
        name: nvidia-driver-daemonset
        namespace: gpu-operator

validation:
  deployment:
    constraints:
      # These run INSIDE the Job with cluster access
      - name: Deployment.gpu-operator.version
        value: ">= v25.10.1"
      - name: Deployment.device-plugin.replicas
        value: ">= 1"
    checks:
      # These also run inside the Job
      - operator-health
      - expected-resources  # validates componentRefs[].expectedResources

Registration Pattern

Registering a Check

Checks use Go's init() pattern for self-registration. Use TestName to specify which test function runs in Jobs:

// pkg/validator/checks/deployment/my_check_check.go
package deployment

import "github.com/NVIDIA/aicr/pkg/validator/checks"

func init() {
    checks.RegisterCheck(&checks.Check{
        Name:        "my-check",
        Description: "Verify my component is healthy",
        Phase:       "deployment",
        TestName:    "TestCheckMyCheck",  // Test function name for Job execution
    })
}

// validateMyCheck is the validator function (private for encapsulation)
func validateMyCheck(ctx *checks.ValidationContext) error {
    // Validation logic here
    return nil
}

Registering a Constraint Validator

Constraint validators evaluate constraints that need cluster access:

// pkg/validator/checks/deployment/my_constraint_constraint.go
package deployment

import (
    "github.com/NVIDIA/aicr/pkg/recipe"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

func init() {
    checks.RegisterConstraintValidator(&checks.ConstraintValidator{
        Name:        "Deployment.my-app.version",
        Description: "Validates my-app deployment version",
        TestName:    "TestMyAppVersion",  // Test function name for Job execution
        Phase:       "deployment",
    })
}

// validateMyAppVersion is the validator function (private for encapsulation)
func validateMyAppVersion(
    ctx *checks.ValidationContext,
    constraint recipe.Constraint,
) (actual string, passed bool, err error) {
    // Query live cluster
    deployment, err := ctx.Clientset.AppsV1().Deployments("my-namespace").Get(
        ctx.Context, "my-app", metav1.GetOptions{})
    if err != nil {
        return "", false, err
    }

    // Extract actual value (e.g., version from image tag)
    actual = extractVersion(deployment.Spec.Template.Spec.Containers[0].Image)

    // Evaluate constraint expression
    passed, err = evaluateVersionConstraint(actual, constraint.Value)

    return actual, passed, err
}

Validation Context

The ValidationContext provides runtime access to:

type ValidationContext struct {
    Context   context.Context          // Cancellation and timeouts
    Snapshot  *snapshotter.Snapshot    // Captured cluster state
    Clientset kubernetes.Interface     // Live Kubernetes API access
    RecipeData map[string]interface{}  // Recipe metadata
}

Snapshot: Hardware, OS, and pre-capture cluster state
Clientset: Query live cluster (deployments, pods, services, etc.)
RecipeData: Access recipe configuration if needed

Test Wrappers for Job Execution

Why Test Wrappers?

Validation checks run inside Kubernetes Jobs via go test. The Jobs execute:

go test -v -json ./pkg/validator/checks/deployment -run operator-health

For go test to discover and run your check, you need a Test* function that:

Loads ValidationContext from the Job environment (snapshot, K8s client)
Executes the registered check by name
Reports results in standard Go test format

Adding a Test Wrapper

Note: When using the generator (make generate-validator), test wrappers are automatically created. The following is for manual creation.

Step 1: Add Test Wrapper to Your Check's Integration Test File

The integration test file (*_check_test.go) contains the test wrapper that runs in Kubernetes Jobs:

// pkg/validator/checks/deployment/operator_health_check_test.go

// TestOperatorHealth is the integration test for operator-health.
// This runs inside validator Jobs and invokes the validator.
func TestOperatorHealth(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }

    runner, err := checks.NewTestRunner(t)
    if err != nil {
        // Skip if not running in Kubernetes (expected during local test runs)
        t.Skipf("Not in Job environment: %v", err)
    }
    defer runner.Cancel()

    runner.RunCheck("operator-health")
}

Step 2: Naming Convention

The test wrapper function name must match the check name pattern:

Check Name	Test Wrapper Function
`operator-health`	`TestOperatorHealth`
`nccl-bandwidth`	`TestNCCLBandwidth`

Pattern: Convert kebab-case to PascalCase and prefix with Test.

How the Test Runner Works

The checks.NewTestRunner(t) function:

Creates in-cluster Kubernetes client using rest.InClusterConfig()
Loads snapshot from mounted file at $AICR_SNAPSHOT_PATH (default: /data/snapshot/snapshot.yaml)
Loads recipe data from $AICR_RECIPE_DATA environment variable (optional)
Returns TestRunner with fully initialized ValidationContext

The runner.RunCheck("check-name") method:

Looks up check in registry by name
Executes check function with the loaded ValidationContext
Reports results via t.Fatalf() on failure, or returns on success

Complete Test Wrapper Example

// pkg/validator/checks/performance/nccl_bandwidth.go
package performance

import (
    "fmt"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

func init() {
    checks.RegisterCheck(&checks.Check{
        Name:        "nccl-bandwidth",
        Description: "Measure NCCL all-reduce bandwidth",
        Phase:       "performance",
        Func:        CheckNCCLBandwidth,
    })
}

func CheckNCCLBandwidth(ctx *checks.ValidationContext) error {
    // Implementation...
    return nil
}

// pkg/validator/checks/performance/nccl_bandwidth_test.go
package performance

import (
    "testing"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

// Test wrapper for Job execution
func TestNCCLBandwidth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }

    runner.RunCheck("nccl-bandwidth")
}

// Unit tests with mocked context
func TestCheckNCCLBandwidth(t *testing.T) {
    tests := []struct {
        name    string
        // ...
    }{
        // Test cases...
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            ctx := &checks.ValidationContext{
                // Mocked context...
            }
            err := CheckNCCLBandwidth(ctx)
            // Assertions...
        })
    }
}

Environment Variables

The validation Job automatically sets these environment variables:

Variable	Purpose	Example
`AICR_SNAPSHOT_PATH`	Path to mounted snapshot file	`/data/snapshot/snapshot.yaml`
`AICR_RECIPE_PATH`	Path to mounted recipe file	`/data/recipe/recipe.yaml`
`AICR_NAMESPACE`	Namespace where Job is running	`aicr-validation`
`AICR_RESULT_CONFIGMAP`	ConfigMap name for results	`aicr-validation-deployment-operator-health-result`

Local vs Job Execution

Local execution (go test ./pkg/validator/checks/...):

Test wrappers skip (no in-cluster config available)
Unit tests run (use mocked context)
Fast feedback during development

Job execution (go test -run operator-health):

Test wrappers run (inside Kubernetes)
Unit tests excluded by -run pattern
Real validation against live cluster

How-To Guide

Adding a Check

Step 1: Create Check File

Create a file in the appropriate phase directory:

pkg/validator/checks/deployment/ - For deployment checks
pkg/validator/checks/performance/ - For performance checks
pkg/validator/checks/conformance/ - For conformance checks

Example: pkg/validator/checks/deployment/operator_health.go

Step 2: Implement Check Function

// Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
// [Standard license header...]

package deployment

import (
    "fmt"

    "github.com/NVIDIA/aicr/pkg/validator/checks"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func init() {
    checks.RegisterCheck(&checks.Check{
        Name:        "operator-health",              // ← Used in recipe
        Description: "Verify GPU operator is healthy",
        Phase:       "deployment",                   // ← Must match phase
        Func:        CheckOperatorHealth,
    })
}

// CheckOperatorHealth verifies the GPU operator pods are running.
func CheckOperatorHealth(ctx *checks.ValidationContext) error {
    // Access live cluster via ctx.Clientset
    pods, err := ctx.Clientset.CoreV1().Pods("gpu-operator").List(
        ctx.Context,
        metav1.ListOptions{LabelSelector: "app=gpu-operator"},
    )
    if err != nil {
        return fmt.Errorf("failed to list GPU operator pods: %w", err)
    }

    if len(pods.Items) == 0 {
        return fmt.Errorf("no GPU operator pods found")
    }

    // Verify at least one pod is running
    for _, pod := range pods.Items {
        if pod.Status.Phase == "Running" {
            return nil // Success!
        }
    }

    return fmt.Errorf("no GPU operator pods in Running state")
}

Step 3: Add Test Wrapper

// pkg/validator/checks/deployment/operator_health_test.go
package deployment

import (
    "testing"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

func TestOperatorHealth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }
    runner.RunCheck("operator-health")
}

Step 4: Use in Recipe

validation:
  deployment:
    checks:
      - operator-health  # ← Must match Check.Name

Step 5: Import Package (if needed)

If the package isn't already imported, add it to trigger init():

// In main.go or test file
import _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"

Adding a Constraint Validator

Step 1: Create Constraints File

Create constraints.go in the phase directory:

pkg/validator/checks/deployment/constraints.go
pkg/validator/checks/performance/constraints.go
pkg/validator/checks/conformance/constraints.go

Step 2: Implement Constraint Validator

// Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
// [Standard license header...]

package deployment

import (
    "context"
    "fmt"

    "github.com/NVIDIA/aicr/pkg/recipe"
    "github.com/NVIDIA/aicr/pkg/validator"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
)

func init() {
    checks.RegisterConstraintValidator(&checks.ConstraintValidator{
        Name:        "Deployment.device-plugin.replicas",  // ← Constraint name
        Description: "Validates device plugin replica count",
        Func:        ValidateDevicePluginReplicas,
    })
}

// ValidateDevicePluginReplicas checks the device plugin replica count.
// Constraint format: "Deployment.device-plugin.replicas"
// Constraint value: ">= 1", "== 3", etc.
func ValidateDevicePluginReplicas(
    ctx *checks.ValidationContext,
    constraint recipe.Constraint,
) (string, bool, error) {
    // 1. Query cluster to get actual value
    replicas, err := getDevicePluginReplicas(ctx.Context, ctx.Clientset)
    if err != nil {
        return "", false, fmt.Errorf("failed to get replica count: %w", err)
    }

    // 2. Convert to string for comparison
    actualValue := fmt.Sprintf("%d", replicas)

    // 3. Evaluate constraint expression
    passed, err := evaluateConstraint(actualValue, constraint.Value)
    if err != nil {
        return actualValue, false, fmt.Errorf("constraint evaluation failed: %w", err)
    }

    // 4. Return: (actual value, pass/fail, error)
    return actualValue, passed, nil
}

// Helper: Get actual replica count from cluster
func getDevicePluginReplicas(ctx context.Context, clientset kubernetes.Interface) (int, error) {
    deployment, err := clientset.AppsV1().Deployments("gpu-operator").Get(
        ctx,
        "nvidia-device-plugin",
        metav1.GetOptions{},
    )
    if err != nil {
        return 0, err
    }

    if deployment.Spec.Replicas == nil {
        return 0, nil
    }

    return int(*deployment.Spec.Replicas), nil
}

// Helper: Evaluate constraint expression
func evaluateConstraint(actualValue, constraintExpr string) (bool, error) {
    parsed, err := validator.ParseConstraintExpression(constraintExpr)
    if err != nil {
        return false, fmt.Errorf("invalid constraint expression: %w", err)
    }

    passed, err := parsed.Evaluate(actualValue)
    if err != nil {
        return false, fmt.Errorf("evaluation failed: %w", err)
    }

    return passed, nil
}

Step 3: Use in Recipe

validation:
  deployment:
    constraints:
      - name: Deployment.device-plugin.replicas  # ← Must match Pattern
        value: ">= 1"                             # ← Constraint expression

Step 4: Import Package (if needed)

Same as checks - ensure the package is imported to trigger init().

Phase-Specific Considerations

Deployment Phase

Typical validations:

Operator health and readiness
Deployment resource versions
Pod counts and statuses
ConfigMap/Secret presence

Example constraint names:

Deployment.gpu-operator.version
Deployment.device-plugin.replicas
Deployment.dcgm-exporter.enabled

Access patterns:

// Deployments
deployment, _ := ctx.Clientset.AppsV1().Deployments(ns).Get(ctx.Context, name, metav1.GetOptions{})

// Pods
pods, _ := ctx.Clientset.CoreV1().Pods(ns).List(ctx.Context, metav1.ListOptions{LabelSelector: "app=foo"})

// ConfigMaps
cm, _ := ctx.Clientset.CoreV1().ConfigMaps(ns).Get(ctx.Context, name, metav1.GetOptions{})

Performance Phase

Typical validations:

NCCL all-reduce bus bandwidth (EW fabric between GPU nodes)
Network fabric health
GPU-to-GPU communication latency
Storage I/O performance

Example constraint names:

nccl-all-reduce-bw (implemented — EKS + H100)
Performance.network.latency
Performance.gpu.peer-access

Implemented constraints:

nccl-all-reduce-bw — Runs a Kubeflow Trainer TrainJob with NCCL all_reduce_perf, parses the 16 GB bus bandwidth from launcher logs, and validates it is within 10% of the recipe threshold. Skips gracefully when fewer than 2 GPU nodes are available (requires EKS + H100 to run). Auto-installs Kubeflow Trainer if not already present and tears it down on exit.

Access patterns:

// Dynamic client for CRD and TrainJob operations
dynamicClient, _ := dynamic.NewForConfig(ctx.RESTConfig)

// List schedulable GPU nodes
nodes, _ := ctx.Clientset.CoreV1().Nodes().List(ctx.Context, metav1.ListOptions{})

// Watch launcher pod for completion
watcher, _ := ctx.Clientset.CoreV1().Pods(ns).Watch(ctx.Context, metav1.ListOptions{
    FieldSelector: "metadata.name=" + podName,
})

Conformance Phase

Typical validations:

Kubernetes API version compatibility
RBAC policy conformance
CRD schema validation
AI workload compatibility

Example constraint names:

Conformance.k8s.version
Conformance.api.gpu-device
Conformance.workload.pytorch

Access patterns:

// API version
version, _ := ctx.Clientset.Discovery().ServerVersion()

// CRDs
crdClient := ctx.Clientset.ApiextensionsV1().CustomResourceDefinitions()
crd, _ := crdClient.Get(ctx.Context, "gpus.nvidia.com", metav1.GetOptions{})

// Run conformance test workloads
job, _ := ctx.Clientset.BatchV1().Jobs(ns).Create(ctx.Context, testJob, metav1.CreateOptions{})

Testing

Unit Test for Check

func TestCheckOperatorHealth(t *testing.T) {
    tests := []struct {
        name    string
        pods    []corev1.Pod
        wantErr bool
    }{
        {
            name: "healthy operator",
            pods: []corev1.Pod{
                {
                    ObjectMeta: metav1.ObjectMeta{Name: "gpu-operator-abc"},
                    Status:     corev1.PodStatus{Phase: "Running"},
                },
            },
            wantErr: false,
        },
        {
            name:    "no pods found",
            pods:    []corev1.Pod{},
            wantErr: true,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Create fake clientset with test data
            var objects []runtime.Object
            for i := range tt.pods {
                objects = append(objects, &tt.pods[i])
            }
            clientset := fake.NewSimpleClientset(objects...)

            ctx := &checks.ValidationContext{
                Context:   context.Background(),
                Clientset: clientset,
            }

            err := CheckOperatorHealth(ctx)
            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
        })
    }
}

Unit Test for Constraint Validator

func TestValidateDevicePluginReplicas(t *testing.T) {
    tests := []struct {
        name          string
        deployment    *appsv1.Deployment
        constraint    recipe.Constraint
        wantActual    string
        wantPassed    bool
        wantErr       bool
    }{
        {
            name: "constraint satisfied",
            deployment: &appsv1.Deployment{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "nvidia-device-plugin",
                    Namespace: "gpu-operator",
                },
                Spec: appsv1.DeploymentSpec{
                    Replicas: ptr.To(int32(3)),
                },
            },
            constraint: recipe.Constraint{
                Name:  "Deployment.device-plugin.replicas",
                Value: ">= 1",
            },
            wantActual: "3",
            wantPassed: true,
            wantErr:    false,
        },
        {
            name: "constraint not satisfied",
            deployment: &appsv1.Deployment{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "nvidia-device-plugin",
                    Namespace: "gpu-operator",
                },
                Spec: appsv1.DeploymentSpec{
                    Replicas: ptr.To(int32(0)),
                },
            },
            constraint: recipe.Constraint{
                Name:  "Deployment.device-plugin.replicas",
                Value: ">= 1",
            },
            wantActual: "0",
            wantPassed: false,
            wantErr:    false,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            clientset := fake.NewSimpleClientset(tt.deployment)

            ctx := &checks.ValidationContext{
                Context:   context.Background(),
                Clientset: clientset,
            }

            actual, passed, err := ValidateDevicePluginReplicas(ctx, tt.constraint)

            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            if actual != tt.wantActual {
                t.Errorf("actual = %v, want %v", actual, tt.wantActual)
            }
            if passed != tt.wantPassed {
                t.Errorf("passed = %v, want %v", passed, tt.wantPassed)
            }
        })
    }
}

Integration Test

func TestConstraintValidatorRegistration(t *testing.T) {
    // Verify the validator is registered
    validator, ok := checks.GetConstraintValidator("Deployment.device-plugin.replicas")
    if !ok {
        t.Fatal("Constraint validator not registered")
    }

    if validator.Pattern != "Deployment.device-plugin.replicas" {
        t.Errorf("Pattern = %v, want Deployment.device-plugin.replicas", validator.Pattern)
    }

    if validator.Func == nil {
        t.Fatal("Func is nil")
    }
}

Testing Checks Locally

func TestOperatorHealthLocal(t *testing.T) {
    deployment := createTestDeployment("gpu-operator", "gpu-operator")
    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    check, ok := checks.GetCheck("operator-health")
    if !ok {
        t.Fatal("check not registered")
    }

    if err := check.Func(ctx); err != nil {
        t.Errorf("check failed: %v", err)
    }
}

Common Patterns

Pattern 1: Version Constraint Validator

func ValidateComponentVersion(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    // 1. Get version from deployment label
    version := deployment.Labels["app.kubernetes.io/version"]

    // 2. Fallback: parse from image tag
    if version == "" {
        version = extractVersionFromImage(container.Image)
    }

    // 3. Normalize version (add 'v' prefix if missing)
    version = normalizeVersion(version)

    // 4. Evaluate constraint
    passed, err := evaluateVersionConstraint(version, constraint.Value)

    return version, passed, err
}

Pattern 2: Count/Numeric Constraint Validator

func ValidateResourceCount(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    // 1. Query cluster for resources
    items, err := ctx.Clientset....List(...)

    // 2. Count items
    count := len(items.Items)
    actualValue := fmt.Sprintf("%d", count)

    // 3. Evaluate numeric constraint
    passed, err := evaluateConstraint(actualValue, constraint.Value)

    return actualValue, passed, err
}

Pattern 3: Boolean/State Constraint Validator

func ValidateFeatureEnabled(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    // 1. Check feature state (ConfigMap, annotation, etc.)
    enabled := checkFeatureState(ctx)
    actualValue := fmt.Sprintf("%t", enabled)

    // 2. Evaluate boolean constraint ("== true", "== false")
    passed, err := evaluateConstraint(actualValue, constraint.Value)

    return actualValue, passed, err
}

Pattern 4: Multi-Namespace Search

func findResourceAcrossNamespaces(ctx context.Context, clientset kubernetes.Interface,
    namespaces []string, names []string) (*appsv1.Deployment, error) {

    for _, ns := range namespaces {
        for _, name := range names {
            deployment, err := clientset.AppsV1().Deployments(ns).Get(
                ctx, name, metav1.GetOptions{},
            )
            if err == nil {
                return deployment, nil
            }
        }
    }

    return nil, fmt.Errorf("resource not found in any namespace")
}

Pattern 5: Performance Test with Job

func CheckPerformance(ctx *checks.ValidationContext) error {
    // 1. Create test Job
    job := &batchv1.Job{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "perf-test",
            Namespace: "aicr-validation",
        },
        Spec: batchv1.JobSpec{
            Template: corev1.PodTemplateSpec{
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{
                        {
                            Name:  "test",
                            Image: "nvcr.io/nvidia/nccl-tests:latest",
                            Args:  []string{"all_reduce_perf", "-b", "8", "-e", "256M"},
                        },
                    },
                    RestartPolicy: "Never",
                },
            },
        },
    }

    // 2. Create and wait for Job
    _, err := ctx.Clientset.BatchV1().Jobs("aicr-validation").Create(
        ctx.Context, job, metav1.CreateOptions{},
    )
    if err != nil {
        return err
    }

    // 3. Wait for completion
    // 4. Read logs
    // 5. Parse results

    return nil
}

Adding Constraint Validators (New Approach)

For constraint validators, AICR provides an automated code generator that scaffolds all necessary files with proper structure. This ensures consistency and catches registration issues automatically.

Quick Start with Generator

1. Generate validator scaffolding:

make generate-validator ARGS="--constraint Deployment.my-app.version --phase deployment --description 'Validates my-app version'"

This creates three files with TODOs guiding implementation:

pkg/validator/checks/deployment/
├── my_app_version.go                    # Helper functions
├── my_app_version_test.go               # Unit tests
└── my_app_version_integration_test.go   # Integration test with registration

2. Implement helper functions:

Edit my_app_version.go and fill in the TODOs:

// getMyAppVersion queries the cluster to get the actual version
func getMyAppVersion(ctx context.Context, clientset kubernetes.Interface) (string, error) {
    // TODO: Implement version detection
    // Search common namespaces
    namespaces := []string{"my-app", "default", "kube-system"}
    names := []string{"my-app", "myapp"}

    for _, ns := range namespaces {
        for _, name := range names {
            deployment, err := clientset.AppsV1().Deployments(ns).Get(
                ctx, name, metav1.GetOptions{},
            )
            if err == nil {
                // Try version from label
                if version := deployment.Labels["app.kubernetes.io/version"]; version != "" {
                    return normalizeVersion(version), nil
                }
                // Try version from image tag
                if len(deployment.Spec.Template.Spec.Containers) > 0 {
                    return extractVersionFromImage(deployment.Spec.Template.Spec.Containers[0].Image), nil
                }
            }
        }
    }

    return "", fmt.Errorf("my-app not found")
}

// evaluateVersionConstraint evaluates version constraint expressions
func evaluateVersionConstraint(actualValue, constraintValue string) (bool, error) {
    // TODO: Implement constraint evaluation
    // Parse constraint (>=, ==, !=, <, >, ~=)
    // Compare versions using semver
    // Return pass/fail
}

3. Add unit test cases:

Edit my_app_version_test.go:

func TestGetMyAppVersion(t *testing.T) {
    tests := []struct {
        name       string
        deployment *appsv1.Deployment
        want       string
        wantErr    bool
    }{
        {
            name: "version from label",
            deployment: &appsv1.Deployment{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "my-app",
                    Namespace: "default",
                    Labels: map[string]string{
                        "app.kubernetes.io/version": "v1.2.3",
                    },
                },
            },
            want:    "v1.2.3",
            wantErr: false,
        },
        // Add more test cases...
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            clientset := fake.NewSimpleClientset(tt.deployment)
            got, err := getMyAppVersion(context.Background(), clientset)

            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            if got != tt.want {
                t.Errorf("got %v, want %v", got, tt.want)
            }
        })
    }
}

4. Integration test is auto-generated with registration:

The generator creates my_app_version_integration_test.go with proper registration:

func init() {
    checks.RegisterConstraintValidator(&checks.ConstraintValidator{
        Name:        "Deployment.my-app.version",
        Description: "Validates my-app version",
        TestName:    "TestMyAppVersion",
        Phase:       "deployment",
    })
}

func TestMyAppVersion(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }

    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
    }

    // Get constraint from recipe
    constraint := runner.GetConstraint("deployment", "Deployment.my-app.version")
    if constraint == nil {
        t.Skip("Constraint not defined in recipe")
    }

    // Execute validation logic
    ctx := runner.Context()
    actualValue, err := getMyAppVersion(ctx.Context, ctx.Clientset)
    if err != nil {
        t.Fatalf("Failed to get my-app version: %v", err)
    }

    passed, err := evaluateVersionConstraint(actualValue, constraint.Value)
    if err != nil {
        t.Fatalf("Failed to evaluate constraint: %v", err)
    }

    if !passed {
        t.Errorf("Version constraint not satisfied: actual=%s, expected=%s",
            actualValue, constraint.Value)
    }
}

5. Run tests:

# Unit tests only (fast, no cluster needed)
make test

# Integration test in Job (requires cluster)
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment

6. Submit PR:

The CI pipeline automatically validates:

Code compiles
Unit tests pass
Registration is complete (enforced by pkg/validator/checks/registration_test.go)
Coverage meets threshold

How It Works

Recipe → Test Execution Flow

# Recipe
validation:
  deployment:
    constraints:
      - name: Deployment.my-app.version
        value: ">= v1.2.0"

↓

// Registry lookup (buildTestPattern in phases.go)
testName, _ := checks.GetTestNameForConstraint("Deployment.my-app.version")
// Returns: "TestMyAppVersion"

↓

// Pattern building
pattern := "^(TestMyAppVersion)$"

↓

# Job command
go test -v -json ./pkg/validator/checks/deployment -run '^(TestMyAppVersion)$'

↓

// Integration test runs with cluster access
func TestMyAppVersion(t *testing.T) {
    // Queries cluster, evaluates constraint
}

Architecture Principles

Key Insight: Integration tests ARE the validators. They contain the validation logic directly, not wrapper functions.

File Structure:

*_version.go - Helper functions (query cluster, evaluate constraints)
*_version_test.go - Unit tests with table-driven cases using fake clientset
*_version_integration_test.go - Integration test that runs in Jobs with real cluster access

Separation:

Unit tests: Fast, use fake clientset, test helper functions
Integration tests: Run in Jobs, use real cluster, test full constraint validation

Test Runner Pattern:

runner, err := checks.NewTestRunner(t)
// Provides:
// - runner.Context() - Kubernetes clientset, context, snapshot, recipe
// - runner.GetConstraint(phase, name) - Lookup constraint from recipe

Enforcement Mechanism

Three layers ensure validators are properly implemented:

1. Automated Registration Tests

pkg/validator/checks/registration_test.go runs in every make test and fails if:

Registered constraint has no test implementation
Integration test exists without registration
Registered check has no test implementation

func TestConstraintRegistrationCompleteness(t *testing.T) {
    constraintTests := checks.ListConstraintTests("")
    existingTests := findTestFunctions(t)  // AST parsing

    var missing []string
    for _, ct := range constraintTests {
        if !existingTests[ct.TestName] {
            missing = append(missing, ct.TestName)
        }
    }

    if len(missing) > 0 {
        t.Errorf("Registered constraints missing test implementations")
    }
}

2. Code Generator

make generate-validator scaffolds all files correctly:

Includes registration automatically in integration test
Provides TODOs for implementation
Follows naming conventions

3. Documentation

Comprehensive development guide (this section)
Generated code has inline examples and TODOs
Contributing guide integration

What Gets Caught

Mistake	How It's Caught
Registered constraint without test	`TestConstraintRegistrationCompleteness` fails
Integration test without registration	`TestIntegrationTestsAreRegistered` fails
Wrong test function name	Pattern matching fails (test not found)
Forgot to implement helpers	Compilation fails (undefined functions)
Missing test cases	Coverage check fails

Testing Locally

# Unit tests only (skips integration tests)
go test ./pkg/validator/checks/deployment -short

# Run specific integration test (will skip if not in Kubernetes)
go test ./pkg/validator/checks/deployment -run TestMyAppVersion -v

# All tests including registration validation
make test

Using in Recipe

validation:
  deployment:
    constraints:
      - name: Deployment.my-app.version  # Must match registered Pattern
        value: ">= v1.2.0"                # Constraint expression

Common Patterns

Multi-Strategy Version Detection

func getComponentVersion(ctx context.Context, clientset kubernetes.Interface) (string, error) {
    deployment := findDeployment(ctx, clientset)

    // Strategy 1: Label
    if version := deployment.Labels["app.kubernetes.io/version"]; version != "" {
        return normalizeVersion(version), nil
    }

    // Strategy 2: Annotation
    if version := deployment.Annotations["version"]; version != "" {
        return normalizeVersion(version), nil
    }

    // Strategy 3: Image tag
    if len(deployment.Spec.Template.Spec.Containers) > 0 {
        image := deployment.Spec.Template.Spec.Containers[0].Image
        return extractVersionFromImage(image), nil
    }

    return "", fmt.Errorf("version not found")
}

Version Constraint Evaluation

func evaluateVersionConstraint(actualValue, constraintValue string) (bool, error) {
    // Parse operator and expected version
    // Supports: ==, !=, >=, <=, >, <, ~= (compatible)
    op, expected := parseConstraint(constraintValue)

    // Compare using semver
    actual, err := semver.Parse(actualValue)
    if err != nil {
        return false, fmt.Errorf("invalid actual version: %w", err)
    }

    expectedVer, err := semver.Parse(expected)
    if err != nil {
        return false, fmt.Errorf("invalid expected version: %w", err)
    }

    switch op {
    case ">=":
        return actual.GTE(expectedVer), nil
    case "==":
        return actual.Equal(expectedVer), nil
    // ... other operators
    }
}

Benefits

1. Impossible to Forget Registration - Tests fail locally and in CI if registration is missing

2. Easy to Add New Validators - One command scaffolds everything correctly

3. Consistent Architecture - Generated code follows established patterns

4. Fast Feedback - Catches issues locally before PR

5. Self-Documenting - Generated code has examples and TODOs

6. CI Enforced - Can't merge without complete implementation

Troubleshooting

Test Wrapper Issues

Test Wrapper Not Found by go test

Symptom:

Job logs: testing: warning: no tests to run

Causes:

Test wrapper function doesn't follow naming convention
Test wrapper is not exported (lowercase name)
Package doesn't compile

Solutions:

Check test function naming:

// Correct
func TestOperatorHealth(t *testing.T) { ... }

// Wrong - lowercase
func TestOperatorhealth(t *testing.T) { ... }

// Wrong - underscore separator
func Test_operator_health(t *testing.T) { ... }

Naming rule: Convert kebab-case check name to PascalCase:

operator-health → TestOperatorHealth
nccl-bandwidth → TestNCCLBandwidth
expected-resources → TestExpectedResources

Verify test file compiles:

go test -c ./pkg/validator/checks/deployment/

Test Wrapper Fails: "check not found in registry"

Symptom:

Job logs: Check "operator-health" not found in registry

Causes:

Check not registered in init() function
Package not imported (init() never runs)
Check name mismatch between registration and runner call

Solutions:

Verify check registration:

// Must be in same package as check function
func init() {
    checks.RegisterCheck(&checks.Check{
        Name:  "operator-health",  // ← Must match exactly
        Phase: "deployment",
        Func:  CheckOperatorHealth,
    })
}

Verify test wrapper uses same name:

func TestOperatorHealth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test: %v", err)
        return
    }
    runner.RunCheck("operator-health")  // ← Must match registration
}

Test Wrapper Fails: "failed to load validation context"

Symptom (during local testing):

SKIP: Skipping integration test (not in Kubernetes): failed to create in-cluster config

Expected behavior: Test should skip gracefully when not in Kubernetes.

Verify skip logic:

func TestMyCheck(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        // Should skip, not fail
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }
    runner.RunCheck("my-check")
}

Symptom (inside Job):

Job logs: Failed to create test runner: failed to load validation context:
          failed to read snapshot file: open /data/snapshot/snapshot.yaml: no such file

Causes:

Snapshot ConfigMap not mounted correctly
Volume mount path mismatch
ConfigMap doesn't exist

Solutions:

Check Job pod volumes:

kubectl get pod <pod-name> -n aicr-validation -o yaml | grep -A 10 volumes

Expected volumes:

volumes:
- name: snapshot
  configMap:
    name: <snapshot-configmap>
- name: recipe
  configMap:
    name: <recipe-configmap>
volumeMounts:
- name: snapshot
  mountPath: /data/snapshot
  readOnly: true

Verify ConfigMap exists:

kubectl get cm -n <namespace> <snapshot-configmap>
kubectl describe cm -n <namespace> <snapshot-configmap>

Check ConfigMap contains snapshot data:

kubectl get cm -n <namespace> <snapshot-configmap> -o jsonpath='{.data.snapshot\.yaml}' | head -20

Job Execution Issues

Job Not Found

Symptom:

Error: failed to wait for Job completion: Job "aicr-validation-deployment" not found

Causes:

Namespace doesn't exist
Job was cleaned up too quickly
Job creation failed silently

Solutions:

Check if namespace exists:

kubectl get namespace aicr-validation

Create namespace if missing:

kubectl create namespace aicr-validation

Check Job status:

kubectl get jobs -n aicr-validation
kubectl describe job aicr-validation-deployment -n aicr-validation

Job Failed to Start

Symptom:

Error: Job failed with status: ImagePullBackOff

Causes:

Image not accessible
Image tag doesn't exist
Registry authentication issues

Solutions:

Check Job events:

kubectl describe job aicr-validation-deployment -n aicr-validation

Check Pod logs:

kubectl get pods -n aicr-validation
kubectl describe pod <pod-name> -n aicr-validation

Verify image exists:

docker pull ghcr.io/nvidia/aicr-validator:latest
# or
kubectl run test --image=ghcr.io/nvidia/aicr-validator:latest --rm -it --restart=Never -- /bin/sh

Job Pods Crash

Symptom:

Error: Job pod exited with code 1

Solutions:

View pod logs:

# Get pod name
kubectl get pods -n aicr-validation -l job-name=aicr-validation-deployment

# View logs
kubectl logs <pod-name> -n aicr-validation

# View logs of crashed pod
kubectl logs <pod-name> -n aicr-validation --previous

Common causes in logs:

panic: runtime error - Code bug in check
context deadline exceeded - Timeout
permission denied - RBAC issue
connection refused - Network/API issue

RBAC Permission Errors

Forbidden: User Cannot Access Resource

Symptom:

Error: failed to list GPU operator pods: pods is forbidden:
User "system:serviceaccount:aicr-validation:aicr-validator" cannot list resource "pods"
in API group "" in the namespace "gpu-operator"

Cause: ServiceAccount lacks necessary RBAC permissions

Solutions:

Check current permissions:

kubectl auth can-i list pods --namespace=gpu-operator \
  --as=system:serviceaccount:aicr-validation:aicr-validator

View current Role/RoleBinding:

kubectl get role aicr-validator -n aicr-validation -o yaml
kubectl get rolebinding aicr-validator -n aicr-validation -o yaml

Fix: Create proper RBAC resources:

# role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aicr-validator
rules:
  # Deployment phase
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments", "daemonsets", "statefulsets"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["get", "list"]

  # Performance phase
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]

  # Conformance phase
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["get", "list"]

Apply RBAC:

kubectl apply -f role.yaml
kubectl create clusterrolebinding aicr-validator \
  --clusterrole=aicr-validator \
  --serviceaccount=aicr-validation:aicr-validator

RBAC for Cross-Namespace Access

Issue: Check needs to access resources in gpu-operator namespace but only has permissions in aicr-validation

Solution: Use ClusterRole instead of Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aicr-validator
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list"]
  # Add other rules...

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aicr-validator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: aicr-validator
subjects:
  - kind: ServiceAccount
    name: aicr-validator
    namespace: aicr-validation

Timeout Problems

Context Deadline Exceeded

Symptom:

Error: context deadline exceeded
Check: operator-health
Duration: 2m0s

Causes:

Check takes too long to execute
Kubernetes API is slow
External resource is unresponsive

Solutions:

Increase timeout in validator:

validator := validator.New(
    validator.WithTimeout(10 * time.Minute),  // Increase from default 2m
)

Increase timeout for specific check:

func CheckOperatorHealth(ctx *checks.ValidationContext) error {
    // Create new context with longer timeout for this check
    checkCtx, cancel := context.WithTimeout(ctx.Context, 5*time.Minute)
    defer cancel()

    pods, err := ctx.Clientset.CoreV1().Pods("gpu-operator").List(
        checkCtx,  // Use extended timeout
        metav1.ListOptions{LabelSelector: "app=gpu-operator"},
    )
    // ...
}

Add context cancellation checks for long operations:

func LongRunningCheck(ctx *checks.ValidationContext) error {
    for i := 0; i < 1000; i++ {
        // Check if context is cancelled
        select {
        case <-ctx.Context.Done():
            return ctx.Context.Err()  // Return context error
        default:
            // Continue processing
        }

        // Do work...
    }
    return nil
}

Job Timeout

Symptom:

Error: Job did not complete within timeout period
Job: aicr-validation-performance
Timeout: 5m0s

Solutions:

Increase Job timeout:

config := agent.Config{
    Timeout: 15 * time.Minute,  // Increase for performance tests
}

Check if Job is actually running:

kubectl get pods -n aicr-validation -l job-name=aicr-validation-performance
kubectl logs <pod-name> -n aicr-validation --follow

Check Job status:

kubectl describe job aicr-validation-performance -n aicr-validation

Check Registration Issues

Check Not Found

Symptom:

Error: check "operator-health" not registered

Causes:

Package not imported
init() not running
Check name mismatch

Solutions:

Verify check is registered:

func TestCheckRegistered(t *testing.T) {
    check, ok := checks.GetCheck("operator-health")
    if !ok {
        t.Fatal("Check not registered")
    }
    assert.Equal(t, "operator-health", check.Name)
}

Ensure package is imported:

// Import with blank identifier to trigger init()
import _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"

List all registered checks:

func TestListChecks(t *testing.T) {
    allChecks := checks.ListChecks("")
    t.Logf("Registered checks: %d", len(allChecks))
    for _, check := range allChecks {
        t.Logf("  - %s (%s)", check.Name, check.Phase)
    }
}

Constraint Validator Not Found

Symptom:

Error: no validator found for constraint "Deployment.gpu-operator.version"

Solutions:

Check if validator is registered:

# Run test to list validators
go test -v ./pkg/validator/checks/... -run TestList

Verify import:

import _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"

Check pattern match:

func TestValidatorRegistration(t *testing.T) {
    validator, ok := checks.GetConstraintValidator("Deployment.gpu-operator.version")
    if !ok {
        t.Fatal("Validator not registered")
    }
    assert.Equal(t, "Deployment.gpu-operator.version", validator.Pattern)
}

Duplicate Registration Panic

Symptom:

panic: constraint validator for pattern "Deployment.gpu-operator.version" is already registered

Cause: Same pattern registered twice (likely imported in multiple places)

Solution: Only import check packages once, typically in main:

// cmd/aicr/main.go
import (
    _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"  // Once here
    _ "github.com/NVIDIA/aicr/pkg/validator/checks/performance"
    _ "github.com/NVIDIA/aicr/pkg/validator/checks/conformance"
)

Constraint Evaluation Errors

Invalid Constraint Expression

Symptom:

Error: invalid constraint expression: cannot parse expected version
Constraint: Deployment.gpu-operator.version
Value: ">= invalid-version"

Solution: Fix constraint value in recipe:

# Wrong
constraints:
  - name: Deployment.gpu-operator.version
    value: ">= invalid-version"

# Correct
constraints:
  - name: Deployment.gpu-operator.version
    value: ">= v24.6.0"

Version Parse Error

Symptom:

Error: cannot parse actual version
Actual: "latest"
Expected: ">= v24.6.0"

Cause: Actual value is not a valid version string

Solution: Fix validator to return valid version:

func getVersion(deployment *appsv1.Deployment) string {
    version := deployment.Labels["app.kubernetes.io/version"]
    if version == "latest" {
        // Don't return "latest" - try other strategies
        version = extractVersionFromImage(deployment.Spec.Template.Spec.Containers[0].Image)
    }
    return normalizeVersion(version)
}

Constraint Always Fails

Symptom:

Constraint: OS.distribution
Expected: "ubuntu"
Actual: "Ubuntu"
Status: FAIL

Cause: Case sensitivity in string comparison

Solution: Normalize strings in validator:

func ValidateOSDistribution(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    actual := strings.ToLower(getOSDistribution(ctx))  // Normalize to lowercase
    expected := strings.ToLower(constraint.Value)

    passed := actual == expected
    return actual, passed, nil
}

Kubernetes Client Errors

Cannot Connect to Cluster

Symptom:

Error: failed to create Kubernetes client: unable to load kubeconfig

Solutions:

Check kubeconfig:

kubectl cluster-info
echo $KUBECONFIG
ls -la ~/.kube/config

Test connectivity:

kubectl get nodes

Verify in code:

clientset, err := k8sclient.GetKubeClient()
if err != nil {
    log.Fatalf("Failed to create k8s client: %v", err)
}

// Test connection
nodes, err := clientset.CoreV1().Nodes().List(context.Background(), metav1.ListOptions{})
if err != nil {
    log.Fatalf("Cannot connect to cluster: %v", err)
}
log.Printf("Connected to cluster with %d nodes", len(nodes.Items))

Resource Not Found

Symptom:

Error: deployments.apps "gpu-operator" not found

Causes:

Resource doesn't exist
Wrong namespace
Wrong name

Solutions:

Check if resource exists:

kubectl get deployments -A | grep gpu-operator

Use multi-namespace search in validator:

func findGPUOperator(ctx context.Context, clientset kubernetes.Interface) (*appsv1.Deployment, error) {
    namespaces := []string{"gpu-operator", "nvidia-gpu-operator", "kube-system"}
    names := []string{"gpu-operator", "nvidia-gpu-operator"}

    for _, ns := range namespaces {
        for _, name := range names {
            deployment, err := clientset.AppsV1().Deployments(ns).Get(
                ctx, name, metav1.GetOptions{},
            )
            if err == nil {
                return deployment, nil
            }
        }
    }

    return nil, fmt.Errorf("GPU operator not found in any common namespace")
}

Test Mode vs Production

Tests Pass but Production Fails

Symptom:

Unit tests pass with fake clientset
Production validation fails with real cluster

Causes:

Fake clientset doesn't match real cluster state
RBAC works in test but not production
Timing issues (context timeout)

Solutions:

Test with real cluster:

# Integration test against real cluster
export USE_REAL_CLUSTER=true
go test -v ./pkg/validator/checks/deployment/... -run TestIntegration

Add integration tests:

func TestOperatorHealthIntegration(t *testing.T) {
    if os.Getenv("USE_REAL_CLUSTER") != "true" {
        t.Skip("Skipping integration test")
    }

    clientset, err := k8sclient.GetKubeClient()
    require.NoError(t, err)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    err = CheckOperatorHealth(ctx)
    assert.NoError(t, err)
}

Validation Passes in Test Mode

Symptom:

WARN Job deployment failed (likely test mode), returning skeleton check
Check: operator-health
Status: pass
Reason: skipped - Job deployment failed (test mode)

Cause: Test environment doesn't have namespace, so checks are skipped

Solutions:

Create test namespace:

kubectl create namespace aicr-validation

Or run tests with fake clientset:

func TestWithFakeCluster(t *testing.T) {
    deployment := createTestDeployment()
    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    // Test directly against check function, not Job execution
    err := CheckOperatorHealth(ctx)
    assert.NoError(t, err)
}

Debugging Techniques

Enable Debug Logging

import "log/slog"

func init() {
    slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
        Level: slog.LevelDebug,
    })))
}

View Job Logs in Real-Time

# Watch for new Jobs
kubectl get jobs -n aicr-validation -w

# Stream logs from running Job
POD=$(kubectl get pods -n aicr-validation -l job-name=aicr-validation-deployment -o name | head -1)
kubectl logs -n aicr-validation $POD --follow

Check Job Results ConfigMap

# List result ConfigMaps
kubectl get configmaps -n aicr-validation

# View specific result
kubectl get configmap aicr-validation-deployment-result -n aicr-validation -o yaml

Debug Check Function Directly

func TestDebugCheck(t *testing.T) {
    // Set up test data
    deployment := createGPUOperatorDeployment("gpu-operator", "gpu-operator",
        map[string]string{"app.kubernetes.io/version": "v24.6.0"},
        "nvcr.io/nvidia/gpu-operator:v24.6.0")

    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    // Call check directly (no Job)
    err := CheckOperatorHealth(ctx)
    if err != nil {
        t.Logf("Check failed: %v", err)
        t.Fail()
    }
}

Trace Constraint Evaluation

func ValidateGPUOperatorVersion(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    slog.Debug("Starting version validation",
        "constraint", constraint.Name,
        "expectedValue", constraint.Value)

    version, err := getGPUOperatorVersion(ctx.Context, ctx.Clientset)
    slog.Debug("Detected version", "version", version, "error", err)

    if err != nil {
        return "", false, err
    }

    passed, err := evaluateVersionConstraint(version, constraint.Value)
    slog.Debug("Constraint evaluation result",
        "version", version,
        "constraint", constraint.Value,
        "passed", passed,
        "error", err)

    return version, passed, err
}

Use kubectl debug

# Debug a running Job pod
kubectl debug -n aicr-validation <pod-name> -it --image=busybox

# Check environment and mounts
env | grep AICR
ls -la /aicr/snapshot
cat /aicr/snapshot/snapshot.yaml

Collect Diagnostic Information

When reporting issues, include:

# Cluster info
kubectl version
kubectl get nodes

# Validation namespace
kubectl get all -n aicr-validation

# Job details
kubectl describe job <job-name> -n aicr-validation

# Pod logs
kubectl logs <pod-name> -n aicr-validation

# RBAC
kubectl auth can-i --list --as=system:serviceaccount:aicr-validation:aicr-validator

Common kubectl Commands

# List all validation Jobs
kubectl get jobs -n aicr-validation

# Delete failed Jobs
kubectl delete job -n aicr-validation -l status=failed

# Clean up validation namespace
kubectl delete namespace aicr-validation

# Re-create validation namespace
kubectl create namespace aicr-validation

# View events
kubectl get events -n aicr-validation --sort-by='.lastTimestamp'

Migration from Inline Validation

Before (inline constraint evaluation):

// phases.go - deployment phase
for _, constraint := range recipe.Validation.Deployment.Constraints {
    result := evaluateConstraint(constraint, snapshot) // Wrong - no cluster access
}

After (Job-based constraint validation):

// deployment/constraints.go
func ValidateDeploymentConstraint(ctx *ValidationContext, constraint recipe.Constraint) {
    // Correct - has cluster access via ctx.Clientset
    deployment := ctx.Clientset.AppsV1().Deployments(...).Get(...)
}

References

Summary

Task	File Location	Key Function	Registry Call
Add Check	`pkg/validator/checks/<phase>/*.go`	`func(ctx *ValidationContext) error`	`RegisterCheck()`
Add Constraint	`pkg/validator/checks/<phase>/constraints.go`	`func(ctx *ValidationContext, constraint recipe.Constraint) (string, bool, error)`	`RegisterConstraintValidator()`

Both use init() for self-registration and are discovered automatically at runtime.

Key Files

registry.go: Check and constraint validator registration infrastructure
runner.go: Test runner for Job execution
deployment/operator_health_check.go: Example check implementation
deployment/constraints.go: Example constraint validator implementation
Constraint Parser: pkg/validator/constraint_expression.go

Documentation ¶

Index ¶

func GetTestNameForCheck(checkName string) (string, bool)
func GetTestNameForConstraint(constraintName string) (string, bool)
func RegisterCheck(check *Check)
func RegisterConstraintValidator(validator *ConstraintValidator)
type Artifact
- func DecodeArtifact(encoded string) (*Artifact, error)
- func (a Artifact) Encode() (string, error)
type ArtifactCollector
- func NewArtifactCollector() *ArtifactCollector
- func (c *ArtifactCollector) Drain() []Artifact
- func (c *ArtifactCollector) Record(label, data string) error
type Check
- func GetCheck(name string) (*Check, bool)
- func GetCheckByTestName(testName string) (*Check, bool)
- func ListChecks(phase string) []*Check
- func ResolveCheck(name string) (*Check, bool)
type CheckFunc
type ConstraintValidator
- func GetConstraintValidator(constraintName string) (*ConstraintValidator, bool)
- func ListConstraintTests(phase string) []*ConstraintValidator
- func ListConstraintValidators() []*ConstraintValidator
type ConstraintValidatorFunc
type TestRunner
- func NewTestRunner(t *testing.T) (*TestRunner, error)
- func (r *TestRunner) Cancel()
- func (r *TestRunner) Context() *ValidationContext
- func (r *TestRunner) GetConstraint(phase, constraintName string) *recipe.Constraint
- func (r *TestRunner) HasCheck(phase, checkName string) bool
- func (r *TestRunner) RunCheck(checkName string)
type ValidationContext
- func LoadValidationContext() (*ValidationContext, context.CancelFunc, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func GetTestNameForCheck ¶

func GetTestNameForCheck(checkName string) (string, bool)

GetTestNameForCheck looks up which test function validates a check. Returns the test name and true if found, empty string and false otherwise.

func GetTestNameForConstraint ¶

func GetTestNameForConstraint(constraintName string) (string, bool)

GetTestNameForConstraint looks up which test function validates a constraint. Returns the test name and true if found, empty string and false otherwise.

func RegisterCheck ¶

func RegisterCheck(check *Check)

RegisterCheck adds a check to the registry. This should be called from init() functions in check packages. If TestName is empty, it's derived from the Name automatically.

func RegisterConstraintValidator ¶

func RegisterConstraintValidator(validator *ConstraintValidator)

RegisterConstraintValidator adds a constraint validator to the registry. This should be called from init() functions in constraint validator packages. If TestName is empty, it's derived from the Name automatically.

Types ¶

type Artifact ¶ added in v0.7.8

type Artifact struct {
	// Label is the human-readable title (e.g., "DRA Driver Pods").
	Label string `json:"label"`

	// Data is the captured content (command output, metric text, YAML, etc.).
	Data string `json:"data"`
}

Artifact represents a captured piece of diagnostic evidence from a conformance check. Each artifact has a human-readable label and a data payload (kubectl output, metric samples, resource YAML, etc.) that is rendered as a fenced code block in evidence markdown.

func DecodeArtifact ¶ added in v0.7.8

func DecodeArtifact(encoded string) (*Artifact, error)

DecodeArtifact decodes a base64-encoded JSON artifact string.

func (Artifact) Encode ¶ added in v0.7.8

func (a Artifact) Encode() (string, error)

Encode returns a base64-encoded JSON representation of the artifact, suitable for emission via t.Logf("ARTIFACT:%s", encoded).

type ArtifactCollector ¶ added in v0.7.8

type ArtifactCollector struct {
	// contains filtered or unexported fields
}

ArtifactCollector is a thread-safe accumulator for artifacts within a single check execution. It enforces per-artifact size limits and per-check count limits.

func NewArtifactCollector ¶ added in v0.7.8

func NewArtifactCollector() *ArtifactCollector

NewArtifactCollector creates a new empty artifact collector.

func (*ArtifactCollector) Drain ¶ added in v0.7.8

func (c *ArtifactCollector) Drain() []Artifact

Drain returns the collected artifacts and resets the internal list. Returns nil if no artifacts were recorded.

func (*ArtifactCollector) Record ¶ added in v0.7.8

func (c *ArtifactCollector) Record(label, data string) error

Record adds a labeled artifact. Data exceeding defaults.ArtifactMaxDataSize is truncated. Returns an error if the per-check artifact count limit is reached.

type Check ¶

type Check struct {
	// Name is the unique identifier for this check (e.g., "operator-health")
	Name string

	// Description explains what this check validates
	Description string

	// Phase indicates which validation phase this check belongs to
	Phase string // "readiness", "deployment", "performance", "conformance"

	// Func is the check implementation
	Func CheckFunc

	// TestName is the Go test function name (e.g., "TestCheckOperatorHealth")
	// If empty, derived from Name automatically
	TestName string

	// RequirementID is the CNCF conformance requirement ID (e.g., "dra_support").
	// Empty for checks that are not CNCF submission requirements.
	RequirementID string

	// EvidenceTitle is the human-readable title for evidence documents (e.g., "DRA Support").
	EvidenceTitle string

	// EvidenceDescription is a one-paragraph description for evidence documents.
	EvidenceDescription string

	// EvidenceFile is the output filename for evidence (e.g., "dra-support.md").
	// Multiple checks can share the same EvidenceFile (combined evidence).
	// Empty means this check produces no evidence file.
	EvidenceFile string

	// SubmissionRequirement indicates this check maps to a CNCF submission requirement.
	// Only checks with this set to true appear in the submission evidence index.
	SubmissionRequirement bool
}

Check represents a registered validation check.

func GetCheck ¶

func GetCheck(name string) (*Check, bool)

GetCheck retrieves a registered check by name.

func GetCheckByTestName ¶ added in v0.7.7

func GetCheckByTestName(testName string) (*Check, bool)

GetCheckByTestName does a reverse lookup: Go test name → Check.

func ListChecks ¶

func ListChecks(phase string) []*Check

ListChecks returns all registered checks, optionally filtered by phase.

func ResolveCheck ¶ added in v0.7.7

func ResolveCheck(name string) (*Check, bool)

ResolveCheck tries check name first, then test name. This handles the identity mismatch where CheckResult.Name can be either a check registry name (--no-cluster path) or a Go test name (normal cluster runs).

type CheckFunc ¶

type CheckFunc func(ctx *ValidationContext) error

CheckFunc is the function signature for a validation check. It validates a specific aspect of the cluster and reports results via t.

type ConstraintValidator ¶

type ConstraintValidator struct {
	// Name is the unique identifier for this constraint (e.g., "Deployment.gpu-operator.version")
	Name string

	// Description explains what constraints this validator handles
	Description string

	// Func is the validator implementation
	Func ConstraintValidatorFunc

	// TestName is the Go test function name (e.g., "TestGPUOperatorVersion")
	// If empty, derived from Name automatically
	TestName string

	// Phase indicates which validation phase (deployment, performance, conformance)
	Phase string
}

ConstraintValidator represents a registered constraint validator.

func GetConstraintValidator ¶

func GetConstraintValidator(constraintName string) (*ConstraintValidator, bool)

GetConstraintValidator retrieves a constraint validator by name.

func ListConstraintTests ¶

func ListConstraintTests(phase string) []*ConstraintValidator

ListConstraintTests returns all registered constraint validators, optionally filtered by phase.

func ListConstraintValidators ¶

func ListConstraintValidators() []*ConstraintValidator

ListConstraintValidators returns all registered constraint validators.

type ConstraintValidatorFunc ¶

type ConstraintValidatorFunc func(ctx *ValidationContext, constraint recipe.Constraint) (actual string, passed bool, err error)

ConstraintValidatorFunc is the function signature for constraint validation. It evaluates whether a constraint is satisfied against the cluster state. Returns the actual value found, whether it passed, and any error.

type TestRunner ¶

type TestRunner struct {
	// contains filtered or unexported fields
}

TestRunner provides infrastructure for running validation checks as Go tests inside Kubernetes Jobs.

The test runner bridges the gap between Go's test framework and the AICR validation system:

Loads ValidationContext from Job environment (snapshot, K8s client, recipe)
Looks up registered checks by name
Executes checks and reports results via testing.T

Example usage in test wrappers:

func TestOperatorHealth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }
    defer runner.Cancel() // Clean up context when test completes
    runner.RunCheck("operator-health")
}

func NewTestRunner ¶

func NewTestRunner(t *testing.T) (*TestRunner, error)

NewTestRunner creates a test runner by loading ValidationContext from the Job environment. Expected environment variables:

AICR_SNAPSHOT_PATH: Path to mounted snapshot file (default: /data/snapshot/snapshot.yaml)
AICR_RECIPE_DATA: Optional JSON-encoded recipe metadata

IMPORTANT: Callers should call Cancel() when done to release resources.

func (*TestRunner) Cancel ¶

func (r *TestRunner) Cancel()

Cancel releases resources associated with the test runner. Should be called via defer after NewTestRunner succeeds. Drains any collected artifacts and emits them as structured test output before canceling the context.

func (*TestRunner) Context ¶

func (r *TestRunner) Context() *ValidationContext

Context returns the validation context for direct access. Use this when you need the Kubernetes client, snapshot, or other context data.

func (*TestRunner) GetConstraint ¶

func (r *TestRunner) GetConstraint(phase, constraintName string) *recipe.Constraint

GetConstraint retrieves a constraint by name from the recipe for the current phase. Returns nil if the recipe doesn't contain the constraint. This is used by integration tests to get constraint values to validate against.

func (*TestRunner) HasCheck ¶

func (r *TestRunner) HasCheck(phase, checkName string) bool

HasCheck checks if a check is enabled in the recipe for a given phase. Returns true if the check is listed in the recipe's checks for that phase.

func (*TestRunner) RunCheck ¶

func (r *TestRunner) RunCheck(checkName string)

RunCheck executes a registered validation check by name. The check must be registered via RegisterCheck() (usually in an init() function).

type ValidationContext ¶

type ValidationContext struct {
	// Context for cancellation and timeouts
	Context context.Context

	// Snapshot contains captured cluster state (hardware, OS, etc.)
	Snapshot *snapshotter.Snapshot

	// Namespace is the namespace where the validation is running
	Namespace string

	// Clientset provides Kubernetes API access for live cluster queries
	Clientset kubernetes.Interface

	// RESTConfig provides Kubernetes API access for cluster queries (used for e.g. remote command execution)
	RESTConfig *rest.Config

	// DynamicClient provides dynamic Kubernetes API access for reading custom resources (CRDs).
	// If nil, checks should create one from RESTConfig. Set this in unit tests for injection.
	DynamicClient dynamic.Interface

	// RecipeData contains recipe metadata that may be needed for validation
	RecipeData map[string]interface{}

	// Recipe contains the full recipe with validation constraints
	// Only available when running inside Jobs (not in unit tests)
	Recipe *recipe.RecipeResult

	// Artifacts collects diagnostic evidence during check execution.
	// Nil when artifact capture is not active (e.g., non-conformance phases).
	// Checks should nil-check before recording.
	Artifacts *ArtifactCollector
}

ValidationContext provides runtime context for checks and constraints.

func LoadValidationContext ¶

func LoadValidationContext() (*ValidationContext, context.CancelFunc, error)

LoadValidationContext loads the validation context for running checks. Works both inside Kubernetes Jobs (in-cluster config) and locally (KUBECONFIG).

Kubernetes client discovery: KUBECONFIG env → ~/.kube/config → in-cluster service account. Namespace resolution: service account file → AICR_VALIDATION_NAMESPACE env → "default".

Environment variables:

KUBECONFIG: Path to kubeconfig file (for local development)
AICR_VALIDATION_NAMESPACE: Validation namespace (for local development)
AICR_SNAPSHOT_PATH: Path to snapshot file (default: /data/snapshot/snapshot.yaml)
AICR_RECIPE_PATH: Path to recipe file (default: /data/recipe/recipe.yaml)
AICR_RECIPE_DATA: Optional JSON-encoded recipe metadata

IMPORTANT: The caller is responsible for calling the returned cancel function when the validation context is no longer needed.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
conformance
deployment
performance

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Validation Checks and Constraint Registry

Table of Contents

Overview

Architecture Overview

Two Types of Validation

Phase-Specific Execution

Directory Structure

File Naming Convention

Getting Started

Quick Start (5 minutes)

Key Principles

Example Recipe Usage

Registration Pattern

Registering a Check

Registering a Constraint Validator

Validation Context

Test Wrappers for Job Execution

Why Test Wrappers?

Adding a Test Wrapper

How the Test Runner Works

Complete Test Wrapper Example

Environment Variables

Local vs Job Execution

How-To Guide

Adding a Check

Adding a Constraint Validator

Phase-Specific Considerations

Deployment Phase

Performance Phase

Conformance Phase

Testing

Unit Test for Check

Unit Test for Constraint Validator

Integration Test

Testing Checks Locally

Common Patterns

Pattern 1: Version Constraint Validator

Pattern 2: Count/Numeric Constraint Validator

Pattern 3: Boolean/State Constraint Validator

Pattern 4: Multi-Namespace Search

Pattern 5: Performance Test with Job

Adding Constraint Validators (New Approach)

Quick Start with Generator

How It Works

Recipe → Test Execution Flow

Architecture Principles

Enforcement Mechanism

What Gets Caught

Testing Locally

Using in Recipe

Common Patterns

Multi-Strategy Version Detection

Version Constraint Evaluation

Benefits

Troubleshooting

Test Wrapper Issues

Test Wrapper Not Found by go test

Test Wrapper Fails: "check not found in registry"

Test Wrapper Fails: "failed to load validation context"

Job Execution Issues

Job Not Found

Job Failed to Start

Job Pods Crash

RBAC Permission Errors

Forbidden: User Cannot Access Resource

RBAC for Cross-Namespace Access

Timeout Problems

Context Deadline Exceeded

Job Timeout

Check Registration Issues

Check Not Found

Constraint Validator Not Found

Duplicate Registration Panic

Constraint Evaluation Errors

Invalid Constraint Expression

Version Parse Error

Constraint Always Fails

Kubernetes Client Errors

Cannot Connect to Cluster