checks

package
v0.8.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

README

Validation Checks and Constraint Registry

This package provides a registration framework for validation checks and constraint validators that run inside Kubernetes Jobs.

Table of Contents

Overview

Architecture Overview

Validation checks run inside Kubernetes Jobs to verify cluster configuration and state. This architecture enables:

  • Cluster Access: Checks query live Kubernetes resources
  • Isolation: Each check runs in a dedicated Job for resource control
  • Testability: Graceful degradation when cluster access is unavailable
  • Observability: Captured results and logs for debugging
Two Types of Validation
Type Purpose Returns Example
Check Named validation test error "operator-health" checks if pods are running
Constraint Validator Evaluates constraint expressions (actual string, passed bool, error) "Deployment.gpu-operator.version" checks version >= v24.6.0

Key difference:

  • Checks verify a condition and return pass/fail
  • Constraint Validators extract a value and evaluate it against a constraint expression
Phase-Specific Execution
Phase Constraints Checks Execution Context
Readiness Evaluated inline from snapshot N/A (constraint-only) Snapshot data only
Deployment Run in Jobs (need cluster access) Run in Jobs Snapshot + Live cluster
Performance Run in Jobs (need measurements) Run in Jobs Snapshot + Live cluster
Conformance Run in Jobs (need cluster access) Run in Jobs Snapshot + Live cluster

Key Insight: Readiness = Constraints Only. It validates prerequisites from snapshot data with no cluster access and no Jobs. All other phases need live cluster access, so their constraints AND checks run inside Jobs.

Directory Structure
pkg/validator/checks/
├── README.md                    # This file - Complete documentation
├── registry.go                  # Registration infrastructure
├── runner.go                    # Test runner for Job execution
├── generator.go                 # Code generator for new checks/constraints
├── deployment/                  # Deployment phase checks + constraints
│   ├── operator_health_check.go           # Check registration and implementation
│   ├── operator_health_check_test.go      # Integration test (runs in Jobs)
│   ├── operator_health_check_unit_test.go # Unit test (runs locally)
│   ├── gpu_operator_version_constraint.go           # Constraint validator
│   ├── gpu_operator_version_constraint_test.go      # Integration test
│   └── gpu_operator_version_constraint_unit_test.go # Unit test
├── performance/                 # Performance phase checks + constraints
│   ├── nccl_all_reduce_bw_constraint.go           # NCCL all-reduce BW constraint + registration
│   ├── nccl_all_reduce_bw_constraint_test.go      # Integration test (TestNcclAllReduceBw — runs in Jobs)
│   ├── nccl_all_reduce_bw_constraint_unit_test.go # Unit test (runs locally without cluster)
│   ├── trainer_lifecycle.go                       # Kubeflow Trainer install/uninstall lifecycle
│   └── testdata/h100/eks/                         # EKS+H100 TrainingRuntime/TrainJob templates
│       ├── runtime.yaml
│       └── trainjob.yaml
└── conformance/                 # Conformance phase checks + constraints
File Naming Convention
Type Files Generated
Check <name>_check.go, <name>_check_test.go, <name>_check_unit_test.go
Constraint <name>_constraint.go, <name>_constraint_test.go, <name>_constraint_unit_test.go

Getting Started

Quick Start (5 minutes)

Use the generator to create a new check or constraint with all required files:

1. Generate a check:

make generate-validator ARGS="--check my-check --phase deployment --description 'Verify my component is healthy'"

This creates:

  • my_check_check.go - Registration and validator function
  • my_check_check_test.go - Integration test (runs in Kubernetes Jobs)
  • my_check_check_unit_test.go - Unit test (runs locally)
  • my_check_recipe.yaml - Sample recipe for testing
  • my_check_README.md - Instructions

2. Implement the validator function:

// pkg/validator/checks/deployment/my_check_check.go
func validateMyCheck(ctx *checks.ValidationContext) error {
    pods, err := ctx.Clientset.CoreV1().Pods("my-namespace").List(
        ctx.Context,
        metav1.ListOptions{LabelSelector: "app=my-component"},
    )
    if err != nil {
        return fmt.Errorf("failed to list pods: %w", err)
    }

    if len(pods.Items) == 0 {
        return fmt.Errorf("no pods found")
    }

    for _, pod := range pods.Items {
        if pod.Status.Phase == "Running" {
            return nil
        }
    }
    return fmt.Errorf("no pods running")
}

3. Add unit tests:

// pkg/validator/checks/deployment/my_check_check_unit_test.go
func TestValidateMyCheck(t *testing.T) {
    tests := []struct {
        name    string
        setup   func() *checks.ValidationContext
        wantErr bool
    }{
        {
            name: "pods running",
            setup: func() *checks.ValidationContext {
                return &checks.ValidationContext{
                    Context:   context.Background(),
                    Clientset: fake.NewSimpleClientset(&runningPod),
                }
            },
            wantErr: false,
        },
        {
            name: "no pods found",
            setup: func() *checks.ValidationContext {
                return &checks.ValidationContext{
                    Context:   context.Background(),
                    Clientset: fake.NewSimpleClientset(),
                }
            },
            wantErr: true,
        },
    }
    // ... test execution
}

4. Run unit tests:

go test -short -v ./pkg/validator/checks/deployment/... -run TestValidateMyCheck

5. Use in recipe:

validation:
  deployment:
    checks:
      - my-check

Done! Your check will run inside validation Jobs.

Generate a constraint validator:

make generate-validator ARGS="--constraint Deployment.my-app.version --phase deployment"
Key Principles
  1. Readiness = Constraints Only - Pre-deployment constraints evaluated inline from snapshot data (no checks, no Jobs, no cluster access)
  2. Other Phases = Cluster Access Required - Deployment/Performance/Conformance need live queries
  3. Self-Registration - Checks auto-discover via init()
  4. Job Isolation - Each check runs in its own Job for resource control
  5. Graceful Degradation - Test mode handles missing cluster gracefully
Example Recipe Usage
# expectedResources are declared on componentRefs (used by expected-resources check)
componentRefs:
  - name: gpu-operator
    type: Helm
    expectedResources:
      - kind: Deployment
        name: gpu-operator
        namespace: gpu-operator
      - kind: DaemonSet
        name: nvidia-driver-daemonset
        namespace: gpu-operator

validation:
  deployment:
    constraints:
      # These run INSIDE the Job with cluster access
      - name: Deployment.gpu-operator.version
        value: ">= v25.10.1"
      - name: Deployment.device-plugin.replicas
        value: ">= 1"
    checks:
      # These also run inside the Job
      - operator-health
      - expected-resources  # validates componentRefs[].expectedResources

Registration Pattern

Registering a Check

Checks use Go's init() pattern for self-registration. Use TestName to specify which test function runs in Jobs:

// pkg/validator/checks/deployment/my_check_check.go
package deployment

import "github.com/NVIDIA/aicr/pkg/validator/checks"

func init() {
    checks.RegisterCheck(&checks.Check{
        Name:        "my-check",
        Description: "Verify my component is healthy",
        Phase:       "deployment",
        TestName:    "TestCheckMyCheck",  // Test function name for Job execution
    })
}

// validateMyCheck is the validator function (private for encapsulation)
func validateMyCheck(ctx *checks.ValidationContext) error {
    // Validation logic here
    return nil
}
Registering a Constraint Validator

Constraint validators evaluate constraints that need cluster access:

// pkg/validator/checks/deployment/my_constraint_constraint.go
package deployment

import (
    "github.com/NVIDIA/aicr/pkg/recipe"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

func init() {
    checks.RegisterConstraintValidator(&checks.ConstraintValidator{
        Name:        "Deployment.my-app.version",
        Description: "Validates my-app deployment version",
        TestName:    "TestMyAppVersion",  // Test function name for Job execution
        Phase:       "deployment",
    })
}

// validateMyAppVersion is the validator function (private for encapsulation)
func validateMyAppVersion(
    ctx *checks.ValidationContext,
    constraint recipe.Constraint,
) (actual string, passed bool, err error) {
    // Query live cluster
    deployment, err := ctx.Clientset.AppsV1().Deployments("my-namespace").Get(
        ctx.Context, "my-app", metav1.GetOptions{})
    if err != nil {
        return "", false, err
    }

    // Extract actual value (e.g., version from image tag)
    actual = extractVersion(deployment.Spec.Template.Spec.Containers[0].Image)

    // Evaluate constraint expression
    passed, err = evaluateVersionConstraint(actual, constraint.Value)

    return actual, passed, err
}
Validation Context

The ValidationContext provides runtime access to:

type ValidationContext struct {
    Context   context.Context          // Cancellation and timeouts
    Snapshot  *snapshotter.Snapshot    // Captured cluster state
    Clientset kubernetes.Interface     // Live Kubernetes API access
    RecipeData map[string]interface{}  // Recipe metadata
}
  • Snapshot: Hardware, OS, and pre-capture cluster state
  • Clientset: Query live cluster (deployments, pods, services, etc.)
  • RecipeData: Access recipe configuration if needed

Test Wrappers for Job Execution

Why Test Wrappers?

Validation checks run inside Kubernetes Jobs via go test. The Jobs execute:

go test -v -json ./pkg/validator/checks/deployment -run operator-health

For go test to discover and run your check, you need a Test* function that:

  1. Loads ValidationContext from the Job environment (snapshot, K8s client)
  2. Executes the registered check by name
  3. Reports results in standard Go test format
Adding a Test Wrapper

Note: When using the generator (make generate-validator), test wrappers are automatically created. The following is for manual creation.

Step 1: Add Test Wrapper to Your Check's Integration Test File

The integration test file (*_check_test.go) contains the test wrapper that runs in Kubernetes Jobs:

// pkg/validator/checks/deployment/operator_health_check_test.go

// TestOperatorHealth is the integration test for operator-health.
// This runs inside validator Jobs and invokes the validator.
func TestOperatorHealth(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }

    runner, err := checks.NewTestRunner(t)
    if err != nil {
        // Skip if not running in Kubernetes (expected during local test runs)
        t.Skipf("Not in Job environment: %v", err)
    }
    defer runner.Cancel()

    runner.RunCheck("operator-health")
}

Step 2: Naming Convention

The test wrapper function name must match the check name pattern:

Check Name Test Wrapper Function
operator-health TestOperatorHealth
nccl-bandwidth TestNCCLBandwidth

Pattern: Convert kebab-case to PascalCase and prefix with Test.

How the Test Runner Works

The checks.NewTestRunner(t) function:

  1. Creates in-cluster Kubernetes client using rest.InClusterConfig()
  2. Loads snapshot from mounted file at $AICR_SNAPSHOT_PATH (default: /data/snapshot/snapshot.yaml)
  3. Loads recipe data from $AICR_RECIPE_DATA environment variable (optional)
  4. Returns TestRunner with fully initialized ValidationContext

The runner.RunCheck("check-name") method:

  1. Looks up check in registry by name
  2. Executes check function with the loaded ValidationContext
  3. Reports results via t.Fatalf() on failure, or returns on success
Complete Test Wrapper Example
// pkg/validator/checks/performance/nccl_bandwidth.go
package performance

import (
    "fmt"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

func init() {
    checks.RegisterCheck(&checks.Check{
        Name:        "nccl-bandwidth",
        Description: "Measure NCCL all-reduce bandwidth",
        Phase:       "performance",
        Func:        CheckNCCLBandwidth,
    })
}

func CheckNCCLBandwidth(ctx *checks.ValidationContext) error {
    // Implementation...
    return nil
}
// pkg/validator/checks/performance/nccl_bandwidth_test.go
package performance

import (
    "testing"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

// Test wrapper for Job execution
func TestNCCLBandwidth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }

    runner.RunCheck("nccl-bandwidth")
}

// Unit tests with mocked context
func TestCheckNCCLBandwidth(t *testing.T) {
    tests := []struct {
        name    string
        // ...
    }{
        // Test cases...
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            ctx := &checks.ValidationContext{
                // Mocked context...
            }
            err := CheckNCCLBandwidth(ctx)
            // Assertions...
        })
    }
}
Environment Variables

The validation Job automatically sets these environment variables:

Variable Purpose Example
AICR_SNAPSHOT_PATH Path to mounted snapshot file /data/snapshot/snapshot.yaml
AICR_RECIPE_PATH Path to mounted recipe file /data/recipe/recipe.yaml
AICR_NAMESPACE Namespace where Job is running aicr-validation
AICR_RESULT_CONFIGMAP ConfigMap name for results aicr-validation-deployment-operator-health-result
Local vs Job Execution

Local execution (go test ./pkg/validator/checks/...):

  • Test wrappers skip (no in-cluster config available)
  • Unit tests run (use mocked context)
  • Fast feedback during development

Job execution (go test -run operator-health):

  • Test wrappers run (inside Kubernetes)
  • Unit tests excluded by -run pattern
  • Real validation against live cluster

How-To Guide

Adding a Check

Step 1: Create Check File

Create a file in the appropriate phase directory:

  • pkg/validator/checks/deployment/ - For deployment checks
  • pkg/validator/checks/performance/ - For performance checks
  • pkg/validator/checks/conformance/ - For conformance checks

Example: pkg/validator/checks/deployment/operator_health.go

Step 2: Implement Check Function

// Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
// [Standard license header...]

package deployment

import (
    "fmt"

    "github.com/NVIDIA/aicr/pkg/validator/checks"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

func init() {
    checks.RegisterCheck(&checks.Check{
        Name:        "operator-health",              // ← Used in recipe
        Description: "Verify GPU operator is healthy",
        Phase:       "deployment",                   // ← Must match phase
        Func:        CheckOperatorHealth,
    })
}

// CheckOperatorHealth verifies the GPU operator pods are running.
func CheckOperatorHealth(ctx *checks.ValidationContext) error {
    // Access live cluster via ctx.Clientset
    pods, err := ctx.Clientset.CoreV1().Pods("gpu-operator").List(
        ctx.Context,
        metav1.ListOptions{LabelSelector: "app=gpu-operator"},
    )
    if err != nil {
        return fmt.Errorf("failed to list GPU operator pods: %w", err)
    }

    if len(pods.Items) == 0 {
        return fmt.Errorf("no GPU operator pods found")
    }

    // Verify at least one pod is running
    for _, pod := range pods.Items {
        if pod.Status.Phase == "Running" {
            return nil // Success!
        }
    }

    return fmt.Errorf("no GPU operator pods in Running state")
}

Step 3: Add Test Wrapper

// pkg/validator/checks/deployment/operator_health_test.go
package deployment

import (
    "testing"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
)

func TestOperatorHealth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }
    runner.RunCheck("operator-health")
}

Step 4: Use in Recipe

validation:
  deployment:
    checks:
      - operator-health  # ← Must match Check.Name

Step 5: Import Package (if needed)

If the package isn't already imported, add it to trigger init():

// In main.go or test file
import _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"
Adding a Constraint Validator

Step 1: Create Constraints File

Create constraints.go in the phase directory:

  • pkg/validator/checks/deployment/constraints.go
  • pkg/validator/checks/performance/constraints.go
  • pkg/validator/checks/conformance/constraints.go

Step 2: Implement Constraint Validator

// Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
// [Standard license header...]

package deployment

import (
    "context"
    "fmt"

    "github.com/NVIDIA/aicr/pkg/recipe"
    "github.com/NVIDIA/aicr/pkg/validator"
    "github.com/NVIDIA/aicr/pkg/validator/checks"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
)

func init() {
    checks.RegisterConstraintValidator(&checks.ConstraintValidator{
        Name:        "Deployment.device-plugin.replicas",  // ← Constraint name
        Description: "Validates device plugin replica count",
        Func:        ValidateDevicePluginReplicas,
    })
}

// ValidateDevicePluginReplicas checks the device plugin replica count.
// Constraint format: "Deployment.device-plugin.replicas"
// Constraint value: ">= 1", "== 3", etc.
func ValidateDevicePluginReplicas(
    ctx *checks.ValidationContext,
    constraint recipe.Constraint,
) (string, bool, error) {
    // 1. Query cluster to get actual value
    replicas, err := getDevicePluginReplicas(ctx.Context, ctx.Clientset)
    if err != nil {
        return "", false, fmt.Errorf("failed to get replica count: %w", err)
    }

    // 2. Convert to string for comparison
    actualValue := fmt.Sprintf("%d", replicas)

    // 3. Evaluate constraint expression
    passed, err := evaluateConstraint(actualValue, constraint.Value)
    if err != nil {
        return actualValue, false, fmt.Errorf("constraint evaluation failed: %w", err)
    }

    // 4. Return: (actual value, pass/fail, error)
    return actualValue, passed, nil
}

// Helper: Get actual replica count from cluster
func getDevicePluginReplicas(ctx context.Context, clientset kubernetes.Interface) (int, error) {
    deployment, err := clientset.AppsV1().Deployments("gpu-operator").Get(
        ctx,
        "nvidia-device-plugin",
        metav1.GetOptions{},
    )
    if err != nil {
        return 0, err
    }

    if deployment.Spec.Replicas == nil {
        return 0, nil
    }

    return int(*deployment.Spec.Replicas), nil
}

// Helper: Evaluate constraint expression
func evaluateConstraint(actualValue, constraintExpr string) (bool, error) {
    parsed, err := validator.ParseConstraintExpression(constraintExpr)
    if err != nil {
        return false, fmt.Errorf("invalid constraint expression: %w", err)
    }

    passed, err := parsed.Evaluate(actualValue)
    if err != nil {
        return false, fmt.Errorf("evaluation failed: %w", err)
    }

    return passed, nil
}

Step 3: Use in Recipe

validation:
  deployment:
    constraints:
      - name: Deployment.device-plugin.replicas  # ← Must match Pattern
        value: ">= 1"                             # ← Constraint expression

Step 4: Import Package (if needed)

Same as checks - ensure the package is imported to trigger init().

Phase-Specific Considerations
Deployment Phase

Typical validations:

  • Operator health and readiness
  • Deployment resource versions
  • Pod counts and statuses
  • ConfigMap/Secret presence

Example constraint names:

  • Deployment.gpu-operator.version
  • Deployment.device-plugin.replicas
  • Deployment.dcgm-exporter.enabled

Access patterns:

// Deployments
deployment, _ := ctx.Clientset.AppsV1().Deployments(ns).Get(ctx.Context, name, metav1.GetOptions{})

// Pods
pods, _ := ctx.Clientset.CoreV1().Pods(ns).List(ctx.Context, metav1.ListOptions{LabelSelector: "app=foo"})

// ConfigMaps
cm, _ := ctx.Clientset.CoreV1().ConfigMaps(ns).Get(ctx.Context, name, metav1.GetOptions{})
Performance Phase

Typical validations:

  • NCCL all-reduce bus bandwidth (EW fabric between GPU nodes)
  • Network fabric health
  • GPU-to-GPU communication latency
  • Storage I/O performance

Example constraint names:

  • nccl-all-reduce-bw (implemented — EKS + H100)
  • Performance.network.latency
  • Performance.gpu.peer-access

Implemented constraints:

  • nccl-all-reduce-bw — Runs a Kubeflow Trainer TrainJob with NCCL all_reduce_perf, parses the 16 GB bus bandwidth from launcher logs, and validates it is within 10% of the recipe threshold. Skips gracefully when fewer than 2 GPU nodes are available (requires EKS + H100 to run). Auto-installs Kubeflow Trainer if not already present and tears it down on exit.

Access patterns:

// Dynamic client for CRD and TrainJob operations
dynamicClient, _ := dynamic.NewForConfig(ctx.RESTConfig)

// List schedulable GPU nodes
nodes, _ := ctx.Clientset.CoreV1().Nodes().List(ctx.Context, metav1.ListOptions{})

// Watch launcher pod for completion
watcher, _ := ctx.Clientset.CoreV1().Pods(ns).Watch(ctx.Context, metav1.ListOptions{
    FieldSelector: "metadata.name=" + podName,
})
Conformance Phase

Typical validations:

  • Kubernetes API version compatibility
  • RBAC policy conformance
  • CRD schema validation
  • AI workload compatibility

Example constraint names:

  • Conformance.k8s.version
  • Conformance.api.gpu-device
  • Conformance.workload.pytorch

Access patterns:

// API version
version, _ := ctx.Clientset.Discovery().ServerVersion()

// CRDs
crdClient := ctx.Clientset.ApiextensionsV1().CustomResourceDefinitions()
crd, _ := crdClient.Get(ctx.Context, "gpus.nvidia.com", metav1.GetOptions{})

// Run conformance test workloads
job, _ := ctx.Clientset.BatchV1().Jobs(ns).Create(ctx.Context, testJob, metav1.CreateOptions{})
Testing
Unit Test for Check
func TestCheckOperatorHealth(t *testing.T) {
    tests := []struct {
        name    string
        pods    []corev1.Pod
        wantErr bool
    }{
        {
            name: "healthy operator",
            pods: []corev1.Pod{
                {
                    ObjectMeta: metav1.ObjectMeta{Name: "gpu-operator-abc"},
                    Status:     corev1.PodStatus{Phase: "Running"},
                },
            },
            wantErr: false,
        },
        {
            name:    "no pods found",
            pods:    []corev1.Pod{},
            wantErr: true,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Create fake clientset with test data
            var objects []runtime.Object
            for i := range tt.pods {
                objects = append(objects, &tt.pods[i])
            }
            clientset := fake.NewSimpleClientset(objects...)

            ctx := &checks.ValidationContext{
                Context:   context.Background(),
                Clientset: clientset,
            }

            err := CheckOperatorHealth(ctx)
            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
        })
    }
}
Unit Test for Constraint Validator
func TestValidateDevicePluginReplicas(t *testing.T) {
    tests := []struct {
        name          string
        deployment    *appsv1.Deployment
        constraint    recipe.Constraint
        wantActual    string
        wantPassed    bool
        wantErr       bool
    }{
        {
            name: "constraint satisfied",
            deployment: &appsv1.Deployment{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "nvidia-device-plugin",
                    Namespace: "gpu-operator",
                },
                Spec: appsv1.DeploymentSpec{
                    Replicas: ptr.To(int32(3)),
                },
            },
            constraint: recipe.Constraint{
                Name:  "Deployment.device-plugin.replicas",
                Value: ">= 1",
            },
            wantActual: "3",
            wantPassed: true,
            wantErr:    false,
        },
        {
            name: "constraint not satisfied",
            deployment: &appsv1.Deployment{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "nvidia-device-plugin",
                    Namespace: "gpu-operator",
                },
                Spec: appsv1.DeploymentSpec{
                    Replicas: ptr.To(int32(0)),
                },
            },
            constraint: recipe.Constraint{
                Name:  "Deployment.device-plugin.replicas",
                Value: ">= 1",
            },
            wantActual: "0",
            wantPassed: false,
            wantErr:    false,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            clientset := fake.NewSimpleClientset(tt.deployment)

            ctx := &checks.ValidationContext{
                Context:   context.Background(),
                Clientset: clientset,
            }

            actual, passed, err := ValidateDevicePluginReplicas(ctx, tt.constraint)

            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            if actual != tt.wantActual {
                t.Errorf("actual = %v, want %v", actual, tt.wantActual)
            }
            if passed != tt.wantPassed {
                t.Errorf("passed = %v, want %v", passed, tt.wantPassed)
            }
        })
    }
}
Integration Test
func TestConstraintValidatorRegistration(t *testing.T) {
    // Verify the validator is registered
    validator, ok := checks.GetConstraintValidator("Deployment.device-plugin.replicas")
    if !ok {
        t.Fatal("Constraint validator not registered")
    }

    if validator.Pattern != "Deployment.device-plugin.replicas" {
        t.Errorf("Pattern = %v, want Deployment.device-plugin.replicas", validator.Pattern)
    }

    if validator.Func == nil {
        t.Fatal("Func is nil")
    }
}
Testing Checks Locally
func TestOperatorHealthLocal(t *testing.T) {
    deployment := createTestDeployment("gpu-operator", "gpu-operator")
    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    check, ok := checks.GetCheck("operator-health")
    if !ok {
        t.Fatal("check not registered")
    }

    if err := check.Func(ctx); err != nil {
        t.Errorf("check failed: %v", err)
    }
}
Common Patterns
Pattern 1: Version Constraint Validator
func ValidateComponentVersion(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    // 1. Get version from deployment label
    version := deployment.Labels["app.kubernetes.io/version"]

    // 2. Fallback: parse from image tag
    if version == "" {
        version = extractVersionFromImage(container.Image)
    }

    // 3. Normalize version (add 'v' prefix if missing)
    version = normalizeVersion(version)

    // 4. Evaluate constraint
    passed, err := evaluateVersionConstraint(version, constraint.Value)

    return version, passed, err
}
Pattern 2: Count/Numeric Constraint Validator
func ValidateResourceCount(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    // 1. Query cluster for resources
    items, err := ctx.Clientset....List(...)

    // 2. Count items
    count := len(items.Items)
    actualValue := fmt.Sprintf("%d", count)

    // 3. Evaluate numeric constraint
    passed, err := evaluateConstraint(actualValue, constraint.Value)

    return actualValue, passed, err
}
Pattern 3: Boolean/State Constraint Validator
func ValidateFeatureEnabled(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    // 1. Check feature state (ConfigMap, annotation, etc.)
    enabled := checkFeatureState(ctx)
    actualValue := fmt.Sprintf("%t", enabled)

    // 2. Evaluate boolean constraint ("== true", "== false")
    passed, err := evaluateConstraint(actualValue, constraint.Value)

    return actualValue, passed, err
}
func findResourceAcrossNamespaces(ctx context.Context, clientset kubernetes.Interface,
    namespaces []string, names []string) (*appsv1.Deployment, error) {

    for _, ns := range namespaces {
        for _, name := range names {
            deployment, err := clientset.AppsV1().Deployments(ns).Get(
                ctx, name, metav1.GetOptions{},
            )
            if err == nil {
                return deployment, nil
            }
        }
    }

    return nil, fmt.Errorf("resource not found in any namespace")
}
Pattern 5: Performance Test with Job
func CheckPerformance(ctx *checks.ValidationContext) error {
    // 1. Create test Job
    job := &batchv1.Job{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "perf-test",
            Namespace: "aicr-validation",
        },
        Spec: batchv1.JobSpec{
            Template: corev1.PodTemplateSpec{
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{
                        {
                            Name:  "test",
                            Image: "nvcr.io/nvidia/nccl-tests:latest",
                            Args:  []string{"all_reduce_perf", "-b", "8", "-e", "256M"},
                        },
                    },
                    RestartPolicy: "Never",
                },
            },
        },
    }

    // 2. Create and wait for Job
    _, err := ctx.Clientset.BatchV1().Jobs("aicr-validation").Create(
        ctx.Context, job, metav1.CreateOptions{},
    )
    if err != nil {
        return err
    }

    // 3. Wait for completion
    // 4. Read logs
    // 5. Parse results

    return nil
}

Adding Constraint Validators (New Approach)

For constraint validators, AICR provides an automated code generator that scaffolds all necessary files with proper structure. This ensures consistency and catches registration issues automatically.

Quick Start with Generator

1. Generate validator scaffolding:

make generate-validator ARGS="--constraint Deployment.my-app.version --phase deployment --description 'Validates my-app version'"

This creates three files with TODOs guiding implementation:

pkg/validator/checks/deployment/
├── my_app_version.go                    # Helper functions
├── my_app_version_test.go               # Unit tests
└── my_app_version_integration_test.go   # Integration test with registration

2. Implement helper functions:

Edit my_app_version.go and fill in the TODOs:

// getMyAppVersion queries the cluster to get the actual version
func getMyAppVersion(ctx context.Context, clientset kubernetes.Interface) (string, error) {
    // TODO: Implement version detection
    // Search common namespaces
    namespaces := []string{"my-app", "default", "kube-system"}
    names := []string{"my-app", "myapp"}

    for _, ns := range namespaces {
        for _, name := range names {
            deployment, err := clientset.AppsV1().Deployments(ns).Get(
                ctx, name, metav1.GetOptions{},
            )
            if err == nil {
                // Try version from label
                if version := deployment.Labels["app.kubernetes.io/version"]; version != "" {
                    return normalizeVersion(version), nil
                }
                // Try version from image tag
                if len(deployment.Spec.Template.Spec.Containers) > 0 {
                    return extractVersionFromImage(deployment.Spec.Template.Spec.Containers[0].Image), nil
                }
            }
        }
    }

    return "", fmt.Errorf("my-app not found")
}

// evaluateVersionConstraint evaluates version constraint expressions
func evaluateVersionConstraint(actualValue, constraintValue string) (bool, error) {
    // TODO: Implement constraint evaluation
    // Parse constraint (>=, ==, !=, <, >, ~=)
    // Compare versions using semver
    // Return pass/fail
}

3. Add unit test cases:

Edit my_app_version_test.go:

func TestGetMyAppVersion(t *testing.T) {
    tests := []struct {
        name       string
        deployment *appsv1.Deployment
        want       string
        wantErr    bool
    }{
        {
            name: "version from label",
            deployment: &appsv1.Deployment{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      "my-app",
                    Namespace: "default",
                    Labels: map[string]string{
                        "app.kubernetes.io/version": "v1.2.3",
                    },
                },
            },
            want:    "v1.2.3",
            wantErr: false,
        },
        // Add more test cases...
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            clientset := fake.NewSimpleClientset(tt.deployment)
            got, err := getMyAppVersion(context.Background(), clientset)

            if (err != nil) != tt.wantErr {
                t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
            }
            if got != tt.want {
                t.Errorf("got %v, want %v", got, tt.want)
            }
        })
    }
}

4. Integration test is auto-generated with registration:

The generator creates my_app_version_integration_test.go with proper registration:

func init() {
    checks.RegisterConstraintValidator(&checks.ConstraintValidator{
        Name:        "Deployment.my-app.version",
        Description: "Validates my-app version",
        TestName:    "TestMyAppVersion",
        Phase:       "deployment",
    })
}

func TestMyAppVersion(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }

    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
    }

    // Get constraint from recipe
    constraint := runner.GetConstraint("deployment", "Deployment.my-app.version")
    if constraint == nil {
        t.Skip("Constraint not defined in recipe")
    }

    // Execute validation logic
    ctx := runner.Context()
    actualValue, err := getMyAppVersion(ctx.Context, ctx.Clientset)
    if err != nil {
        t.Fatalf("Failed to get my-app version: %v", err)
    }

    passed, err := evaluateVersionConstraint(actualValue, constraint.Value)
    if err != nil {
        t.Fatalf("Failed to evaluate constraint: %v", err)
    }

    if !passed {
        t.Errorf("Version constraint not satisfied: actual=%s, expected=%s",
            actualValue, constraint.Value)
    }
}

5. Run tests:

# Unit tests only (fast, no cluster needed)
make test

# Integration test in Job (requires cluster)
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml --phase deployment

6. Submit PR:

The CI pipeline automatically validates:

  • Code compiles
  • Unit tests pass
  • Registration is complete (enforced by pkg/validator/checks/registration_test.go)
  • Coverage meets threshold
How It Works
Recipe → Test Execution Flow
# Recipe
validation:
  deployment:
    constraints:
      - name: Deployment.my-app.version
        value: ">= v1.2.0"

// Registry lookup (buildTestPattern in phases.go)
testName, _ := checks.GetTestNameForConstraint("Deployment.my-app.version")
// Returns: "TestMyAppVersion"

// Pattern building
pattern := "^(TestMyAppVersion)$"

# Job command
go test -v -json ./pkg/validator/checks/deployment -run '^(TestMyAppVersion)$'

// Integration test runs with cluster access
func TestMyAppVersion(t *testing.T) {
    // Queries cluster, evaluates constraint
}
Architecture Principles

Key Insight: Integration tests ARE the validators. They contain the validation logic directly, not wrapper functions.

File Structure:

  • *_version.go - Helper functions (query cluster, evaluate constraints)
  • *_version_test.go - Unit tests with table-driven cases using fake clientset
  • *_version_integration_test.go - Integration test that runs in Jobs with real cluster access

Separation:

  • Unit tests: Fast, use fake clientset, test helper functions
  • Integration tests: Run in Jobs, use real cluster, test full constraint validation

Test Runner Pattern:

runner, err := checks.NewTestRunner(t)
// Provides:
// - runner.Context() - Kubernetes clientset, context, snapshot, recipe
// - runner.GetConstraint(phase, name) - Lookup constraint from recipe
Enforcement Mechanism

Three layers ensure validators are properly implemented:

1. Automated Registration Tests

pkg/validator/checks/registration_test.go runs in every make test and fails if:

  • Registered constraint has no test implementation
  • Integration test exists without registration
  • Registered check has no test implementation
func TestConstraintRegistrationCompleteness(t *testing.T) {
    constraintTests := checks.ListConstraintTests("")
    existingTests := findTestFunctions(t)  // AST parsing

    var missing []string
    for _, ct := range constraintTests {
        if !existingTests[ct.TestName] {
            missing = append(missing, ct.TestName)
        }
    }

    if len(missing) > 0 {
        t.Errorf("Registered constraints missing test implementations")
    }
}

2. Code Generator

make generate-validator scaffolds all files correctly:

  • Includes registration automatically in integration test
  • Provides TODOs for implementation
  • Follows naming conventions

3. Documentation

  • Comprehensive development guide (this section)
  • Generated code has inline examples and TODOs
  • Contributing guide integration
What Gets Caught
Mistake How It's Caught
Registered constraint without test TestConstraintRegistrationCompleteness fails
Integration test without registration TestIntegrationTestsAreRegistered fails
Wrong test function name Pattern matching fails (test not found)
Forgot to implement helpers Compilation fails (undefined functions)
Missing test cases Coverage check fails
Testing Locally
# Unit tests only (skips integration tests)
go test ./pkg/validator/checks/deployment -short

# Run specific integration test (will skip if not in Kubernetes)
go test ./pkg/validator/checks/deployment -run TestMyAppVersion -v

# All tests including registration validation
make test
Using in Recipe
validation:
  deployment:
    constraints:
      - name: Deployment.my-app.version  # Must match registered Pattern
        value: ">= v1.2.0"                # Constraint expression
Common Patterns
Multi-Strategy Version Detection
func getComponentVersion(ctx context.Context, clientset kubernetes.Interface) (string, error) {
    deployment := findDeployment(ctx, clientset)

    // Strategy 1: Label
    if version := deployment.Labels["app.kubernetes.io/version"]; version != "" {
        return normalizeVersion(version), nil
    }

    // Strategy 2: Annotation
    if version := deployment.Annotations["version"]; version != "" {
        return normalizeVersion(version), nil
    }

    // Strategy 3: Image tag
    if len(deployment.Spec.Template.Spec.Containers) > 0 {
        image := deployment.Spec.Template.Spec.Containers[0].Image
        return extractVersionFromImage(image), nil
    }

    return "", fmt.Errorf("version not found")
}
Version Constraint Evaluation
func evaluateVersionConstraint(actualValue, constraintValue string) (bool, error) {
    // Parse operator and expected version
    // Supports: ==, !=, >=, <=, >, <, ~= (compatible)
    op, expected := parseConstraint(constraintValue)

    // Compare using semver
    actual, err := semver.Parse(actualValue)
    if err != nil {
        return false, fmt.Errorf("invalid actual version: %w", err)
    }

    expectedVer, err := semver.Parse(expected)
    if err != nil {
        return false, fmt.Errorf("invalid expected version: %w", err)
    }

    switch op {
    case ">=":
        return actual.GTE(expectedVer), nil
    case "==":
        return actual.Equal(expectedVer), nil
    // ... other operators
    }
}
Benefits

1. Impossible to Forget Registration - Tests fail locally and in CI if registration is missing

2. Easy to Add New Validators - One command scaffolds everything correctly

3. Consistent Architecture - Generated code follows established patterns

4. Fast Feedback - Catches issues locally before PR

5. Self-Documenting - Generated code has examples and TODOs

6. CI Enforced - Can't merge without complete implementation

Troubleshooting

Test Wrapper Issues
Test Wrapper Not Found by go test

Symptom:

Job logs: testing: warning: no tests to run

Causes:

  1. Test wrapper function doesn't follow naming convention
  2. Test wrapper is not exported (lowercase name)
  3. Package doesn't compile

Solutions:

Check test function naming:

// Correct
func TestOperatorHealth(t *testing.T) { ... }

// Wrong - lowercase
func TestOperatorhealth(t *testing.T) { ... }

// Wrong - underscore separator
func Test_operator_health(t *testing.T) { ... }

Naming rule: Convert kebab-case check name to PascalCase:

  • operator-healthTestOperatorHealth
  • nccl-bandwidthTestNCCLBandwidth
  • expected-resourcesTestExpectedResources

Verify test file compiles:

go test -c ./pkg/validator/checks/deployment/
Test Wrapper Fails: "check not found in registry"

Symptom:

Job logs: Check "operator-health" not found in registry

Causes:

  1. Check not registered in init() function
  2. Package not imported (init() never runs)
  3. Check name mismatch between registration and runner call

Solutions:

Verify check registration:

// Must be in same package as check function
func init() {
    checks.RegisterCheck(&checks.Check{
        Name:  "operator-health",  // ← Must match exactly
        Phase: "deployment",
        Func:  CheckOperatorHealth,
    })
}

Verify test wrapper uses same name:

func TestOperatorHealth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test: %v", err)
        return
    }
    runner.RunCheck("operator-health")  // ← Must match registration
}
Test Wrapper Fails: "failed to load validation context"

Symptom (during local testing):

SKIP: Skipping integration test (not in Kubernetes): failed to create in-cluster config

Expected behavior: Test should skip gracefully when not in Kubernetes.

Verify skip logic:

func TestMyCheck(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        // Should skip, not fail
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }
    runner.RunCheck("my-check")
}

Symptom (inside Job):

Job logs: Failed to create test runner: failed to load validation context:
          failed to read snapshot file: open /data/snapshot/snapshot.yaml: no such file

Causes:

  1. Snapshot ConfigMap not mounted correctly
  2. Volume mount path mismatch
  3. ConfigMap doesn't exist

Solutions:

Check Job pod volumes:

kubectl get pod <pod-name> -n aicr-validation -o yaml | grep -A 10 volumes

Expected volumes:

volumes:
- name: snapshot
  configMap:
    name: <snapshot-configmap>
- name: recipe
  configMap:
    name: <recipe-configmap>
volumeMounts:
- name: snapshot
  mountPath: /data/snapshot
  readOnly: true

Verify ConfigMap exists:

kubectl get cm -n <namespace> <snapshot-configmap>
kubectl describe cm -n <namespace> <snapshot-configmap>

Check ConfigMap contains snapshot data:

kubectl get cm -n <namespace> <snapshot-configmap> -o jsonpath='{.data.snapshot\.yaml}' | head -20
Job Execution Issues
Job Not Found

Symptom:

Error: failed to wait for Job completion: Job "aicr-validation-deployment" not found

Causes:

  1. Namespace doesn't exist
  2. Job was cleaned up too quickly
  3. Job creation failed silently

Solutions:

Check if namespace exists:

kubectl get namespace aicr-validation

Create namespace if missing:

kubectl create namespace aicr-validation

Check Job status:

kubectl get jobs -n aicr-validation
kubectl describe job aicr-validation-deployment -n aicr-validation
Job Failed to Start

Symptom:

Error: Job failed with status: ImagePullBackOff

Causes:

  1. Image not accessible
  2. Image tag doesn't exist
  3. Registry authentication issues

Solutions:

Check Job events:

kubectl describe job aicr-validation-deployment -n aicr-validation

Check Pod logs:

kubectl get pods -n aicr-validation
kubectl describe pod <pod-name> -n aicr-validation

Verify image exists:

docker pull ghcr.io/nvidia/aicr-validator:latest
# or
kubectl run test --image=ghcr.io/nvidia/aicr-validator:latest --rm -it --restart=Never -- /bin/sh
Job Pods Crash

Symptom:

Error: Job pod exited with code 1

Solutions:

View pod logs:

# Get pod name
kubectl get pods -n aicr-validation -l job-name=aicr-validation-deployment

# View logs
kubectl logs <pod-name> -n aicr-validation

# View logs of crashed pod
kubectl logs <pod-name> -n aicr-validation --previous

Common causes in logs:

  • panic: runtime error - Code bug in check
  • context deadline exceeded - Timeout
  • permission denied - RBAC issue
  • connection refused - Network/API issue
RBAC Permission Errors
Forbidden: User Cannot Access Resource

Symptom:

Error: failed to list GPU operator pods: pods is forbidden:
User "system:serviceaccount:aicr-validation:aicr-validator" cannot list resource "pods"
in API group "" in the namespace "gpu-operator"

Cause: ServiceAccount lacks necessary RBAC permissions

Solutions:

Check current permissions:

kubectl auth can-i list pods --namespace=gpu-operator \
  --as=system:serviceaccount:aicr-validation:aicr-validator

View current Role/RoleBinding:

kubectl get role aicr-validator -n aicr-validation -o yaml
kubectl get rolebinding aicr-validator -n aicr-validation -o yaml

Fix: Create proper RBAC resources:

# role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aicr-validator
rules:
  # Deployment phase
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments", "daemonsets", "statefulsets"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods", "services", "configmaps"]
    verbs: ["get", "list"]

  # Performance phase
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "create", "delete"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]

  # Conformance phase
  - apiGroups: ["apiextensions.k8s.io"]
    resources: ["customresourcedefinitions"]
    verbs: ["get", "list"]

Apply RBAC:

kubectl apply -f role.yaml
kubectl create clusterrolebinding aicr-validator \
  --clusterrole=aicr-validator \
  --serviceaccount=aicr-validation:aicr-validator
RBAC for Cross-Namespace Access

Issue: Check needs to access resources in gpu-operator namespace but only has permissions in aicr-validation

Solution: Use ClusterRole instead of Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aicr-validator
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list"]
  # Add other rules...

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aicr-validator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: aicr-validator
subjects:
  - kind: ServiceAccount
    name: aicr-validator
    namespace: aicr-validation
Timeout Problems
Context Deadline Exceeded

Symptom:

Error: context deadline exceeded
Check: operator-health
Duration: 2m0s

Causes:

  1. Check takes too long to execute
  2. Kubernetes API is slow
  3. External resource is unresponsive

Solutions:

Increase timeout in validator:

validator := validator.New(
    validator.WithTimeout(10 * time.Minute),  // Increase from default 2m
)

Increase timeout for specific check:

func CheckOperatorHealth(ctx *checks.ValidationContext) error {
    // Create new context with longer timeout for this check
    checkCtx, cancel := context.WithTimeout(ctx.Context, 5*time.Minute)
    defer cancel()

    pods, err := ctx.Clientset.CoreV1().Pods("gpu-operator").List(
        checkCtx,  // Use extended timeout
        metav1.ListOptions{LabelSelector: "app=gpu-operator"},
    )
    // ...
}

Add context cancellation checks for long operations:

func LongRunningCheck(ctx *checks.ValidationContext) error {
    for i := 0; i < 1000; i++ {
        // Check if context is cancelled
        select {
        case <-ctx.Context.Done():
            return ctx.Context.Err()  // Return context error
        default:
            // Continue processing
        }

        // Do work...
    }
    return nil
}
Job Timeout

Symptom:

Error: Job did not complete within timeout period
Job: aicr-validation-performance
Timeout: 5m0s

Solutions:

Increase Job timeout:

config := agent.Config{
    Timeout: 15 * time.Minute,  // Increase for performance tests
}

Check if Job is actually running:

kubectl get pods -n aicr-validation -l job-name=aicr-validation-performance
kubectl logs <pod-name> -n aicr-validation --follow

Check Job status:

kubectl describe job aicr-validation-performance -n aicr-validation
Check Registration Issues
Check Not Found

Symptom:

Error: check "operator-health" not registered

Causes:

  1. Package not imported
  2. init() not running
  3. Check name mismatch

Solutions:

Verify check is registered:

func TestCheckRegistered(t *testing.T) {
    check, ok := checks.GetCheck("operator-health")
    if !ok {
        t.Fatal("Check not registered")
    }
    assert.Equal(t, "operator-health", check.Name)
}

Ensure package is imported:

// Import with blank identifier to trigger init()
import _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"

List all registered checks:

func TestListChecks(t *testing.T) {
    allChecks := checks.ListChecks("")
    t.Logf("Registered checks: %d", len(allChecks))
    for _, check := range allChecks {
        t.Logf("  - %s (%s)", check.Name, check.Phase)
    }
}
Constraint Validator Not Found

Symptom:

Error: no validator found for constraint "Deployment.gpu-operator.version"

Solutions:

Check if validator is registered:

# Run test to list validators
go test -v ./pkg/validator/checks/... -run TestList

Verify import:

import _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"

Check pattern match:

func TestValidatorRegistration(t *testing.T) {
    validator, ok := checks.GetConstraintValidator("Deployment.gpu-operator.version")
    if !ok {
        t.Fatal("Validator not registered")
    }
    assert.Equal(t, "Deployment.gpu-operator.version", validator.Pattern)
}
Duplicate Registration Panic

Symptom:

panic: constraint validator for pattern "Deployment.gpu-operator.version" is already registered

Cause: Same pattern registered twice (likely imported in multiple places)

Solution: Only import check packages once, typically in main:

// cmd/aicr/main.go
import (
    _ "github.com/NVIDIA/aicr/pkg/validator/checks/deployment"  // Once here
    _ "github.com/NVIDIA/aicr/pkg/validator/checks/performance"
    _ "github.com/NVIDIA/aicr/pkg/validator/checks/conformance"
)
Constraint Evaluation Errors
Invalid Constraint Expression

Symptom:

Error: invalid constraint expression: cannot parse expected version
Constraint: Deployment.gpu-operator.version
Value: ">= invalid-version"

Solution: Fix constraint value in recipe:

# Wrong
constraints:
  - name: Deployment.gpu-operator.version
    value: ">= invalid-version"

# Correct
constraints:
  - name: Deployment.gpu-operator.version
    value: ">= v24.6.0"
Version Parse Error

Symptom:

Error: cannot parse actual version
Actual: "latest"
Expected: ">= v24.6.0"

Cause: Actual value is not a valid version string

Solution: Fix validator to return valid version:

func getVersion(deployment *appsv1.Deployment) string {
    version := deployment.Labels["app.kubernetes.io/version"]
    if version == "latest" {
        // Don't return "latest" - try other strategies
        version = extractVersionFromImage(deployment.Spec.Template.Spec.Containers[0].Image)
    }
    return normalizeVersion(version)
}
Constraint Always Fails

Symptom:

Constraint: OS.distribution
Expected: "ubuntu"
Actual: "Ubuntu"
Status: FAIL

Cause: Case sensitivity in string comparison

Solution: Normalize strings in validator:

func ValidateOSDistribution(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    actual := strings.ToLower(getOSDistribution(ctx))  // Normalize to lowercase
    expected := strings.ToLower(constraint.Value)

    passed := actual == expected
    return actual, passed, nil
}
Kubernetes Client Errors
Cannot Connect to Cluster

Symptom:

Error: failed to create Kubernetes client: unable to load kubeconfig

Solutions:

Check kubeconfig:

kubectl cluster-info
echo $KUBECONFIG
ls -la ~/.kube/config

Test connectivity:

kubectl get nodes

Verify in code:

clientset, err := k8sclient.GetKubeClient()
if err != nil {
    log.Fatalf("Failed to create k8s client: %v", err)
}

// Test connection
nodes, err := clientset.CoreV1().Nodes().List(context.Background(), metav1.ListOptions{})
if err != nil {
    log.Fatalf("Cannot connect to cluster: %v", err)
}
log.Printf("Connected to cluster with %d nodes", len(nodes.Items))
Resource Not Found

Symptom:

Error: deployments.apps "gpu-operator" not found

Causes:

  1. Resource doesn't exist
  2. Wrong namespace
  3. Wrong name

Solutions:

Check if resource exists:

kubectl get deployments -A | grep gpu-operator

Use multi-namespace search in validator:

func findGPUOperator(ctx context.Context, clientset kubernetes.Interface) (*appsv1.Deployment, error) {
    namespaces := []string{"gpu-operator", "nvidia-gpu-operator", "kube-system"}
    names := []string{"gpu-operator", "nvidia-gpu-operator"}

    for _, ns := range namespaces {
        for _, name := range names {
            deployment, err := clientset.AppsV1().Deployments(ns).Get(
                ctx, name, metav1.GetOptions{},
            )
            if err == nil {
                return deployment, nil
            }
        }
    }

    return nil, fmt.Errorf("GPU operator not found in any common namespace")
}
Test Mode vs Production
Tests Pass but Production Fails

Symptom:

  • Unit tests pass with fake clientset
  • Production validation fails with real cluster

Causes:

  1. Fake clientset doesn't match real cluster state
  2. RBAC works in test but not production
  3. Timing issues (context timeout)

Solutions:

Test with real cluster:

# Integration test against real cluster
export USE_REAL_CLUSTER=true
go test -v ./pkg/validator/checks/deployment/... -run TestIntegration

Add integration tests:

func TestOperatorHealthIntegration(t *testing.T) {
    if os.Getenv("USE_REAL_CLUSTER") != "true" {
        t.Skip("Skipping integration test")
    }

    clientset, err := k8sclient.GetKubeClient()
    require.NoError(t, err)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    err = CheckOperatorHealth(ctx)
    assert.NoError(t, err)
}
Validation Passes in Test Mode

Symptom:

WARN Job deployment failed (likely test mode), returning skeleton check
Check: operator-health
Status: pass
Reason: skipped - Job deployment failed (test mode)

Cause: Test environment doesn't have namespace, so checks are skipped

Solutions:

Create test namespace:

kubectl create namespace aicr-validation

Or run tests with fake clientset:

func TestWithFakeCluster(t *testing.T) {
    deployment := createTestDeployment()
    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    // Test directly against check function, not Job execution
    err := CheckOperatorHealth(ctx)
    assert.NoError(t, err)
}
Debugging Techniques
Enable Debug Logging
import "log/slog"

func init() {
    slog.SetDefault(slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{
        Level: slog.LevelDebug,
    })))
}
View Job Logs in Real-Time
# Watch for new Jobs
kubectl get jobs -n aicr-validation -w

# Stream logs from running Job
POD=$(kubectl get pods -n aicr-validation -l job-name=aicr-validation-deployment -o name | head -1)
kubectl logs -n aicr-validation $POD --follow
Check Job Results ConfigMap
# List result ConfigMaps
kubectl get configmaps -n aicr-validation

# View specific result
kubectl get configmap aicr-validation-deployment-result -n aicr-validation -o yaml
Debug Check Function Directly
func TestDebugCheck(t *testing.T) {
    // Set up test data
    deployment := createGPUOperatorDeployment("gpu-operator", "gpu-operator",
        map[string]string{"app.kubernetes.io/version": "v24.6.0"},
        "nvcr.io/nvidia/gpu-operator:v24.6.0")

    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    // Call check directly (no Job)
    err := CheckOperatorHealth(ctx)
    if err != nil {
        t.Logf("Check failed: %v", err)
        t.Fail()
    }
}
Trace Constraint Evaluation
func ValidateGPUOperatorVersion(ctx *checks.ValidationContext, constraint recipe.Constraint) (string, bool, error) {
    slog.Debug("Starting version validation",
        "constraint", constraint.Name,
        "expectedValue", constraint.Value)

    version, err := getGPUOperatorVersion(ctx.Context, ctx.Clientset)
    slog.Debug("Detected version", "version", version, "error", err)

    if err != nil {
        return "", false, err
    }

    passed, err := evaluateVersionConstraint(version, constraint.Value)
    slog.Debug("Constraint evaluation result",
        "version", version,
        "constraint", constraint.Value,
        "passed", passed,
        "error", err)

    return version, passed, err
}
Use kubectl debug
# Debug a running Job pod
kubectl debug -n aicr-validation <pod-name> -it --image=busybox

# Check environment and mounts
env | grep AICR
ls -la /aicr/snapshot
cat /aicr/snapshot/snapshot.yaml
Collect Diagnostic Information

When reporting issues, include:

# Cluster info
kubectl version
kubectl get nodes

# Validation namespace
kubectl get all -n aicr-validation

# Job details
kubectl describe job <job-name> -n aicr-validation

# Pod logs
kubectl logs <pod-name> -n aicr-validation

# RBAC
kubectl auth can-i --list --as=system:serviceaccount:aicr-validation:aicr-validator
Common kubectl Commands
# List all validation Jobs
kubectl get jobs -n aicr-validation

# Delete failed Jobs
kubectl delete job -n aicr-validation -l status=failed

# Clean up validation namespace
kubectl delete namespace aicr-validation

# Re-create validation namespace
kubectl create namespace aicr-validation

# View events
kubectl get events -n aicr-validation --sort-by='.lastTimestamp'

Migration from Inline Validation

Before (inline constraint evaluation):

// phases.go - deployment phase
for _, constraint := range recipe.Validation.Deployment.Constraints {
    result := evaluateConstraint(constraint, snapshot) // Wrong - no cluster access
}

After (Job-based constraint validation):

// deployment/constraints.go
func ValidateDeploymentConstraint(ctx *ValidationContext, constraint recipe.Constraint) {
    // Correct - has cluster access via ctx.Clientset
    deployment := ctx.Clientset.AppsV1().Deployments(...).Get(...)
}

References

Summary
Task File Location Key Function Registry Call
Add Check pkg/validator/checks/<phase>/*.go func(ctx *ValidationContext) error RegisterCheck()
Add Constraint pkg/validator/checks/<phase>/constraints.go func(ctx *ValidationContext, constraint recipe.Constraint) (string, bool, error) RegisterConstraintValidator()

Both use init() for self-registration and are discovered automatically at runtime.

Key Files
  • registry.go: Check and constraint validator registration infrastructure
  • runner.go: Test runner for Job execution
  • deployment/operator_health_check.go: Example check implementation
  • deployment/constraints.go: Example constraint validator implementation
  • Constraint Parser: pkg/validator/constraint_expression.go

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetTestNameForCheck

func GetTestNameForCheck(checkName string) (string, bool)

GetTestNameForCheck looks up which test function validates a check. Returns the test name and true if found, empty string and false otherwise.

func GetTestNameForConstraint

func GetTestNameForConstraint(constraintName string) (string, bool)

GetTestNameForConstraint looks up which test function validates a constraint. Returns the test name and true if found, empty string and false otherwise.

func RegisterCheck

func RegisterCheck(check *Check)

RegisterCheck adds a check to the registry. This should be called from init() functions in check packages. If TestName is empty, it's derived from the Name automatically.

func RegisterConstraintValidator

func RegisterConstraintValidator(validator *ConstraintValidator)

RegisterConstraintValidator adds a constraint validator to the registry. This should be called from init() functions in constraint validator packages. If TestName is empty, it's derived from the Name automatically.

Types

type Artifact added in v0.7.8

type Artifact struct {
	// Label is the human-readable title (e.g., "DRA Driver Pods").
	Label string `json:"label"`

	// Data is the captured content (command output, metric text, YAML, etc.).
	Data string `json:"data"`
}

Artifact represents a captured piece of diagnostic evidence from a conformance check. Each artifact has a human-readable label and a data payload (kubectl output, metric samples, resource YAML, etc.) that is rendered as a fenced code block in evidence markdown.

func DecodeArtifact added in v0.7.8

func DecodeArtifact(encoded string) (*Artifact, error)

DecodeArtifact decodes a base64-encoded JSON artifact string.

func (Artifact) Encode added in v0.7.8

func (a Artifact) Encode() (string, error)

Encode returns a base64-encoded JSON representation of the artifact, suitable for emission via t.Logf("ARTIFACT:%s", encoded).

type ArtifactCollector added in v0.7.8

type ArtifactCollector struct {
	// contains filtered or unexported fields
}

ArtifactCollector is a thread-safe accumulator for artifacts within a single check execution. It enforces per-artifact size limits and per-check count limits.

func NewArtifactCollector added in v0.7.8

func NewArtifactCollector() *ArtifactCollector

NewArtifactCollector creates a new empty artifact collector.

func (*ArtifactCollector) Drain added in v0.7.8

func (c *ArtifactCollector) Drain() []Artifact

Drain returns the collected artifacts and resets the internal list. Returns nil if no artifacts were recorded.

func (*ArtifactCollector) Record added in v0.7.8

func (c *ArtifactCollector) Record(label, data string) error

Record adds a labeled artifact. Data exceeding defaults.ArtifactMaxDataSize is truncated. Returns an error if the per-check artifact count limit is reached.

type Check

type Check struct {
	// Name is the unique identifier for this check (e.g., "operator-health")
	Name string

	// Description explains what this check validates
	Description string

	// Phase indicates which validation phase this check belongs to
	Phase string // "readiness", "deployment", "performance", "conformance"

	// Func is the check implementation
	Func CheckFunc

	// TestName is the Go test function name (e.g., "TestCheckOperatorHealth")
	// If empty, derived from Name automatically
	TestName string

	// RequirementID is the CNCF conformance requirement ID (e.g., "dra_support").
	// Empty for checks that are not CNCF submission requirements.
	RequirementID string

	// EvidenceTitle is the human-readable title for evidence documents (e.g., "DRA Support").
	EvidenceTitle string

	// EvidenceDescription is a one-paragraph description for evidence documents.
	EvidenceDescription string

	// EvidenceFile is the output filename for evidence (e.g., "dra-support.md").
	// Multiple checks can share the same EvidenceFile (combined evidence).
	// Empty means this check produces no evidence file.
	EvidenceFile string

	// SubmissionRequirement indicates this check maps to a CNCF submission requirement.
	// Only checks with this set to true appear in the submission evidence index.
	SubmissionRequirement bool
}

Check represents a registered validation check.

func GetCheck

func GetCheck(name string) (*Check, bool)

GetCheck retrieves a registered check by name.

func GetCheckByTestName added in v0.7.7

func GetCheckByTestName(testName string) (*Check, bool)

GetCheckByTestName does a reverse lookup: Go test name → Check.

func ListChecks

func ListChecks(phase string) []*Check

ListChecks returns all registered checks, optionally filtered by phase.

func ResolveCheck added in v0.7.7

func ResolveCheck(name string) (*Check, bool)

ResolveCheck tries check name first, then test name. This handles the identity mismatch where CheckResult.Name can be either a check registry name (--no-cluster path) or a Go test name (normal cluster runs).

type CheckFunc

type CheckFunc func(ctx *ValidationContext) error

CheckFunc is the function signature for a validation check. It validates a specific aspect of the cluster and reports results via t.

type ConstraintValidator

type ConstraintValidator struct {
	// Name is the unique identifier for this constraint (e.g., "Deployment.gpu-operator.version")
	Name string

	// Description explains what constraints this validator handles
	Description string

	// Func is the validator implementation
	Func ConstraintValidatorFunc

	// TestName is the Go test function name (e.g., "TestGPUOperatorVersion")
	// If empty, derived from Name automatically
	TestName string

	// Phase indicates which validation phase (deployment, performance, conformance)
	Phase string
}

ConstraintValidator represents a registered constraint validator.

func GetConstraintValidator

func GetConstraintValidator(constraintName string) (*ConstraintValidator, bool)

GetConstraintValidator retrieves a constraint validator by name.

func ListConstraintTests

func ListConstraintTests(phase string) []*ConstraintValidator

ListConstraintTests returns all registered constraint validators, optionally filtered by phase.

func ListConstraintValidators

func ListConstraintValidators() []*ConstraintValidator

ListConstraintValidators returns all registered constraint validators.

type ConstraintValidatorFunc

type ConstraintValidatorFunc func(ctx *ValidationContext, constraint recipe.Constraint) (actual string, passed bool, err error)

ConstraintValidatorFunc is the function signature for constraint validation. It evaluates whether a constraint is satisfied against the cluster state. Returns the actual value found, whether it passed, and any error.

type TestRunner

type TestRunner struct {
	// contains filtered or unexported fields
}

TestRunner provides infrastructure for running validation checks as Go tests inside Kubernetes Jobs.

The test runner bridges the gap between Go's test framework and the AICR validation system:

  • Loads ValidationContext from Job environment (snapshot, K8s client, recipe)
  • Looks up registered checks by name
  • Executes checks and reports results via testing.T

Example usage in test wrappers:

func TestOperatorHealth(t *testing.T) {
    runner, err := checks.NewTestRunner(t)
    if err != nil {
        t.Skipf("Skipping integration test (not in Kubernetes): %v", err)
        return
    }
    defer runner.Cancel() // Clean up context when test completes
    runner.RunCheck("operator-health")
}

func NewTestRunner

func NewTestRunner(t *testing.T) (*TestRunner, error)

NewTestRunner creates a test runner by loading ValidationContext from the Job environment. Expected environment variables:

  • AICR_SNAPSHOT_PATH: Path to mounted snapshot file (default: /data/snapshot/snapshot.yaml)
  • AICR_RECIPE_DATA: Optional JSON-encoded recipe metadata

IMPORTANT: Callers should call Cancel() when done to release resources.

func (*TestRunner) Cancel

func (r *TestRunner) Cancel()

Cancel releases resources associated with the test runner. Should be called via defer after NewTestRunner succeeds. Drains any collected artifacts and emits them as structured test output before canceling the context.

func (*TestRunner) Context

func (r *TestRunner) Context() *ValidationContext

Context returns the validation context for direct access. Use this when you need the Kubernetes client, snapshot, or other context data.

func (*TestRunner) GetConstraint

func (r *TestRunner) GetConstraint(phase, constraintName string) *recipe.Constraint

GetConstraint retrieves a constraint by name from the recipe for the current phase. Returns nil if the recipe doesn't contain the constraint. This is used by integration tests to get constraint values to validate against.

func (*TestRunner) HasCheck

func (r *TestRunner) HasCheck(phase, checkName string) bool

HasCheck checks if a check is enabled in the recipe for a given phase. Returns true if the check is listed in the recipe's checks for that phase.

func (*TestRunner) RunCheck

func (r *TestRunner) RunCheck(checkName string)

RunCheck executes a registered validation check by name. The check must be registered via RegisterCheck() (usually in an init() function).

type ValidationContext

type ValidationContext struct {
	// Context for cancellation and timeouts
	Context context.Context

	// Snapshot contains captured cluster state (hardware, OS, etc.)
	Snapshot *snapshotter.Snapshot

	// Namespace is the namespace where the validation is running
	Namespace string

	// Clientset provides Kubernetes API access for live cluster queries
	Clientset kubernetes.Interface

	// RESTConfig provides Kubernetes API access for cluster queries (used for e.g. remote command execution)
	RESTConfig *rest.Config

	// DynamicClient provides dynamic Kubernetes API access for reading custom resources (CRDs).
	// If nil, checks should create one from RESTConfig. Set this in unit tests for injection.
	DynamicClient dynamic.Interface

	// RecipeData contains recipe metadata that may be needed for validation
	RecipeData map[string]interface{}

	// Recipe contains the full recipe with validation constraints
	// Only available when running inside Jobs (not in unit tests)
	Recipe *recipe.RecipeResult

	// Artifacts collects diagnostic evidence during check execution.
	// Nil when artifact capture is not active (e.g., non-conformance phases).
	// Checks should nil-check before recording.
	Artifacts *ArtifactCollector
}

ValidationContext provides runtime context for checks and constraints.

func LoadValidationContext

func LoadValidationContext() (*ValidationContext, context.CancelFunc, error)

LoadValidationContext loads the validation context for running checks. Works both inside Kubernetes Jobs (in-cluster config) and locally (KUBECONFIG).

Kubernetes client discovery: KUBECONFIG env → ~/.kube/config → in-cluster service account. Namespace resolution: service account file → AICR_VALIDATION_NAMESPACE env → "default".

Environment variables:

  • KUBECONFIG: Path to kubeconfig file (for local development)
  • AICR_VALIDATION_NAMESPACE: Validation namespace (for local development)
  • AICR_SNAPSHOT_PATH: Path to snapshot file (default: /data/snapshot/snapshot.yaml)
  • AICR_RECIPE_PATH: Path to recipe file (default: /data/recipe/recipe.yaml)
  • AICR_RECIPE_DATA: Optional JSON-encoded recipe metadata

IMPORTANT: The caller is responsible for calling the returned cancel function when the validation context is no longer needed.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL