validator

package
v0.7.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2026 License: Apache-2.0 Imports: 28 Imported by: 0

README

Validator Package

The validator package provides a comprehensive validation framework for GPU-accelerated Kubernetes clusters. It validates cluster state against recipe specifications across multiple phases using a Job-based execution model.

Quick Start

import (
    "context"
    "github.com/NVIDIA/aicr/pkg/validator"
    "github.com/NVIDIA/aicr/pkg/recipe"
    "github.com/NVIDIA/aicr/pkg/snapshotter"
)

// Load recipe and snapshot
recipe := recipe.Load("recipe.yaml")
snapshot := snapshotter.Load("snapshot.yaml")

// Create validator
v := validator.New(validator.WithKubeconfig("/path/to/kubeconfig"))

// Validate a specific phase
result, err := v.ValidatePhase(context.Background(), "deployment", recipe, snapshot)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Status: %s, Passed: %d, Failed: %d\n",
    result.Status, result.Summary.Passed, result.Summary.Failed)

Architecture

Validation Phases
Phase Execution Data Source Purpose
Readiness Constraints inline, Checks in Jobs Snapshot only Validate prerequisites before deployment
Deployment All in Jobs Snapshot + Live cluster Verify deployed resources
Performance All in Jobs Snapshot + Live cluster Measure system performance
Conformance All in Jobs Snapshot + Live cluster Validate API conformance
Execution Model
Recipe Definition
    ↓
┌─────────────────────────────────────────────────────┐
│ Readiness Phase                                     │
│ • Constraints: Evaluated inline (snapshot)          │
│ • Checks: Run in Jobs (GPU detection, kernel, OS)   │
└─────────────────────────────────────────────────────┘
    ↓ (if passed)
┌─────────────────────────────────────────────────────┐
│ Deployment Phase                                    │
│ • Constraints: Run in Jobs (operator versions)      │
│ • Checks: Run in Jobs (operator health, resources)  │
└─────────────────────────────────────────────────────┘
    ↓ (if passed)
┌─────────────────────────────────────────────────────┐
│ Performance Phase                                   │
│ • Constraints: Run in Jobs (bandwidth thresholds)   │
│ • Checks: Run in Jobs (NCCL tests, fabric health)   │
└─────────────────────────────────────────────────────┘
    ↓ (if passed)
┌─────────────────────────────────────────────────────┐
│ Conformance Phase                                   │
│ • Constraints: Run in Jobs (API versions)           │
│ • Checks: Run in Jobs (API conformance, workloads)  │
└─────────────────────────────────────────────────────┘
    ↓
Validation Results
Job-Based Execution

All checks run inside Kubernetes Jobs for:

  • Isolation: Proper RBAC and resource limits
  • Observability: Jobs visible in kubectl get jobs
  • Reproducibility: Consistent execution environment
  • Flexibility: Node affinity for GPU tests
Validator (CLI/API)
    ↓
Agent Deployer
    ├─► RBAC (ServiceAccount, Role, RoleBinding)
    ├─► ConfigMaps (snapshot.yaml, recipe.yaml, validation-result.yaml)
    └─► Job
         ├─► Executes: go test -json (all tests in phase)
         ├─► Test wrapper loads ValidationContext
         ├─► Check functions run with snapshot + K8s client
         └─► Results output to logs (JSON format)
              └─► Validator parses logs and updates ValidationResult ConfigMap
Validator Image

Validation Jobs require a special image with Go toolchain to run tests in-cluster.

Why a Separate Image?

  • Main aicr image (built with Ko): Contains only the compiled binary, no Go toolchain
  • Validator image: Contains Go toolchain + source code to run go test commands

Building the Validator Image:

# Local development (with local registry)
make image-validator IMAGE_REGISTRY=localhost:5001 IMAGE_TAG=latest

# Production release (published to GHCR)
# Automatically built by goreleaser on git tags
docker pull ghcr.io/nvidia/aicr-validator:latest
docker pull ghcr.io/nvidia/aicr-validator:v0.4.0

Image Configuration:

// Default image (overridable)
v := validator.New(
    validator.WithImage("ghcr.io/nvidia/aicr-validator:latest"),
)

// Or via CLI
aicr validate --image localhost:5001/aicr-validator:latest \
  -r recipe.yaml -s snapshot.yaml

// Or via environment variable (for CI)
export AICR_VALIDATOR_IMAGE=localhost:5001/aicr-validator:local
aicr validate -r recipe.yaml -s snapshot.yaml

CI/CD:

  • E2E tests build validator image from current source code
  • Release pipeline publishes to ghcr.io/nvidia/aicr-validator
  • Multi-platform support (linux/amd64, linux/arm64)
  • SLSA attestation for supply chain security

Test Wrapper Infrastructure:

Checks execute via Go's standard test framework:

// Check function (registered in init())
func CheckGPUHardwareDetection(ctx *checks.ValidationContext) error {
    // Access snapshot data and K8s API
    for _, m := range ctx.Snapshot.Measurements {
        if m.Type == measurement.TypeGPU { /* validate */ }
    }
    return nil
}

// Test wrapper (enables Job execution)
func TestGPUHardwareDetection(t *testing.T) {
    runner, err := checks.NewTestRunner(t)  // Loads context from Job env
    if err != nil {
        t.Skipf("Skipping (not in Kubernetes): %v", err)
        return
    }
    runner.RunCheck("gpu-hardware-detection")  // Executes check
}

The test wrapper pattern enables:

  • ✅ Standard Go testing infrastructure (go test)
  • ✅ Automatic context loading (snapshot, K8s client)
  • ✅ Graceful skipping during local development
  • ✅ JSON test output for result parsing

See: checks/README.md for complete guide, examples, and troubleshooting.

Validation Run Management (RunID)

Each validation run is assigned a unique RunID for resource isolation and resumability:

RunID Format: YYYYMMDD-HHMMSS-XXXXXXXXXXXXXXXX (e.g., 20260206-140523-a3f9b2c1e7d04a68b2c1e7d04a68)

  • Timestamp: Date and time when validation started
  • Random suffix: 16 hex characters for uniqueness

Resource Naming: All resources created during a validation run include the RunID:

  • Input ConfigMaps: aicr-snapshot-{runID}, aicr-recipe-{runID} (shared by all phases)
  • Output ConfigMap: aicr-validation-result-{runID} (progressively updated)
  • Jobs: aicr-{runID}-readiness, aicr-{runID}-deployment, etc. (one per phase)

Benefits:

  • Concurrent Validations: Multiple validation runs can execute simultaneously without conflicts
  • Resumability: Failed validations can be resumed from the last successful phase (future feature)
  • Traceability: All resources for a run are grouped by RunID label
  • Cleanup: Resources can be cleaned up per-run using RunID labels

CLI Output:

$ aicr validate --phase all --recipe recipe.yaml --snapshot snapshot.yaml
Starting validation run: 20260206-140523-a3f9b2c1e7d04a68
...

Querying Validation Runs:

# List all validation runs
kubectl get configmaps -n aicr-validation \
  -l app.kubernetes.io/component=validation

# List resources for specific run
kubectl get jobs,configmaps -n aicr-validation \
  -l aicr.nvidia.com/run-id=20260206-140523-a3f9b2c1e7d04a68

# View run details
kubectl get configmap -n aicr-validation \
  -l aicr.nvidia.com/run-id=20260206-140523-a3f9b2c1e7d04a68 \
  -o yaml

Cleanup by RunID:

# Cleanup specific validation run
kubectl delete jobs,configmaps -n aicr-validation \
  -l aicr.nvidia.com/run-id=20260206-140523-a3f9b2c1e7d04a68

# Cleanup all validation runs (caution!)
kubectl delete jobs,configmaps -n aicr-validation \
  -l app.kubernetes.io/component=validation
ValidationResult ConfigMap (Resumability)

The validator creates a single ValidationResult ConfigMap per validation run that is progressively updated:

ConfigMap: aicr-validation-result-{runID}

Lifecycle:

  1. Creation: Created at validation start with empty structure
  2. Progressive Updates: Updated after each phase completes with results
  3. Resume: Read by --resume flag to continue from failed phase
  4. Cleanup: Automatically deleted after validation completes

Resume Functionality:

# New validation (auto-generates RunID)
aicr validate --phase all --recipe recipe.yaml --snapshot snapshot.yaml
# Output: Starting validation run: 20260206-140523-a3f9b2c1e7d04a68

# Validation fails at deployment phase (readiness passed)
# Resume from failed phase
aicr validate --phase all --resume 20260206-140523-a3f9b2c1e7d04a68
# Reads existing results, skips readiness (passed), continues from deployment

Query Validation State:

# View current validation progress
kubectl get cm aicr-validation-result-20260206-140523-a3f9b2c1e7d04a68 -o yaml

# Check which phases passed/failed
kubectl get cm aicr-validation-result-20260206-140523-a3f9b2c1e7d04a68 \
  -o jsonpath='{.data.result\.yaml}' | yq '.phases'

Implementation:

  • createValidationResultConfigMap() - Creates empty structure
  • updateValidationResultConfigMap() - Updates after each phase
  • readValidationResultConfigMap() - Reads for resume
  • determineStartPhase() - Finds where to resume from
ConfigMap Management

The validator automatically manages ConfigMaps for snapshot and recipe data:

Lifecycle:

  1. Creation: ConfigMaps are created once per validation run before any phases execute
    • aicr-snapshot-{runID}: Contains the cluster snapshot (YAML)
    • aicr-recipe-{runID}: Contains the recipe configuration (YAML)
  2. Reuse: All phases in a validation run share the same ConfigMaps
    • Readiness phase uses snapshot-{runID} and recipe-{runID}
    • Deployment phase uses snapshot-{runID} and recipe-{runID}
    • Performance phase uses snapshot-{runID} and recipe-{runID}
    • Conformance phase uses snapshot-{runID} and recipe-{runID}
  3. Mounting: Jobs mount these ConfigMaps as volumes at:
    • /data/snapshot/snapshot.yaml
    • /data/recipe/recipe.yaml
  4. Cleanup: ConfigMaps are automatically deleted once after all phases complete

Implementation Details:

  • ConfigMaps are created once per validation run, not per phase (efficient)
  • ConfigMaps are uniquely named per validation run using RunID
  • Each ConfigMap includes labels for querying and cleanup:
    • aicr.nvidia.com/run-id: The validation run identifier
    • aicr.nvidia.com/created-at: Timestamp (format: YYYYMMDD-HHMMSS)
    • aicr.nvidia.com/data-type: snapshot or recipe
  • Cleanup happens in defer blocks to ensure removal even on errors
  • Test wrappers load data from mounted ConfigMaps using LoadValidationContext()

Security Considerations:

  • ConfigMaps may contain sensitive cluster information
  • Access is restricted by Kubernetes RBAC
  • ConfigMaps are namespace-scoped (default: aicr-validation)
  • RunID-based naming prevents conflicts between concurrent validations

Recipe Format

Constraints and Checks

Constraints - Expression-based validations:

validation:
  deployment:
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
      - name: Deployment.device-plugin.replicas
        value: ">= 1"

Checks - Named validation tests:

# expected-resources check requires expectedResources on componentRefs
componentRefs:
  - name: gpu-operator
    type: Helm
    expectedResources:
      - kind: Deployment
        name: gpu-operator
        namespace: gpu-operator

validation:
  deployment:
    checks:
      - operator-health
      - expected-resources
Multi-Phase Recipe Example
# expectedResources are declared on componentRefs (used by expected-resources check)
componentRefs:
  - name: gpu-operator
    type: Helm
    expectedResources:
      - kind: Deployment
        name: gpu-operator
        namespace: gpu-operator
      - kind: DaemonSet
        name: nvidia-driver-daemonset
        namespace: gpu-operator

validation:
  # Phase 1: Readiness (pre-deployment validation)
  readiness:
    constraints:
      - name: GPU.count
        value: ">= 8"
      - name: OS.version
        value: "== ubuntu"
      - name: Kernel.version
        value: ">= 5.15.0"
    checks:
      - gpu-hardware-detection
      - kernel-parameters
      - os-prerequisites

  # Phase 2: Deployment (verify deployed resources)
  deployment:
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
    checks:
      - operator-health
      - expected-resources

  # Phase 3: Performance (measure system performance)
  performance:
    constraints:
      - name: Performance.nccl.bandwidth
        value: ">= 200"  # GB/s
    checks:
      - nccl-bandwidth-test
      - fabric-health

  # Phase 4: Conformance (validate compatibility)
  conformance:
    checks:
      - ai-workload-validation

Result Format

type ValidationResult struct {
    Phase     string              // "readiness", "deployment", etc.
    Status    ValidationStatus    // "pass", "fail", "skipped"
    StartTime time.Time
    EndTime   time.Time
    Duration  time.Duration

    // Constraints evaluated
    Constraints []ConstraintValidation

    // Checks executed
    Checks []CheckResult

    // Summary statistics
    Summary ValidationSummary
}

type ConstraintValidation struct {
    Name     string  // e.g., "Deployment.gpu-operator.version"
    Expected string  // e.g., ">= v24.6.0"
    Actual   string  // e.g., "v24.6.0"
    Passed   bool
    Message  string
}

type CheckResult struct {
    Name     string  // e.g., "operator-health"
    Status   ValidationStatus
    Message  string
    Duration time.Duration
}

CLI Usage

# Validate all phases
aicr validate --phase all \
  --recipe recipe.yaml \
  --snapshot snapshot.yaml

# Validate specific phase
aicr validate --phase deployment \
  --recipe recipe.yaml \
  --snapshot snapshot.yaml

# Output formats
aicr validate --phase all -o json
aicr validate --phase all -o yaml
aicr validate --phase all -o table

Documentation

Core Documentation
Check Development
Phase-Specific Guides

Key Concepts

Checks vs Constraints
Aspect Check Constraint
Definition Named validation test Expression-based validation
Returns Pass/fail (error) Actual value + pass/fail
Registration RegisterCheck() RegisterConstraintValidator()
Recipe Syntax checks: [name] constraints: [{name, value}]
Example operator-health Deployment.gpu-operator.version: ">= v24.6.0"
ValidationContext

Validation functions receive a context with:

type ValidationContext struct {
    Context   context.Context        // Cancellation and timeouts
    Snapshot  *snapshotter.Snapshot  // Captured cluster state
    Clientset kubernetes.Interface   // Live Kubernetes API access
    RecipeData map[string]interface{} // Recipe metadata
}
  • Snapshot: Hardware inventory, OS info, pre-capture state
  • Clientset: Query live cluster (deployments, pods, etc.)
  • RecipeData: Access recipe configuration
Phase Dependencies

Phases execute sequentially with early exit:

  1. Readiness must pass before Deployment
  2. Deployment must pass before Performance
  3. Performance must pass before Conformance

If any phase fails, subsequent phases are skipped.

Testing

Unit Testing Validators
func TestValidateOperatorVersion(t *testing.T) {
    // Create fake Kubernetes client
    deployment := createTestDeployment("v24.6.0")
    clientset := fake.NewSimpleClientset(deployment)

    ctx := &checks.ValidationContext{
        Context:   context.Background(),
        Clientset: clientset,
    }

    constraint := recipe.Constraint{
        Name:  "Deployment.gpu-operator.version",
        Value: ">= v24.6.0",
    }

    actual, passed, err := ValidateGPUOperatorVersion(ctx, constraint)
    assert.NoError(t, err)
    assert.True(t, passed)
    assert.Equal(t, "v24.6.0", actual)
}
Integration Testing
# Run all validator tests
go test -v ./pkg/validator/...

# Run with race detector
go test -v -race ./pkg/validator/...

# Run specific phase tests
go test -v ./pkg/validator/checks/deployment/...

Design Decisions

Why Job-Based Execution?
  1. Cluster Context: Checks run with proper RBAC inside the cluster
  2. Resource Control: Jobs can have CPU/memory limits
  3. Node Scheduling: Performance tests can target GPU nodes
  4. Observability: Jobs appear in kubectl get jobs
  5. Isolation: Each check is independent
Why Constraint Validators Run in Jobs?

Deployment, performance, and conformance constraints need live cluster access:

  • Query deployed operator versions
  • Measure network bandwidth
  • Check API conformance

Only readiness constraints can evaluate inline because they only need snapshot data.

Why ConfigMaps for Results?

Single ValidationResult ConfigMap per validation run:

  • ConfigMap: aicr-validation-result-{runID}
  • Progressively updated as each phase completes
  • Enables resumability (--resume flag)
  • Persists even if CLI crashes or disconnects

Benefits:

  1. Resumability: Continue from failed phase using --resume {runID}
  2. Observability: Query current validation state with kubectl get cm
  3. Persistence: Results survive Job deletion and CLI disconnection
  4. Progressive Updates: Real-time visibility into validation progress
  5. Accessibility: Easy to retrieve and inspect with kubectl

Example:

# Check validation progress
kubectl get cm aicr-validation-result-20260206-140523-a3f9b2c1e7d04a68 -o yaml

# Resume from failure
aicr validate --resume 20260206-140523-a3f9b2c1e7d04a68

Examples

Example 1: Validate GPU Operator Deployment
validation:
  deployment:
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
    checks:
      - operator-health
Example 2: Performance Validation
validation:
  performance:
    constraints:
      - name: Performance.nccl.bandwidth
        value: ">= 200"
    checks:
      - nccl-bandwidth-test
      - fabric-health
Example 3: Full Multi-Phase Validation
validation:
  readiness:
    constraints:
      - name: GPU.count
        value: ">= 8"
    checks:
      - gpu-hardware-detection

  deployment:
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
    checks:
      - operator-health

  performance:
    checks:
      - nccl-bandwidth-test

  conformance:
    checks:
      - ai-workload-validation

API Reference

Main API
// Create validator
validator := validator.New(
    validator.WithKubeconfig(kubeconfigPath),
    validator.WithTimeout(5 * time.Minute),
)

// Validate specific phase
result, err := validator.ValidatePhase(ctx, "deployment", recipe, snapshot)

// Validate all phases
results, err := validator.ValidateAll(ctx, recipe, snapshot)

// Validate with phase filter
results, err := validator.ValidatePhases(ctx,
    []string{"readiness", "deployment"}, recipe, snapshot)
Registry API
// Get registered check
check, ok := checks.GetCheck("operator-health")

// Get registered constraint validator
validator, ok := checks.GetConstraintValidator("Deployment.gpu-operator.version")

// List all checks for a phase
checkList := checks.ListChecks("deployment")

// List all constraint validators
validators := checks.ListConstraintValidators()

Troubleshooting

See Troubleshooting Guide for:

  • Common errors and solutions
  • RBAC permission issues
  • Job timeout debugging
  • How to view Job logs
  • Test mode vs production mode

Contributing

To add new validation checks or constraint validators:

  1. Read How-To Guide for step-by-step instructions
  2. Follow existing patterns in pkg/validator/checks/
  3. Write comprehensive tests
  4. Update documentation

References

Documentation

Overview

Package validator provides recipe constraint validation against system snapshots.

Overview

The validator package evaluates recipe constraints against actual system measurements captured in snapshots. It supports version comparison operators and exact string matching to determine if a cluster meets the requirements specified in a recipe.

Constraint Format

Constraints use fully qualified measurement paths in the format: {Type}.{Subtype}.{Key}

Examples:

K8s.server.version         -> Kubernetes server version
OS.release.ID              -> Operating system identifier (e.g., "ubuntu")
OS.release.VERSION_ID      -> OS version (e.g., "24.04")
OS.sysctl./proc/sys/kernel/osrelease -> Kernel version

Supported Operators

The following comparison operators are supported in constraint values:

  • ">=" - Greater than or equal (version comparison)
  • "<=" - Less than or equal (version comparison)
  • ">" - Greater than (version comparison)
  • "<" - Less than (version comparison)
  • "==" - Exact match (string or version)
  • "!=" - Not equal (string or version)
  • (no operator) - Exact string match

Usage

Basic validation:

v := validator.New()
result, err := v.Validate(ctx, recipe, snapshot)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Status: %s\n", result.Summary.Status)
for _, r := range result.Results {
    fmt.Printf("  %s: expected %q, got %q - %v\n",
        r.Name, r.Expected, r.Actual, r.Status)
}

Result Structure

ValidationResult contains:

  • Summary: Overall pass/fail counts and status
  • Results: Per-constraint validation results with expected/actual values

Error Handling

Constraints that cannot be evaluated (e.g., path not found in snapshot) are marked as "skipped" with appropriate warning messages, allowing partial validation results to be returned.

Index

Constants

View Source
const (
	DefaultReadinessTimeout   = defaults.ValidateReadinessTimeout
	DefaultDeploymentTimeout  = defaults.ValidateDeploymentTimeout
	DefaultPerformanceTimeout = defaults.ValidatePerformanceTimeout
	DefaultConformanceTimeout = defaults.ValidateConformanceTimeout
)

Phase timeout aliases — defined in pkg/defaults/timeouts.go.

View Source
const (
	// APIVersion is the API version for validation results.
	APIVersion = "aicr.nvidia.com/v1alpha1"
)

Variables

PhaseOrder defines the canonical execution order for validation phases. Readiness and deployment must run before performance or conformance.

Functions

This section is empty.

Types

type CheckResult

type CheckResult struct {
	// Name is the check identifier.
	Name string `json:"name" yaml:"name"`

	// Status is the check outcome.
	Status ValidationStatus `json:"status" yaml:"status"`

	// Reason explains why the check failed or was skipped.
	Reason string `json:"reason,omitempty" yaml:"reason,omitempty"`

	// Remediation provides actionable guidance for fixing failures.
	Remediation string `json:"remediation,omitempty" yaml:"remediation,omitempty"`
}

CheckResult represents the result of a named validation check.

type ConstraintEvalResult

type ConstraintEvalResult struct {
	// Passed indicates if the constraint was satisfied.
	Passed bool

	// Actual is the actual value extracted from the snapshot.
	Actual string

	// Error contains the error if evaluation failed (e.g., value not found).
	Error error
}

ConstraintEvalResult represents the result of evaluating a single constraint.

func EvaluateConstraint

func EvaluateConstraint(constraint recipe.Constraint, snap *snapshotter.Snapshot) ConstraintEvalResult

EvaluateConstraint evaluates a single constraint against a snapshot. This is a standalone function that can be used by other packages without creating a full Validator instance. Used by the recipe package to filter overlays based on constraint evaluation during snapshot-based recipe generation.

type ConstraintPath

type ConstraintPath struct {
	Type    measurement.Type
	Subtype string
	Key     string
}

ConstraintPath represents a parsed fully qualified constraint path. Format: {Type}.{Subtype}.{Key} Example: "K8s.server.version" -> Type="K8s", Subtype="server", Key="version"

func ParseConstraintPath

func ParseConstraintPath(path string) (*ConstraintPath, error)

ParseConstraintPath parses a fully qualified constraint path. The path format is: {Type}.{Subtype}.{Key} The key portion may contain dots (e.g., "/proc/sys/kernel/osrelease").

func (*ConstraintPath) ExtractValue

func (cp *ConstraintPath) ExtractValue(snap *snapshotter.Snapshot) (string, error)

ExtractValue extracts the value at this path from a snapshot. Returns the value as a string, or an error if the path doesn't exist.

func (*ConstraintPath) String

func (cp *ConstraintPath) String() string

String returns the fully qualified path string.

type ConstraintStatus

type ConstraintStatus string

ConstraintStatus represents the outcome of evaluating a single constraint.

const (
	// ConstraintStatusPassed indicates the constraint was satisfied.
	ConstraintStatusPassed ConstraintStatus = "passed"

	// ConstraintStatusFailed indicates the constraint was not satisfied.
	ConstraintStatusFailed ConstraintStatus = "failed"

	// ConstraintStatusSkipped indicates the constraint couldn't be evaluated.
	ConstraintStatusSkipped ConstraintStatus = "skipped"
)

type ConstraintValidation

type ConstraintValidation struct {
	// Name is the fully qualified constraint name (e.g., "K8s.server.version").
	Name string `json:"name" yaml:"name"`

	// Expected is the constraint expression from the recipe (e.g., ">= 1.32.4").
	Expected string `json:"expected" yaml:"expected"`

	// Actual is the value found in the snapshot (e.g., "v1.33.5-eks-3025e55").
	Actual string `json:"actual" yaml:"actual"`

	// Status is the outcome of this constraint evaluation.
	Status ConstraintStatus `json:"status" yaml:"status"`

	// Message provides additional context, especially for failures or skipped constraints.
	Message string `json:"message,omitempty" yaml:"message,omitempty"`
}

ConstraintValidation represents the result of evaluating a single constraint.

type Operator

type Operator string

Operator represents a comparison operator in constraint expressions.

const (
	// OperatorGTE represents ">=" (greater than or equal).
	OperatorGTE Operator = ">="

	// OperatorLTE represents "<=" (less than or equal).
	OperatorLTE Operator = "<="

	// OperatorGT represents ">" (greater than).
	OperatorGT Operator = ">"

	// OperatorLT represents "<" (less than).
	OperatorLT Operator = "<"

	// OperatorEQ represents "==" (exact match).
	OperatorEQ Operator = "=="

	// OperatorNE represents "!=" (not equal).
	OperatorNE Operator = "!="

	// OperatorExact represents no operator (exact string match).
	OperatorExact Operator = ""
)

type Option

type Option func(*Validator)

Option is a functional option for configuring Validator instances.

func WithCleanup

func WithCleanup(cleanup bool) Option

WithCleanup returns an Option that controls cleanup of validation resources. When false, Jobs, ConfigMaps, and RBAC resources are kept for debugging.

func WithImage

func WithImage(image string) Option

WithImage returns an Option that sets the container image for validation Jobs.

func WithImagePullSecrets

func WithImagePullSecrets(secrets []string) Option

WithImagePullSecrets returns an Option that sets image pull secrets for validation Jobs.

func WithNamespace

func WithNamespace(namespace string) Option

WithNamespace returns an Option that sets the namespace for validation jobs.

func WithNoCluster

func WithNoCluster(noCluster bool) Option

WithNoCluster returns an Option that controls cluster access. When set to true, validation runs in dry-run mode without connecting to cluster.

func WithRunID

func WithRunID(runID string) Option

WithRunID returns an Option that sets the RunID for this validation run. Used when resuming a previous validation run.

func WithVersion

func WithVersion(version string) Option

WithVersion returns an Option that sets the Validator version string.

type ParsedConstraint

type ParsedConstraint struct {
	// Operator is the comparison operator (or empty for exact match).
	Operator Operator

	// Value is the expected value after the operator.
	Value string

	// IsVersionComparison indicates if this should be treated as a version comparison.
	IsVersionComparison bool
}

ParsedConstraint represents a parsed constraint expression.

func ParseConstraintExpression

func ParseConstraintExpression(expr string) (*ParsedConstraint, error)

ParseConstraintExpression parses a constraint value expression. Examples:

  • ">= 1.32.4" -> {Operator: ">=", Value: "1.32.4", IsVersionComparison: true}
  • "ubuntu" -> {Operator: "", Value: "ubuntu", IsVersionComparison: false}
  • "== 24.04" -> {Operator: "==", Value: "24.04", IsVersionComparison: false}

func (*ParsedConstraint) Evaluate

func (pc *ParsedConstraint) Evaluate(actual string) (bool, error)

Evaluate evaluates the constraint against an actual value. Returns true if the constraint is satisfied, false otherwise.

func (*ParsedConstraint) String

func (pc *ParsedConstraint) String() string

String returns a string representation of the parsed constraint.

type PhaseResult

type PhaseResult struct {
	// Status is the overall status of this phase.
	Status ValidationStatus `json:"status" yaml:"status"`

	// Constraints contains per-constraint results for this phase.
	Constraints []ConstraintValidation `json:"constraints,omitempty" yaml:"constraints,omitempty"`

	// Checks contains results of named validation checks.
	Checks []CheckResult `json:"checks,omitempty" yaml:"checks,omitempty"`

	// Reason explains why the phase was skipped or failed.
	Reason string `json:"reason,omitempty" yaml:"reason,omitempty"`

	// Duration is how long this phase took to run.
	Duration time.Duration `json:"duration,omitempty" yaml:"duration,omitempty"`
}

PhaseResult represents the result of a single validation phase.

type ValidationPhaseName

type ValidationPhaseName string

ValidationPhaseName represents the name of a validation phase.

const (
	// PhaseReadiness is the readiness validation phase.
	PhaseReadiness ValidationPhaseName = "readiness"

	// PhaseDeployment is the deployment validation phase.
	PhaseDeployment ValidationPhaseName = "deployment"

	// PhasePerformance is the performance validation phase.
	PhasePerformance ValidationPhaseName = "performance"

	// PhaseConformance is the conformance validation phase.
	PhaseConformance ValidationPhaseName = "conformance"

	// PhaseAll runs all phases sequentially.
	PhaseAll ValidationPhaseName = "all"
)

type ValidationResult

type ValidationResult struct {
	header.Header `json:",inline" yaml:",inline"`

	// RunID is a unique identifier for this validation run.
	// Used for resume functionality and correlating resources.
	// Format: YYYYMMDD-HHMMSS-RANDOM (e.g., "20260206-140523-a3f9")
	RunID string `json:"runID,omitempty" yaml:"runID,omitempty"`

	// RecipeSource is the path/URI of the recipe that was validated.
	RecipeSource string `json:"recipeSource" yaml:"recipeSource"`

	// SnapshotSource is the path/URI of the snapshot used for validation.
	SnapshotSource string `json:"snapshotSource" yaml:"snapshotSource"`

	// Summary contains aggregate validation statistics.
	Summary ValidationSummary `json:"summary" yaml:"summary"`

	// Results contains per-constraint validation details (legacy, for backward compatibility).
	Results []ConstraintValidation `json:"results,omitempty" yaml:"results,omitempty"`

	// Phases contains per-phase validation results (multi-phase validation).
	Phases map[string]*PhaseResult `json:"phases,omitempty" yaml:"phases,omitempty"`
}

ValidationResult represents the complete validation outcome.

func NewValidationResult

func NewValidationResult() *ValidationResult

NewValidationResult creates a new ValidationResult with initialized slices.

type ValidationStatus

type ValidationStatus string

ValidationStatus represents the overall validation outcome.

const (
	// ValidationStatusPass indicates all constraints passed.
	ValidationStatusPass ValidationStatus = "pass"

	// ValidationStatusFail indicates one or more constraints failed.
	ValidationStatusFail ValidationStatus = "fail"

	// ValidationStatusPartial indicates some constraints couldn't be evaluated.
	ValidationStatusPartial ValidationStatus = "partial"

	// ValidationStatusSkipped indicates a phase was skipped (due to dependency failure).
	ValidationStatusSkipped ValidationStatus = "skipped"

	// ValidationStatusWarning indicates warnings but no hard failures.
	ValidationStatusWarning ValidationStatus = "warning"
)

type ValidationSummary

type ValidationSummary struct {
	// Passed is the count of constraints that were satisfied.
	Passed int `json:"passed" yaml:"passed"`

	// Failed is the count of constraints that were not satisfied.
	Failed int `json:"failed" yaml:"failed"`

	// Skipped is the count of constraints that couldn't be evaluated.
	Skipped int `json:"skipped" yaml:"skipped"`

	// Total is the total number of constraints evaluated.
	Total int `json:"total" yaml:"total"`

	// Status is the overall validation status.
	Status ValidationStatus `json:"status" yaml:"status"`

	// Duration is how long the validation took.
	Duration time.Duration `json:"duration" yaml:"duration"`
}

ValidationSummary contains aggregate statistics about the validation.

type Validator

type Validator struct {
	// Version is the validator version (typically the CLI version).
	Version string

	// Namespace is the Kubernetes namespace where validation jobs will run.
	// Defaults to "aicr-validation" if not specified.
	Namespace string

	// Image is the container image to use for validation Jobs.
	// Must include Go toolchain for running tests.
	// Defaults to "ghcr.io/nvidia/aicr-validator:latest".
	Image string

	// RunID is a unique identifier for this validation run.
	// Used to scope all resources (ConfigMaps, Jobs) and enable resumability.
	// Format: YYYYMMDD-HHMMSS-RANDOM (e.g., "20260206-140523-a3f9")
	RunID string

	// Cleanup controls whether to delete Jobs, ConfigMaps, and RBAC resources after validation.
	// Defaults to true. Set to false to keep resources for debugging.
	Cleanup bool

	// ImagePullSecrets are secret names for pulling images from private registries.
	ImagePullSecrets []string

	// NoCluster controls whether to skip actual cluster operations (dry-run mode).
	// When true, validation runs without connecting to Kubernetes cluster.
	NoCluster bool
}

Validator evaluates recipe constraints against snapshot measurements.

func New

func New(opts ...Option) *Validator

New creates a new Validator with the provided options.

func (*Validator) Validate

func (v *Validator) Validate(ctx context.Context, recipeResult *recipe.RecipeResult, snap *snapshotter.Snapshot) (*ValidationResult, error)

Validate evaluates all constraints from the recipe against the snapshot. Returns a ValidationResult containing per-constraint results and summary.

func (*Validator) ValidatePhase

func (v *Validator) ValidatePhase(
	ctx context.Context,
	phase ValidationPhaseName,
	recipeResult *recipe.RecipeResult,
	snap *snapshotter.Snapshot,
) (*ValidationResult, error)

ValidatePhase runs validation for a specific phase. This is the main entry point for phase-based validation.

func (*Validator) ValidatePhases

func (v *Validator) ValidatePhases(
	ctx context.Context,
	phases []ValidationPhaseName,
	recipeResult *recipe.RecipeResult,
	snap *snapshotter.Snapshot,
) (*ValidationResult, error)

ValidatePhases runs validation for multiple specified phases. If no phases are specified, defaults to readiness phase. If phases includes "all", runs all phases.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL