agent

package
v0.8.16 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2026 License: Apache-2.0 Imports: 20 Imported by: 0

README

Validation Agent

The validation agent package provides a Kubernetes Job-based executor for running validation checks.

Architecture

The validation agent follows the same pattern as the snapshot agent (pkg/k8s/agent):

┌────────────────┐
│   Validator    │
│   (CLI/API)    │
└───────┬────────┘
        │
        ▼
┌────────────────┐
│ Agent Deployer │  ← Creates K8s resources
└───────┬────────┘
        │
        ├─► RBAC (ServiceAccount, Role, RoleBinding)
        ├─► Input ConfigMaps (snapshot.yaml, recipe.yaml)
        └─► Job (runs go test commands)
            │
            ├─► Mounts snapshot + recipe as volumes
            ├─► Runs: go test -json -run TestName
            └─► Writes results to ConfigMap

Key Components

Deployer
  • Deploy() - Creates RBAC + Job
  • WaitForCompletion() - Waits for Job to finish
  • GetResult() - Retrieves validation results from ConfigMap
  • Cleanup() - Removes Job and RBAC resources
Job Execution

The Job container:

  1. Mounts snapshot and recipe from ConfigMaps
  2. Sets environment variables (AICR_SNAPSHOT_PATH, AICR_RECIPE_PATH)
  3. Runs go test -v -json <package> (runs all tests in phase package)
  4. Outputs test results to stdout (JSON format between markers)
  5. Exits with test exit code

The validator reads Job logs and updates the unified ValidationResult ConfigMap.

RBAC Permissions

The validation Job needs permissions to:

  • Read/write ConfigMaps (for inputs and results)
  • Read pods, services, deployments (for deployment phase checks)
  • Read nodes (for deployment phase checks)

Usage Example

// Create Kubernetes client
clientset, err := k8sclient.GetKubeClient()

// Configure validation agent
config := agent.Config{
    Namespace:          "aicr-validation",
    JobName:            "aicr-validation-deployment",
    Image:              "ghcr.io/nvidia/aicr-validator:latest",  // Validator image with Go toolchain
    ServiceAccountName: "aicr-validator",
    SnapshotConfigMap:  "aicr-snapshot",
    RecipeConfigMap:    "aicr-recipe",
    TestPackage:        "./pkg/validator/checks/deployment",
    TestPattern:        "TestOperatorHealth",
    Timeout:            5 * time.Minute,
    Cleanup:            true,
}

// Create deployer
deployer := agent.NewDeployer(clientset, config)

// Deploy and wait
if err := deployer.Deploy(ctx); err != nil {
    return err
}

defer deployer.Cleanup(ctx, agent.CleanupOptions{Enabled: true})

if err := deployer.WaitForCompletion(ctx, config.Timeout); err != nil {
    return err
}

// Get results
result, err := deployer.GetResult(ctx)
if err != nil {
    return err
}

fmt.Printf("Check: %s, Status: %s\n", result.CheckName, result.Status)

Design Decisions

Why Jobs instead of inline execution?
  1. Isolation - Tests run in cluster context with proper RBAC
  2. Resource limits - Jobs can have CPU/memory constraints
  3. Smart scheduling - Jobs prefer CPU nodes via soft affinity (nvidia.com/gpu.present DoesNotExist); checks that need GPUs create their own workload Pods with GPU resource requests
  4. Observability - Jobs show up in kubectl get jobs
  5. Reproducibility - Same execution environment every time
Why ConfigMaps for results?
  1. Persistence - Results survive even if Job is deleted
  2. Accessibility - Easy to retrieve with kubectl or API
  3. Size limits - ConfigMap 1MB limit encourages concise results
  4. Standard pattern - Consistent with snapshot agent
Why one Job per phase?
  1. Early exit - Can skip subsequent phases on failure
  2. Granular control - Easier to retry specific phases
  3. Clear boundaries - Each phase is independent unit of work

Integration with Validator

The validator package uses this agent to run checks:

// In pkg/validator/phases.go
func (v *Validator) validateDeployment(ctx context.Context, ...) {
    // For each check in recipe
    for _, checkName := range recipe.Validation.Deployment.Checks {
        // Deploy Job for this check
        deployer := agent.NewDeployer(...)
        deployer.Deploy(ctx)
        deployer.WaitForCompletion(ctx, timeout)
        result := deployer.GetResult(ctx)

        // Aggregate results
        phaseResult.Checks = append(phaseResult.Checks, result)
    }
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CleanupOptions

type CleanupOptions struct {
	// Enabled determines whether to cleanup resources
	Enabled bool
}

CleanupOptions controls what resources to remove during cleanup.

type Config

type Config struct {
	// Namespace is where the validation Job will be deployed
	Namespace string

	// JobName is the name of the Kubernetes Job
	JobName string

	// Image is the container image to use (should contain aicr CLI)
	Image string

	// ImagePullSecrets for pulling the image from private registries
	ImagePullSecrets []string

	// ServiceAccountName for the Job pods
	ServiceAccountName string

	// Tolerations for scheduling on tainted nodes
	Tolerations []corev1.Toleration

	// Affinity specifies pod scheduling affinity rules.
	// Used to prefer CPU nodes for non-GPU validation Jobs.
	Affinity *corev1.Affinity

	// SnapshotConfigMap is the ConfigMap containing the snapshot data
	SnapshotConfigMap string

	// RecipeConfigMap is the ConfigMap containing the recipe data
	RecipeConfigMap string

	// TestPackage is the Go package path used to derive the pre-compiled test binary name.
	// filepath.Base(TestPackage) + ".test" gives the binary (e.g. "readiness.test").
	TestPackage string

	// TestPattern is the test name pattern to run (passed to -run flag)
	// Example: "TestGpuHardwareDetection"
	TestPattern string

	// ExpectedTests is the number of tests expected to run.
	// If set and actual tests differ, validation fails.
	ExpectedTests int

	// Timeout for the Job to complete
	Timeout time.Duration

	// Cleanup determines whether to remove Job and RBAC on completion
	Cleanup bool

	// Debug enables debug logging
	Debug bool
}

Config holds the configuration for deploying a validation agent Job.

type Deployer

type Deployer struct {
	// contains filtered or unexported fields
}

Deployer manages the deployment and lifecycle of validation agent Jobs.

func NewDeployer

func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer

NewDeployer creates a new validation agent Deployer.

func (*Deployer) Cleanup

func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error

Cleanup removes the validation Job and RBAC resources. For multi-phase validation, prefer CleanupJob() per phase, then CleanupRBAC() once at end.

func (*Deployer) CleanupJob

func (d *Deployer) CleanupJob(ctx context.Context) error

CleanupJob removes the validation Job. Use this after each phase in multi-phase validation to clean up per-phase Jobs.

func (*Deployer) CleanupRBAC

func (d *Deployer) CleanupRBAC(ctx context.Context) error

CleanupRBAC removes RBAC resources (ServiceAccount, Role, RoleBinding). Use this once at the end of multi-phase validation after all Jobs are done.

func (*Deployer) Deploy

func (d *Deployer) Deploy(ctx context.Context) error

Deploy deploys the validation agent with all required resources (RBAC + Job). For multi-phase validation, prefer using EnsureRBAC() once, then DeployJob() per phase.

func (*Deployer) DeployJob

func (d *Deployer) DeployJob(ctx context.Context) error

DeployJob deploys the validation Job. Assumes RBAC resources already exist (call EnsureRBAC first). This deletes any existing Job with the same name before creating a new one.

func (*Deployer) EnsureRBAC

func (d *Deployer) EnsureRBAC(ctx context.Context) error

EnsureRBAC creates RBAC resources (ServiceAccount, Role, RoleBinding). This is idempotent - safe to call multiple times, reuses existing resources. For multi-phase validation, call this once before running multiple Jobs.

func (*Deployer) GetPodLogs

func (d *Deployer) GetPodLogs(ctx context.Context) (string, error)

GetPodLogs retrieves all pod logs as a string. This is useful for capturing logs when a Job fails for debugging.

func (*Deployer) GetResult

func (d *Deployer) GetResult(ctx context.Context) (*ValidationResult, error)

GetResult retrieves the validation result from the Job's pod logs.

func (*Deployer) StreamLogs

func (d *Deployer) StreamLogs(ctx context.Context) error

StreamLogs streams logs from the validation Job pod to the provided writer.

func (*Deployer) WaitForCompletion

func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error

WaitForCompletion waits for the validation Job to complete successfully.

func (*Deployer) WaitForJobPodTermination added in v0.8.8

func (d *Deployer) WaitForJobPodTermination(ctx context.Context)

WaitForJobPodTermination waits for the Job's pod to reach a terminal state or be deleted. This prevents race conditions where RBAC resources are cleaned up while the pod is still running cleanup operations (e.g., chainsaw namespace deletion). The caller must provide a timeout-bounded context.

func (*Deployer) WaitForPodReady added in v0.8.10

func (d *Deployer) WaitForPodReady(ctx context.Context) error

WaitForPodReady waits for the Job's pod to appear and reach the Running phase. This is used before streaming logs so we have a pod to follow. The caller must provide a timeout-bounded context.

type GoTestEvent

type GoTestEvent struct {
	Time    time.Time
	Action  string
	Package string
	Test    string
	Output  string
	Elapsed float64
}

GoTestEvent represents a single event from go test -json output.

type TestResult

type TestResult struct {
	// Name is the test function name (e.g., "TestGpuHardwareDetection")
	Name string

	// Status is the test result (pass/fail/skip)
	Status string

	// Duration is how long the test took
	Duration time.Duration

	// Output contains the test output lines
	Output []string
}

TestResult represents the result of a single test function.

type ValidationResult

type ValidationResult struct {
	// CheckName is the name of the check that was run
	CheckName string

	// Phase is the validation phase
	Phase string

	// Status is the result status (pass/fail/skip)
	Status string

	// Message provides details about the result
	Message string

	// Duration is how long the check took
	Duration time.Duration

	// Details contains structured data about the result
	Details map[string]interface{}

	// Tests contains individual test results when parsing go test JSON output
	// Each entry represents a single test function that was executed
	Tests []TestResult
}

ValidationResult represents the result of running validation checks.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL