agent

package

v0.8.16 Latest Latest Go to latest Published: Mar 5, 2026 License: Apache-2.0 Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

README ¶

Validation Agent

The validation agent package provides a Kubernetes Job-based executor for running validation checks.

Architecture

The validation agent follows the same pattern as the snapshot agent (pkg/k8s/agent):

┌────────────────┐
│   Validator    │
│   (CLI/API)    │
└───────┬────────┘
        │
        ▼
┌────────────────┐
│ Agent Deployer │  ← Creates K8s resources
└───────┬────────┘
        │
        ├─► RBAC (ServiceAccount, Role, RoleBinding)
        ├─► Input ConfigMaps (snapshot.yaml, recipe.yaml)
        └─► Job (runs go test commands)
            │
            ├─► Mounts snapshot + recipe as volumes
            ├─► Runs: go test -json -run TestName
            └─► Writes results to ConfigMap

Key Components

Deployer

Deploy() - Creates RBAC + Job
WaitForCompletion() - Waits for Job to finish
GetResult() - Retrieves validation results from ConfigMap
Cleanup() - Removes Job and RBAC resources

Job Execution

The Job container:

Mounts snapshot and recipe from ConfigMaps
Sets environment variables (AICR_SNAPSHOT_PATH, AICR_RECIPE_PATH)
Runs go test -v -json <package> (runs all tests in phase package)
Outputs test results to stdout (JSON format between markers)
Exits with test exit code

The validator reads Job logs and updates the unified ValidationResult ConfigMap.

RBAC Permissions

The validation Job needs permissions to:

Read/write ConfigMaps (for inputs and results)
Read pods, services, deployments (for deployment phase checks)
Read nodes (for deployment phase checks)

Usage Example

// Create Kubernetes client
clientset, err := k8sclient.GetKubeClient()

// Configure validation agent
config := agent.Config{
    Namespace:          "aicr-validation",
    JobName:            "aicr-validation-deployment",
    Image:              "ghcr.io/nvidia/aicr-validator:latest",  // Validator image with Go toolchain
    ServiceAccountName: "aicr-validator",
    SnapshotConfigMap:  "aicr-snapshot",
    RecipeConfigMap:    "aicr-recipe",
    TestPackage:        "./pkg/validator/checks/deployment",
    TestPattern:        "TestOperatorHealth",
    Timeout:            5 * time.Minute,
    Cleanup:            true,
}

// Create deployer
deployer := agent.NewDeployer(clientset, config)

// Deploy and wait
if err := deployer.Deploy(ctx); err != nil {
    return err
}

defer deployer.Cleanup(ctx, agent.CleanupOptions{Enabled: true})

if err := deployer.WaitForCompletion(ctx, config.Timeout); err != nil {
    return err
}

// Get results
result, err := deployer.GetResult(ctx)
if err != nil {
    return err
}

fmt.Printf("Check: %s, Status: %s\n", result.CheckName, result.Status)

Design Decisions

Why Jobs instead of inline execution?

Isolation - Tests run in cluster context with proper RBAC
Resource limits - Jobs can have CPU/memory constraints
Smart scheduling - Jobs prefer CPU nodes via soft affinity (nvidia.com/gpu.present DoesNotExist); checks that need GPUs create their own workload Pods with GPU resource requests
Observability - Jobs show up in kubectl get jobs
Reproducibility - Same execution environment every time

Why ConfigMaps for results?

Persistence - Results survive even if Job is deleted
Accessibility - Easy to retrieve with kubectl or API
Size limits - ConfigMap 1MB limit encourages concise results
Standard pattern - Consistent with snapshot agent

Why one Job per phase?

Early exit - Can skip subsequent phases on failure
Granular control - Easier to retry specific phases
Clear boundaries - Each phase is independent unit of work

Integration with Validator

The validator package uses this agent to run checks:

// In pkg/validator/phases.go
func (v *Validator) validateDeployment(ctx context.Context, ...) {
    // For each check in recipe
    for _, checkName := range recipe.Validation.Deployment.Checks {
        // Deploy Job for this check
        deployer := agent.NewDeployer(...)
        deployer.Deploy(ctx)
        deployer.WaitForCompletion(ctx, timeout)
        result := deployer.GetResult(ctx)

        // Aggregate results
        phaseResult.Checks = append(phaseResult.Checks, result)
    }
}

Documentation ¶

Index ¶

type CleanupOptions
type Config
type Deployer
- func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer
type GoTestEvent
type TestResult
type ValidationResult

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CleanupOptions ¶

type CleanupOptions struct {
	// Enabled determines whether to cleanup resources
	Enabled bool
}

CleanupOptions controls what resources to remove during cleanup.

type Config ¶

type Config struct {
	// Namespace is where the validation Job will be deployed
	Namespace string

	// JobName is the name of the Kubernetes Job
	JobName string

	// Image is the container image to use (should contain aicr CLI)
	Image string

	// ImagePullSecrets for pulling the image from private registries
	ImagePullSecrets []string

	// ServiceAccountName for the Job pods
	ServiceAccountName string

	// Tolerations for scheduling on tainted nodes
	Tolerations []corev1.Toleration

	// Affinity specifies pod scheduling affinity rules.
	// Used to prefer CPU nodes for non-GPU validation Jobs.
	Affinity *corev1.Affinity

	// SnapshotConfigMap is the ConfigMap containing the snapshot data
	SnapshotConfigMap string

	// RecipeConfigMap is the ConfigMap containing the recipe data
	RecipeConfigMap string

	// TestPackage is the Go package path used to derive the pre-compiled test binary name.
	// filepath.Base(TestPackage) + ".test" gives the binary (e.g. "readiness.test").
	TestPackage string

	// TestPattern is the test name pattern to run (passed to -run flag)
	// Example: "TestGpuHardwareDetection"
	TestPattern string

	// ExpectedTests is the number of tests expected to run.
	// If set and actual tests differ, validation fails.
	ExpectedTests int

	// Timeout for the Job to complete
	Timeout time.Duration

	// Cleanup determines whether to remove Job and RBAC on completion
	Cleanup bool

	// Debug enables debug logging
	Debug bool
}

Config holds the configuration for deploying a validation agent Job.

type Deployer ¶

type Deployer struct {
	// contains filtered or unexported fields
}

Deployer manages the deployment and lifecycle of validation agent Jobs.

func NewDeployer ¶

func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer

NewDeployer creates a new validation agent Deployer.

func (*Deployer) Cleanup ¶

func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error

Cleanup removes the validation Job and RBAC resources. For multi-phase validation, prefer CleanupJob() per phase, then CleanupRBAC() once at end.

func (*Deployer) CleanupJob ¶

func (d *Deployer) CleanupJob(ctx context.Context) error

CleanupJob removes the validation Job. Use this after each phase in multi-phase validation to clean up per-phase Jobs.

func (*Deployer) CleanupRBAC ¶

func (d *Deployer) CleanupRBAC(ctx context.Context) error

CleanupRBAC removes RBAC resources (ServiceAccount, Role, RoleBinding). Use this once at the end of multi-phase validation after all Jobs are done.

func (*Deployer) Deploy ¶

func (d *Deployer) Deploy(ctx context.Context) error

Deploy deploys the validation agent with all required resources (RBAC + Job). For multi-phase validation, prefer using EnsureRBAC() once, then DeployJob() per phase.

func (*Deployer) DeployJob ¶

func (d *Deployer) DeployJob(ctx context.Context) error

DeployJob deploys the validation Job. Assumes RBAC resources already exist (call EnsureRBAC first). This deletes any existing Job with the same name before creating a new one.

func (*Deployer) EnsureRBAC ¶

func (d *Deployer) EnsureRBAC(ctx context.Context) error

EnsureRBAC creates RBAC resources (ServiceAccount, Role, RoleBinding). This is idempotent - safe to call multiple times, reuses existing resources. For multi-phase validation, call this once before running multiple Jobs.

func (*Deployer) GetPodLogs ¶

func (d *Deployer) GetPodLogs(ctx context.Context) (string, error)

GetPodLogs retrieves all pod logs as a string. This is useful for capturing logs when a Job fails for debugging.

func (*Deployer) GetResult ¶

func (d *Deployer) GetResult(ctx context.Context) (*ValidationResult, error)

GetResult retrieves the validation result from the Job's pod logs.

func (*Deployer) StreamLogs ¶

func (d *Deployer) StreamLogs(ctx context.Context) error

StreamLogs streams logs from the validation Job pod to the provided writer.

func (*Deployer) WaitForCompletion ¶

func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error

WaitForCompletion waits for the validation Job to complete successfully.

func (*Deployer) WaitForJobPodTermination ¶ added in v0.8.8

func (d *Deployer) WaitForJobPodTermination(ctx context.Context)

WaitForJobPodTermination waits for the Job's pod to reach a terminal state or be deleted. This prevents race conditions where RBAC resources are cleaned up while the pod is still running cleanup operations (e.g., chainsaw namespace deletion). The caller must provide a timeout-bounded context.

func (*Deployer) WaitForPodReady ¶ added in v0.8.10

func (d *Deployer) WaitForPodReady(ctx context.Context) error

WaitForPodReady waits for the Job's pod to appear and reach the Running phase. This is used before streaming logs so we have a pod to follow. The caller must provide a timeout-bounded context.

type GoTestEvent ¶

type GoTestEvent struct {
	Time    time.Time
	Action  string
	Package string
	Test    string
	Output  string
	Elapsed float64
}

GoTestEvent represents a single event from go test -json output.

type TestResult ¶

type TestResult struct {
	// Name is the test function name (e.g., "TestGpuHardwareDetection")
	Name string

	// Status is the test result (pass/fail/skip)
	Status string

	// Duration is how long the test took
	Duration time.Duration

	// Output contains the test output lines
	Output []string
}

TestResult represents the result of a single test function.

type ValidationResult ¶

type ValidationResult struct {
	// CheckName is the name of the check that was run
	CheckName string

	// Phase is the validation phase
	Phase string

	// Status is the result status (pass/fail/skip)
	Status string

	// Message provides details about the result
	Message string

	// Duration is how long the check took
	Duration time.Duration

	// Details contains structured data about the result
	Details map[string]interface{}

	// Tests contains individual test results when parsing go test JSON output
	// Each entry represents a single test function that was executed
	Tests []TestResult
}

ValidationResult represents the result of running validation checks.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL