agent

package
v0.11.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 21, 2026 License: Apache-2.0 Imports: 21 Imported by: 0

Documentation

Overview

Package agent provides Kubernetes Job deployment for automated snapshot capture.

The agent package deploys a Kubernetes Job that runs aicr snapshot on GPU nodes and writes output to ConfigMap storage. It handles RBAC setup, Job lifecycle management, and snapshot retrieval.

Deployment Strategy

RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding) are created idempotently - if they exist, they are reused. The Job is deleted and recreated for each snapshot to ensure clean state.

Usage Example

package main

import (
	"context"
	"time"

	"github.com/NVIDIA/aicr/pkg/k8s/agent"
	"github.com/NVIDIA/aicr/pkg/k8s/client"
)

func main() {
	ctx := context.Background()

	// Get Kubernetes client
	clientset, _, err := client.GetKubeClient()
	if err != nil {
		panic(err)
	}

	// Configure deployer
	config := agent.Config{
		Namespace: "gpu-operator",
		Image:     "ghcr.io/nvidia/aicr-validator:latest",
		Output:    "cm://gpu-operator/aicr-snapshot",
		NodeSelector: map[string]string{
			"nodeGroup": "customer-gpu",
		},
	}

	// Create deployer
	deployer := agent.NewDeployer(clientset, config)

	// Deploy RBAC and Job
	if err := deployer.Deploy(ctx); err != nil {
		panic(err)
	}

	// Wait for completion
	if err := deployer.WaitForCompletion(ctx, 5*time.Minute); err != nil {
		panic(err)
	}

	// Get snapshot
	snapshot, err := deployer.GetSnapshot(ctx)
	if err != nil {
		panic(err)
	}

	// Use snapshot...
}

Reconciliation

The deployer ensures idempotent operation:

  • RBAC resources: Created if missing, reused if exist
  • Job: Deleted and recreated for clean state each run
  • ConfigMap: Created or updated with latest snapshot

Testing

The package is designed for testability with Kubernetes fake clients:

import (
	"testing"
	"k8s.io/client-go/kubernetes/fake"
)

func TestDeployer(t *testing.T) {
	clientset := fake.NewSimpleClientset()
	deployer := agent.NewDeployer(clientset, agent.Config{
		Namespace: "test",
		Image:     "test:latest",
	})
	// Test deployment logic...
}

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CleanupOptions

type CleanupOptions struct {
	Enabled bool // If true, removes Job and all RBAC resources
}

CleanupOptions controls what resources to remove during cleanup.

type Config

type Config struct {
	Namespace          string
	ServiceAccountName string
	JobName            string
	Image              string
	ImagePullSecrets   []string
	NodeSelector       map[string]string
	Tolerations        []corev1.Toleration
	Output             string
	Debug              bool
	Privileged         bool   // If true, run with privileged security context (required for GPU/SystemD collectors)
	RequireGPU         bool   // If true, request nvidia.com/gpu resource (required for CDI environments)
	RuntimeClassName   string // If set, use this runtimeClassName on the pod and inject NVIDIA_VISIBLE_DEVICES=all (alternative to RequireGPU)
	MaxNodesPerEntry   int    // Max node names per topology entry (0 = unlimited)
}

Config holds the configuration for deploying the agent.

type Deployer

type Deployer struct {
	// contains filtered or unexported fields
}

Deployer manages the deployment and lifecycle of the agent Job.

func NewDeployer

func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer

NewDeployer creates a new agent Deployer with the given configuration.

func (*Deployer) CheckPermissions

func (d *Deployer) CheckPermissions(ctx context.Context) ([]permissionCheck, error)

CheckPermissions verifies if the current user has the required permissions to deploy the agent. Returns a list of permission checks and an error if any required permissions are missing.

func (*Deployer) Cleanup

func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error

Cleanup removes the agent Job and RBAC resources. If opts.Enabled is false, no cleanup is performed (resources are kept for debugging). All resources are attempted for deletion even if some fail, and a combined error is returned.

func (*Deployer) Deploy

func (d *Deployer) Deploy(ctx context.Context) error

Deploy deploys the agent with all required resources (RBAC + Job). This is the main entry point that orchestrates the deployment.

func (*Deployer) GetPodLogs

func (d *Deployer) GetPodLogs(ctx context.Context) (string, error)

GetPodLogs retrieves logs from the Job's Pod.

func (*Deployer) GetSnapshot

func (d *Deployer) GetSnapshot(ctx context.Context) ([]byte, error)

GetSnapshot retrieves the snapshot data from the ConfigMap created by the agent. Returns the snapshot YAML content.

func (*Deployer) StreamLogs

func (d *Deployer) StreamLogs(ctx context.Context, w io.Writer, prefix string) error

StreamLogs streams logs from the Job's Pod to the provided writer. It will follow the logs until the context is canceled. Returns when the context is canceled or an error occurs.

func (*Deployer) WaitForCompletion

func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error

WaitForCompletion waits for the agent Job to complete successfully. Returns error if the Job fails or times out.

func (*Deployer) WaitForPodReady

func (d *Deployer) WaitForPodReady(ctx context.Context, timeout time.Duration) error

WaitForPodReady waits for the Job's Pod to be in Running state. This is useful for streaming logs before Job completes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL