agent

package

v0.14.0 Latest Latest Go to latest Published: Jun 1, 2026 License: Apache-2.0 Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/NVIDIA/aicr

Links

Open Source Insights

Documentation ¶

Overview ¶

Package agent provides Kubernetes Job deployment for automated snapshot capture.

The agent package deploys a Kubernetes Job that runs aicr snapshot on GPU nodes and writes output to ConfigMap storage. It handles RBAC setup, Job lifecycle management, and snapshot retrieval.

Deployment Strategy ¶

RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding) are created idempotently - if they exist, they are reused. Mutable resources (Role, RoleBinding, ClusterRole, ClusterRoleBinding) use create-or-update semantics so stale rules from a previous run cannot persist.

The agent Namespace is created with an "app.kubernetes.io/managed-by=aicr" label; if the namespace pre-existed without that label, ensureNamespace patches the label rather than silently dropping intent.

The Job is deleted and recreated for each snapshot to ensure clean state. Job and Pod lifecycle waits use the Kubernetes watch API (not polling) for efficiency.

Usage Example ¶

package main

import (
	"context"
	"time"

	"github.com/NVIDIA/aicr/pkg/k8s/agent"
	"github.com/NVIDIA/aicr/pkg/k8s/client"
)

func main() {
	ctx := context.Background()

	// Get Kubernetes client
	clientset, _, err := client.GetKubeClient()
	if err != nil {
		panic(err)
	}

	// Configure deployer
	config := agent.Config{
		Namespace: "gpu-operator",
		Image:     "ghcr.io/nvidia/aicr-validator:latest",
		Output:    "cm://gpu-operator/aicr-snapshot",
		NodeSelector: map[string]string{
			"nodeGroup": "customer-gpu",
		},
	}

	// Create deployer
	deployer := agent.NewDeployer(clientset, config)

	// Deploy RBAC and Job
	if err := deployer.Deploy(ctx); err != nil {
		panic(err)
	}

	// Wait for completion
	if err := deployer.WaitForCompletion(ctx, 5*time.Minute); err != nil {
		panic(err)
	}

	// Get snapshot
	snapshot, err := deployer.GetSnapshot(ctx)
	if err != nil {
		panic(err)
	}

	// Use snapshot...
}

Reconciliation ¶

The deployer ensures idempotent operation:

Namespace: Created with managed-by label, or patched if pre-existing
Immutable RBAC (ServiceAccount): Created if missing, reused if exists
Mutable RBAC (Role/RoleBinding/ClusterRole/ClusterRoleBinding): create-or-update semantics so stale rules cannot persist
Job: Deleted and recreated for clean state each run; deletion is observed via watch (watch.Deleted event), not polling
ConfigMap: Created or updated with latest snapshot

Testing ¶

The package is designed for testability with Kubernetes fake clients:

import (
	"testing"
	"k8s.io/client-go/kubernetes/fake"
)

func TestDeployer(t *testing.T) {
	clientset := fake.NewSimpleClientset()
	deployer := agent.NewDeployer(clientset, agent.Config{
		Namespace: "test",
		Image:     "test:latest",
	})
	// Test deployment logic...
}

Index ¶

type CleanupOptions
type Config
type Deployer
- func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CleanupOptions ¶

type CleanupOptions struct {
	Enabled bool // If true, removes Job and all RBAC resources
}

CleanupOptions controls what resources to remove during cleanup.

type Config ¶

type Config struct {
Namespace string
ServiceAccountName string
JobName string
Image string
ImagePullSecrets []string
NodeSelector map[string]string
Tolerations []corev1.Toleration
Output string
Debug bool
Privileged bool // If true, run with privileged security context (required for GPU/SystemD collectors)
RequireGPU bool // If true, request nvidia.com/gpu resource (required for CDI environments)
RuntimeClassName string // If set, use this runtimeClassName on the pod and inject NVIDIA_VISIBLE_DEVICES=all (alternative to RequireGPU)
MaxNodesPerEntry int // Max node names per topology entry (0 = unlimited)
OS string // Recipe OS criteria value. When set to oskind.Talos, systemd hostPath mounts are skipped and the in-pod agent uses the Talos service backend.

// Requests overrides the per-resource container requests on the agent pod.
// When nil, the privileged/restricted defaults in job.go are used. Keys
// must match standard Kubernetes resource names (cpu, memory,
// ephemeral-storage); unknown keys are passed through unchanged.
Requests corev1.ResourceList

// Limits overrides the per-resource container limits on the agent pod.
// When nil, the privileged/restricted defaults in job.go are used.
// RequireGPU adds nvidia.com/gpu=1 to the merged limits ONLY when the
// caller did not already supply that key — so a caller can request
// e.g. nvidia.com/gpu=4 alongside RequireGPU and keep their value.
Limits corev1.ResourceList
}

Config holds the configuration for deploying the agent.

type Deployer ¶

type Deployer struct {
	// contains filtered or unexported fields
}

Deployer manages the deployment and lifecycle of the agent Job.

func NewDeployer ¶

func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer

NewDeployer creates a new agent Deployer with the given configuration.

func (*Deployer) CheckPermissions ¶

func (d *Deployer) CheckPermissions(ctx context.Context) ([]permissionCheck, error)

CheckPermissions verifies if the current user has the required permissions to deploy the agent. Returns a list of permission checks and an error if any required permissions are missing.

func (*Deployer) Cleanup ¶

func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error

Cleanup removes the agent Job and RBAC resources. If opts.Enabled is false, no cleanup is performed (resources are kept for debugging). All resources are attempted for deletion even if some fail, and a combined error is returned. Deletions are fanned out concurrently so a slow apiserver does not serialize the wall clock.

func (*Deployer) Deploy ¶

func (d *Deployer) Deploy(ctx context.Context) error

Deploy deploys the agent with all required resources (RBAC + Job). This is the main entry point that orchestrates the deployment.

func (*Deployer) GetPodLogs ¶

func (d *Deployer) GetPodLogs(ctx context.Context) (string, error)

GetPodLogs retrieves logs from the Job's Pod.

func (*Deployer) GetSnapshot ¶

func (d *Deployer) GetSnapshot(ctx context.Context) ([]byte, error)

GetSnapshot retrieves the snapshot data from the ConfigMap created by the agent. Returns the snapshot YAML content.

func (*Deployer) StreamLogs ¶

func (d *Deployer) StreamLogs(ctx context.Context, w io.Writer, prefix string) error

StreamLogs streams logs from the Job's Pod to the provided writer. It will follow the logs until the context is canceled. Returns when the context is canceled or an error occurs.

func (*Deployer) WaitForCompletion ¶

func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error

WaitForCompletion waits for the agent Job to complete successfully. Returns error if the Job fails or times out.

func (*Deployer) WaitForPodReady ¶

func (d *Deployer) WaitForPodReady(ctx context.Context, timeout time.Duration) error

WaitForPodReady waits for the Job's Pod to be in Running state. This is useful for streaming logs before Job completes.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL