Documentation
¶
Overview ¶
Package agent provides Kubernetes Job deployment for automated snapshot capture.
The agent package deploys a Kubernetes Job that runs aicr snapshot on GPU nodes and writes output to ConfigMap storage. It handles RBAC setup, Job lifecycle management, and snapshot retrieval.
Deployment Strategy ¶
RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding) are created idempotently - if they exist, they are reused. Mutable resources (Role, RoleBinding, ClusterRole, ClusterRoleBinding) use create-or-update semantics so stale rules from a previous run cannot persist.
The agent Namespace is created with an "app.kubernetes.io/managed-by=aicr" label; if the namespace pre-existed without that label, ensureNamespace patches the label rather than silently dropping intent.
The Job is deleted and recreated for each snapshot to ensure clean state. Job and Pod lifecycle waits use the Kubernetes watch API (not polling) for efficiency.
Usage Example ¶
package main
import (
"context"
"time"
"github.com/NVIDIA/aicr/pkg/k8s/agent"
"github.com/NVIDIA/aicr/pkg/k8s/client"
)
func main() {
ctx := context.Background()
// Get Kubernetes client
clientset, _, err := client.GetKubeClient()
if err != nil {
panic(err)
}
// Configure deployer
config := agent.Config{
Namespace: "gpu-operator",
Image: "ghcr.io/nvidia/aicr-validator:latest",
Output: "cm://gpu-operator/aicr-snapshot",
NodeSelector: map[string]string{
"nodeGroup": "customer-gpu",
},
}
// Create deployer
deployer := agent.NewDeployer(clientset, config)
// Deploy RBAC and Job
if err := deployer.Deploy(ctx); err != nil {
panic(err)
}
// Wait for completion
if err := deployer.WaitForCompletion(ctx, 5*time.Minute); err != nil {
panic(err)
}
// Get snapshot
snapshot, err := deployer.GetSnapshot(ctx)
if err != nil {
panic(err)
}
// Use snapshot...
}
Reconciliation ¶
The deployer ensures idempotent operation:
- Namespace: Created with managed-by label, or patched if pre-existing
- Immutable RBAC (ServiceAccount): Created if missing, reused if exists
- Mutable RBAC (Role/RoleBinding/ClusterRole/ClusterRoleBinding): create-or-update semantics so stale rules cannot persist
- Job: Deleted and recreated for clean state each run; deletion is observed via watch (watch.Deleted event), not polling
- ConfigMap: Created or updated with latest snapshot
Testing ¶
The package is designed for testability with Kubernetes fake clients:
import (
"testing"
"k8s.io/client-go/kubernetes/fake"
)
func TestDeployer(t *testing.T) {
clientset := fake.NewSimpleClientset()
deployer := agent.NewDeployer(clientset, agent.Config{
Namespace: "test",
Image: "test:latest",
})
// Test deployment logic...
}
Index ¶
- type CleanupOptions
- type Config
- type Deployer
- func (d *Deployer) CheckPermissions(ctx context.Context) ([]permissionCheck, error)
- func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error
- func (d *Deployer) Deploy(ctx context.Context) error
- func (d *Deployer) GetPodLogs(ctx context.Context) (string, error)
- func (d *Deployer) GetSnapshot(ctx context.Context) ([]byte, error)
- func (d *Deployer) StreamLogs(ctx context.Context, w io.Writer, prefix string) error
- func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error
- func (d *Deployer) WaitForPodReady(ctx context.Context, timeout time.Duration) error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CleanupOptions ¶
type CleanupOptions struct {
Enabled bool // If true, removes Job and all RBAC resources
}
CleanupOptions controls what resources to remove during cleanup.
type Config ¶
type Config struct {
Namespace string
ServiceAccountName string
JobName string
Image string
ImagePullSecrets []string
NodeSelector map[string]string
Tolerations []corev1.Toleration
Output string
Debug bool
Privileged bool // If true, run with privileged security context (required for GPU/SystemD collectors)
RequireGPU bool // If true, request nvidia.com/gpu resource (required for CDI environments)
RuntimeClassName string // If set, use this runtimeClassName on the pod and inject NVIDIA_VISIBLE_DEVICES=all (alternative to RequireGPU)
MaxNodesPerEntry int // Max node names per topology entry (0 = unlimited)
OS string // Recipe OS criteria value. When set to oskind.Talos, systemd hostPath mounts are skipped and the in-pod agent uses the Talos service backend.
// Requests overrides the per-resource container requests on the agent pod.
// When nil, the privileged/restricted defaults in job.go are used. Keys
// must match standard Kubernetes resource names (cpu, memory,
// ephemeral-storage); unknown keys are passed through unchanged.
Requests corev1.ResourceList
// Limits overrides the per-resource container limits on the agent pod.
// When nil, the privileged/restricted defaults in job.go are used.
// RequireGPU adds nvidia.com/gpu=1 to the merged limits ONLY when the
// caller did not already supply that key — so a caller can request
// e.g. nvidia.com/gpu=4 alongside RequireGPU and keep their value.
Limits corev1.ResourceList
}
Config holds the configuration for deploying the agent.
type Deployer ¶
type Deployer struct {
// contains filtered or unexported fields
}
Deployer manages the deployment and lifecycle of the agent Job.
func NewDeployer ¶
func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer
NewDeployer creates a new agent Deployer with the given configuration.
func (*Deployer) CheckPermissions ¶
CheckPermissions verifies if the current user has the required permissions to deploy the agent. Returns a list of permission checks and an error if any required permissions are missing.
func (*Deployer) Cleanup ¶
func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error
Cleanup removes the agent Job and RBAC resources. If opts.Enabled is false, no cleanup is performed (resources are kept for debugging). All resources are attempted for deletion even if some fail, and a combined error is returned. Deletions are fanned out concurrently so a slow apiserver does not serialize the wall clock.
func (*Deployer) Deploy ¶
Deploy deploys the agent with all required resources (RBAC + Job). This is the main entry point that orchestrates the deployment.
func (*Deployer) GetPodLogs ¶
GetPodLogs retrieves logs from the Job's Pod.
func (*Deployer) GetSnapshot ¶
GetSnapshot retrieves the snapshot data from the ConfigMap created by the agent. Returns the snapshot YAML content.
func (*Deployer) StreamLogs ¶
StreamLogs streams logs from the Job's Pod to the provided writer. It will follow the logs until the context is canceled. Returns when the context is canceled or an error occurs.
func (*Deployer) WaitForCompletion ¶
WaitForCompletion waits for the agent Job to complete successfully. Returns error if the Job fails or times out.