Documentation
¶
Overview ¶
Package agent provides Kubernetes Job deployment for automated snapshot capture.
The agent package deploys a Kubernetes Job that runs aicr snapshot on GPU nodes and writes output to ConfigMap storage. It handles RBAC setup, Job lifecycle management, and snapshot retrieval.
Deployment Strategy ¶
RBAC resources (ServiceAccount, Role, RoleBinding, ClusterRole, ClusterRoleBinding) are created idempotently - if they exist, they are reused. The Job is deleted and recreated for each snapshot to ensure clean state.
Usage Example ¶
package main
import (
"context"
"time"
"github.com/NVIDIA/aicr/pkg/k8s/agent"
"github.com/NVIDIA/aicr/pkg/k8s/client"
)
func main() {
ctx := context.Background()
// Get Kubernetes client
clientset, _, err := client.GetKubeClient()
if err != nil {
panic(err)
}
// Configure deployer
config := agent.Config{
Namespace: "gpu-operator",
Image: "ghcr.io/nvidia/aicr-validator:latest",
Output: "cm://gpu-operator/aicr-snapshot",
NodeSelector: map[string]string{
"nodeGroup": "customer-gpu",
},
}
// Create deployer
deployer := agent.NewDeployer(clientset, config)
// Deploy RBAC and Job
if err := deployer.Deploy(ctx); err != nil {
panic(err)
}
// Wait for completion
if err := deployer.WaitForCompletion(ctx, 5*time.Minute); err != nil {
panic(err)
}
// Get snapshot
snapshot, err := deployer.GetSnapshot(ctx)
if err != nil {
panic(err)
}
// Use snapshot...
}
Reconciliation ¶
The deployer ensures idempotent operation:
- RBAC resources: Created if missing, reused if exist
- Job: Deleted and recreated for clean state each run
- ConfigMap: Created or updated with latest snapshot
Testing ¶
The package is designed for testability with Kubernetes fake clients:
import (
"testing"
"k8s.io/client-go/kubernetes/fake"
)
func TestDeployer(t *testing.T) {
clientset := fake.NewSimpleClientset()
deployer := agent.NewDeployer(clientset, agent.Config{
Namespace: "test",
Image: "test:latest",
})
// Test deployment logic...
}
Index ¶
- type CleanupOptions
- type Config
- type Deployer
- func (d *Deployer) CheckPermissions(ctx context.Context) ([]permissionCheck, error)
- func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error
- func (d *Deployer) Deploy(ctx context.Context) error
- func (d *Deployer) GetPodLogs(ctx context.Context) (string, error)
- func (d *Deployer) GetSnapshot(ctx context.Context) ([]byte, error)
- func (d *Deployer) StreamLogs(ctx context.Context, w io.Writer, prefix string) error
- func (d *Deployer) WaitForCompletion(ctx context.Context, timeout time.Duration) error
- func (d *Deployer) WaitForPodReady(ctx context.Context, timeout time.Duration) error
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CleanupOptions ¶
type CleanupOptions struct {
Enabled bool // If true, removes Job and all RBAC resources
}
CleanupOptions controls what resources to remove during cleanup.
type Config ¶
type Config struct {
Namespace string
ServiceAccountName string
JobName string
Image string
ImagePullSecrets []string
NodeSelector map[string]string
Tolerations []corev1.Toleration
Output string
Debug bool
Privileged bool // If true, run with privileged security context (required for GPU/SystemD collectors)
RequireGPU bool // If true, request nvidia.com/gpu resource (required for CDI environments)
RuntimeClassName string // If set, use this runtimeClassName on the pod and inject NVIDIA_VISIBLE_DEVICES=all (alternative to RequireGPU)
MaxNodesPerEntry int // Max node names per topology entry (0 = unlimited)
}
Config holds the configuration for deploying the agent.
type Deployer ¶
type Deployer struct {
// contains filtered or unexported fields
}
Deployer manages the deployment and lifecycle of the agent Job.
func NewDeployer ¶
func NewDeployer(clientset kubernetes.Interface, config Config) *Deployer
NewDeployer creates a new agent Deployer with the given configuration.
func (*Deployer) CheckPermissions ¶
CheckPermissions verifies if the current user has the required permissions to deploy the agent. Returns a list of permission checks and an error if any required permissions are missing.
func (*Deployer) Cleanup ¶
func (d *Deployer) Cleanup(ctx context.Context, opts CleanupOptions) error
Cleanup removes the agent Job and RBAC resources. If opts.Enabled is false, no cleanup is performed (resources are kept for debugging). All resources are attempted for deletion even if some fail, and a combined error is returned.
func (*Deployer) Deploy ¶
Deploy deploys the agent with all required resources (RBAC + Job). This is the main entry point that orchestrates the deployment.
func (*Deployer) GetPodLogs ¶
GetPodLogs retrieves logs from the Job's Pod.
func (*Deployer) GetSnapshot ¶
GetSnapshot retrieves the snapshot data from the ConfigMap created by the agent. Returns the snapshot YAML content.
func (*Deployer) StreamLogs ¶
StreamLogs streams logs from the Job's Pod to the provided writer. It will follow the logs until the context is canceled. Returns when the context is canceled or an error occurs.
func (*Deployer) WaitForCompletion ¶
WaitForCompletion waits for the agent Job to complete successfully. Returns error if the Job fails or times out.