Documentation
¶
Overview ¶
Package snapshotter captures comprehensive system configuration snapshots.
Overview ¶
The snapshotter package orchestrates parallel collection of system measurements from multiple sources (Kubernetes, GPU, OS, SystemD) and produces structured snapshots that can be serialized for analysis, auditing, or recommendation generation.
Core Types ¶
Snapshotter: Interface for snapshot collection
type Snapshotter interface {
Measure(ctx context.Context) error
}
NodeSnapshotter: Production implementation that collects from the current node
type NodeSnapshotter struct {
Version string // Snapshotter version
Factory collector.Factory // Collector factory (optional)
Serializer serializer.Serializer // Output serializer (optional)
}
Snapshot: Captured configuration data
type Snapshot struct {
Header // API version, kind, metadata
Measurements []*measurement.Measurement // Collected data
}
Usage ¶
Basic snapshot with defaults (stdout YAML):
snapshotter := &snapshotter.NodeSnapshotter{
Version: "v1.0.0",
}
ctx := context.Background()
if err := snapshotter.Measure(ctx); err != nil {
log.Fatalf("snapshot failed: %v", err)
}
Custom collector factory:
factory := collector.NewDefaultFactory(
collector.WithSystemDServices([]string{"containerd.service"}),
)
snapshotter := &snapshotter.NodeSnapshotter{
Version: "v1.0.0",
Factory: factory,
}
if err := snapshotter.Measure(context.Background()); err != nil {
log.Fatal(err)
}
Custom output serializer:
serializer, err := serializer.NewFileSerializer("snapshot.json")
if err != nil {
log.Fatal(err)
}
defer serializer.Close()
snapshotter := &snapshotter.NodeSnapshotter{
Version: "v1.0.0",
Serializer: serializer,
}
if err := snapshotter.Measure(context.Background()); err != nil {
log.Fatal(err)
}
With timeout:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
snapshotter := &snapshotter.NodeSnapshotter{Version: "v1.0.0"}
if err := snapshotter.Measure(ctx); err != nil {
log.Fatal(err)
}
Snapshot Structure ¶
Snapshots contain a header and measurements:
apiVersion: aicr.nvidia.com/v1alpha1
kind: Snapshot
metadata:
version: v1.0.0
source: node-1
timestamp: 2025-01-15T10:30:00Z
measurements:
- type: K8s
subtypes:
- subtype: server
data:
version: 1.33.5
platform: linux/amd64
- subtype: node
data:
provider: eks
kernel-version: 6.8.0
- subtype: image
data:
kube-apiserver: v1.33.5
- subtype: policy
data:
driver.version: 570.86.16
- subtype: helm
data:
gpu-operator.chart: gpu-operator
gpu-operator.version: 25.3.0
- subtype: argocd
data:
gpu-operator.source.chart: gpu-operator
gpu-operator.syncStatus: Synced
- type: GPU
subtypes:
- subtype: device
data:
driver: 570.158.01
model: H100
Parallel Collection ¶
NodeSnapshotter runs all collectors concurrently using errgroup:
- Metadata collection (node name, version)
- Kubernetes resources (cluster config, policies)
- SystemD services (containerd, kubelet)
- OS configuration (grub, sysctl, modules)
- GPU hardware (driver, model, settings)
If any collector fails, all are canceled and an error is returned.
Node Name Detection ¶
Node name is determined with fallback priority:
- NODE_NAME environment variable
- KUBERNETES_NODE_NAME environment variable
- HOSTNAME environment variable
This ensures correct node identification in various deployment scenarios.
Error Handling ¶
Measure() returns an error when:
- Any collector fails
- Context is canceled or times out
- Serialization fails
Partial data is never returned - snapshots are all-or-nothing.
Observability ¶
The snapshotter exports Prometheus metrics:
- snapshot_collection_duration_seconds: Total time to collect snapshot
- snapshot_collector_duration_seconds{collector}: Per-collector timing
Structured logs are emitted for:
- Snapshot start
- Collector progress
- Errors and failures
Resource Requirements ¶
Collectors may require:
- Kubernetes API access (in-cluster config or kubeconfig)
- NVIDIA GPU and nvidia-smi binary
- systemd and systemctl binary
- Read access to /proc, /sys, /etc
Failures due to missing resources are reported as errors.
Integration ¶
The snapshotter is invoked by:
- pkg/cli - snapshot command
- Kubernetes Job - aicr-agent deployment
It depends on:
- pkg/collector - Data collection implementations
- pkg/serializer - Output formatting
- pkg/measurement - Data structures
Snapshots are consumed by:
- pkg/recipe - Recipe generation from snapshots
- External analysis tools
- Auditing and compliance systems
Index ¶
- Constants
- func DefaultTolerations() []corev1.Toleration
- func ParseNodeSelectors(selectors []string) (map[string]string, error)
- func ParseTaint(taintStr string) (*corev1.Taint, error)
- func ParseTolerations(tolerations []string) ([]corev1.Toleration, error)
- type AgentConfig
- type NodeSnapshotter
- type Snapshot
Constants ¶
const (
// FullAPIVersion is the complete API version string
FullAPIVersion = apiDomain + "/" + apiVersion
)
Variables ¶
This section is empty.
Functions ¶
func DefaultTolerations ¶
func DefaultTolerations() []corev1.Toleration
DefaultTolerations returns tolerations that accept all taints. This allows the agent Job to be scheduled on any node regardless of taints.
func ParseNodeSelectors ¶
ParseNodeSelectors parses node selector strings in format "key=value".
func ParseTaint ¶
ParseTaint parses a single taint string in format "key=value:effect" or "key:effect". Returns a corev1.Taint struct.
func ParseTolerations ¶
func ParseTolerations(tolerations []string) ([]corev1.Toleration, error)
ParseTolerations parses toleration strings in format "key=value:effect" or "key:effect". If no tolerations are provided, returns DefaultTolerations() which accepts all taints.
Types ¶
type AgentConfig ¶
type AgentConfig struct {
// Kubeconfig path (optional override)
Kubeconfig string
// Namespace for agent deployment
Namespace string
// Image for agent container
Image string
// ImagePullSecrets for pulling the agent image from private registries
ImagePullSecrets []string
// JobName for the agent Job
JobName string
// ServiceAccountName for the agent
ServiceAccountName string
// NodeSelector for targeting specific nodes
NodeSelector map[string]string
// Tolerations for scheduling on tainted nodes
Tolerations []corev1.Toleration
// Timeout for waiting for Job completion
Timeout time.Duration
// Cleanup determines whether to remove Job and RBAC on completion
Cleanup bool
// Output destination for snapshot
Output string
// Debug enables debug logging
Debug bool
// Privileged enables privileged mode (hostPID, hostNetwork, privileged container).
// Required for GPU and SystemD collectors. When false, only K8s and OS collectors work.
Privileged bool
// RequireGPU requests nvidia.com/gpu resource for the agent pod.
// Required in CDI environments (e.g., kind with nvkind) where GPU devices
// are only injected when explicitly requested.
RequireGPU bool
// RuntimeClassName sets runtimeClassName on the agent pod and injects
// NVIDIA_VISIBLE_DEVICES=all. Use instead of RequireGPU when all GPUs
// are allocated — gives the agent nvidia-smi access without consuming
// a GPU from the Device Plugin.
RuntimeClassName string
// TemplatePath is the path to a Go template file for custom output formatting.
// When set, the snapshot output will be processed through this template.
TemplatePath string
// MaxNodesPerEntry limits node names per topology entry (0 = unlimited).
MaxNodesPerEntry int
}
AgentConfig contains configuration for Kubernetes agent deployment.
type NodeSnapshotter ¶
type NodeSnapshotter struct {
// Version is the snapshotter version.
Version string
// Factory is the collector factory to use. If nil, the default factory is used.
Factory collector.Factory
// Serializer is the serializer to use for output. If nil, a default stdout JSON serializer is used.
Serializer serializer.Serializer
// AgentConfig contains configuration for agent deployment mode. If nil or Enabled=false, runs locally.
AgentConfig *AgentConfig
// RequireGPU when true causes the snapshot to fail if no GPU is detected.
RequireGPU bool
}
NodeSnapshotter collects system configuration measurements from the current node. It coordinates multiple collectors in parallel to gather data about Kubernetes, GPU hardware, OS configuration, and systemd services, then serializes the results. If AgentConfig is provided with Enabled=true, it deploys a Kubernetes Job instead.
func (*NodeSnapshotter) Measure ¶
func (n *NodeSnapshotter) Measure(ctx context.Context) error
Measure collects configuration measurements and serializes the snapshot. When AgentConfig is set, it deploys a Kubernetes Job to capture the snapshot on a GPU node. Otherwise, it runs collectors locally in parallel. Individual collector failures are logged and skipped — the snapshot contains all measurements that could be successfully collected.
type Snapshot ¶
type Snapshot struct {
header.Header `json:",inline" yaml:",inline"`
// Measurements contains the collected measurements from various collectors.
Measurements []*measurement.Measurement `json:"measurements" yaml:"measurements"`
}
Snapshot represents a collected configuration snapshot from a system node. It contains metadata and measurements from various collectors including Kubernetes, GPU, OS configuration, and systemd services.
func DeployAndGetSnapshot ¶
func DeployAndGetSnapshot(ctx context.Context, config *AgentConfig) (*Snapshot, error)
DeployAndGetSnapshot deploys an agent to capture a snapshot and returns the Snapshot struct. This is used by commands that need to capture a snapshot but also process the data (e.g., validate command that needs to run validation on the captured snapshot).
func NewSnapshot ¶
func NewSnapshot() *Snapshot
NewSnapshot creates a new Snapshot instance with an initialized Measurements slice.