snapshotter

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 9, 2026 License: Apache-2.0 Imports: 23 Imported by: 0

Documentation

Overview

Package snapshotter captures comprehensive system configuration snapshots.

Overview

The snapshotter package orchestrates parallel collection of system measurements from multiple sources (Kubernetes, GPU, OS, SystemD) and produces structured snapshots that can be serialized for analysis, auditing, or recommendation generation.

Core Types

Snapshotter: Interface for snapshot collection

type Snapshotter interface {
    Measure(ctx context.Context) error
}

NodeSnapshotter: Production implementation that collects from the current node

type NodeSnapshotter struct {
    Version    string               // Snapshotter version
    Factory    collector.Factory    // Collector factory (optional)
    Serializer serializer.Serializer // Output serializer (optional)
}

Snapshot: Captured configuration data

type Snapshot struct {
    Header                            // API version, kind, metadata
    Measurements []*measurement.Measurement // Collected data
}

Usage

Basic snapshot with defaults (stdout YAML):

snapshotter := &snapshotter.NodeSnapshotter{
    Version: "v1.0.0",
}

ctx := context.Background()
if err := snapshotter.Measure(ctx); err != nil {
    log.Fatalf("snapshot failed: %v", err)
}

Custom collector factory:

factory := collector.NewDefaultFactory(
    collector.WithSystemDServices([]string{"containerd.service"}),
)

snapshotter := &snapshotter.NodeSnapshotter{
    Version: "v1.0.0",
    Factory: factory,
}

if err := snapshotter.Measure(context.Background()); err != nil {
    log.Fatal(err)
}

Custom output serializer:

serializer, err := serializer.NewFileSerializer("snapshot.json")
if err != nil {
    log.Fatal(err)
}
defer serializer.Close()

snapshotter := &snapshotter.NodeSnapshotter{
    Version:    "v1.0.0",
    Serializer: serializer,
}

if err := snapshotter.Measure(context.Background()); err != nil {
    log.Fatal(err)
}

With timeout:

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

snapshotter := &snapshotter.NodeSnapshotter{Version: "v1.0.0"}
if err := snapshotter.Measure(ctx); err != nil {
    log.Fatal(err)
}

Snapshot Structure

Snapshots contain a header and measurements:

apiVersion: aicr.nvidia.com/v1alpha1
kind: Snapshot
metadata:
  version: v1.0.0
  source: node-1
  timestamp: 2025-01-15T10:30:00Z
measurements:
  - type: K8s
    subtypes:
      - subtype: server
        data:
          version: 1.33.5
          platform: linux/amd64
      - subtype: node
        data:
          provider: eks
          kernel-version: 6.8.0
      - subtype: image
        data:
          kube-apiserver: v1.33.5
      - subtype: policy
        data:
          driver.version: 570.86.16
      - subtype: helm
        data:
          gpu-operator.chart: gpu-operator
          gpu-operator.version: 25.3.0
      - subtype: argocd
        data:
          gpu-operator.source.chart: gpu-operator
          gpu-operator.syncStatus: Synced
  - type: GPU
    subtypes:
      - subtype: device
        data:
          driver: 570.158.01
          model: H100

Parallel Collection

NodeSnapshotter runs all collectors concurrently using errgroup:

  1. Metadata collection (node name, version)
  2. Kubernetes resources (cluster config, policies)
  3. SystemD services (containerd, kubelet)
  4. OS configuration (grub, sysctl, modules)
  5. GPU hardware (driver, model, settings)

If any collector fails, all are canceled and an error is returned.

Node Name Detection

Node name is determined with fallback priority:

  1. NODE_NAME environment variable
  2. KUBERNETES_NODE_NAME environment variable
  3. HOSTNAME environment variable

This ensures correct node identification in various deployment scenarios.

Error Handling

Measure() returns an error when:

  • Any collector fails
  • Context is canceled or times out
  • Serialization fails

Partial data is never returned - snapshots are all-or-nothing.

Observability

The snapshotter exports Prometheus metrics:

  • snapshot_collection_duration_seconds: Total time to collect snapshot
  • snapshot_collector_duration_seconds{collector}: Per-collector timing

Structured logs are emitted for:

  • Snapshot start
  • Collector progress
  • Errors and failures

Resource Requirements

Collectors may require:

  • Kubernetes API access (in-cluster config or kubeconfig)
  • NVIDIA GPU and nvidia-smi binary
  • systemd and systemctl binary
  • Read access to /proc, /sys, /etc

Failures due to missing resources are reported as errors.

Integration

The snapshotter is invoked by:

  • pkg/cli - snapshot command
  • Kubernetes Job - aicr-agent deployment

It depends on:

  • pkg/collector - Data collection implementations
  • pkg/serializer - Output formatting
  • pkg/measurement - Data structures

Snapshots are consumed by:

  • pkg/recipe - Recipe generation from snapshots
  • External analysis tools
  • Auditing and compliance systems

Index

Constants

View Source
const (

	// FullAPIVersion is the complete API version string
	FullAPIVersion = apiDomain + "/" + apiVersion
)

Variables

This section is empty.

Functions

func DefaultTolerations

func DefaultTolerations() []corev1.Toleration

DefaultTolerations returns tolerations that accept all taints. This allows the agent Job to be scheduled on any node regardless of taints.

func ParseNodeSelectors

func ParseNodeSelectors(selectors []string) (map[string]string, error)

ParseNodeSelectors parses node selector strings in format "key=value".

func ParseTaint

func ParseTaint(taintStr string) (*corev1.Taint, error)

ParseTaint parses a single taint string in format "key=value:effect" or "key:effect". Returns a corev1.Taint struct.

func ParseTolerations

func ParseTolerations(tolerations []string) ([]corev1.Toleration, error)

ParseTolerations parses toleration strings in format "key=value:effect" or "key:effect". If no tolerations are provided, returns DefaultTolerations() which accepts all taints.

Types

type AgentConfig

type AgentConfig struct {
	// Kubeconfig path (optional override)
	Kubeconfig string

	// Namespace for agent deployment
	Namespace string

	// Image for agent container
	Image string

	// ImagePullSecrets for pulling the agent image from private registries
	ImagePullSecrets []string

	// JobName for the agent Job
	JobName string

	// ServiceAccountName for the agent
	ServiceAccountName string

	// NodeSelector for targeting specific nodes
	NodeSelector map[string]string

	// Tolerations for scheduling on tainted nodes
	Tolerations []corev1.Toleration

	// Timeout for waiting for Job completion
	Timeout time.Duration

	// Cleanup determines whether to remove Job and RBAC on completion
	Cleanup bool

	// Output destination for snapshot
	Output string

	// Debug enables debug logging
	Debug bool

	// Privileged enables privileged mode (hostPID, hostNetwork, privileged container).
	// Required for GPU and SystemD collectors. When false, only K8s and OS collectors work.
	Privileged bool

	// RequireGPU requests nvidia.com/gpu resource for the agent pod.
	// Required in CDI environments (e.g., kind with nvkind) where GPU devices
	// are only injected when explicitly requested.
	RequireGPU bool

	// TemplatePath is the path to a Go template file for custom output formatting.
	// When set, the snapshot output will be processed through this template.
	TemplatePath string

	// MaxNodesPerEntry limits node names per topology entry (0 = unlimited).
	MaxNodesPerEntry int
}

AgentConfig contains configuration for Kubernetes agent deployment.

type NodeSnapshotter

type NodeSnapshotter struct {
	// Version is the snapshotter version.
	Version string

	// Factory is the collector factory to use. If nil, the default factory is used.
	Factory collector.Factory

	// Serializer is the serializer to use for output. If nil, a default stdout JSON serializer is used.
	Serializer serializer.Serializer

	// AgentConfig contains configuration for agent deployment mode. If nil or Enabled=false, runs locally.
	AgentConfig *AgentConfig

	// RequireGPU when true causes the snapshot to fail if no GPU is detected.
	RequireGPU bool
}

NodeSnapshotter collects system configuration measurements from the current node. It coordinates multiple collectors in parallel to gather data about Kubernetes, GPU hardware, OS configuration, and systemd services, then serializes the results. If AgentConfig is provided with Enabled=true, it deploys a Kubernetes Job instead.

func (*NodeSnapshotter) Measure

func (n *NodeSnapshotter) Measure(ctx context.Context) error

Measure collects configuration measurements and serializes the snapshot. When AgentConfig is set, it deploys a Kubernetes Job to capture the snapshot on a GPU node. Otherwise, it runs collectors locally in parallel. Individual collector failures are logged and skipped — the snapshot contains all measurements that could be successfully collected.

type Snapshot

type Snapshot struct {
	header.Header `json:",inline" yaml:",inline"`

	// Measurements contains the collected measurements from various collectors.
	Measurements []*measurement.Measurement `json:"measurements" yaml:"measurements"`
}

Snapshot represents a collected configuration snapshot from a system node. It contains metadata and measurements from various collectors including Kubernetes, GPU, OS configuration, and systemd services.

func DeployAndGetSnapshot

func DeployAndGetSnapshot(ctx context.Context, config *AgentConfig) (*Snapshot, error)

DeployAndGetSnapshot deploys an agent to capture a snapshot and returns the Snapshot struct. This is used by commands that need to capture a snapshot but also process the data (e.g., validate command that needs to run validation on the captured snapshot).

func NewSnapshot

func NewSnapshot() *Snapshot

NewSnapshot creates a new Snapshot instance with an initialized Measurements slice.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL