snapshotter

package
v0.14.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 1, 2026 License: Apache-2.0 Imports: 26 Imported by: 0

Documentation

Overview

Package snapshotter captures comprehensive system configuration snapshots.

Overview

The snapshotter package orchestrates parallel collection of system measurements from multiple sources (Kubernetes, GPU, OS, SystemD) and produces structured snapshots that can be serialized for analysis, auditing, or recommendation generation.

Core Types

NodeSnapshotter: collects from the current node (or, when AgentConfig is set, deploys a Kubernetes Job to capture from a remote GPU node).

type NodeSnapshotter struct {
    Version     string                // Snapshotter version
    Factory     collector.Factory     // Collector factory (optional)
    Serializer  serializer.Serializer // Output serializer (optional)
    AgentConfig *AgentConfig          // Optional remote agent deployment
    RequireGPU  bool                  // Fail snapshot if no GPU detected
}

The exported entry point is the Measure method:

func (n *NodeSnapshotter) Measure(ctx context.Context) error

Snapshot: Captured configuration data

type Snapshot struct {
    Header                            // API version, kind, metadata
    Measurements []*measurement.Measurement // Collected data
}

Usage

Basic snapshot with defaults (stdout YAML):

snapshotter := &snapshotter.NodeSnapshotter{
    Version: "v1.0.0",
}

ctx := context.Background()
if err := snapshotter.Measure(ctx); err != nil {
    log.Fatalf("snapshot failed: %v", err)
}

Custom collector factory:

factory := collector.NewDefaultFactory(
    collector.WithSystemDServices([]string{"containerd.service"}),
)

snapshotter := &snapshotter.NodeSnapshotter{
    Version: "v1.0.0",
    Factory: factory,
}

if err := snapshotter.Measure(context.Background()); err != nil {
    log.Fatal(err)
}

Custom output serializer:

serializer, err := serializer.NewFileSerializer("snapshot.json")
if err != nil {
    log.Fatal(err)
}
defer serializer.Close()

snapshotter := &snapshotter.NodeSnapshotter{
    Version:    "v1.0.0",
    Serializer: serializer,
}

if err := snapshotter.Measure(context.Background()); err != nil {
    log.Fatal(err)
}

With timeout:

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

snapshotter := &snapshotter.NodeSnapshotter{Version: "v1.0.0"}
if err := snapshotter.Measure(ctx); err != nil {
    log.Fatal(err)
}

Snapshot Structure

Snapshots contain a header and measurements:

apiVersion: aicr.nvidia.com/v1alpha1
kind: Snapshot
metadata:
  version: v1.0.0
  source: node-1
  timestamp: 2025-01-15T10:30:00Z
measurements:
  - type: K8s
    subtypes:
      - subtype: server
        data:
          version: 1.33.5
          platform: linux/amd64
      - subtype: node
        data:
          provider: eks
          kernel-version: 6.8.0
      - subtype: image
        data:
          kube-apiserver: v1.33.5
      - subtype: policy
        data:
          driver.version: 570.86.16
      - subtype: helm
        data:
          gpu-operator.chart: gpu-operator
          gpu-operator.version: 25.3.0
      - subtype: argocd
        data:
          gpu-operator.source.chart: gpu-operator
          gpu-operator.syncStatus: Synced
  - type: GPU
    subtypes:
      - subtype: device
        data:
          driver: 570.158.01
          model: H100

Parallel Collection

NodeSnapshotter runs all collectors concurrently using errgroup:

  1. Metadata collection (node name, version)
  2. Kubernetes resources (cluster config, policies)
  3. SystemD services (containerd, kubelet)
  4. OS configuration (grub, sysctl, modules)
  5. GPU hardware (driver, model, settings)

Individual collector failures are logged and skipped — the snapshot contains all measurements that could be successfully collected. The overall Measure call only returns an error for setup, context, or serialization failures (and for missing GPU when RequireGPU is set).

Node Name Detection

Node name is determined with fallback priority:

  1. NODE_NAME environment variable
  2. KUBERNETES_NODE_NAME environment variable
  3. HOSTNAME environment variable

This ensures correct node identification in various deployment scenarios.

Error Handling

Measure() returns an error when:

  • Context is canceled or times out
  • Serialization fails
  • RequireGPU is set and no GPU was detected

Individual collector errors do not fail the snapshot; they are logged and the affected measurement is omitted, so partial snapshots are the expected outcome on heterogeneous hosts.

Observability

The snapshotter exports Prometheus metrics:

  • snapshot_collection_duration_seconds: Total time to collect snapshot
  • snapshot_collector_duration_seconds{collector}: Per-collector timing

Structured logs are emitted for:

  • Snapshot start
  • Collector progress
  • Errors and failures

Resource Requirements

Collectors may require:

  • Kubernetes API access (in-cluster config or kubeconfig)
  • NVIDIA GPU and nvidia-smi binary
  • systemd and systemctl binary
  • Read access to /proc, /sys, /etc

Failures due to missing resources are reported as errors.

Integration

The snapshotter is invoked by:

  • pkg/cli - snapshot command
  • Kubernetes Job - aicr-agent deployment

It depends on:

  • pkg/collector - Data collection implementations
  • pkg/serializer - Output formatting
  • pkg/measurement - Data structures

Snapshots are consumed by:

  • pkg/recipe - Recipe generation from snapshots
  • External analysis tools
  • Auditing and compliance systems

Index

Constants

View Source
const (

	// FullAPIVersion is the complete API version string
	FullAPIVersion = apiDomain + "/" + apiVersion
)

Variables

This section is empty.

Functions

func DefaultTolerations

func DefaultTolerations() []corev1.Toleration

DefaultTolerations returns tolerations that accept all taints. This allows the agent Job to be scheduled on any node regardless of taints.

func ParseNodeSelectors

func ParseNodeSelectors(selectors []string) (map[string]string, error)

ParseNodeSelectors parses node selector strings in format "key=value".

func ParseResourceList added in v0.13.0

func ParseResourceList(spec string) (corev1.ResourceList, error)

ParseResourceList converts a comma-separated "name=quantity" list (e.g. "cpu=500m,memory=1Gi,ephemeral-storage=1Gi") into a corev1.ResourceList for use as a per-container request or limit override. An empty string returns a nil ResourceList so the caller can distinguish "no override supplied" (defaults apply) from "override supplied" (replace per-key); a sentinel error would force every call site to special-case the empty-flag path. Each quantity is parsed via resource.ParseQuantity, so the same suffixes accepted everywhere else in Kubernetes work here (m, Ki, Mi, Gi, Ti, ...).

func ParseTaint

func ParseTaint(taintStr string) (*corev1.Taint, error)

ParseTaint parses a single taint string in format "key=value:effect" or "key:effect". Returns a corev1.Taint struct.

func ParseTolerations

func ParseTolerations(tolerations []string) ([]corev1.Toleration, error)

ParseTolerations parses toleration strings in format "key=value:effect" or "key:effect". If no tolerations are provided, returns DefaultTolerations() which accepts all taints.

Types

type AgentConfig

type AgentConfig struct {
	// Kubeconfig path (optional override)
	Kubeconfig string

	// Namespace for agent deployment
	Namespace string

	// Image for agent container
	Image string

	// ImagePullSecrets for pulling the agent image from private registries
	ImagePullSecrets []string

	// JobName for the agent Job
	JobName string

	// ServiceAccountName for the agent
	ServiceAccountName string

	// NodeSelector for targeting specific nodes
	NodeSelector map[string]string

	// Tolerations for scheduling on tainted nodes
	Tolerations []corev1.Toleration

	// Timeout for waiting for Job completion
	Timeout time.Duration

	// Cleanup determines whether to remove Job and RBAC on completion
	Cleanup bool

	// Output destination for snapshot
	Output string

	// Debug enables debug logging
	Debug bool

	// Privileged enables privileged mode (hostPID, hostNetwork, privileged container).
	// Required for GPU and SystemD collectors. When false, only K8s and OS collectors work.
	Privileged bool

	// RequireGPU requests nvidia.com/gpu resource for the agent pod.
	// Required in CDI environments (e.g., kind with nvkind) where GPU devices
	// are only injected when explicitly requested.
	RequireGPU bool

	// RuntimeClassName sets runtimeClassName on the agent pod and injects
	// NVIDIA_VISIBLE_DEVICES=all. Use instead of RequireGPU when all GPUs
	// are allocated — gives the agent nvidia-smi access without consuming
	// a GPU from the Device Plugin.
	RuntimeClassName string

	// TemplatePath is the path to a Go template file for custom output formatting.
	// When set, the snapshot output will be processed through this template.
	TemplatePath string

	// MaxNodesPerEntry limits node names per topology entry (0 = unlimited).
	MaxNodesPerEntry int

	// OS is the recipe OS criteria value (e.g., "ubuntu", "talos"). Drives
	// per-OS pod construction and in-pod collector backend selection. When
	// empty, defaults preserve the systemd-based behavior.
	OS string

	// Requests overrides the agent container's per-resource requests.
	// When nil, the privileged/restricted defaults baked into
	// pkg/k8s/agent are used. Useful for right-sizing the agent on
	// resource-constrained dev clusters (e.g. talosctl Docker
	// provisioner workers).
	Requests corev1.ResourceList

	// Limits overrides the agent container's per-resource limits. When
	// nil, the privileged/restricted defaults are used. RequireGPU
	// defaults nvidia.com/gpu=1 only when the caller has not supplied
	// that key in Limits — e.g. --require-gpu --limits nvidia.com/gpu=4
	// keeps 4, not 1.
	Limits corev1.ResourceList
}

AgentConfig contains configuration for Kubernetes agent deployment.

type NodeSnapshotter

type NodeSnapshotter struct {
	// Version is the snapshotter version.
	Version string

	// Factory is the collector factory to use. If nil, the default factory is used.
	Factory collector.Factory

	// Serializer is the serializer to use for output. If nil, a default stdout JSON serializer is used.
	Serializer serializer.Serializer

	// AgentConfig contains configuration for agent deployment mode. If nil or Enabled=false, runs locally.
	AgentConfig *AgentConfig

	// RequireGPU when true causes the snapshot to fail if no GPU is detected.
	RequireGPU bool
}

NodeSnapshotter collects system configuration measurements from the current node. It coordinates multiple collectors in parallel to gather data about Kubernetes, GPU hardware, OS configuration, and systemd services, then serializes the results. If AgentConfig is provided with Enabled=true, it deploys a Kubernetes Job instead.

func (*NodeSnapshotter) Measure

func (n *NodeSnapshotter) Measure(ctx context.Context) error

Measure collects configuration measurements and serializes the snapshot. When AgentConfig is set, it deploys a Kubernetes Job to capture the snapshot on a GPU node. Otherwise, it runs collectors locally in parallel. Individual collector failures are logged and skipped — the snapshot contains all measurements that could be successfully collected.

type Snapshot

type Snapshot struct {
	header.Header `json:",inline" yaml:",inline"`

	// Fingerprint is a structured cluster identity derived from the
	// raw measurements: detected service, accelerator, OS,
	// Kubernetes server version, region, and node count. Populated
	// after all collectors finish so it reflects the final
	// measurement set.
	//
	// The embedded Fingerprint is advisory: it is a convenience for
	// humans reading the snapshot file, not an authoritative claim.
	// Consumers of the snapshot that bear trust — notably the
	// ADR-007 bundler when building the predicate body and the
	// evidence verifier when re-checking it — MUST recompute the
	// Fingerprint from Measurements via fingerprint.FromMeasurements
	// rather than read this field. The snapshot YAML is not signed
	// at this layer; an attacker controlling the file could swap
	// the embedded Fingerprint without touching the measurements
	// that back it.
	Fingerprint *fingerprint.Fingerprint `json:"fingerprint,omitempty" yaml:"fingerprint,omitempty"`

	// Measurements contains the collected measurements from various collectors.
	Measurements []*measurement.Measurement `json:"measurements" yaml:"measurements"`
}

Snapshot represents a collected configuration snapshot from a system node. It contains metadata and measurements from various collectors including Kubernetes, GPU, OS configuration, and systemd services.

func DeployAndGetSnapshot

func DeployAndGetSnapshot(ctx context.Context, config *AgentConfig) (*Snapshot, error)

DeployAndGetSnapshot deploys an agent to capture a snapshot and returns the Snapshot struct. This is used by commands that need to capture a snapshot but also process the data (e.g., validate command that needs to run validation on the captured snapshot).

func NewSnapshot

func NewSnapshot() *Snapshot

NewSnapshot creates a new Snapshot instance with an initialized Measurements slice.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL