Documentation
¶
Overview ¶
Package snapshotter captures comprehensive system configuration snapshots.
Overview ¶
The snapshotter package orchestrates parallel collection of system measurements from multiple sources (Kubernetes, GPU, OS, SystemD) and produces structured snapshots that can be serialized for analysis, auditing, or recommendation generation.
Core Types ¶
Snapshotter: Interface for snapshot collection
type Snapshotter interface {
Measure(ctx context.Context) error
}
NodeSnapshotter: Production implementation that collects from the current node
type NodeSnapshotter struct {
Version string // Snapshotter version
Factory collector.Factory // Collector factory (optional)
Serializer serializer.Serializer // Output serializer (optional)
}
Snapshot: Captured configuration data
type Snapshot struct {
Header // API version, kind, metadata
Measurements []*measurement.Measurement // Collected data
}
Usage ¶
Basic snapshot with defaults (stdout YAML):
snapshotter := &snapshotter.NodeSnapshotter{
Version: "v1.0.0",
}
ctx := context.Background()
if err := snapshotter.Measure(ctx); err != nil {
log.Fatalf("snapshot failed: %v", err)
}
Custom collector factory:
factory := collector.NewDefaultFactory(
collector.WithSystemDServices([]string{"containerd.service"}),
)
snapshotter := &snapshotter.NodeSnapshotter{
Version: "v1.0.0",
Factory: factory,
}
if err := snapshotter.Measure(context.Background()); err != nil {
log.Fatal(err)
}
Custom output serializer:
serializer, err := serializer.NewFileSerializer("snapshot.json")
if err != nil {
log.Fatal(err)
}
defer serializer.Close()
snapshotter := &snapshotter.NodeSnapshotter{
Version: "v1.0.0",
Serializer: serializer,
}
if err := snapshotter.Measure(context.Background()); err != nil {
log.Fatal(err)
}
With timeout:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
snapshotter := &snapshotter.NodeSnapshotter{Version: "v1.0.0"}
if err := snapshotter.Measure(ctx); err != nil {
log.Fatal(err)
}
Snapshot Structure ¶
Snapshots contain a header and measurements:
apiVersion: aicr.nvidia.com/v1alpha1
kind: Snapshot
metadata:
version: v1.0.0
source: node-1
timestamp: 2025-01-15T10:30:00Z
measurements:
- type: K8s
subtypes:
- subtype: server
data:
version: 1.33.5
platform: linux/amd64
- subtype: node
data:
provider: eks
kernel-version: 6.8.0
- subtype: image
data:
kube-apiserver: v1.33.5
- subtype: policy
data:
driver.version: 570.86.16
- subtype: helm
data:
gpu-operator.chart: gpu-operator
gpu-operator.version: 25.3.0
- subtype: argocd
data:
gpu-operator.source.chart: gpu-operator
gpu-operator.syncStatus: Synced
- type: GPU
subtypes:
- subtype: device
data:
driver: 570.158.01
model: H100
Parallel Collection ¶
NodeSnapshotter runs all collectors concurrently using errgroup:
- Metadata collection (node name, version)
- Kubernetes resources (cluster config, policies)
- SystemD services (containerd, kubelet)
- OS configuration (grub, sysctl, modules)
- GPU hardware (driver, model, settings)
If any collector fails, all are canceled and an error is returned.
Node Name Detection ¶
Node name is determined with fallback priority:
- NODE_NAME environment variable
- KUBERNETES_NODE_NAME environment variable
- HOSTNAME environment variable
This ensures correct node identification in various deployment scenarios.
Error Handling ¶
Measure() returns an error when:
- Any collector fails
- Context is canceled or times out
- Serialization fails
Partial data is never returned - snapshots are all-or-nothing.
Observability ¶
The snapshotter exports Prometheus metrics:
- snapshot_collection_duration_seconds: Total time to collect snapshot
- snapshot_collector_duration_seconds{collector}: Per-collector timing
Structured logs are emitted for:
- Snapshot start
- Collector progress
- Errors and failures
Resource Requirements ¶
Collectors may require:
- Kubernetes API access (in-cluster config or kubeconfig)
- NVIDIA GPU and nvidia-smi binary
- systemd and systemctl binary
- Read access to /proc, /sys, /etc
Failures due to missing resources are reported as errors.
Integration ¶
The snapshotter is invoked by:
- pkg/cli - snapshot command
- Kubernetes Job - aicr-agent deployment
It depends on:
- pkg/collector - Data collection implementations
- pkg/serializer - Output formatting
- pkg/measurement - Data structures
Snapshots are consumed by:
- pkg/recipe - Recipe generation from snapshots
- External analysis tools
- Auditing and compliance systems
Index ¶
- Constants
- func DefaultTolerations() []corev1.Toleration
- func ParseNodeSelectors(selectors []string) (map[string]string, error)
- func ParseResourceList(spec string) (corev1.ResourceList, error)
- func ParseTaint(taintStr string) (*corev1.Taint, error)
- func ParseTolerations(tolerations []string) ([]corev1.Toleration, error)
- type AgentConfig
- type NodeSnapshotter
- type Snapshot
Constants ¶
const (
// FullAPIVersion is the complete API version string
FullAPIVersion = apiDomain + "/" + apiVersion
)
Variables ¶
This section is empty.
Functions ¶
func DefaultTolerations ¶
func DefaultTolerations() []corev1.Toleration
DefaultTolerations returns tolerations that accept all taints. This allows the agent Job to be scheduled on any node regardless of taints.
func ParseNodeSelectors ¶
ParseNodeSelectors parses node selector strings in format "key=value".
func ParseResourceList ¶ added in v0.13.0
func ParseResourceList(spec string) (corev1.ResourceList, error)
ParseResourceList converts a comma-separated "name=quantity" list (e.g. "cpu=500m,memory=1Gi,ephemeral-storage=1Gi") into a corev1.ResourceList for use as a per-container request or limit override. An empty string returns a nil ResourceList so the caller can distinguish "no override supplied" (defaults apply) from "override supplied" (replace per-key); a sentinel error would force every call site to special-case the empty-flag path. Each quantity is parsed via resource.ParseQuantity, so the same suffixes accepted everywhere else in Kubernetes work here (m, Ki, Mi, Gi, Ti, ...).
func ParseTaint ¶
ParseTaint parses a single taint string in format "key=value:effect" or "key:effect". Returns a corev1.Taint struct.
func ParseTolerations ¶
func ParseTolerations(tolerations []string) ([]corev1.Toleration, error)
ParseTolerations parses toleration strings in format "key=value:effect" or "key:effect". If no tolerations are provided, returns DefaultTolerations() which accepts all taints.
Types ¶
type AgentConfig ¶
type AgentConfig struct {
// Kubeconfig path (optional override)
Kubeconfig string
// Namespace for agent deployment
Namespace string
// Image for agent container
Image string
// ImagePullSecrets for pulling the agent image from private registries
ImagePullSecrets []string
// JobName for the agent Job
JobName string
// ServiceAccountName for the agent
ServiceAccountName string
// NodeSelector for targeting specific nodes
NodeSelector map[string]string
// Tolerations for scheduling on tainted nodes
Tolerations []corev1.Toleration
// Timeout for waiting for Job completion
Timeout time.Duration
// Cleanup determines whether to remove Job and RBAC on completion
Cleanup bool
// Output destination for snapshot
Output string
// Debug enables debug logging
Debug bool
// Privileged enables privileged mode (hostPID, hostNetwork, privileged container).
// Required for GPU and SystemD collectors. When false, only K8s and OS collectors work.
Privileged bool
// RequireGPU requests nvidia.com/gpu resource for the agent pod.
// Required in CDI environments (e.g., kind with nvkind) where GPU devices
// are only injected when explicitly requested.
RequireGPU bool
// RuntimeClassName sets runtimeClassName on the agent pod and injects
// NVIDIA_VISIBLE_DEVICES=all. Use instead of RequireGPU when all GPUs
// are allocated — gives the agent nvidia-smi access without consuming
// a GPU from the Device Plugin.
RuntimeClassName string
// TemplatePath is the path to a Go template file for custom output formatting.
// When set, the snapshot output will be processed through this template.
TemplatePath string
// MaxNodesPerEntry limits node names per topology entry (0 = unlimited).
MaxNodesPerEntry int
// OS is the recipe OS criteria value (e.g., "ubuntu", "talos"). Drives
// per-OS pod construction and in-pod collector backend selection. When
// empty, defaults preserve the systemd-based behavior.
OS string
// Requests overrides the agent container's per-resource requests.
// When nil, the privileged/restricted defaults baked into
// pkg/k8s/agent are used. Useful for right-sizing the agent on
// resource-constrained dev clusters (e.g. talosctl Docker
// provisioner workers).
Requests corev1.ResourceList
// Limits overrides the agent container's per-resource limits. When
// nil, the privileged/restricted defaults are used. RequireGPU
// defaults nvidia.com/gpu=1 only when the caller has not supplied
// that key in Limits — e.g. --require-gpu --limits nvidia.com/gpu=4
// keeps 4, not 1.
Limits corev1.ResourceList
}
AgentConfig contains configuration for Kubernetes agent deployment.
type NodeSnapshotter ¶
type NodeSnapshotter struct {
// Version is the snapshotter version.
Version string
// Factory is the collector factory to use. If nil, the default factory is used.
Factory collector.Factory
// Serializer is the serializer to use for output. If nil, a default stdout JSON serializer is used.
Serializer serializer.Serializer
// AgentConfig contains configuration for agent deployment mode. If nil or Enabled=false, runs locally.
AgentConfig *AgentConfig
// RequireGPU when true causes the snapshot to fail if no GPU is detected.
RequireGPU bool
}
NodeSnapshotter collects system configuration measurements from the current node. It coordinates multiple collectors in parallel to gather data about Kubernetes, GPU hardware, OS configuration, and systemd services, then serializes the results. If AgentConfig is provided with Enabled=true, it deploys a Kubernetes Job instead.
func (*NodeSnapshotter) Measure ¶
func (n *NodeSnapshotter) Measure(ctx context.Context) error
Measure collects configuration measurements and serializes the snapshot. When AgentConfig is set, it deploys a Kubernetes Job to capture the snapshot on a GPU node. Otherwise, it runs collectors locally in parallel. Individual collector failures are logged and skipped — the snapshot contains all measurements that could be successfully collected.
type Snapshot ¶
type Snapshot struct {
header.Header `json:",inline" yaml:",inline"`
// Fingerprint is a structured cluster identity derived from the
// raw measurements: detected service, accelerator, OS,
// Kubernetes server version, region, and node count. Populated
// after all collectors finish so it reflects the final
// measurement set.
//
// The embedded Fingerprint is advisory: it is a convenience for
// humans reading the snapshot file, not an authoritative claim.
// Consumers of the snapshot that bear trust — notably the
// ADR-007 bundler when building the predicate body and the
// evidence verifier when re-checking it — MUST recompute the
// Fingerprint from Measurements via fingerprint.FromMeasurements
// rather than read this field. The snapshot YAML is not signed
// at this layer; an attacker controlling the file could swap
// the embedded Fingerprint without touching the measurements
// that back it.
Fingerprint *fingerprint.Fingerprint `json:"fingerprint,omitempty" yaml:"fingerprint,omitempty"`
// Measurements contains the collected measurements from various collectors.
Measurements []*measurement.Measurement `json:"measurements" yaml:"measurements"`
}
Snapshot represents a collected configuration snapshot from a system node. It contains metadata and measurements from various collectors including Kubernetes, GPU, OS configuration, and systemd services.
func DeployAndGetSnapshot ¶
func DeployAndGetSnapshot(ctx context.Context, config *AgentConfig) (*Snapshot, error)
DeployAndGetSnapshot deploys an agent to capture a snapshot and returns the Snapshot struct. This is used by commands that need to capture a snapshot but also process the data (e.g., validate command that needs to run validation on the captured snapshot).
func NewSnapshot ¶
func NewSnapshot() *Snapshot
NewSnapshot creates a new Snapshot instance with an initialized Measurements slice.