gpu

package
v0.15.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 15, 2026 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Overview

Package gpu collects GPU hardware data via driver-free NFD/PCI enumeration.

Detection

The collector uses a single, driver-free detector backed by NFD source packages: it enumerates PCI devices from sysfs, counts NVIDIA GPUs by vendor/class, resolves the accelerator SKU from each device ID (see device_ids.go), and checks the nvidia kernel-module state. It requires neither the NVIDIA driver nor nvidia-smi — only Linux with sysfs mounted — which is what makes day-0 detection on freshly provisioned nodes possible.

(Historically a second "smi" phase shelled out to nvidia-smi for the SKU and extra telemetry; it was removed once the PCI device ID could name the SKU, eliminating the CUDA-image dependency. The accelerator SKU now also comes from the GPU-operator nvidia.com/gpu.product label during fingerprinting.)

Graceful Degradation

Detection degrades to an empty result rather than an error:

  • Detector failure (e.g., no sysfs on macOS): logged as a warning; the GPU measurement is returned with no subtypes.
  • No HardwareDetector configured: the GPU measurement is returned with no subtypes.

Measurement Structure

Measurement{
    Type: "GPU",
    Subtypes: [
        {Name: "hardware", Data: {gpu-present, gpu-count, driver-loaded, detection-source, model}},
    ],
}

The "hardware" subtype keys are defined in pkg/measurement:

  • KeyGPUPresent: bool — true if at least one NVIDIA GPU found via PCI
  • KeyGPUCount: int — number of NVIDIA GPUs detected
  • KeyGPUDriverLoaded: bool — true if nvidia kernel module is loaded
  • KeyGPUDetectionSource: string — detection method (e.g., "nfd")
  • KeyGPUModel: string — accelerator SKU resolved from the PCI device ID (omitted when the device ID is unknown). This is a descriptive discovery vocabulary broader than the recipe accelerator enum.

Usage

The collector is created by the factory with NFD wiring:

collector := gpu.NewCollector(
    gpu.WithHardwareDetector(&gpu.NFDHardwareDetector{}),
)
m, err := collector.Collect(ctx)

Without WithHardwareDetector, Collect returns a GPU measurement with no subtypes.

Context and Timeouts

The collector respects context cancellation and applies a bounded timeout (defaults.CollectorTimeout). NFD detection has its own sub-timeout (defaults.NFDDetectionTimeout).

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Collector

type Collector struct {
	// contains filtered or unexported fields
}

Collector collects GPU information via driver-free NFD/PCI enumeration: presence, count, kernel-module state, and the accelerator SKU resolved from the PCI device ID. It requires neither the NVIDIA driver nor nvidia-smi.

func NewCollector added in v0.12.0

func NewCollector(opts ...CollectorOption) *Collector

NewCollector creates a GPU collector with the given options.

func (*Collector) Collect

func (s *Collector) Collect(ctx context.Context) (*measurement.Measurement, error)

Collect retrieves GPU information from the hardware detector (NFD PCI enumeration). It degrades gracefully: when no detector is configured or detection fails, it returns a GPU measurement with no subtypes rather than an error, so a snapshot on a node without sysfs (e.g. macOS) still succeeds.

type CollectorOption added in v0.12.0

type CollectorOption func(*Collector)

CollectorOption configures a Collector.

func WithHardwareDetector added in v0.12.0

func WithHardwareDetector(d HardwareDetector) CollectorOption

WithHardwareDetector sets the hardware detector used for GPU detection. When not set, Collect returns a GPU measurement with no subtypes.

type HardwareDetector added in v0.12.0

type HardwareDetector interface {
	// Detect discovers GPU hardware and driver module state.
	// Returns HardwareInfo describing what was found, or an error if
	// detection could not be performed (e.g., sysfs not available).
	Detect(ctx context.Context) (*HardwareInfo, error)
}

HardwareDetector abstracts GPU hardware detection for testability. Implementations enumerate PCI devices and kernel module state without requiring GPU drivers to be installed.

type HardwareInfo added in v0.12.0

type HardwareInfo struct {
	// GPUPresent is true if at least one NVIDIA GPU was found via PCI enumeration.
	GPUPresent bool

	// GPUCount is the number of NVIDIA GPUs detected via PCI enumeration.
	GPUCount int

	// DriverLoaded is true if the nvidia kernel module is currently loaded.
	DriverLoaded bool

	// DetectionSource identifies which detection method produced this result
	// (e.g., "nfd", "sysfs").
	DetectionSource string

	// SKU is the AICR accelerator enum value (e.g. "h100", "l40") resolved
	// from the GPU's PCI device ID, or "" when the device ID is unknown or
	// the node carries a heterogeneous mix of GPU SKUs. Lets the fingerprint
	// name the accelerator without nvidia-smi or a GFD node label.
	SKU string
}

HardwareInfo describes the GPU hardware state detected without drivers.

type NFDHardwareDetector added in v0.12.0

type NFDHardwareDetector struct{}

NFDHardwareDetector uses NFD source packages to detect GPU hardware via PCI enumeration and kernel module state from sysfs/procfs.

NFDHardwareDetector is not safe for concurrent use. NFD source singletons are shared package-level state without synchronization. In AICR's architecture the GPU collector runs once per snapshot, so this is not a practical concern.

func (*NFDHardwareDetector) Detect added in v0.12.0

Detect discovers GPU hardware using NFD PCI and kernel sources. PCI discovery is required; kernel module detection is best-effort.

This method requires Linux with sysfs/procfs mounted. On other platforms (macOS, containers without /sys), PCI discovery will fail and an error is returned. The caller (Collector.Collect) handles this gracefully by falling back to nvidia-smi-only collection.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL