Documentation
¶
Overview ¶
Package gpu collects GPU hardware data via driver-free NFD/PCI enumeration.
Detection ¶
The collector uses a single, driver-free detector backed by NFD source packages: it enumerates PCI devices from sysfs, counts NVIDIA GPUs by vendor/class, resolves the accelerator SKU from each device ID (see device_ids.go), and checks the nvidia kernel-module state. It requires neither the NVIDIA driver nor nvidia-smi — only Linux with sysfs mounted — which is what makes day-0 detection on freshly provisioned nodes possible.
(Historically a second "smi" phase shelled out to nvidia-smi for the SKU and extra telemetry; it was removed once the PCI device ID could name the SKU, eliminating the CUDA-image dependency. The accelerator SKU now also comes from the GPU-operator nvidia.com/gpu.product label during fingerprinting.)
Graceful Degradation ¶
Detection degrades to an empty result rather than an error:
- Detector failure (e.g., no sysfs on macOS): logged as a warning; the GPU measurement is returned with no subtypes.
- No HardwareDetector configured: the GPU measurement is returned with no subtypes.
Measurement Structure ¶
Measurement{
Type: "GPU",
Subtypes: [
{Name: "hardware", Data: {gpu-present, gpu-count, driver-loaded, detection-source, model}},
],
}
The "hardware" subtype keys are defined in pkg/measurement:
- KeyGPUPresent: bool — true if at least one NVIDIA GPU found via PCI
- KeyGPUCount: int — number of NVIDIA GPUs detected
- KeyGPUDriverLoaded: bool — true if nvidia kernel module is loaded
- KeyGPUDetectionSource: string — detection method (e.g., "nfd")
- KeyGPUModel: string — accelerator SKU resolved from the PCI device ID (omitted when the device ID is unknown). This is a descriptive discovery vocabulary broader than the recipe accelerator enum.
Usage ¶
The collector is created by the factory with NFD wiring:
collector := gpu.NewCollector(
gpu.WithHardwareDetector(&gpu.NFDHardwareDetector{}),
)
m, err := collector.Collect(ctx)
Without WithHardwareDetector, Collect returns a GPU measurement with no subtypes.
Context and Timeouts ¶
The collector respects context cancellation and applies a bounded timeout (defaults.CollectorTimeout). NFD detection has its own sub-timeout (defaults.NFDDetectionTimeout).
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Collector ¶
type Collector struct {
// contains filtered or unexported fields
}
Collector collects GPU information via driver-free NFD/PCI enumeration: presence, count, kernel-module state, and the accelerator SKU resolved from the PCI device ID. It requires neither the NVIDIA driver nor nvidia-smi.
func NewCollector ¶ added in v0.12.0
func NewCollector(opts ...CollectorOption) *Collector
NewCollector creates a GPU collector with the given options.
func (*Collector) Collect ¶
func (s *Collector) Collect(ctx context.Context) (*measurement.Measurement, error)
Collect retrieves GPU information from the hardware detector (NFD PCI enumeration). It degrades gracefully: when no detector is configured or detection fails, it returns a GPU measurement with no subtypes rather than an error, so a snapshot on a node without sysfs (e.g. macOS) still succeeds.
type CollectorOption ¶ added in v0.12.0
type CollectorOption func(*Collector)
CollectorOption configures a Collector.
func WithHardwareDetector ¶ added in v0.12.0
func WithHardwareDetector(d HardwareDetector) CollectorOption
WithHardwareDetector sets the hardware detector used for GPU detection. When not set, Collect returns a GPU measurement with no subtypes.
type HardwareDetector ¶ added in v0.12.0
type HardwareDetector interface {
// Detect discovers GPU hardware and driver module state.
// Returns HardwareInfo describing what was found, or an error if
// detection could not be performed (e.g., sysfs not available).
Detect(ctx context.Context) (*HardwareInfo, error)
}
HardwareDetector abstracts GPU hardware detection for testability. Implementations enumerate PCI devices and kernel module state without requiring GPU drivers to be installed.
type HardwareInfo ¶ added in v0.12.0
type HardwareInfo struct {
// GPUPresent is true if at least one NVIDIA GPU was found via PCI enumeration.
GPUPresent bool
// GPUCount is the number of NVIDIA GPUs detected via PCI enumeration.
GPUCount int
// DriverLoaded is true if the nvidia kernel module is currently loaded.
DriverLoaded bool
// DetectionSource identifies which detection method produced this result
// (e.g., "nfd", "sysfs").
DetectionSource string
// SKU is the AICR accelerator enum value (e.g. "h100", "l40") resolved
// from the GPU's PCI device ID, or "" when the device ID is unknown or
// the node carries a heterogeneous mix of GPU SKUs. Lets the fingerprint
// name the accelerator without nvidia-smi or a GFD node label.
SKU string
}
HardwareInfo describes the GPU hardware state detected without drivers.
type NFDHardwareDetector ¶ added in v0.12.0
type NFDHardwareDetector struct{}
NFDHardwareDetector uses NFD source packages to detect GPU hardware via PCI enumeration and kernel module state from sysfs/procfs.
NFDHardwareDetector is not safe for concurrent use. NFD source singletons are shared package-level state without synchronization. In AICR's architecture the GPU collector runs once per snapshot, so this is not a practical concern.
func (*NFDHardwareDetector) Detect ¶ added in v0.12.0
func (d *NFDHardwareDetector) Detect(ctx context.Context) (*HardwareInfo, error)
Detect discovers GPU hardware using NFD PCI and kernel sources. PCI discovery is required; kernel module detection is best-effort.
This method requires Linux with sysfs/procfs mounted. On other platforms (macOS, containers without /sys), PCI discovery will fail and an error is returned. The caller (Collector.Collect) handles this gracefully by falling back to nvidia-smi-only collection.