nvml

package
v0.11.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 25, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

Documentation

Overview

Package nvml implements the NVIDIA Management Library (NVML) interface. See https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference for more details.

Index

Constants

This section is empty.

Variables

View Source
var ErrDeviceGetDevicesInjected = errors.New("error getting device handle for index '0': Unknown Error (injected for testing)")

ErrDeviceGetDevicesInjected is the error returned when NVMLDeviceGetDevicesError is enabled. This simulates the "Unable to determine the device handle for GPU: Unknown Error" scenario.

Functions

func ClockEventsSupportedVersion

func ClockEventsSupportedVersion(major int) bool

clock events are supported in versions 535 and above otherwise, CGO call just exits with undefined symbol: nvmlDeviceGetCurrentClocksEventReasons

func GetArchFamily

func GetArchFamily(dev device.Device) (string, error)

GetArchFamily returns the GPU architecture family name based on the given device CUDA compute capability. ref. https://github.com/NVIDIA/k8s-device-plugin/blob/f666bc3f836a09ae2fda439f3d7a8d8b06b48ac4/internal/lm/resource.go#L283C6-L283C19

func GetBrand

func GetBrand(dev device.Device) (string, error)

func GetCUDAVersion

func GetCUDAVersion() (string, error)

func GetDriverVersion

func GetDriverVersion() (string, error)

func GetProductName

func GetProductName(dev device.Device) (string, error)

func GetSystemDriverVersion

func GetSystemDriverVersion(nvmlLib nvml.Interface) (string, error)

func LoadGPUDeviceName

func LoadGPUDeviceName() (string, error)

Loads the product name of the NVIDIA GPU device.

func ParseDriverVersion

func ParseDriverVersion(version string) (major, minor, patch int, err error)

Types

type FailureInjectorConfig

type FailureInjectorConfig struct {
	GPUUUIDsWithGPULost                           []string
	GPUUUIDsWithGPURequiresReset                  []string
	GPUUUIDsWithFabricStateHealthSummaryUnhealthy []string

	// GPUProductNameOverride overrides the detected GPU product name.
	// This is useful for testing fabric state failure injection on systems where
	// the actual GPU (e.g., H100-PCIe) doesn't support fabric state monitoring.
	// Set this to a product name like "H100-SXM" to simulate a fabric-capable GPU.
	// When set, this affects FabricStateSupported(), FabricManagerSupported(),
	// and memory management capabilities detection.
	GPUProductNameOverride string

	// NVMLDeviceGetDevicesError when true simulates Device().GetDevices() failure.
	// This is useful for testing the "Unable to determine the device handle for GPU: Unknown Error"
	// scenario that occurs when NVML library loads but device enumeration fails (e.g., Xid 79).
	// When enabled, gpud continues running but all nvidia components report unhealthy.
	// ref. https://github.com/leptonai/gpud/pull/1180
	NVMLDeviceGetDevicesError bool
}

FailureInjectorConfig holds configuration for test failure injection

type Instance

type Instance interface {
	// NVMLExists returns true if the NVML library is installed.
	NVMLExists() bool

	// Library returns the NVML library.
	Library() nvmllib.Library

	// Devices returns the current devices in the system.
	// The key is the UUID of the GPU device.
	Devices() map[string]device.Device

	// ProductName returns the product name of the GPU.
	// Note that some machines have nvml library but the driver is not installed,
	// returning empty value for the GPU product name.
	ProductName() string

	// Architecture returns the architecture of the GPU.
	// GB200 may return "NVIDIA-Graphics-Device" for the product name
	// but "blackwell" for architecture.
	Architecture() string

	// Brand returns the brand of the GPU.
	Brand() string

	// DriverVersion returns the driver version of the GPU.
	DriverVersion() string

	// DriverMajor returns the major version of the driver.
	DriverMajor() int

	// CUDAVersion returns the CUDA version of the GPU.
	CUDAVersion() string

	// FabricManagerSupported returns true if the fabric manager is supported.
	FabricManagerSupported() bool

	// FabricStateSupported returns true if NVML fabric state telemetry is
	// available for the product (e.g. GB200 via nvmlDeviceGetGpuFabricInfo*).
	FabricStateSupported() bool

	// GetMemoryErrorManagementCapabilities returns the memory error management capabilities of the GPU.
	GetMemoryErrorManagementCapabilities() nvidiaproduct.MemoryErrorManagementCapabilities

	// Shutdown shuts down the NVML library.
	Shutdown() error

	// InitError returns any error that occurred during NVML initialization.
	// If initialization succeeded, this returns nil.
	// Components should check this and report unhealthy if non-nil.
	// This typically occurs when NVML library loads but device enumeration fails,
	// for example: "error getting device handle for index '4': Unknown Error"
	// which corresponds to nvidia-smi showing:
	// "Unable to determine the device handle for GPU0000:XX:00.0: Unknown Error"
	InitError() error
}

Instance is the interface for the NVML library connector.

func New

func New() (Instance, error)

New creates a new instance of the NVML library. If NVML is not installed, it returns no-op nvml instance.

func NewErrored

func NewErrored(initErr error) Instance

NewErrored creates an Instance that represents a failed NVML initialization. gpud run continues even when NVML fails, but all nvidia accelerator components will report unhealthy with this error. This typically happens when: - nvidia-smi shows: "Unable to determine the device handle for GPU0000:XX:00.0: Unknown Error" - NVML returns: "error getting device handle for index 'N': Unknown Error"

func NewNoOp

func NewNoOp() Instance

func NewWithExitOnSuccessfulLoad

func NewWithExitOnSuccessfulLoad(ctx context.Context) (Instance, error)

NewWithExitOnSuccessfulLoad creates a new instance of the NVML library. If NVML is not installed, it returns no-op nvml instance. It also calls the exit function when NVML is successfully loaded. The exit function is only called when the NVML library is not found. Other errors are returned as is.

func NewWithFailureInjector

func NewWithFailureInjector(failureInjector *FailureInjectorConfig) (Instance, error)

NewWithFailureInjector creates a new instance with failure injection configuration.

type Op

type Op struct {
	// contains filtered or unexported fields
}

type OpOption

type OpOption func(*Op)

func WithHWSlowdownEventBucket

func WithHWSlowdownEventBucket(bucket eventstore.Bucket) OpOption

Directories

Path Synopsis
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
Package lib implements the NVIDIA Management Library (NVML) interface.
Package lib implements the NVIDIA Management Library (NVML) interface.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL