nvml

package
v0.9.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 7, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

Documentation

Overview

Package nvml implements the NVIDIA Management Library (NVML) interface. See https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference for more details.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ClockEventsSupportedVersion

func ClockEventsSupportedVersion(major int) bool

clock events are supported in versions 535 and above otherwise, CGO call just exits with undefined symbol: nvmlDeviceGetCurrentClocksEventReasons

func GetArchFamily added in v0.5.0

func GetArchFamily(dev device.Device) (string, error)

GetArchFamily returns the GPU architecture family name based on the given device CUDA compute capability. ref. https://github.com/NVIDIA/k8s-device-plugin/blob/f666bc3f836a09ae2fda439f3d7a8d8b06b48ac4/internal/lm/resource.go#L283C6-L283C19

func GetBrand added in v0.5.0

func GetBrand(dev device.Device) (string, error)

func GetCUDAVersion added in v0.4.5

func GetCUDAVersion() (string, error)

func GetDriverVersion

func GetDriverVersion() (string, error)

func GetProductName added in v0.5.0

func GetProductName(dev device.Device) (string, error)

func GetSystemDriverVersion added in v0.5.0

func GetSystemDriverVersion(nvmlLib nvml.Interface) (string, error)

func LoadGPUDeviceName added in v0.4.5

func LoadGPUDeviceName() (string, error)

Loads the product name of the NVIDIA GPU device.

func ParseDriverVersion

func ParseDriverVersion(version string) (major, minor, patch int, err error)

Types

type FailureInjectorConfig added in v0.9.0

type FailureInjectorConfig struct {
	GPUUUIDsWithGPULost                           []string
	GPUUUIDsWithGPURequiresReset                  []string
	GPUUUIDsWithFabricStateHealthSummaryUnhealthy []string

	// GPUProductNameOverride overrides the detected GPU product name.
	// This is useful for testing fabric state failure injection on systems where
	// the actual GPU (e.g., H100-PCIe) doesn't support fabric state monitoring.
	// Set this to a product name like "H100-SXM" to simulate a fabric-capable GPU.
	// When set, this affects FabricStateSupported(), FabricManagerSupported(),
	// and memory management capabilities detection.
	GPUProductNameOverride string
}

FailureInjectorConfig holds configuration for test failure injection

type Instance

type Instance interface {
	// NVMLExists returns true if the NVML library is installed.
	NVMLExists() bool

	// Library returns the NVML library.
	Library() nvmllib.Library

	// Devices returns the current devices in the system.
	// The key is the UUID of the GPU device.
	Devices() map[string]device.Device

	// ProductName returns the product name of the GPU.
	// Note that some machines have nvml library but the driver is not installed,
	// returning empty value for the GPU product name.
	ProductName() string

	// Architecture returns the architecture of the GPU.
	// GB200 may return "NVIDIA-Graphics-Device" for the product name
	// but "blackwell" for architecture.
	Architecture() string

	// Brand returns the brand of the GPU.
	Brand() string

	// DriverVersion returns the driver version of the GPU.
	DriverVersion() string

	// DriverMajor returns the major version of the driver.
	DriverMajor() int

	// CUDAVersion returns the CUDA version of the GPU.
	CUDAVersion() string

	// FabricManagerSupported returns true if the fabric manager is supported.
	FabricManagerSupported() bool

	// FabricStateSupported returns true if NVML fabric state telemetry is
	// available for the product (e.g. GB200 via nvmlDeviceGetGpuFabricInfo*).
	FabricStateSupported() bool

	// GetMemoryErrorManagementCapabilities returns the memory error management capabilities of the GPU.
	GetMemoryErrorManagementCapabilities() nvidiaproduct.MemoryErrorManagementCapabilities

	// Shutdown shuts down the NVML library.
	Shutdown() error
}

Instance is the interface for the NVML library connector.

func New added in v0.5.0

func New() (Instance, error)

New creates a new instance of the NVML library. If NVML is not installed, it returns no-op nvml instance.

func NewNoOp added in v0.5.0

func NewNoOp() Instance

func NewWithExitOnSuccessfulLoad added in v0.5.0

func NewWithExitOnSuccessfulLoad(ctx context.Context) (Instance, error)

NewWithExitOnSuccessfulLoad creates a new instance of the NVML library. If NVML is not installed, it returns no-op nvml instance. It also calls the exit function when NVML is successfully loaded. The exit function is only called when the NVML library is not found. Other errors are returned as is.

func NewWithFailureInjector added in v0.9.0

func NewWithFailureInjector(failureInjector *FailureInjectorConfig) (Instance, error)

NewWithFailureInjector creates a new instance with failure injection configuration.

type Op

type Op struct {
	// contains filtered or unexported fields
}

type OpOption

type OpOption func(*Op)

func WithHWSlowdownEventBucket added in v0.4.5

func WithHWSlowdownEventBucket(bucket eventstore.Bucket) OpOption

Directories

Path Synopsis
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
Package lib implements the NVIDIA Management Library (NVML) interface.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL