Documentation
¶
Overview ¶
Package nvml implements the NVIDIA Management Library (NVML) interface. See https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference for more details.
Index ¶
- func ClockEventsSupportedVersion(major int) bool
- func GetArchFamily(dev device.Device) (string, error)
- func GetBrand(dev device.Device) (string, error)
- func GetCUDAVersion() (string, error)
- func GetDriverVersion() (string, error)
- func GetProductName(dev device.Device) (string, error)
- func GetSystemDriverVersion(nvmlLib nvml.Interface) (string, error)
- func LoadGPUDeviceName() (string, error)
- func ParseDriverVersion(version string) (major, minor, patch int, err error)
- type FailureInjectorConfig
- type Instance
- type Op
- type OpOption
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ClockEventsSupportedVersion ¶
clock events are supported in versions 535 and above otherwise, CGO call just exits with undefined symbol: nvmlDeviceGetCurrentClocksEventReasons
func GetArchFamily ¶ added in v0.5.0
GetArchFamily returns the GPU architecture family name based on the given device CUDA compute capability. ref. https://github.com/NVIDIA/k8s-device-plugin/blob/f666bc3f836a09ae2fda439f3d7a8d8b06b48ac4/internal/lm/resource.go#L283C6-L283C19
func GetCUDAVersion ¶ added in v0.4.5
func GetDriverVersion ¶
func GetSystemDriverVersion ¶ added in v0.5.0
func LoadGPUDeviceName ¶ added in v0.4.5
Loads the product name of the NVIDIA GPU device.
func ParseDriverVersion ¶
Types ¶
type FailureInjectorConfig ¶ added in v0.9.0
type FailureInjectorConfig struct {
GPUUUIDsWithGPULost []string
GPUUUIDsWithGPURequiresReset []string
GPUUUIDsWithFabricStateHealthSummaryUnhealthy []string
// GPUProductNameOverride overrides the detected GPU product name.
// This is useful for testing fabric state failure injection on systems where
// the actual GPU (e.g., H100-PCIe) doesn't support fabric state monitoring.
// Set this to a product name like "H100-SXM" to simulate a fabric-capable GPU.
// When set, this affects FabricStateSupported(), FabricManagerSupported(),
// and memory management capabilities detection.
GPUProductNameOverride string
}
FailureInjectorConfig holds configuration for test failure injection
type Instance ¶
type Instance interface {
// NVMLExists returns true if the NVML library is installed.
NVMLExists() bool
// Library returns the NVML library.
Library() nvmllib.Library
// Devices returns the current devices in the system.
// The key is the UUID of the GPU device.
Devices() map[string]device.Device
// ProductName returns the product name of the GPU.
// Note that some machines have nvml library but the driver is not installed,
// returning empty value for the GPU product name.
ProductName() string
// Architecture returns the architecture of the GPU.
// GB200 may return "NVIDIA-Graphics-Device" for the product name
// but "blackwell" for architecture.
Architecture() string
// Brand returns the brand of the GPU.
Brand() string
// DriverVersion returns the driver version of the GPU.
DriverVersion() string
// DriverMajor returns the major version of the driver.
DriverMajor() int
// CUDAVersion returns the CUDA version of the GPU.
CUDAVersion() string
// FabricManagerSupported returns true if the fabric manager is supported.
FabricManagerSupported() bool
// FabricStateSupported returns true if NVML fabric state telemetry is
// available for the product (e.g. GB200 via nvmlDeviceGetGpuFabricInfo*).
FabricStateSupported() bool
// GetMemoryErrorManagementCapabilities returns the memory error management capabilities of the GPU.
GetMemoryErrorManagementCapabilities() nvidiaproduct.MemoryErrorManagementCapabilities
// Shutdown shuts down the NVML library.
Shutdown() error
}
Instance is the interface for the NVML library connector.
func New ¶ added in v0.5.0
New creates a new instance of the NVML library. If NVML is not installed, it returns no-op nvml instance.
func NewWithExitOnSuccessfulLoad ¶ added in v0.5.0
NewWithExitOnSuccessfulLoad creates a new instance of the NVML library. If NVML is not installed, it returns no-op nvml instance. It also calls the exit function when NVML is successfully loaded. The exit function is only called when the NVML library is not found. Other errors are returned as is.
func NewWithFailureInjector ¶ added in v0.9.0
func NewWithFailureInjector(failureInjector *FailureInjectorConfig) (Instance, error)
NewWithFailureInjector creates a new instance with failure injection configuration.
type OpOption ¶
type OpOption func(*Op)
func WithHWSlowdownEventBucket ¶ added in v0.4.5
func WithHWSlowdownEventBucket(bucket eventstore.Bucket) OpOption
Directories
¶
| Path | Synopsis |
|---|---|
|
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
|
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection. |
|
Package lib implements the NVIDIA Management Library (NVML) interface.
|
Package lib implements the NVIDIA Management Library (NVML) interface. |