Documentation
¶
Index ¶
- Constants
- Variables
- func CountAllDevicesFromDevDir() (int, error)
- func CountSMINVSwitches(ctx context.Context) ([]string, error)
- func IsErrDeviceHandleUnknownError(err error) bool
- func ListPCIGPUs(ctx context.Context) ([]string, error)
- func ListPCINVSwitches(ctx context.Context) ([]string, error)
- type MemoryErrorManagementCapabilities
Constants ¶
const DeviceVendorID = "10de"
DeviceVendorID defines the vendor ID of NVIDIA devices. e.g., lspci -nn | grep -i "10de.*" ref. https://devicehunt.com/view/type/pci/vendor/10DE
Variables ¶
var ( DefaultNVIDIALibraries = map[string][]string{ "libnvidia-ml.so": { "libnvidia-ml.so.1", }, "libcuda.so": { "libcuda.so.1", }, } DefaultNVIDIALibrariesSearchDirs = []string{ "/", "/usr/lib64", "/usr/lib/x86_64-linux-gnu", "/usr/lib/aarch64-linux-gnu", "/usr/lib/x86_64-linux-gnu/nvidia/current", "/usr/lib/aarch64-linux-gnu/nvidia/current", "/lib64", "/lib/x86_64-linux-gnu", "/lib/aarch64-linux-gnu", "/lib/x86_64-linux-gnu/nvidia/current", "/lib/aarch64-linux-gnu/nvidia/current", } )
Functions ¶
func CountSMINVSwitches ¶ added in v0.5.0
func IsErrDeviceHandleUnknownError ¶
"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"
or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"
func ListPCIGPUs ¶ added in v0.5.0
ListPCIGPUs returns all "lspci" lines that represents NVIDIA GPU devices.
Types ¶
type MemoryErrorManagementCapabilities ¶
type MemoryErrorManagementCapabilities struct {
// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
// Thus, does not require a GPU reset when memory errors occur.
//
// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
// thus impacting all other workloads being procssed in the GPU.
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
ErrorContainment bool `json:"error_containment"`
// (If supported) GPU can dynamically mark the page containing uncorrectable errors
// as unusable, and any existing or new workloads will not be allocating this page.
//
// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
DynamicPageOfflining bool `json:"dynamic_page_offlining"`
// (If supported) GPU can replace degrading memory cells with spare ones
// to avoid offlining regions of memory. And the row remapping is different
// from dynamic page offlining which is fixed at a hardware level.
//
// The row remapping requires a GPU reset to take effect.
//
// even for "NVIDIA GeForce RTX 4090", nvml returns no error
// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
// thus we track a separate boolean value based on the GPU product name
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
RowRemapping bool `json:"row_remapping"`
// Message contains the message to the user about the memory error management capabilities.
Message string `json:"message,omitempty"`
}
Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus
func SupportedMemoryMgmtCapsByGPUProduct ¶
func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities
SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
|
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs. |
|
Package nvml implements the NVIDIA Management Library (NVML) interface.
|
Package nvml implements the NVIDIA Management Library (NVML) interface. |
|
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
|
Package lib implements the NVIDIA Management Library (NVML) interface. |
|
Package sxid provides the NVIDIA SXID error details.
|
Package sxid provides the NVIDIA SXID error details. |
|
Package xid provides the NVIDIA XID error details.
|
Package xid provides the NVIDIA XID error details. |