Documentation
¶
Index ¶
- Constants
- func CountAllDevicesFromDevDir() (int, error)
- func CountSMINVSwitches(ctx context.Context) ([]string, error)
- func IsErrDeviceHandleUnknownError(err error) bool
- func ListPCIGPUs(ctx context.Context) ([]string, error)
- func ListPCINVSwitches(ctx context.Context) ([]string, error)
- type MemoryErrorManagementCapabilities
Constants ¶
const DeviceVendorID = "10de"
DeviceVendorID defines the vendor ID of NVIDIA devices. e.g., lspci -nn | grep -i "10de.*" ref. https://devicehunt.com/view/type/pci/vendor/10DE
Variables ¶
This section is empty.
Functions ¶
func CountSMINVSwitches ¶ added in v0.5.0
func IsErrDeviceHandleUnknownError ¶
"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"
or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"
func ListPCIGPUs ¶ added in v0.5.0
ListPCIGPUs returns all "lspci" lines that represents NVIDIA GPU devices.
Types ¶
type MemoryErrorManagementCapabilities ¶
type MemoryErrorManagementCapabilities struct {
// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
// Thus, does not require a GPU reset when memory errors occur.
//
// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
// thus impacting all other workloads being procssed in the GPU.
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
ErrorContainment bool `json:"error_containment"`
// (If supported) GPU can dynamically mark the page containing uncorrectable errors
// as unusable, and any existing or new workloads will not be allocating this page.
//
// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
DynamicPageOfflining bool `json:"dynamic_page_offlining"`
// (If supported) GPU can replace degrading memory cells with spare ones
// to avoid offlining regions of memory. And the row remapping is different
// from dynamic page offlining which is fixed at a hardware level.
//
// The row remapping requires a GPU reset to take effect.
//
// even for "NVIDIA GeForce RTX 4090", nvml returns no error
// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
// thus we track a separate boolean value based on the GPU product name
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
RowRemapping bool `json:"row_remapping"`
// Message contains the message to the user about the memory error management capabilities.
Message string `json:"message,omitempty"`
}
Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus
func SupportedMemoryMgmtCapsByGPUProduct ¶
func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities
SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
class
Package class implements the infiniband class sysfs interface.
|
Package class implements the infiniband class sysfs interface. |
|
store
Package store stores infiniband states in time-series.
|
Package store stores infiniband states in time-series. |
|
Package nvml implements the NVIDIA Management Library (NVML) interface.
|
Package nvml implements the NVIDIA Management Library (NVML) interface. |
|
device
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID method.
|
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID method. |
|
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
|
Package lib implements the NVIDIA Management Library (NVML) interface. |