query

package
v0.0.1-test Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 22, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Index

Constants

View Source
const DeviceVendorID = "10de"

DeviceVendorID defines the vendor ID of NVIDIA devices. e.g., lspci -nn | grep -i "10de.*" ref. https://devicehunt.com/view/type/pci/vendor/10DE

Variables

This section is empty.

Functions

func CountAllDevicesFromDevDir

func CountAllDevicesFromDevDir() (int, error)

func CountSMINVSwitches added in v0.5.0

func CountSMINVSwitches(ctx context.Context) ([]string, error)

func IsErrDeviceHandleUnknownError

func IsErrDeviceHandleUnknownError(err error) bool

"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"

or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"

func ListPCIGPUs added in v0.5.0

func ListPCIGPUs(ctx context.Context) ([]string, error)

ListPCIGPUs returns all "lspci" lines that represents NVIDIA GPU devices.

func ListPCINVSwitches added in v0.5.0

func ListPCINVSwitches(ctx context.Context) ([]string, error)

ListPCINVSwitches returns all "lspci" lines that represents NVIDIA NVSwitch devices.

Types

type MemoryErrorManagementCapabilities

type MemoryErrorManagementCapabilities struct {
	// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
	// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
	// Thus, does not require a GPU reset when memory errors occur.
	//
	// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
	// thus impacting all other workloads being procssed in the GPU.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
	ErrorContainment bool `json:"error_containment"`

	// (If supported) GPU can dynamically mark the page containing uncorrectable errors
	// as unusable, and any existing or new workloads will not be allocating this page.
	//
	// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
	DynamicPageOfflining bool `json:"dynamic_page_offlining"`

	// (If supported) GPU can replace degrading memory cells with spare ones
	// to avoid offlining regions of memory. And the row remapping is different
	// from dynamic page offlining which is fixed at a hardware level.
	//
	// The row remapping requires a GPU reset to take effect.
	//
	// even for "NVIDIA GeForce RTX 4090", nvml returns no error
	// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
	// thus we track a separate boolean value based on the GPU product name
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
	RowRemapping bool `json:"row_remapping"`

	// Message contains the message to the user about the memory error management capabilities.
	Message string `json:"message,omitempty"`
}

Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

func SupportedMemoryMgmtCapsByGPUProduct

func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities

SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

Directories

Path Synopsis
class
Package class implements the infiniband class sysfs interface.
Package class implements the infiniband class sysfs interface.
store
Package store stores infiniband states in time-series.
Package store stores infiniband states in time-series.
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
device
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID method.
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID method.
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
Package lib implements the NVIDIA Management Library (NVML) interface.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL