query

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 5, 2025 License: Apache-2.0 Imports: 10 Imported by: 0

Documentation

Index

Constants

View Source
const DeviceVendorID = "10de"

DeviceVendorID defines the vendor ID of NVIDIA devices. e.g., lspci -nn | grep -i "10de.*" ref. https://devicehunt.com/view/type/pci/vendor/10DE

Variables

View Source
var (
	DefaultNVIDIALibraries = map[string][]string{

		"libnvidia-ml.so": {

			"libnvidia-ml.so.1",
		},

		"libcuda.so": {
			"libcuda.so.1",
		},
	}

	DefaultNVIDIALibrariesSearchDirs = []string{

		"/",
		"/usr/lib64",
		"/usr/lib/x86_64-linux-gnu",
		"/usr/lib/aarch64-linux-gnu",
		"/usr/lib/x86_64-linux-gnu/nvidia/current",
		"/usr/lib/aarch64-linux-gnu/nvidia/current",
		"/lib64",
		"/lib/x86_64-linux-gnu",
		"/lib/aarch64-linux-gnu",
		"/lib/x86_64-linux-gnu/nvidia/current",
		"/lib/aarch64-linux-gnu/nvidia/current",
	}
)

Functions

func CountAllDevicesFromDevDir

func CountAllDevicesFromDevDir() (int, error)

func CountSMINVSwitches added in v0.5.0

func CountSMINVSwitches(ctx context.Context) ([]string, error)

func IsErrDeviceHandleUnknownError

func IsErrDeviceHandleUnknownError(err error) bool

"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"

or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"

func ListPCIGPUs added in v0.5.0

func ListPCIGPUs(ctx context.Context) ([]string, error)

ListPCIGPUs returns all "lspci" lines that represents NVIDIA GPU devices.

func ListPCINVSwitches added in v0.5.0

func ListPCINVSwitches(ctx context.Context) ([]string, error)

ListPCINVSwitches returns all "lspci" lines that represents NVIDIA NVSwitch devices.

Types

type MemoryErrorManagementCapabilities

type MemoryErrorManagementCapabilities struct {
	// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
	// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
	// Thus, does not require a GPU reset when memory errors occur.
	//
	// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
	// thus impacting all other workloads being procssed in the GPU.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
	ErrorContainment bool `json:"error_containment"`

	// (If supported) GPU can dynamically mark the page containing uncorrectable errors
	// as unusable, and any existing or new workloads will not be allocating this page.
	//
	// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
	DynamicPageOfflining bool `json:"dynamic_page_offlining"`

	// (If supported) GPU can replace degrading memory cells with spare ones
	// to avoid offlining regions of memory. And the row remapping is different
	// from dynamic page offlining which is fixed at a hardware level.
	//
	// The row remapping requires a GPU reset to take effect.
	//
	// even for "NVIDIA GeForce RTX 4090", nvml returns no error
	// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
	// thus we track a separate boolean value based on the GPU product name
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
	RowRemapping bool `json:"row_remapping"`

	// Message contains the message to the user about the memory error management capabilities.
	Message string `json:"message,omitempty"`
}

Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

func SupportedMemoryMgmtCapsByGPUProduct

func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities

SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

Directories

Path Synopsis
class
Package class implements the infiniband class sysfs interface.
Package class implements the infiniband class sysfs interface.
store
Package store stores infiniband states in time-series.
Package store stores infiniband states in time-series.
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
device
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID method.
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID method.
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
Package lib implements the NVIDIA Management Library (NVML) interface.
Package sxid provides the NVIDIA SXID error details.
Package sxid provides the NVIDIA SXID error details.
Package xid provides the NVIDIA XID error details.
Package xid provides the NVIDIA XID error details.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL