Documentation
¶
Overview ¶
Package query implements various NVIDIA-related system queries. All interactions with NVIDIA data sources are implemented under the query packages.
Index ¶
- Constants
- Variables
- func CountAllDevicesFromDevDir() (int, error)
- func CreateGet(opts ...OpOption) query.GetFunc
- func GPUsInstalled(ctx context.Context) (bool, error)
- func Get(ctx context.Context, opts ...OpOption) (output any, err error)
- func GetDefaultPoller() query.Poller
- func GetSuccessOnce() <-chan any
- func IsErrDeviceHandleUnknownError(err error) bool
- func ListNVIDIAPCIs(ctx context.Context) ([]string, error)
- func SetDefaultPoller(opts ...OpOption)
- type MemoryErrorManagementCapabilities
- type Op
- type OpOption
- type Output
Constants ¶
const ( StateKeyGPUProductName = "gpu_product_name" StateKeyFabricManagerExists = "fabric_manager_exists" StateKeyIbstatExists = "ibstat_exists" )
Variables ¶
var ( DefaultNVIDIALibraries = map[string][]string{ "libnvidia-ml.so": { "libnvidia-ml.so.1", }, "libcuda.so": { "libcuda.so.1", }, } DefaultNVIDIALibrariesSearchDirs = []string{ "/", "/usr/lib64", "/usr/lib/x86_64-linux-gnu", "/usr/lib/aarch64-linux-gnu", "/usr/lib/x86_64-linux-gnu/nvidia/current", "/usr/lib/aarch64-linux-gnu/nvidia/current", "/lib64", "/lib/x86_64-linux-gnu", "/lib/aarch64-linux-gnu", "/lib/x86_64-linux-gnu/nvidia/current", "/lib/aarch64-linux-gnu/nvidia/current", } )
var ErrDefaultPollerNotSet = errors.New("default nvidia poller is not set")
Functions ¶
func GPUsInstalled ¶
Returns true if the local machine has NVIDIA GPUs installed.
func GetDefaultPoller ¶
func GetSuccessOnce ¶
func GetSuccessOnce() <-chan any
func IsErrDeviceHandleUnknownError ¶
"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"
or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"
func ListNVIDIAPCIs ¶
Lists all PCI devices that are compatible with NVIDIA.
func SetDefaultPoller ¶
func SetDefaultPoller(opts ...OpOption)
only set once since it relies on the kube client and specific port
Types ¶
type MemoryErrorManagementCapabilities ¶
type MemoryErrorManagementCapabilities struct {
// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
// Thus, does not require a GPU reset when memory errors occur.
//
// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
// thus impacting all other workloads being procssed in the GPU.
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
ErrorContainment bool `json:"error_containment"`
// (If supported) GPU can dynamically mark the page containing uncorrectable errors
// as unusable, and any existing or new workloads will not be allocating this page.
//
// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
DynamicPageOfflining bool `json:"dynamic_page_offlining"`
// (If supported) GPU can replace degrading memory cells with spare ones
// to avoid offlining regions of memory. And the row remapping is different
// from dynamic page offlining which is fixed at a hardware level.
//
// The row remapping requires a GPU reset to take effect.
//
// even for "NVIDIA GeForce RTX 4090", nvml returns no error
// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
// thus we track a separate boolean value based on the GPU product name
//
// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
RowRemapping bool `json:"row_remapping"`
// Message contains the message to the user about the memory error management capabilities.
Message string `json:"message,omitempty"`
}
Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus
func SupportedMemoryMgmtCapsByGPUProduct ¶
func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities
SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus
type OpOption ¶
type OpOption func(*Op)
func WithHWSlowdownEventBucket ¶ added in v0.4.5
func WithHWSlowdownEventBucket(bucket eventstore.Bucket) OpOption
func WithIbstatCommand ¶
Specifies the ibstat binary path to overwrite the default path.
func WithXidEventBucket ¶ added in v0.4.5
func WithXidEventBucket(bucket eventstore.Bucket) OpOption
type Output ¶
type Output struct {
// Time is the time when the query is executed.
Time time.Time `json:"time"`
// GPU device count from the /dev directory.
GPUDeviceCount int `json:"gpu_device_count"`
LsmodPeermem *peermem.LsmodPeermemModuleOutput `json:"lsmod_peermem,omitempty"`
LsmodPeermemErrors []string `json:"lsmod_peermem_errors,omitempty"`
NVML *nvml.Output `json:"nvml,omitempty"`
NVMLErrors []string `json:"nvml_errors,omitempty"`
MemoryErrorManagementCapabilities MemoryErrorManagementCapabilities `json:"memory_error_management_capabilities,omitempty"`
}
func (*Output) GPUCountFromNVML ¶
func (*Output) GPUProductName ¶
func (*Output) GPUProductNameFromNVML ¶
This is the same product name in nvidia-smi outputs. ref. https://developer.nvidia.com/management-library-nvml
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
metrics
|
|
|
clock
Package clock provides the NVIDIA clock metrics collection and reporting.
|
Package clock provides the NVIDIA clock metrics collection and reporting. |
|
clock-speed
Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.
|
Package clockspeed provides the NVIDIA clock speed metrics collection and reporting. |
|
ecc
Package ecc provides the NVIDIA ECC metrics collection and reporting.
|
Package ecc provides the NVIDIA ECC metrics collection and reporting. |
|
gpm
Package gpm provides the NVIDIA GPM metrics collection and reporting.
|
Package gpm provides the NVIDIA GPM metrics collection and reporting. |
|
memory
Package memory provides the NVIDIA memory metrics collection and reporting.
|
Package memory provides the NVIDIA memory metrics collection and reporting. |
|
nvlink
Package nvlink provides the NVIDIA nvlink metrics collection and reporting.
|
Package nvlink provides the NVIDIA nvlink metrics collection and reporting. |
|
power
Package power provides the NVIDIA power usage metrics collection and reporting.
|
Package power provides the NVIDIA power usage metrics collection and reporting. |
|
processes
Package processes provides the NVIDIA processes metrics collection and reporting.
|
Package processes provides the NVIDIA processes metrics collection and reporting. |
|
temperature
Package temperature provides the NVIDIA temperature metrics collection and reporting.
|
Package temperature provides the NVIDIA temperature metrics collection and reporting. |
|
utilization
Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.
|
Package utilization provides the NVIDIA GPU utilization metrics collection and reporting. |
|
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
|
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs. |
|
Package nvml implements the NVIDIA Management Library (NVML) interface.
|
Package nvml implements the NVIDIA Management Library (NVML) interface. |
|
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
|
Package lib implements the NVIDIA Management Library (NVML) interface. |
|
Package peermem contains the implementation of the peermem query for NVIDIA GPUs.
|
Package peermem contains the implementation of the peermem query for NVIDIA GPUs. |
|
Package sxid provides the NVIDIA SXID error details.
|
Package sxid provides the NVIDIA SXID error details. |
|
Package xid provides the NVIDIA XID error details.
|
Package xid provides the NVIDIA XID error details. |