query

package
v0.4.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 21, 2025 License: Apache-2.0 Imports: 29 Imported by: 0

Documentation

Overview

Package query implements various NVIDIA-related system queries. All interactions with NVIDIA data sources are implemented under the query packages.

Index

Constants

View Source
const (
	StateKeyGPUProductName      = "gpu_product_name"
	StateKeyFabricManagerExists = "fabric_manager_exists"
	StateKeyIbstatExists        = "ibstat_exists"
)

Variables

View Source
var (
	DefaultNVIDIALibraries = map[string][]string{

		"libnvidia-ml.so": {

			"libnvidia-ml.so.1",
		},

		"libcuda.so": {
			"libcuda.so.1",
		},
	}

	DefaultNVIDIALibrariesSearchDirs = []string{

		"/",
		"/usr/lib64",
		"/usr/lib/x86_64-linux-gnu",
		"/usr/lib/aarch64-linux-gnu",
		"/usr/lib/x86_64-linux-gnu/nvidia/current",
		"/usr/lib/aarch64-linux-gnu/nvidia/current",
		"/lib64",
		"/lib/x86_64-linux-gnu",
		"/lib/aarch64-linux-gnu",
		"/lib/x86_64-linux-gnu/nvidia/current",
		"/lib/aarch64-linux-gnu/nvidia/current",
	}
)
View Source
var ErrDefaultPollerNotSet = errors.New("default nvidia poller is not set")

Functions

func CountAllDevicesFromDevDir

func CountAllDevicesFromDevDir() (int, error)

func CreateGet

func CreateGet(opts ...OpOption) query.GetFunc

func GPUsInstalled

func GPUsInstalled(ctx context.Context) (bool, error)

Returns true if the local machine has NVIDIA GPUs installed.

func Get

func Get(ctx context.Context, opts ...OpOption) (output any, err error)

Get all nvidia component queries.

func GetDefaultPoller

func GetDefaultPoller() query.Poller

func GetSuccessOnce

func GetSuccessOnce() <-chan any

func IsErrDeviceHandleUnknownError

func IsErrDeviceHandleUnknownError(err error) bool

"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"

or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"

func ListNVIDIAPCIs

func ListNVIDIAPCIs(ctx context.Context) ([]string, error)

Lists all PCI devices that are compatible with NVIDIA.

func SetDefaultPoller

func SetDefaultPoller(opts ...OpOption)

only set once since it relies on the kube client and specific port

Types

type MemoryErrorManagementCapabilities

type MemoryErrorManagementCapabilities struct {
	// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
	// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
	// Thus, does not require a GPU reset when memory errors occur.
	//
	// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
	// thus impacting all other workloads being procssed in the GPU.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
	ErrorContainment bool `json:"error_containment"`

	// (If supported) GPU can dynamically mark the page containing uncorrectable errors
	// as unusable, and any existing or new workloads will not be allocating this page.
	//
	// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
	DynamicPageOfflining bool `json:"dynamic_page_offlining"`

	// (If supported) GPU can replace degrading memory cells with spare ones
	// to avoid offlining regions of memory. And the row remapping is different
	// from dynamic page offlining which is fixed at a hardware level.
	//
	// The row remapping requires a GPU reset to take effect.
	//
	// even for "NVIDIA GeForce RTX 4090", nvml returns no error
	// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
	// thus we track a separate boolean value based on the GPU product name
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
	RowRemapping bool `json:"row_remapping"`

	// Message contains the message to the user about the memory error management capabilities.
	Message string `json:"message,omitempty"`
}

Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

func SupportedMemoryMgmtCapsByGPUProduct

func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities

SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

type Op

type Op struct {
	// contains filtered or unexported fields
}

type OpOption

type OpOption func(*Op)

func WithDebug

func WithDebug(debug bool) OpOption

func WithHWSlowdownEventBucket added in v0.4.5

func WithHWSlowdownEventBucket(bucket eventstore.Bucket) OpOption

func WithIbstatCommand

func WithIbstatCommand(p string) OpOption

Specifies the ibstat binary path to overwrite the default path.

func WithXidEventBucket added in v0.4.5

func WithXidEventBucket(bucket eventstore.Bucket) OpOption

type Output

type Output struct {
	// Time is the time when the query is executed.
	Time time.Time `json:"time"`

	// GPU device count from the /dev directory.
	GPUDeviceCount int `json:"gpu_device_count"`

	LsmodPeermem       *peermem.LsmodPeermemModuleOutput `json:"lsmod_peermem,omitempty"`
	LsmodPeermemErrors []string                          `json:"lsmod_peermem_errors,omitempty"`

	NVML       *nvml.Output `json:"nvml,omitempty"`
	NVMLErrors []string     `json:"nvml_errors,omitempty"`

	MemoryErrorManagementCapabilities MemoryErrorManagementCapabilities `json:"memory_error_management_capabilities,omitempty"`
}

func (*Output) GPUCount

func (o *Output) GPUCount() int

func (*Output) GPUCountFromNVML

func (o *Output) GPUCountFromNVML() int

func (*Output) GPUProductName

func (o *Output) GPUProductName() string

func (*Output) GPUProductNameFromNVML

func (o *Output) GPUProductNameFromNVML() string

This is the same product name in nvidia-smi outputs. ref. https://developer.nvidia.com/management-library-nvml

func (*Output) PrintInfo

func (o *Output) PrintInfo(opts ...OpOption)

func (*Output) YAML

func (o *Output) YAML() ([]byte, error)

Directories

Path Synopsis
metrics
clock
Package clock provides the NVIDIA clock metrics collection and reporting.
Package clock provides the NVIDIA clock metrics collection and reporting.
clock-speed
Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.
Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.
ecc
Package ecc provides the NVIDIA ECC metrics collection and reporting.
Package ecc provides the NVIDIA ECC metrics collection and reporting.
gpm
Package gpm provides the NVIDIA GPM metrics collection and reporting.
Package gpm provides the NVIDIA GPM metrics collection and reporting.
memory
Package memory provides the NVIDIA memory metrics collection and reporting.
Package memory provides the NVIDIA memory metrics collection and reporting.
nvlink
Package nvlink provides the NVIDIA nvlink metrics collection and reporting.
Package nvlink provides the NVIDIA nvlink metrics collection and reporting.
power
Package power provides the NVIDIA power usage metrics collection and reporting.
Package power provides the NVIDIA power usage metrics collection and reporting.
processes
Package processes provides the NVIDIA processes metrics collection and reporting.
Package processes provides the NVIDIA processes metrics collection and reporting.
temperature
Package temperature provides the NVIDIA temperature metrics collection and reporting.
Package temperature provides the NVIDIA temperature metrics collection and reporting.
utilization
Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.
Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
lib
Package lib implements the NVIDIA Management Library (NVML) interface.
Package lib implements the NVIDIA Management Library (NVML) interface.
Package peermem contains the implementation of the peermem query for NVIDIA GPUs.
Package peermem contains the implementation of the peermem query for NVIDIA GPUs.
Package sxid provides the NVIDIA SXID error details.
Package sxid provides the NVIDIA SXID error details.
Package xid provides the NVIDIA XID error details.
Package xid provides the NVIDIA XID error details.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL