query

package

v0.4.9 Latest Latest Go to latest Published: Apr 21, 2025 License: Apache-2.0 Imports: 29 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/leptonai/gpud

Links

Open Source Insights

Documentation ¶

Overview ¶

Package query implements various NVIDIA-related system queries. All interactions with NVIDIA data sources are implemented under the query packages.

Index ¶

Constants
Variables
func CountAllDevicesFromDevDir() (int, error)
func CreateGet(opts ...OpOption) query.GetFunc
func GPUsInstalled(ctx context.Context) (bool, error)
func Get(ctx context.Context, opts ...OpOption) (output any, err error)
func GetDefaultPoller() query.Poller
func GetSuccessOnce() <-chan any
func IsErrDeviceHandleUnknownError(err error) bool
func ListNVIDIAPCIs(ctx context.Context) ([]string, error)
func SetDefaultPoller(opts ...OpOption)
type MemoryErrorManagementCapabilities
- func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities
type Op
type OpOption
type Output

Constants ¶

View Source

const (
	StateKeyGPUProductName      = "gpu_product_name"
	StateKeyFabricManagerExists = "fabric_manager_exists"
	StateKeyIbstatExists        = "ibstat_exists"
)

Variables ¶

View Source

var (
	DefaultNVIDIALibraries = map[string][]string{

		"libnvidia-ml.so": {

			"libnvidia-ml.so.1",
		},

		"libcuda.so": {
			"libcuda.so.1",
		},
	}

	DefaultNVIDIALibrariesSearchDirs = []string{

		"/",
		"/usr/lib64",
		"/usr/lib/x86_64-linux-gnu",
		"/usr/lib/aarch64-linux-gnu",
		"/usr/lib/x86_64-linux-gnu/nvidia/current",
		"/usr/lib/aarch64-linux-gnu/nvidia/current",
		"/lib64",
		"/lib/x86_64-linux-gnu",
		"/lib/aarch64-linux-gnu",
		"/lib/x86_64-linux-gnu/nvidia/current",
		"/lib/aarch64-linux-gnu/nvidia/current",
	}
)

View Source

var ErrDefaultPollerNotSet = errors.New("default nvidia poller is not set")

Functions ¶

func CountAllDevicesFromDevDir ¶

func CountAllDevicesFromDevDir() (int, error)

func CreateGet ¶

func CreateGet(opts ...OpOption) query.GetFunc

func GPUsInstalled ¶

func GPUsInstalled(ctx context.Context) (bool, error)

Returns true if the local machine has NVIDIA GPUs installed.

func Get ¶

func Get(ctx context.Context, opts ...OpOption) (output any, err error)

Get all nvidia component queries.

func GetDefaultPoller ¶

func GetDefaultPoller() query.Poller

func GetSuccessOnce ¶

func GetSuccessOnce() <-chan any

func IsErrDeviceHandleUnknownError ¶

func IsErrDeviceHandleUnknownError(err error) bool

"NVIDIA Xid 79: GPU has fallen off the bus" may fail this syscall with: "error getting device handle for index '6': Unknown Error"

or "Unable to determine the device handle for GPU0000:CB:00.0: Unknown Error"

func ListNVIDIAPCIs ¶

func ListNVIDIAPCIs(ctx context.Context) ([]string, error)

Lists all PCI devices that are compatible with NVIDIA.

func SetDefaultPoller ¶

func SetDefaultPoller(opts ...OpOption)

only set once since it relies on the kube client and specific port

Types ¶

type MemoryErrorManagementCapabilities ¶

type MemoryErrorManagementCapabilities struct {
	// (If supported) GPU can limit the impact of uncorrectable ECC errors to GPU applications.
	// Existing/new workloads will run unaffected, both in terms of accuracy and performance.
	// Thus, does not require a GPU reset when memory errors occur.
	//
	// Note thtat there are some rarer cases, where uncorrectable errors are still uncontained
	// thus impacting all other workloads being procssed in the GPU.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#error-containments
	ErrorContainment bool `json:"error_containment"`

	// (If supported) GPU can dynamically mark the page containing uncorrectable errors
	// as unusable, and any existing or new workloads will not be allocating this page.
	//
	// Thus, does not require a GPU reset to recover from most uncorrectable ECC errors.
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#dynamic-page-offlining
	DynamicPageOfflining bool `json:"dynamic_page_offlining"`

	// (If supported) GPU can replace degrading memory cells with spare ones
	// to avoid offlining regions of memory. And the row remapping is different
	// from dynamic page offlining which is fixed at a hardware level.
	//
	// The row remapping requires a GPU reset to take effect.
	//
	// even for "NVIDIA GeForce RTX 4090", nvml returns no error
	// thus "NVML.Supported" is not a reliable way to check if row remapping is supported
	// thus we track a separate boolean value based on the GPU product name
	//
	// ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping
	RowRemapping bool `json:"row_remapping"`

	// Message contains the message to the user about the memory error management capabilities.
	Message string `json:"message,omitempty"`
}

Contains information about the GPU's memory error management capabilities. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

func SupportedMemoryMgmtCapsByGPUProduct ¶

func SupportedMemoryMgmtCapsByGPUProduct(gpuProductName string) MemoryErrorManagementCapabilities

SupportedMemoryMgmtCapsByGPUProduct returns the GPU memory error management capabilities based on the GPU product name. ref. https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#supported-gpus

type Op ¶

type Op struct {
	// contains filtered or unexported fields
}

type OpOption ¶

type OpOption func(*Op)

func WithDebug ¶

func WithDebug(debug bool) OpOption

func WithHWSlowdownEventBucket ¶ added in v0.4.5

func WithHWSlowdownEventBucket(bucket eventstore.Bucket) OpOption

func WithIbstatCommand ¶

func WithIbstatCommand(p string) OpOption

Specifies the ibstat binary path to overwrite the default path.

func WithXidEventBucket ¶ added in v0.4.5

func WithXidEventBucket(bucket eventstore.Bucket) OpOption

type Output ¶

type Output struct {
	// Time is the time when the query is executed.
	Time time.Time `json:"time"`

	// GPU device count from the /dev directory.
	GPUDeviceCount int `json:"gpu_device_count"`

	LsmodPeermem       *peermem.LsmodPeermemModuleOutput `json:"lsmod_peermem,omitempty"`
	LsmodPeermemErrors []string                          `json:"lsmod_peermem_errors,omitempty"`

	NVML       *nvml.Output `json:"nvml,omitempty"`
	NVMLErrors []string     `json:"nvml_errors,omitempty"`

	MemoryErrorManagementCapabilities MemoryErrorManagementCapabilities `json:"memory_error_management_capabilities,omitempty"`
}

func (*Output) GPUCount ¶

func (o *Output) GPUCount() int

func (*Output) GPUCountFromNVML ¶

func (o *Output) GPUCountFromNVML() int

func (*Output) GPUProductName ¶

func (o *Output) GPUProductName() string

func (*Output) GPUProductNameFromNVML ¶

func (o *Output) GPUProductNameFromNVML() string

This is the same product name in nvidia-smi outputs. ref. https://developer.nvidia.com/management-library-nvml

func (*Output) PrintInfo ¶

func (o *Output) PrintInfo(opts ...OpOption)

func (*Output) YAML ¶

func (o *Output) YAML() ([]byte, error)

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
infiniband
metrics
clock Package clock provides the NVIDIA clock metrics collection and reporting.	Package clock provides the NVIDIA clock metrics collection and reporting.
clock-speed Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.	Package clockspeed provides the NVIDIA clock speed metrics collection and reporting.
ecc Package ecc provides the NVIDIA ECC metrics collection and reporting.	Package ecc provides the NVIDIA ECC metrics collection and reporting.
gpm Package gpm provides the NVIDIA GPM metrics collection and reporting.	Package gpm provides the NVIDIA GPM metrics collection and reporting.
memory Package memory provides the NVIDIA memory metrics collection and reporting.	Package memory provides the NVIDIA memory metrics collection and reporting.
nvlink Package nvlink provides the NVIDIA nvlink metrics collection and reporting.	Package nvlink provides the NVIDIA nvlink metrics collection and reporting.
power Package power provides the NVIDIA power usage metrics collection and reporting.	Package power provides the NVIDIA power usage metrics collection and reporting.
processes Package processes provides the NVIDIA processes metrics collection and reporting.	Package processes provides the NVIDIA processes metrics collection and reporting.
temperature Package temperature provides the NVIDIA temperature metrics collection and reporting.	Package temperature provides the NVIDIA temperature metrics collection and reporting.
utilization Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.	Package utilization provides the NVIDIA GPU utilization metrics collection and reporting.
nccl Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.	Package nccl contains the implementation of the NCCL (NVIDIA Collective Communications Library) query for NVIDIA GPUs.
nvml Package nvml implements the NVIDIA Management Library (NVML) interface.	Package nvml implements the NVIDIA Management Library (NVML) interface.
lib Package lib implements the NVIDIA Management Library (NVML) interface.	Package lib implements the NVIDIA Management Library (NVML) interface.
lib/mock
testutil
peermem Package peermem contains the implementation of the peermem query for NVIDIA GPUs.	Package peermem contains the implementation of the peermem query for NVIDIA GPUs.
sxid Package sxid provides the NVIDIA SXID error details.	Package sxid provides the NVIDIA SXID error details.
xid Package xid provides the NVIDIA XID error details.	Package xid provides the NVIDIA XID error details.