Documentation
¶
Overview ¶
Package processes tracks the NVIDIA per-GPU processes.
Index ¶
Constants ¶
const Name = "accelerator-nvidia-processes"
Name is the component name for NVIDIA GPU process monitoring.
const SubSystem = "accelerator_nvidia_processes"
SubSystem is the Prometheus subsystem name for process metrics.
Variables ¶
This section is empty.
Functions ¶
func New ¶
func New(gpudInstance *components.GPUdInstance) (components.Component, error)
New returns the NVIDIA processes component.
Types ¶
type Process ¶ added in v0.9.0
type Process struct {
PID uint32 `json:"pid"`
Status []string `json:"status,omitempty"`
// ZombieStatus is set to true if the process is defunct
// (terminated but not reaped by its parent).
ZombieStatus bool `json:"zombie_status,omitempty"`
// BadEnvVarsForCUDA is a map of environment variables that are known to hurt CUDA
// that is set for this specific process.
// Empty if there is no bad environment variable found for this process.
// This implements "DCGM_FR_BAD_CUDA_ENV" logic in DCGM.
BadEnvVarsForCUDA map[string]string `json:"bad_env_vars_for_cuda,omitempty"`
CmdArgs []string `json:"cmd_args,omitempty"`
CreateTime metav1.Time `json:"create_time,omitzero"`
GPUUsedPercent uint32 `json:"gpu_used_percent,omitempty"`
GPUUsedMemoryBytes uint64 `json:"gpu_used_memory_bytes,omitempty"`
GPUUsedMemoryBytesHumanized string `json:"gpu_used_memory_bytes_humanized,omitempty"`
}
Process describes a single GPU-backed process observed through NVML.
type Processes ¶ added in v0.9.0
type Processes struct {
// Represents the GPU UUID.
UUID string `json:"uuid"`
// BusID is the GPU bus ID from the nvml API.
// e.g., "0000:0f:00.0"
BusID string `json:"bus_id"`
// A list of running processes.
RunningProcesses []Process `json:"running_processes"`
// GetComputeRunningProcessesSupported is true if the device supports the getComputeRunningProcesses API.
GetComputeRunningProcessesSupported bool `json:"get_compute_running_processes_supported"`
// GetProcessUtilizationSupported is true if the device supports the getProcessUtilization API.
GetProcessUtilizationSupported bool `json:"get_process_utilization_supported"`
}
Processes represents the current clock events from the nvmlDeviceGetCurrentClocksEventReasons API. ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g7e505374454a0d4fc7339b6c885656d6 ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1ga115e41a14b747cb334a0e7b49ae1941 ref. https://docs.nvidia.com/deploy/nvml-api/group__nvmlClocksEventReasons.html#group__nvmlClocksEventReasons