Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( // ErrAlreadyRegistered is the error returned when a component is already registered. ErrAlreadyRegistered = errors.New("component already registered") )
Functions ¶
This section is empty.
Types ¶
type CheckResult ¶ added in v0.5.0
type CheckResult interface {
// String returns a string representation of the data.
// Describes the data in a human-readable format.
fmt.Stringer
// Summary returns a summary of the check result.
Summary() string
// HealthStateType returns the health state of the last check result.
HealthStateType() apiv1.HealthStateType
// HealthStates returns the health states of the last check result.
HealthStates() apiv1.HealthStates
}
CheckResult is the data type that represents the result of a component health state check.
type CheckResultDebugger ¶ added in v0.5.0
type CheckResultDebugger interface {
// Debug returns a string representation of the check result.
Debug() string
}
CheckResultDebugger is an optional interface that can be implemented by components to allow debugging the check result.
type Component ¶
type Component interface {
// Defines the component name,
// and used for the HTTP handler registration path.
// Must be globally unique.
Name() string
// Start called upon server start.
// Implements component-specific poller start logic.
Start() error
// Check triggers the component check once, and returns the latest health check result.
// This is used for one-time checks, such as "gpud scan".
// It is up to the component to decide the check timeouts.
// Thus, we do not pass the context here.
// The CheckResult should embed the timeout errors if any, via Summary and HealthState.
Check() CheckResult
// LastHealthStates reads the latest health states of the component,
// cached from its periodic checks.
// If no check has been performed, it returns a single health state of apiv1.StateTypeHealthy.
LastHealthStates() apiv1.HealthStates
// Events returns all the events from "since".
Events(ctx context.Context, since time.Time) (apiv1.Events, error)
// Called upon server close.
// Implements copmonent-specific poller cleanup logic.
Close() error
}
Component represents an individual component of the system.
Each component check is independent of each other. But the underlying implementation may share the same data sources in order to minimize the querying overhead (e.g., nvidia-smi calls).
Each component implements its own output format inside the State struct. And recommended to have a consistent name for its HTTP handler. And recommended to define const keys for the State extra information field.
type Deregisterable ¶ added in v0.5.0
type Deregisterable interface {
// CanDeregister returns true if the custom plugin can be deregistered.
CanDeregister() bool
}
Deregisterable is an interface that allows a custom plugin to be deregistered. By default, the regular/built-in components are not allowed to be deregistered, unless it implements this interface.
type GPUdInstance ¶ added in v0.5.0
type GPUdInstance struct {
RootCtx context.Context
KernelModulesToCheck []string
NVMLInstance nvidianvml.Instance
NVIDIAToolOverwrites nvidiacommon.ToolOverwrites
Annotations map[string]string
DBRO *sql.DB
EventStore eventstore.Store
RebootEventStore pkghost.RebootEventStore
MountPoints []string
MountTargets []string
}
GPUdInstance is the instance of the GPUd dependencies.
type HealthSettable ¶ added in v0.5.0
type HealthSettable interface {
// SetHealthy sets the health state to healthy.
SetHealthy() error
}
HealthSettable is an optional interface that can be implemented by components to allow setting the health state.
type InitFunc ¶ added in v0.5.0
type InitFunc func(*GPUdInstance) (Component, error)
InitFunc is the function that initializes a component.
type Registry ¶ added in v0.5.0
type Registry interface {
// MustRegister registers a component with the given initialization function.
// It panics if the component is already registered.
// It panics if the initialization function returns an error.
MustRegister(initFunc InitFunc)
// Register registers a component with the given initialization function.
// It returns an error if the component is already registered.
// It returns an error if the initialization function returns an error.
Register(initFunc InitFunc) (Component, error)
// All returns all registered components.
All() []Component
// Get returns a component by name.
// It returns nil if the component is not registered.
Get(name string) Component
// Deregister deregisters a component by name, and returns the
// underlying component if it is registered.
// It returns nil if the component is not registered.
// Meaning, it is safe to call it multiple times,
// and it is also safe to call it with a non-registered name.
Deregister(name string) Component
}
Registry is the interface for the registry of components.
func NewRegistry ¶ added in v0.5.0
func NewRegistry(gpudInstance *GPUdInstance) Registry
NewRegistry creates a new registry.
Directories
¶
| Path | Synopsis |
|---|---|
|
Package accelerator contains the accelerator components and its query interface.
|
Package accelerator contains the accelerator components and its query interface. |
|
nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
|
Package nvidia contains the NVIDIA accelerator components and its query interface. |
|
nvidia/bad-envs
Package badenvs tracks any bad environment variables that are globally set for the NVIDIA GPUs.
|
Package badenvs tracks any bad environment variables that are globally set for the NVIDIA GPUs. |
|
nvidia/clock-speed
Package clockspeed tracks the NVIDIA per-GPU clock speed.
|
Package clockspeed tracks the NVIDIA per-GPU clock speed. |
|
nvidia/ecc
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
|
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information. |
|
nvidia/fabric-manager
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
|
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness. |
|
nvidia/gpm
Package gpm tracks the NVIDIA per-GPU GPM metrics.
|
Package gpm tracks the NVIDIA per-GPU GPM metrics. |
|
nvidia/gsp-firmware-mode
Package gspfirmwaremode tracks the NVIDIA GSP firmware mode.
|
Package gspfirmwaremode tracks the NVIDIA GSP firmware mode. |
|
nvidia/hw-slowdown
Package hwslowdown monitors NVIDIA GPU hardware clock events of all GPUs, such as HW Slowdown events.
|
Package hwslowdown monitors NVIDIA GPU hardware clock events of all GPUs, such as HW Slowdown events. |
|
nvidia/infiniband
Package infiniband monitors the infiniband status of the system.
|
Package infiniband monitors the infiniband status of the system. |
|
nvidia/info
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names).
|
Package info provides relatively static information about the NVIDIA accelerator (e.g., GPU product names). |
|
nvidia/memory
Package memory tracks the NVIDIA per-GPU memory usage.
|
Package memory tracks the NVIDIA per-GPU memory usage. |
|
nvidia/nccl
Package nccl monitors the NCCL status.
|
Package nccl monitors the NCCL status. |
|
nvidia/nvlink
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
|
Package nvlink monitors the NVIDIA per-GPU nvlink devices. |
|
nvidia/peermem
Package peermem monitors the peermem module status.
|
Package peermem monitors the peermem module status. |
|
nvidia/persistence-mode
Package persistencemode tracks the NVIDIA persistence mode.
|
Package persistencemode tracks the NVIDIA persistence mode. |
|
nvidia/power
Package power tracks the NVIDIA per-GPU power usage.
|
Package power tracks the NVIDIA per-GPU power usage. |
|
nvidia/processes
Package processes tracks the NVIDIA per-GPU processes.
|
Package processes tracks the NVIDIA per-GPU processes. |
|
nvidia/remapped-rows
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
|
Package remappedrows tracks the NVIDIA per-GPU remapped rows. |
|
nvidia/sxid
Package sxid tracks the NVIDIA GPU SXid errors scanning the kmsg.
|
Package sxid tracks the NVIDIA GPU SXid errors scanning the kmsg. |
|
nvidia/temperature
Package temperature tracks the NVIDIA per-GPU temperatures.
|
Package temperature tracks the NVIDIA per-GPU temperatures. |
|
nvidia/utilization
Package utilization tracks the NVIDIA per-GPU utilization.
|
Package utilization tracks the NVIDIA per-GPU utilization. |
|
nvidia/xid
Package xid tracks the NVIDIA GPU Xid errors scanning the kmsg See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.
|
Package xid tracks the NVIDIA GPU Xid errors scanning the kmsg See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages. |
|
Package containerd contains the containerd components and its query interface.
|
Package containerd contains the containerd components and its query interface. |
|
pod
Package pod tracks the current pods from the containerd CRI.
|
Package pod tracks the current pods from the containerd CRI. |
|
Package cpu tracks the combined usage of all CPUs (not per-CPU).
|
Package cpu tracks the combined usage of all CPUs (not per-CPU). |
|
Package disk tracks the disk usage of all the mount points specified in the configuration.
|
Package disk tracks the disk usage of all the mount points specified in the configuration. |
|
Package docker contains the docker components and its query interface.
|
Package docker contains the docker components and its query interface. |
|
container
Package container tracks the current containers from the docker runtime.
|
Package container tracks the current containers from the docker runtime. |
|
Package fd tracks the number of file descriptors used on the host.
|
Package fd tracks the number of file descriptors used on the host. |
|
Package fuse monitors the FUSE (Filesystem in Userspace).
|
Package fuse monitors the FUSE (Filesystem in Userspace). |
|
Package info provides static information about the host (e.g., labels, IDs).
|
Package info provides static information about the host (e.g., labels, IDs). |
|
Package kernelmodule provides a component that checks the kernel modules in Linux.
|
Package kernelmodule provides a component that checks the kernel modules in Linux. |
|
kubelet
|
|
|
pod
Package pod tracks the current pods from the kubelet read-only port.
|
Package pod tracks the current pods from the kubelet read-only port. |
|
Package library provides a component that returns healthy if and only if all the specified libraries exist.
|
Package library provides a component that returns healthy if and only if all the specified libraries exist. |
|
Package memory tracks the memory usage of the host.
|
Package memory tracks the memory usage of the host. |
|
network
|
|
|
latency
Package latency tracks the global network connectivity statistics.
|
Package latency tracks the global network connectivity statistics. |
|
Package os queries the host OS information (e.g., kernel version).
|
Package os queries the host OS information (e.g., kernel version). |
|
Package pci tracks the PCI devices and their Access Control Services (ACS) status.
|
Package pci tracks the PCI devices and their Access Control Services (ACS) status. |
|
Package tailscale tracks the current tailscale status.
|
Package tailscale tracks the current tailscale status. |