components

package
v0.5.2-test Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 24, 2025 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrAlreadyRegistered is the error returned when a component is already registered.
	ErrAlreadyRegistered = errors.New("component already registered")
)

Functions

This section is empty.

Types

type CheckResult added in v0.5.0

type CheckResult interface {
	// ComponentName returns the name of the component that produced this check result.
	ComponentName() string

	// String returns a string representation of the data.
	// Describes the data in a human-readable format.
	fmt.Stringer

	// Summary returns a summary of the check result.
	Summary() string

	// HealthStateType returns the health state of the last check result.
	HealthStateType() apiv1.HealthStateType
	// HealthStates returns the health states of the last check result.
	HealthStates() apiv1.HealthStates
}

CheckResult is the data type that represents the result of a component health state check.

type CheckResultDebugger added in v0.5.0

type CheckResultDebugger interface {
	// Debug returns a string representation of the check result.
	Debug() string
}

CheckResultDebugger is an optional interface that can be implemented by components to allow debugging the check result.

type Component

type Component interface {
	// Defines the component name,
	// and used for the HTTP handler registration path.
	// Must be globally unique.
	Name() string

	// Tags returns a list of tags that describe the component.
	//
	// The component tags are static, and will not change over time.
	// It is important to keep in mind that, the tags only represent
	// the component's functionality, but not the component's health state.
	//
	// This is useful to trigger on-demand component checks based on specific tags.
	//
	// e.g.,
	// GPU enabled components may return its accelerator manufacturer,
	// but does not report its health state via the tags.
	Tags() []string

	// IsSupported returns true if the component is supported on the current machine.
	// For example, this returns "false" if a component requires NVIDIA GPUs,
	// but the machine does not have NVIDIA GPUs.
	IsSupported() bool

	// Start called upon server start.
	// Implements component-specific poller start logic.
	Start() error

	// Check triggers the component check once, and returns the latest health check result.
	// This is used for one-time checks, such as "gpud scan".
	// It is up to the component to decide the check timeouts.
	// Thus, we do not pass the context here.
	// The CheckResult should embed the timeout errors if any, via Summary and HealthState.
	Check() CheckResult

	// LastHealthStates reads the latest health states of the component,
	// cached from its periodic checks.
	// If no check has been performed, it returns a single health state of apiv1.StateTypeHealthy.
	LastHealthStates() apiv1.HealthStates

	// Events returns all the events from "since".
	Events(ctx context.Context, since time.Time) (apiv1.Events, error)

	// Called upon server close.
	// Implements copmonent-specific poller cleanup logic.
	Close() error
}

Component represents an individual component of the system.

Each component check is independent of each other. But the underlying implementation may share the same data sources in order to minimize the querying overhead (e.g., nvidia-smi calls).

Each component implements its own output format inside the State struct. And recommended to have a consistent name for its HTTP handler. And recommended to define const keys for the State extra information field.

type Deregisterable added in v0.5.0

type Deregisterable interface {
	// CanDeregister returns true if the custom plugin can be deregistered.
	CanDeregister() bool
}

Deregisterable is an interface that allows a custom plugin to be deregistered. By default, the regular/built-in components are not allowed to be deregistered, unless it implements this interface.

type GPUdInstance added in v0.5.0

type GPUdInstance struct {
	RootCtx context.Context

	// MachineID is either the machine ID assigned from the control plane
	// or the unique UUID of the machine.
	// For example, it is used to identify itself for the NFS checker.
	MachineID string

	KernelModulesToCheck []string

	NVMLInstance         nvidianvml.Instance
	NVIDIAToolOverwrites nvidiacommon.ToolOverwrites

	DBRW *sql.DB
	DBRO *sql.DB

	EventStore       eventstore.Store
	RebootEventStore pkghost.RebootEventStore

	MountPoints  []string
	MountTargets []string
}

GPUdInstance is the instance of the GPUd dependencies.

type HealthSettable added in v0.5.0

type HealthSettable interface {
	// SetHealthy sets the health state to healthy.
	SetHealthy() error
}

HealthSettable is an optional interface that can be implemented by components to allow setting the health state.

type InitFunc added in v0.5.0

type InitFunc func(*GPUdInstance) (Component, error)

InitFunc is the function that initializes a component.

type Registry added in v0.5.0

type Registry interface {
	// MustRegister registers a component with the given initialization function.
	// It panics if the component is already registered.
	// It panics if the initialization function returns an error.
	MustRegister(initFunc InitFunc)

	// Register registers a component with the given initialization function.
	// It returns an error if the component is already registered.
	// It returns an error if the initialization function returns an error.
	Register(initFunc InitFunc) (Component, error)

	// All returns all registered components.
	All() []Component

	// Get returns a component by name.
	// It returns nil if the component is not registered.
	Get(name string) Component

	// Deregister deregisters a component by name, and returns the
	// underlying component if it is registered.
	// It returns nil if the component is not registered.
	// Meaning, it is safe to call it multiple times,
	// and it is also safe to call it with a non-registered name.
	Deregister(name string) Component
}

Registry is the interface for the registry of components.

func NewRegistry added in v0.5.0

func NewRegistry(gpudInstance *GPUdInstance) Registry

NewRegistry creates a new registry.

Directories

Path Synopsis
Package accelerator contains the accelerator components and its query interface.
Package accelerator contains the accelerator components and its query interface.
nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
Package nvidia contains the NVIDIA accelerator components and its query interface.
nvidia/bad-envs
Package badenvs tracks any bad environment variables that are globally set for the NVIDIA GPUs.
Package badenvs tracks any bad environment variables that are globally set for the NVIDIA GPUs.
nvidia/clock-speed
Package clockspeed tracks the NVIDIA per-GPU clock speed.
Package clockspeed tracks the NVIDIA per-GPU clock speed.
nvidia/ecc
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
nvidia/fabric-manager
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
Package fabricmanager tracks the NVIDIA fabric manager version and its activeness.
nvidia/gpm
Package gpm tracks the NVIDIA per-GPU GPM metrics.
Package gpm tracks the NVIDIA per-GPU GPM metrics.
nvidia/gpu-counts
Package gpucounts monitors the GPU count of the system.
Package gpucounts monitors the GPU count of the system.
nvidia/gsp-firmware-mode
Package gspfirmwaremode tracks the NVIDIA GSP firmware mode.
Package gspfirmwaremode tracks the NVIDIA GSP firmware mode.
nvidia/hw-slowdown
Package hwslowdown monitors NVIDIA GPU hardware clock events of all GPUs, such as HW Slowdown events.
Package hwslowdown monitors NVIDIA GPU hardware clock events of all GPUs, such as HW Slowdown events.
nvidia/infiniband
Package infiniband monitors the infiniband status of the system.
Package infiniband monitors the infiniband status of the system.
nvidia/memory
Package memory tracks the NVIDIA per-GPU memory usage.
Package memory tracks the NVIDIA per-GPU memory usage.
nvidia/nccl
Package nccl monitors the NCCL status.
Package nccl monitors the NCCL status.
nvidia/nvlink
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
nvidia/peermem
Package peermem monitors the peermem module status.
Package peermem monitors the peermem module status.
nvidia/persistence-mode
Package persistencemode tracks the NVIDIA persistence mode.
Package persistencemode tracks the NVIDIA persistence mode.
nvidia/power
Package power tracks the NVIDIA per-GPU power usage.
Package power tracks the NVIDIA per-GPU power usage.
nvidia/processes
Package processes tracks the NVIDIA per-GPU processes.
Package processes tracks the NVIDIA per-GPU processes.
nvidia/remapped-rows
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
nvidia/sxid
Package sxid tracks the NVIDIA GPU SXid errors scanning the kmsg.
Package sxid tracks the NVIDIA GPU SXid errors scanning the kmsg.
nvidia/temperature
Package temperature tracks the NVIDIA per-GPU temperatures.
Package temperature tracks the NVIDIA per-GPU temperatures.
nvidia/utilization
Package utilization tracks the NVIDIA per-GPU utilization.
Package utilization tracks the NVIDIA per-GPU utilization.
nvidia/xid
Package xid tracks the NVIDIA GPU Xid errors scanning the kmsg See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.
Package xid tracks the NVIDIA GPU Xid errors scanning the kmsg See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.
Package all contains all the components.
Package all contains all the components.
Package containerd tracks the current containerd status.
Package containerd tracks the current containerd status.
Package cpu tracks the combined usage of all CPUs (not per-CPU).
Package cpu tracks the combined usage of all CPUs (not per-CPU).
Package disk tracks the disk usage of all the mount points specified in the configuration.
Package disk tracks the disk usage of all the mount points specified in the configuration.
Package docker tracks the current docker status.
Package docker tracks the current docker status.
Package fuse monitors the FUSE (Filesystem in Userspace).
Package fuse monitors the FUSE (Filesystem in Userspace).
Package kernelmodule provides a component that checks the kernel modules in Linux.
Package kernelmodule provides a component that checks the kernel modules in Linux.
Package kubelet tracks the current kubelet status.
Package kubelet tracks the current kubelet status.
Package library provides a component that returns healthy if and only if all the specified libraries exist.
Package library provides a component that returns healthy if and only if all the specified libraries exist.
Package memory tracks the memory usage of the host.
Package memory tracks the memory usage of the host.
network
latency
Package latency tracks the global network connectivity statistics.
Package latency tracks the global network connectivity statistics.
Package nfs writes to and reads from the specified NFS mount points.
Package nfs writes to and reads from the specified NFS mount points.
Package os queries the host OS information (e.g., kernel version).
Package os queries the host OS information (e.g., kernel version).
Package pci tracks the PCI devices and their Access Control Services (ACS) status.
Package pci tracks the PCI devices and their Access Control Services (ACS) status.
Package tailscale tracks the current tailscale status.
Package tailscale tracks the current tailscale status.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL