gpud

module
v0.10.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2026 License: Apache-2.0

README

GPUd logo

Go Report Card GitHub release (latest SemVer) Go Reference codecov

Overview

GPUd is designed to ensure GPU efficiency and reliability by actively monitoring GPUs and effectively managing AI/ML workloads.

Why GPUd

GPUd is built on years of experience operating large-scale GPU clusters at Meta, Alibaba Cloud, Uber, and Lepton AI. It is carefully designed to be self-contained and to integrate seamlessly with other systems such as Docker, containerd, Kubernetes, and NVIDIA ecosystems.

  • First-class GPU support: GPUd is GPU-centric, providing a unified view of critical GPU metrics and issues.
  • Easy to run at scale: GPUd is a self-contained binary that runs on any machine with a low footprint.
  • Production grade: GPUd is used in DGX Cloud Lepton's production infrastructure.

Most importantly, GPUd operates with minimal CPU and memory overhead in a non-critical path and requires only read-only operations. See architecture for more details.

Get Started

The fastest way to see gpud in action is to watch our 40-second demo video below. For more detailed guides, see our Tutorials page.

gpud-2025-06-01-01-install-and-scan

Installation

To install from the official release on Linux amd64 (x86_64) machine:

curl -fsSL https://pkg.gpud.dev/install.sh | sh

To specify a version:

curl -fsSL https://pkg.gpud.dev/install.sh | sh -s v0.10.0

The install script also currently support other architectures (e.g., arm64) and OSes (e.g., macOS).


Run GPUd on a Host

This section covers running gpud directly on a host machine.

Resource Requirements (for Lepton Platform)

If you plan to join the Lepton platform (using the --token flag), your node must meet these minimum requirements:

Minimum:

  • 3 CPU cores (2-core instances will fail to join — kubelet and system pods require minimum 3 cores)
  • 4 GiB memory

Recommended:

  • 4+ CPU cores (e.g., AWS c6a.xlarge)
  • 8+ GiB memory

Why these requirements: GPUd periodically reads system files from /sys/class/infiniband/, /proc/, and other paths to collect telemetry data. On nodes with less than 4 GiB memory, the Linux page cache cannot retain these files between polling cycles, causing every read to hit the disk and resulting in excessive I/O (measured at 5+ MB/s on 2 GiB nodes vs. 0 MB/s on larger nodes). The 4 GiB minimum ensures sufficient page cache for GPUd to operate as a lightweight daemon without causing disk I/O pressure.

For complete hardware, software, and network requirements, see the official NVIDIA DGX Cloud Lepton BYOC Requirements.

Note: These requirements apply only when joining the Lepton platform; standalone gpud operation has lower requirements.

Start the service:

sudo gpud up [--token <DGXC_LEPTON_AI_TOKEN>]

Note: The optional --token connects gpud to the Lepton Platform. You can get a token from the Settings > Tokens page on your dashboard.

gpud up \
--token <DGXC_LEPTON_AI_TOKEN> \
--node-group <DGXC_LEPTON_NODE_GROUP>

Stop the service:

sudo gpud down

Uninstall:

sudo rm /usr/local/bin/gpud
sudo rm /etc/systemd/system/gpud.service
Without systemd (e.g., macOS)

Run in the foreground:

gpud run [--token <LEPTON_AI_TOKEN>]

Run in the background:

nohup sudo /usr/local/bin/gpud run [--token <LEPTON_AI_TOKEN>] &>> <your_log_file_path> &

Uninstall:

sudo rm /usr/local/bin/gpud

Run GPUd with Kubernetes

The recommended way to deploy GPUd on Kubernetes is with our official Helm chart.

Build with Docker

A Dockerfile is provided to build a container image from source. For complete instructions, please see our Docker guide in CONTRIBUTING.md.


Key Features

  • Monitor critical GPU and GPU fabric metrics (power, temperature).
  • Reports GPU and GPU fabric status (nvidia-smi parser, error checking).
  • Detects critical GPU and GPU fabric errors (kmsg, hardware slowdown, NVML Xid event, DCGM).
  • Monitor overall system metrics (CPU, memory, disk).

Check out components for a detailed list of components and their features.

Integration

For users looking to set up a platform to collect and process data from gpud, please refer to INTEGRATION.

FAQs

Does GPUd send data to lepton.ai?

GPUd collects a small anonymous usage signal by default to help the engineering team better understand usage frequencies. The data is strictly anonymized and does not contain any sensitive data. You can disable this behavior by setting GPUD_NO_USAGE_STATS=true. If GPUd is run with systemd (default option for the gpud up command), you can add the line GPUD_NO_USAGE_STATS=true to the /etc/default/gpud environment file and restart the service.

If you opt-in to log in to the Lepton AI platform, to assist you with more helpful GPU health states, GPUd periodically sends system runtime related information about the host to the platform. All these info are system workload and health info, and contain no user data. The data are sent via secure channels.

How to update GPUd?

GPUd is still in active development, regularly releasing new versions for critical bug fixes and new features. We strongly recommend always being on the latest version of GPUd.

When GPUd is registered with the Lepton platform, the platform will automatically update GPUd to the latest version. To disable such auto-updates, if GPUd is run with systemd (default option for the gpud up command), you may add the flag FLAGS="--enable-auto-update=false" to the /etc/default/gpud environment file and restart the service.

Learn more

Contributing

Please see the CONTRIBUTING.md for guidelines on how to contribute to this project.

Directories

Path Synopsis
api
v1
client
v1
Package v1 provides the gpud v1 client for the server.
Package v1 provides the gpud v1 client for the server.
cmd
common
Package common provides common utilities for the gpud command.
Package common provides common utilities for the gpud command.
gpud command
gpud/inject-fault
Package injectfault provides a command to inject faults into the system.
Package injectfault provides a command to inject faults into the system.
gpud/run
Package run implements the "run" command.
Package run implements the "run" command.
gpud/up
Package up implements the "up" command.
Package up implements the "up" command.
swagger command
accelerator
Package accelerator contains the accelerator components and its query interface.
Package accelerator contains the accelerator components and its query interface.
accelerator/nvidia
Package nvidia contains the NVIDIA accelerator components and its query interface.
Package nvidia contains the NVIDIA accelerator components and its query interface.
accelerator/nvidia/clock-speed
Package clockspeed tracks the NVIDIA per-GPU clock speed.
Package clockspeed tracks the NVIDIA per-GPU clock speed.
accelerator/nvidia/ecc
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
Package ecc tracks the NVIDIA per-GPU ECC errors and other ECC related information.
accelerator/nvidia/fabric-manager
Package fabricmanager tracks NVIDIA fabric manager and fabric health monitoring services.
Package fabricmanager tracks NVIDIA fabric manager and fabric health monitoring services.
accelerator/nvidia/gpm
Package gpm tracks the NVIDIA per-GPU GPM metrics.
Package gpm tracks the NVIDIA per-GPU GPM metrics.
accelerator/nvidia/gpu-counts
Package gpucounts monitors the GPU count of the system.
Package gpucounts monitors the GPU count of the system.
accelerator/nvidia/hw-slowdown
Package hwslowdown monitors NVIDIA GPU hardware clock events of all GPUs, such as HW Slowdown events.
Package hwslowdown monitors NVIDIA GPU hardware clock events of all GPUs, such as HW Slowdown events.
accelerator/nvidia/infiniband
Package infiniband monitors the infiniband status of the system.
Package infiniband monitors the infiniband status of the system.
accelerator/nvidia/infiniband/class
Package class implements the infiniband class sysfs interface.
Package class implements the infiniband class sysfs interface.
accelerator/nvidia/infiniband/store
Package store stores infiniband states in time-series.
Package store stores infiniband states in time-series.
accelerator/nvidia/infiniband/types
Package types contains shared types for the infiniband package to avoid import cycles.
Package types contains shared types for the infiniband package to avoid import cycles.
accelerator/nvidia/memory
Package memory tracks the NVIDIA per-GPU memory usage.
Package memory tracks the NVIDIA per-GPU memory usage.
accelerator/nvidia/nccl
Package nccl monitors the NCCL status.
Package nccl monitors the NCCL status.
accelerator/nvidia/nvlink
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
Package nvlink monitors the NVIDIA per-GPU nvlink devices.
accelerator/nvidia/peermem
Package peermem monitors the peermem module status.
Package peermem monitors the peermem module status.
accelerator/nvidia/persistence-mode
Package persistencemode tracks the NVIDIA persistence mode.
Package persistencemode tracks the NVIDIA persistence mode.
accelerator/nvidia/power
Package power tracks the NVIDIA per-GPU power usage.
Package power tracks the NVIDIA per-GPU power usage.
accelerator/nvidia/processes
Package processes tracks the NVIDIA per-GPU processes.
Package processes tracks the NVIDIA per-GPU processes.
accelerator/nvidia/remapped-rows
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
Package remappedrows tracks the NVIDIA per-GPU remapped rows.
accelerator/nvidia/sxid
Package sxid tracks the NVIDIA GPU SXid errors scanning the kmsg.
Package sxid tracks the NVIDIA GPU SXid errors scanning the kmsg.
accelerator/nvidia/temperature
Package temperature tracks the NVIDIA per-GPU temperatures.
Package temperature tracks the NVIDIA per-GPU temperatures.
accelerator/nvidia/utilization
Package utilization tracks the NVIDIA per-GPU utilization.
Package utilization tracks the NVIDIA per-GPU utilization.
accelerator/nvidia/xid
Package xid tracks the NVIDIA GPU Xid errors scanning the kmsg See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.
Package xid tracks the NVIDIA GPU Xid errors scanning the kmsg See Xid messages https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#xid-messages.
all
Package all contains all the components.
Package all contains all the components.
containerd
Package containerd tracks the current containerd status.
Package containerd tracks the current containerd status.
cpu
Package cpu tracks the combined usage of all CPUs (not per-CPU).
Package cpu tracks the combined usage of all CPUs (not per-CPU).
disk
Package disk tracks the disk usage of all the mount points specified in the configuration.
Package disk tracks the disk usage of all the mount points specified in the configuration.
docker
Package docker tracks the current docker status.
Package docker tracks the current docker status.
fuse
Package fuse monitors the FUSE (Filesystem in Userspace).
Package fuse monitors the FUSE (Filesystem in Userspace).
kernel-module
Package kernelmodule provides a component that checks the kernel modules in Linux.
Package kernelmodule provides a component that checks the kernel modules in Linux.
kubelet
Package kubelet tracks the current kubelet status.
Package kubelet tracks the current kubelet status.
library
Package library provides a component that returns healthy if and only if all the specified libraries exist.
Package library provides a component that returns healthy if and only if all the specified libraries exist.
memory
Package memory tracks the memory usage of the host.
Package memory tracks the memory usage of the host.
network/latency
Package latency tracks the global network connectivity statistics.
Package latency tracks the global network connectivity statistics.
nfs
Package nfs writes to and reads from the specified NFS mount points.
Package nfs writes to and reads from the specified NFS mount points.
os
Package os queries the host OS information (e.g., kernel version).
Package os queries the host OS information (e.g., kernel version).
pci
Package pci tracks the PCI devices and their Access Control Services (ACS) status.
Package pci tracks the PCI devices and their Access Control Services (ACS) status.
tailscale
Package tailscale tracks the current tailscale status.
Package tailscale tracks the current tailscale status.
docs
e2e
pkg
Package pkg contains a set of generic Go packages that are useful to gpud and possibly to other projects.
Package pkg contains a set of generic Go packages that are useful to gpud and possibly to other projects.
asn
config
Package config provides the gpud configuration data for the server.
Package config provides the gpud configuration data for the server.
custom-plugins
Package customplugins provides a way to register and run custom plugins.
Package customplugins provides a way to register and run custom plugins.
disk
Package disk provides utilities for disk operations.
Package disk provides utilities for disk operations.
errdefs
Package errdefs provides common error definitions for gpud.
Package errdefs provides common error definitions for gpud.
fault-injector
Package faultinjector provides a way to inject failures into the system.
Package faultinjector provides a way to inject failures into the system.
file
Package file implements file utils.
Package file implements file utils.
fuse
Package fuse provides a client for the FUSE (Filesystem in Userspace) protocol.
Package fuse provides a client for the FUSE (Filesystem in Userspace) protocol.
gpud-manager/systemd
Package systemd provides the systemd artifacts and variables for the gpud server.
Package systemd provides the systemd artifacts and variables for the gpud server.
host
Package host provides the host information.
Package host provides the host information.
httputil
Package httputil provides utilities for HTTP requests.
Package httputil provides utilities for HTTP requests.
kmsg/writer
Package writer implements the kmsg writer.
Package writer implements the kmsg writer.
log
Package log provides the logging functionality for gpud.
Package log provides the logging functionality for gpud.
login
Package login provides login functionality for GPUd.
Package login provides login functionality for GPUd.
machine-info
Package machineinfo provides information about the machine.
Package machineinfo provides information about the machine.
memory
Package memory provides utilities for memory usage.
Package memory provides utilities for memory usage.
metadata
Package metadata provides the persistent storage layer for GPUd metadata.
Package metadata provides the persistent storage layer for GPUd metadata.
metrics/recorder
Package recorder records internal GPUd metrics to Prometheus.
Package recorder records internal GPUd metrics to Prometheus.
metrics/scraper
Package scraper scrapes internal GPUd metrics from Prometheus.
Package scraper scrapes internal GPUd metrics from Prometheus.
metrics/store
Package store provides the persistent storage layer for the metrics.
Package store provides the persistent storage layer for the metrics.
metrics/syncer
Package syncer provides a syncer for the metrics.
Package syncer provides a syncer for the metrics.
netutil
Package netutil provides utility functions for network operations.
Package netutil provides utility functions for network operations.
netutil/latency
Package latency contains logic for egress traffic from each device.
Package latency contains logic for egress traffic from each device.
netutil/latency/edge
Package edge provides a client for the Tailscale DERP (Designated Edge Router Protocol) service.
Package edge provides a client for the Tailscale DERP (Designated Edge Router Protocol) service.
netutil/latency/edge/derpmap
Package derpmap provides the tailscale derp map implementation.
Package derpmap provides the tailscale derp map implementation.
netutil/latency/edge/derpmap/sync command
"sync" syncs the tailscale derp map.
"sync" syncs the tailscale derp map.
nfs-checker
Package nfschecker checks the health of the NFS mount points.
Package nfschecker checks the health of the NFS mount points.
nvidia-query/nvml
Package nvml implements the NVIDIA Management Library (NVML) interface.
Package nvml implements the NVIDIA Management Library (NVML) interface.
nvidia-query/nvml/device
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
Package device provides a wrapper around the "github.com/NVIDIA/go-nvlib/pkg/nvlib/device".Device type that adds a PCIBusID and UUID method, with support for test failure injection.
nvidia-query/nvml/lib
Package lib implements the NVIDIA Management Library (NVML) interface.
Package lib implements the NVIDIA Management Library (NVML) interface.
osutil
Package osutil provides utilities for the operating system.
Package osutil provides utilities for the operating system.
pci
process
Package process provides the process runner implementation on the host.
Package process provides the process runner implementation on the host.
providers
Package providers contains machine/cloud providers.
Package providers contains machine/cloud providers.
providers/all
Package all provides a list of known providers.
Package all provides a list of known providers.
providers/aws
Package aws implements "AWS" provider and helpers.
Package aws implements "AWS" provider and helpers.
providers/aws/imds
Package imds provides functions for interacting with the AWS Instance Metadata Service.
Package imds provides functions for interacting with the AWS Instance Metadata Service.
providers/azure
Package azure implements "azure" provider and helpers.
Package azure implements "azure" provider and helpers.
providers/azure/imds
Package imds provides functions for interacting with the Azure Instance Metadata Service.
Package imds provides functions for interacting with the Azure Instance Metadata Service.
providers/gcp
Package gcp implements Google Cloud Platform (GCP) provider and helpers.
Package gcp implements Google Cloud Platform (GCP) provider and helpers.
providers/gcp/imds
Package imds provides functions for interacting with the Google Cloud Platform Instance Metadata Service.
Package imds provides functions for interacting with the Google Cloud Platform Instance Metadata Service.
pstore
Package pstore provides operations for Linux pstore, mainly to read the pstore log on reboot.
Package pstore provides operations for Linux pstore, mainly to read the pstore log on reboot.
release
Package release provides utilities for releasing new versions of gpud.
Package release provides utilities for releasing new versions of gpud.
release/distsign
Package distsign implements signature and validation of arbitrary distributable files.
Package distsign implements signature and validation of arbitrary distributable files.
session/states
Package states provides tracking of login success and failure events as well as the state of ongoing session loops (token expiration, etc.).
Package states provides tracking of login success and failure events as well as the state of ongoing session loops (token expiration, etc.).
sqlite
Package sqlite provides a SQLite3 database utils.
Package sqlite provides a SQLite3 database utils.
systemd
Package systemd provides the common systemd helper functions.
Package systemd provides the common systemd helper functions.
update
Package update provides the update functionality for the server.
Package update provides the update functionality for the server.
uptime
Package uptime provides utilities for uptime.
Package uptime provides utilities for uptime.
Package version provides the version information for the gpud server.
Package version provides the version information for the gpud server.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL