fleet-intelligence-agent

module
v1.0.0-rc.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 5, 2026 License: Apache-2.0

README

NVIDIA Fleet Intelligence Agent

Lightweight Fleet Intelligence monitoring and reporting agent for NVIDIA GPU infrastructure building on top of leptonai/gpud

Overview

Prerequisites:

  • NVIDIA DCGM (Data Center GPU Manager) - automatically installed from NVIDIA CUDA repositories
  • See DEB Installation or RPM Installation for CUDA repository setup instructions

What It Monitors:

  • GPU Metrics: Power, temperature, clocks, utilization, memory, Xid events
  • System Metrics: CPU, memory, disk, network usage
  • Infrastructure: NVIDIA drivers, CUDA runtime, InfiniBand, containers

Export Formats:

  • HTTP API Server: Serves data via REST endpoints (JSON) and Prometheus metrics (/metrics)
  • File Export (Offline Mode): Writes data to local files in CSV or JSON format
  • Remote Export: Sends telemetry data to OpenTelemetry-compatible endpoints via OTLP over HTTP

Key Features:

  • Lightweight: <100MB RAM, <1% CPU usage
  • Non-intrusive: Read-only operations, no system modifications
  • Production-ready: 24/7 datacenter operation

Supported Platforms

OS Family Supported Versions Architecture
Ubuntu 22.04, 24.04 x86_64, ARM64
RHEL 8, 9, 10 x86_64, ARM64
Rocky Linux 8, 9, 10 x86_64, ARM64
AlmaLinux 8, 9, 10 x86_64, ARM64
Amazon Linux 2023 x86_64, ARM64

Documentation

  • Helm Installation - Kubernetes (Helm) installation and troubleshooting
  • DEB Installation - Ubuntu package install, update, and uninstall
  • RPM Installation - RHEL/Rocky/Alma/Amazon package install, update, and uninstall
  • Architecture - Bare metal and Kubernetes architecture, dependencies, and runtime flow
  • Usage - Commands, HTTP API, integration, and troubleshooting
  • Configuration - Environment variables and service configuration
  • Development - Building from source and contributing

Contributing

See CONTRIBUTING.md for development setup and guidelines.

Related: leptonai/gpud (upstream dependency)

License

Apache License 2.0 - see LICENSE for details.

Directories

Path Synopsis
cmd
fleetint command
internal
attestation
Package attestation provides functionality for GPU attestation
Package attestation provides functionality for GPU attestation
enrollment
Package enrollment provides shared enrollment functionality for the Fleet Intelligence agent
Package enrollment provides shared enrollment functionality for the Fleet Intelligence agent
exporter
Package healthexporter provides functionality to export health data from local SQLite to a global health endpoint for centralized monitoring and long-term storage using OTLP format.
Package healthexporter provides functionality to export health data from local SQLite to a global health endpoint for centralized monitoring and long-term storage using OTLP format.
exporter/collector
Package collector handles health data collection from various sources
Package collector handles health data collection from various sources
exporter/converter
Package converter handles conversion of health data to different formats
Package converter handles conversion of health data to different formats
exporter/writer
Package writer handles writing health data to various outputs
Package writer handles writing health data to various outputs
machineinfo
Package machineinfo provides a shim layer over gpud's machine-info package to customize version information for Fleet Intelligence.
Package machineinfo provides a shim layer over gpud's machine-info package to customize version information for Fleet Intelligence.
registry
Package registry provides component registration and management for fleetint, allowing fine-grained control over which components are enabled.
Package registry provides component registration and management for fleetint, allowing fine-grained control over which components are enabled.
scan
Package scan provides system scanning functionality for Fleet Intelligence monitoring.
Package scan provides system scanning functionality for Fleet Intelligence monitoring.
server
Package healthserver provides a simplified HTTP server for Fleet Intelligence metrics export.
Package healthserver provides a simplified HTTP server for Fleet Intelligence metrics export.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL