Nvidia GPU monitoring with Netdata
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.)
using the nvidia-smi CLI tool.
Warning: under development, collects fewer metrics then python version.
Metrics
All metrics have "nvidia_smi." prefix.
Labels per scope:
| Metric |
Scope |
Dimensions |
Units |
XML |
CSV |
| gpu_pcie_bandwidth_usage |
gpu |
rx, tx |
B/s |
yes |
no |
| gpu_fan_speed_perc |
gpu |
fan_speed |
% |
yes |
yes |
| gpu_utilization |
gpu |
gpu |
% |
yes |
yes |
| gpu_memory_utilization |
gpu |
memory |
% |
yes |
yes |
| gpu_decoder_utilization |
gpu |
decoder |
% |
yes |
no |
| gpu_encoder_utilization |
gpu |
encoder |
% |
yes |
no |
| gpu_frame_buffer_memory_usage |
gpu |
free, used, reserved |
B |
yes |
yes |
| gpu_bar1_memory_usage |
gpu |
free, used |
B |
yes |
no |
| gpu_temperature |
gpu |
temperature |
Celsius |
yes |
yes |
| gpu_voltage |
gpu |
voltage |
V |
yes |
no |
| gpu_clock_freq |
gpu |
graphics, video, sm, mem |
MHz |
yes |
yes |
| gpu_power_draw |
gpu |
power_draw |
Watts |
yes |
yes |
| gpu_performance_state |
gpu |
P0-P15 |
state |
yes |
yes |
Configuration
This module supports data collection in CSV and XML formats. The default is CSV.
- XML provides more metrics, but requesting GPU information consumes more CPU, especially if there are multiple GPU
cards in the system.
- CSV provides fewer metrics, but is much lighter than XML in terms of CPU usage.
The format can be changed in the configuration file.
Edit the go.d/nvidia_smi.conf configuration file using edit-config from the
Netdata config directory, which is typically at /etc/netdata.
jobs:
- name: nvidia_smi
use_csv_format: no # set to 'no' to use the XML format.
Troubleshooting
To troubleshoot issues with the nvidia_smi collector, run the go.d.plugin with the debug option enabled. The
output should give you clues as to why the collector isn't working.
-
Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that's not the case on
your system, open netdata.conf and look for the plugins setting under [directories].
cd /usr/libexec/netdata/plugins.d/
-
Switch to the netdata user.
sudo -u netdata -s
-
Run the go.d.plugin to debug the collector:
./go.d.plugin -d -m nvidia_smi