Nvidia GPU monitoring with Netdata
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.)
using the nvidia-smi CLI tool.
Warning: under development, collects fewer metrics then python version.
Metrics
All metrics have "nvidia_smi." prefix.
Labels per scope:
- gpu: product_name, product_brand.
| Metric |
Scope |
Dimensions |
Units |
| gpu_pcie_bandwidth_usage |
gpu |
rx, tx |
B/s |
| gpu_fan_speed_perc |
gpu |
fan_speed |
% |
| gpu_utilization |
gpu |
gpu |
% |
| gpu_memory_utilization |
gpu |
memory |
% |
| gpu_decoder_utilization |
gpu |
decoder |
% |
| gpu_encoder_utilization |
gpu |
encoder |
% |
| gpu_frame_buffer_memory_usage |
gpu |
free, used, reserved |
B |
| gpu_bar1_memory_usage |
gpu |
free, used |
B |
| gpu_temperature |
gpu |
temperature |
Celsius |
| gpu_clock_freq |
gpu |
graphics, video, sm, mem |
MHz |
| gpu_power_draw |
gpu |
power_draw |
Watts |
| gpu_performance_state |
gpu |
P0-P15 |
state |
Configuration
No configuration required.
Troubleshooting
To troubleshoot issues with the nvidia_smi collector, run the go.d.plugin with the debug option enabled. The
output should give you clues as to why the collector isn't working.
-
Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that's not the case on
your system, open netdata.conf and look for the plugins setting under [directories].
cd /usr/libexec/netdata/plugins.d/
-
Switch to the netdata user.
sudo -u netdata -s
-
Run the go.d.plugin to debug the collector:
./go.d.plugin -d -m nvidia_smi