README
¶
SSV - Monitoring
/metrics end-point is exposing metrics from ssv node to prometheus.
Prometheus should also hit /health end-point in order to collect the health check metrics.
Even if prometheus is not configured, the end-point can simply be polled by a simple HTTP client
(it doesn't contain metrics)
See the configuration of a local prometheus service.
Metrics
MetricsAPIPort is used to enable prometheus metrics collection:
Example:
MetricsAPIPort: 15000
Or as env variable:
METRICS_API_PORT=15000
go_*metrics byprometheusssv:node_statusHealth check status of operator nodessv:eth1:node_statusHealth check status of eth1 nodessv:beacon:node_statusHealth check status of beacon nodessv:network:connected_peers{pubKey}Count connected peers for a validatorssv:network:ibft_decided_messages_outbound{topic}Count IBFT decided messages outboundssv:network:ibft_messages_outbound{topic}Count IBFT messages outboundssv:network:net_messages_inbound{topic}Count incoming network messagesssv:validator:ibft_highest_decided{lambda}The highest decided sequence numberssv:validator:ibft_round{lambda}IBFTs roundssv:validator:ibft_stage{lambda}IBFTs stagessv:validator:ibft_current_slot{pubKey}Current running slotssv:validator:running_ibfts_count{pubKey}Count running IBFTs by validator pub keyssv:validator:running_ibfts_count_allCount all running IBFTs
Grafana
In order to setup a grafana dashboard do the following:
- Enable metrics (
MetricsAPIPort) - Setup Prometheus as mentioned in the beginning of this document and add as data source
- Job name assumed to be '
ssv'
- Job name assumed to be '
- Import SSV Operator dashboard to Grafana
- Align dashboard variables:
instance- container name, used in 'instance' field for metrics coming from prometheus.
In the given dashboard, instances names are:ssv-node-v2-<i>, make sure to change according to your setup
Note: In order to show Process Health panels, the following K8S metrics should be exposed:
kubelet_volume_stats_used_bytescontainer_cpu_usage_seconds_totalcontainer_memory_working_set_bytes
Health Check
Health check route is available on GET /health.
In case the node is healthy it returns an HTTP Code 200 with empty response:
$ curl http://localhost:15000/health
If the node is not healthy, the corresponding errors will be returned with HTTP Code 500:
$ curl http://localhost:15000/health
{"errors": ["could not sync eth1 events"]}
Profiling
Profiling can be enabled via config:
EnableProfile: true
All the default pprof routes are available via HTTP:
$ curl http://localhost:15000/debug/pprof/goroutine?minutes\=20 --output goroutines.tar.gz
Open with Go CLI:
$ go tool pprof goroutines.tar.gz
Or with Web UI:
$ go tool pprof -web goroutines.tar.gz
Another option is to visualize results in web UI directly:
$ go tool pprof -web http://localhost:15001/debug/pprof/heap?minutes=5
