dnms

command module
v0.0.0-...-80b60c5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2026 License: MIT Imports: 23 Imported by: 0

README

Distributed Network Monitoring System

objective

Black-box test the network from the edges/leafs of the network

How?

Effectively a distributed ping + traceroute across the whole infrastructure at some interval. This data is then aggregated into a central setvice to expose the data through an API

Terms

Peer: another server on the network

NetworkGraph: graph of the entire network Route: a set of links NetworkNode: Router in the network-- something that should respond to traceroute Link: specific connection between 2 NetworkNodes

Ping: a ping with a specific source port PingGroup: a group of pings against a specific destination

Traceroute: a traceroute with a specific source port Traceroute Group: a group of traceroutes against a specific destination

how those fit together?

Any given node will know about the peers in the network. It will intermittently ping and traceroute peers in the network, and keep track of failures. In the event of a failure we'll determine what links are at fault for the disruption.

Implementation

The goal here is to create a few layers that in themselves create something useful

Base parts:

  • Memberlist: our peers on the network to talk to
  • Mapper: responsible for mapping the network based on who is in the memberlist
  • Pinger: ping all peers in the network-- specifically to hit all routes in the mapper
  • Aggregator: aggregate all the graph info from the members of the memberlist

Running

go build ./...
./dnms -gossipAddr <advertise-ip> -peer <seed-peer:33434>

Flags:

Flag Default Notes
-gossipAddr local IP Address advertised to the memberlist gossip layer.
-peer (none) Seed host:port to join an existing cluster. Empty starts a standalone node.
-aggregator false Run the aggregator HTTP API in addition to the mapper.
-httpAddr :12345 Bind address for the HTTP API (graph, routemap, events).
-pingPort 33435 UDP port for the ping/ack transport. Must be the same on every cluster node.
-mapInterval 1s Pause between traceroutes inside a single source-port sweep.
-mapSrcPortStart / -mapSrcPortEnd auto Source-port range used to elicit ECMP variation. Defaults to pingPort+1 .. +11; main fatals at startup if the range overlaps -pingPort.
-pingPeerInterval 100ms Pause between successive peers in a ping sweep.
-pingRouteInterval 1s Pause between successive routes when pinging one peer.
-pingTimeout 1s Ack timeout for a single ping.
-metricRingSize 100 Number of recent ping samples retained per route.
-httpAllowOrigin * Value for Access-Control-Allow-Origin. Lock down by setting to a specific origin.
-httpToken (none) If non-empty, requires Authorization: Bearer <token> on every HTTP API request.
-aggregatorToken (none) Token the aggregator sends when subscribing to peers; defaults to -httpToken.

Note: gossip still uses UDP/TCP 33434 (memberlist), and -pingPort is a separate socket that dnms owns directly — no more piggy-backing pings on the memberlist receive loop.

HTTP API

All endpoints return JSON. Useful ones:

Path Description
/v1/graph Full per-node graph (nodes + edges + routes).
/v1/graph/nodes Just the nodes the mapper has observed.
/v1/graph/edges Just the links.
/v1/graph/edges/health Per-link fault attribution — see below.
/v1/graph/routes Routes with per-route loss/latency/jitter metrics.
/v1/mapper/peers Peers the local node knows about.
/v1/mapper/routemap (src,dst) → route lookup.
/v1/events/graph Server-Sent Events stream of graph mutations.
/v1/aggregator/graph* Same shapes as /v1/graph*, but over the cluster-wide merged view.
/v1/aggregator/events/graph SSE stream of the aggregated graph.

Between any two peers there are usually N different paths through the intermediate network. Each route has its own observed loss rate. To figure out which hop is responsible for the loss, dnms correlates per-route loss observations with link membership:

For each link L:
  AvgLossThrough    = sample-weighted mean lossRate of routes containing L
  AvgLossNotThrough = sample-weighted mean lossRate of routes NOT containing L
  Suspicion         = AvgLossThrough − AvgLossNotThrough

The endpoint returns one row per link sorted by suspicion descending. A genuinely bad link surfaces with a high positive score because every route through it shares its loss while routes avoiding it don't; a healthy link on otherwise noisy paths lands near zero or negative.

The same correlation runs on the aggregator side at /v1/aggregator/graph/edges/health, where it has access to every peer's routes — much more evidence than any single mapper sees on its own.

Documentation

Overview

TODO: separate package (to avoid namespace collisions)

Directories

Path Synopsis
TODO: better name? network topology?
TODO: better name? network topology?

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL