herd-hub-leaf

command

v0.12.0 Latest Latest Go to latest Published: May 7, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/danmestas/bones

Links

Open Source Insights

README ¶

herd-hub-leaf — thundering-herd trial harness

A scaled-up companion to examples/hub-leaf-e2e/ (the 3x3 sanity test). Drives the new hub-and-leaf architecture (per-agent libfossil + per-agent SQLite + JetStream tip broadcast) at 16 agents x 30 tasks (= 480 commits) and emits OTLP traces to a configurable endpoint.

What it exercises

16 concurrent leaves, each with its own libfossil repo, SQLite, and worktree under one os.MkdirTemp-based work directory;
one in-process libfossil hub (httptest-backed) that every leaf pushes to and pulls from;
one embedded NATS JetStream server (random loopback port);
the full Open -> Claim -> Commit -> Close path on disjoint files (each agent owns slot-i/), with randomized file count, content size, and think-time per task.

Running

Smoke test (4 x 5 = 20 commits, no OTLP)

go test ./examples/herd-hub-leaf/

Full trial (16 x 30 = 480 commits, OTLP to SigNoz)

OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-vm.tail51604c.ts.net:4318 \
OTEL_SERVICE_NAME=herd-hub-leaf \
  go run ./examples/herd-hub-leaf/

If OTEL_EXPORTER_OTLP_ENDPOINT is unset, the trial still runs but spans go nowhere.

Env knobs

HERD_AGENTS (default 16) — agent count
HERD_TASKS_PER_AGENT (default 30) — tasks each agent runs
HERD_SEED (default 1) — master RNG seed; per-slot seeds are Seed + slotIndex. With the same seed, two re-runs produce identical workloads.

Rate envelope (the hub's ceiling under tight-loop stress)

This harness commits with zero think time — every leaf fires the next commit immediately after the previous one finishes. That's a deliberate stress amplifier; production agents commit on minute timescales during human-paced coding work, not 50ms tight loops.

Under tight-loop stress, the architecture asymptotes around 2 hub events/sec sustained. Above that, libfossil's 100-round Pull- negotiation budget gets eaten and a leaf's pre-flight Pull aborts. Trial #14 (see docs/trials/2026-04-25/trial-report.md) found:

HERD_AGENTS=4 → P50 49ms, runtime 8.7s, 100% completion
HERD_AGENTS=8 → P50 3.6s, runtime 2m4s, 100% completion
HERD_AGENTS=12 → P50 10.5s, runtime 5m38s, 100% completion
HERD_AGENTS=13 → aborts ("100 rounds")
HERD_AGENTS=14 → aborts ("100 rounds")
HERD_AGENTS=16 → aborts ("100 rounds")

Production cadence (1 commit/min/agent) gives 100+ concurrent agents of head-room before the rate envelope tightens. The trial uses tight loops to surface the architectural ceiling, not to model production.

Stdout summary

The harness prints a result block at the end:

herd-hub-leaf trial: agents=16 tasks=30 total=480
  hub commits:        480
  fork retries:       <N>
  fork unrecoverable: 0
  claims won:         480
  claims lost:        0
  broadcasts pulled:  <N>
  broadcasts skipped (idempotent): <N>
  P50/P99 commit ms:  <P50> / <P99>
  total runtime:      <X.Xs>

broadcasts pulled and broadcasts skipped (idempotent) are not counted in-process — they come from coord.SyncOnBroadcast span attributes (pull.success, pull.skipped_idempotent) which the harness emits to OTLP. Inspect them in SigNoz to count broadcast behavior across the trial.

What to look for in SigNoz

Service: herd-hub-leaf (or whatever OTEL_SERVICE_NAME is set to).

Spans of interest:

coord.Commit — one per task. Attributes: commit.fork_retried, commit.fork_retried_succeeded. Sum of fork_retried==true is the harness's "fork retries" line.
coord.SyncOnBroadcast — one per tip.changed message a leaf consumed. Attributes: pull.success, pull.skipped_idempotent, manifest.hash. Count how many slots actually pulled vs. skipped to size broadcast traffic.

A typical 16x30 trial generates ~480 coord.Commit spans plus broadcast deliveries. The sliding-window product of broadcasts and subscribers is bounded by JetStream durable consumers, so coord.SyncOnBroadcast count is usually a multiple of agents.

Caveats

Slots are disjoint by construction (slot-i/), so fork unrecoverable should always be 0 in this scenario; non-zero signals a bug in coord's commit-retry path or a fossil-side cross-contamination via the hub.
libfossil v0.4.0's HandleSync stores blobs but does not crosslink server-side; the harness counts hub commits via a fresh verifier clone (which crosslinks locally) — the verifier.fossil file is ephemeral and lives only for the count.
The trial's stdout claims lost is always 0 in disjoint-slot layout. The metric stays in the report so the same harness can be driven at higher contention later (overlapping slots) without changing the print format.

Documentation ¶

Overview ¶

Package herdhubleaf is a thundering-herd trial harness for the hub-and-leaf architecture (ADR 0023).

The harness brings up:

one coord.Hub (libfossil hub fossil + embedded leaf.Agent NATS mesh + HTTP xfer endpoint);
n disjoint coord.Leaf instances, each with its own libfossil leaf repo + worktree + leaf.Agent that joins the hub mesh as a NATS leaf-node (single-hop subject-interest propagation).

Each agent runs k tasks against its own slot directory. Slots are disjoint by construction (slot-i/) so the no-fork-branches contract holds, but the harness exercises real concurrency at the agent NATS sync path and the fossil push to the hub.

Compared to examples/hub-leaf-e2e (the 3x3 sanity test), this harness scales up (default 16 x 30 = 480 commits) and emits OTLP traces to SigNoz so the user can inspect span timing under load.

Command herd-hub-leaf is the entrypoint for the thundering-herd trial against the new hub-and-leaf architecture.

Usage:

# Default build: telemetry calls are no-ops (no OTel deps in binary).
go run ./examples/herd-hub-leaf/

# OTel build: real OTLP HTTP exporter, opt-in via build tag.
OTEL_EXPORTER_OTLP_ENDPOINT=https://signoz.example/ \
OTEL_SERVICE_NAME=herd-hub-leaf \
  go run -tags=otel ./examples/herd-hub-leaf/

Without -tags=otel the OTEL_* env vars are ignored. Without OTEL_EXPORTER_OTLP_ENDPOINT (under -tags=otel) telemetry is suppressed (no-op exporter) so the trial still runs deterministically locally.

Env knobs (override the defaults in DefaultConfig):

HERD_AGENTS=N           default 16
HERD_TASKS_PER_AGENT=K  default 30
HERD_SEED=S             default 1

Reports to stdout. Returns non-zero on unrecoverable failure.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL