herd-hub-leaf

command
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 29, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

README

herd-hub-leaf — thundering-herd trial harness

A scaled-up companion to examples/hub-leaf-e2e/ (the 3x3 sanity test). Drives the new hub-and-leaf architecture (per-agent libfossil + per-agent SQLite + JetStream tip broadcast) at 16 agents x 30 tasks (= 480 commits) and emits OTLP traces to a configurable endpoint.

What it exercises

  • 16 concurrent leaves, each with its own libfossil repo, SQLite, and worktree under one os.MkdirTemp-based work directory;
  • one in-process libfossil hub (httptest-backed) that every leaf pushes to and pulls from;
  • one embedded NATS JetStream server (random loopback port);
  • the full Open -> Claim -> Commit -> Close path on disjoint files (each agent owns slot-i/), with randomized file count, content size, and think-time per task.

Running

Smoke test (4 x 5 = 20 commits, no OTLP)
go test ./examples/herd-hub-leaf/
Full trial (16 x 30 = 480 commits, OTLP to SigNoz)
OTEL_EXPORTER_OTLP_ENDPOINT=http://signoz-vm.tail51604c.ts.net:4318 \
OTEL_SERVICE_NAME=herd-hub-leaf \
  go run ./examples/herd-hub-leaf/

If OTEL_EXPORTER_OTLP_ENDPOINT is unset, the trial still runs but spans go nowhere.

Env knobs

  • HERD_AGENTS (default 16) — agent count
  • HERD_TASKS_PER_AGENT (default 30) — tasks each agent runs
  • HERD_SEED (default 1) — master RNG seed; per-slot seeds are Seed + slotIndex. With the same seed, two re-runs produce identical workloads.

Rate envelope (the hub's ceiling under tight-loop stress)

This harness commits with zero think time — every leaf fires the next commit immediately after the previous one finishes. That's a deliberate stress amplifier; production agents commit on minute timescales during human-paced coding work, not 50ms tight loops.

Under tight-loop stress, the architecture asymptotes around 2 hub events/sec sustained. Above that, libfossil's 100-round Pull- negotiation budget gets eaten and a leaf's pre-flight Pull aborts. Trial #14 (see docs/trials/2026-04-25/trial-report.md) found:

  • HERD_AGENTS=4 → P50 49ms, runtime 8.7s, 100% completion
  • HERD_AGENTS=8 → P50 3.6s, runtime 2m4s, 100% completion
  • HERD_AGENTS=12 → P50 10.5s, runtime 5m38s, 100% completion
  • HERD_AGENTS=13 → aborts ("100 rounds")
  • HERD_AGENTS=14 → aborts ("100 rounds")
  • HERD_AGENTS=16 → aborts ("100 rounds")

Production cadence (1 commit/min/agent) gives 100+ concurrent agents of head-room before the rate envelope tightens. The trial uses tight loops to surface the architectural ceiling, not to model production.

Stdout summary

The harness prints a result block at the end:

herd-hub-leaf trial: agents=16 tasks=30 total=480
  hub commits:        480
  fork retries:       <N>
  fork unrecoverable: 0
  claims won:         480
  claims lost:        0
  broadcasts pulled:  <N>
  broadcasts skipped (idempotent): <N>
  P50/P99 commit ms:  <P50> / <P99>
  total runtime:      <X.Xs>

broadcasts pulled and broadcasts skipped (idempotent) are not counted in-process — they come from coord.SyncOnBroadcast span attributes (pull.success, pull.skipped_idempotent) which the harness emits to OTLP. Inspect them in SigNoz to count broadcast behavior across the trial.

What to look for in SigNoz

Service: herd-hub-leaf (or whatever OTEL_SERVICE_NAME is set to).

Spans of interest:

  • coord.Commit — one per task. Attributes: commit.fork_retried, commit.fork_retried_succeeded. Sum of fork_retried==true is the harness's "fork retries" line.
  • coord.SyncOnBroadcast — one per tip.changed message a leaf consumed. Attributes: pull.success, pull.skipped_idempotent, manifest.hash. Count how many slots actually pulled vs. skipped to size broadcast traffic.

A typical 16x30 trial generates ~480 coord.Commit spans plus broadcast deliveries. The sliding-window product of broadcasts and subscribers is bounded by JetStream durable consumers, so coord.SyncOnBroadcast count is usually a multiple of agents.

Caveats

  • Slots are disjoint by construction (slot-i/), so fork unrecoverable should always be 0 in this scenario; non-zero signals a bug in coord's commit-retry path or a fossil-side cross-contamination via the hub.
  • libfossil v0.4.0's HandleSync stores blobs but does not crosslink server-side; the harness counts hub commits via a fresh verifier clone (which crosslinks locally) — the verifier.fossil file is ephemeral and lives only for the count.
  • The trial's stdout claims lost is always 0 in disjoint-slot layout. The metric stays in the report so the same harness can be driven at higher contention later (overlapping slots) without changing the print format.

Documentation

Overview

Package herdhubleaf is a thundering-herd trial harness for the hub-and-leaf architecture (ADR 0018, Phase 2 of hub-leaf-orchestrator).

The harness brings up:

  • one coord.Hub (libfossil hub fossil + embedded leaf.Agent NATS mesh + HTTP xfer endpoint);
  • n disjoint coord.Leaf instances, each with its own libfossil leaf repo + worktree + leaf.Agent that joins the hub mesh as a NATS leaf-node (single-hop subject-interest propagation).

Each agent runs k tasks against its own slot directory. Slots are disjoint by construction (slot-i/) so the no-fork-branches contract holds, but the harness exercises real concurrency at the agent NATS sync path and the fossil push to the hub.

Compared to examples/hub-leaf-e2e (the 3x3 sanity test), this harness scales up (default 16 x 30 = 480 commits) and emits OTLP traces to SigNoz so the user can inspect span timing under load.

Command herd-hub-leaf is the entrypoint for the thundering-herd trial against the new hub-and-leaf architecture.

Usage:

# Default build: telemetry calls are no-ops (no OTel deps in binary).
go run ./examples/herd-hub-leaf/

# OTel build: real OTLP HTTP exporter, opt-in via build tag.
OTEL_EXPORTER_OTLP_ENDPOINT=https://signoz.example/ \
OTEL_SERVICE_NAME=herd-hub-leaf \
  go run -tags=otel ./examples/herd-hub-leaf/

Without -tags=otel the OTEL_* env vars are ignored. Without OTEL_EXPORTER_OTLP_ENDPOINT (under -tags=otel) telemetry is suppressed (no-op exporter) so the trial still runs deterministically locally.

Env knobs (override the defaults in DefaultConfig):

HERD_AGENTS=N           default 16
HERD_TASKS_PER_AGENT=K  default 30
HERD_SEED=S             default 1

Reports to stdout. Returns non-zero on unrecoverable failure.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL