swarm

package
v0.16.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 10, 2026 License: Apache-2.0 Imports: 21 Imported by: 0

Documentation

Overview

Package swarm holds the per-slot session record schema and the JetStream-KV-backed Manager that bones swarm verbs use to track active swarm sessions in a workspace.

State lives in a single KV bucket (DefaultBucketName) keyed by slot name. ADR 0028 §"Lifecycle and state" explains why session state went to KV instead of a per-slot state.json on disk: every field in this struct is either already authoritative somewhere else in JetStream KV (task_id in bones-tasks, agent_id in bones-presence, the claim hold in bones-holds) or describes the host-local OS process owning the leaf (host, leaf_pid). Putting them in one bucket gives cross-host visibility and a single source of truth for `bones swarm status` and `bones doctor` without re-deriving state from N substrate buckets per call.

The Manager is intentionally simpler than internal/presence: there is no background heartbeat goroutine. Heartbeats are driven by agent-process calls to `bones swarm commit`, which extends TTL via CAS. A slot whose agent has crashed stops heartbeating and the bucket TTL evicts the entry — exactly the surface `bones doctor` is meant to see.

Index

Constants

View Source
const AgentBranchPrefix = "agent/"

AgentBranchPrefix is the fossil branch prefix used for synthetic slot commits (ADR 0050 §"Branch model"). Each agent invocation commits land on `agent/<full-agent-id>`; the operator decides via `bones apply --slot=agent-<id>` whether to materialize that branch into the git working tree.

View Source
const AgentSlotIDLen = 12

AgentSlotIDLen is the number of agent_id characters baked into the synthetic slot name (`agent-<prefix>`). 12 hex characters is enough that collisions on a single workspace are negligible (~5e-15 for even a million agents) while keeping the slot name short enough to scan in `bones swarm status` output and to encode as a fossil branch suffix (`agent/<prefix>`).

The full agent_id is preserved in the session record's AgentID field; the slot name is the truncation. ADR 0050 §"Slot identity is implicit" treats this as a derivation, not a primary key.

View Source
const AgentSlotPrefix = "agent-"

AgentSlotPrefix is the slot-name prefix used for synthetic slots (ADR 0050). Plan-anchored slots never use this prefix because plan validators reject names starting with `agent-`; synthetic slots always do, so callers can distinguish the two namespaces by name.

View Source
const DefaultBucketName = "bones-swarm-sessions"

DefaultBucketName is the JetStream KV bucket holding swarm sessions. The name is part of the substrate contract (parallel to bones-tasks, bones-holds, bones-presence) and SHOULD NOT change without a substrate-evolution ADR.

View Source
const DefaultCaps = "oih"

DefaultCaps is the fossil capability string granted to a slot user when the caller doesn't override. Matches cli/swarm_join.go.

View Source
const DefaultHubFossilURL = "http://127.0.0.1:8765"

DefaultHubFossilURL is the hub fossil HTTP URL bones writes when `bones up` brings a hub to default ports. Mirrors the constant in cli/swarm_join.go so Acquire callers don't have to plumb the URL through the CLI flag layer.

View Source
const DefaultTTL = 5 * time.Minute

DefaultTTL is the JetStream KV bucket TTL. Sessions are heartbeated via `bones swarm commit`; a slot whose agent has crashed stops renewing and the bucket evicts the entry after this interval. Five minutes mirrors ADR 0028's "stale after 5min no renewal" rule.

View Source
const DefaultWatcherTTL = 5 * time.Minute

DefaultWatcherTTL is the default lease-TTL the watcher uses when the caller passes a zero value in WatcherConfig. Five minutes matches the JetStream KV bucket TTL (DefaultTTL): the JS KV layer auto-evicts the record at the substrate level; the watcher additionally cleans up the on-disk slot directory and emits the `slot_reap` event consumers expect to see.

View Source
const DefaultWatcherTick = 30 * time.Second

DefaultWatcherTick is how often the TTL watcher wakes up to scan for stale sessions. Short enough that operators see synthetic-slot (ADR 0050) cleanup land within ~30s of the agent going silent; long enough to keep the watcher's idle cost negligible (one KV list per tick, no list when the bucket is empty).

View Source
const EventsFile = ".bones/swarm-events.jsonl"

EventsFile is the workspace-relative path of the structured swarm-lifecycle event log. Living next to the workspace leaf log keeps the audit-shaped data in one place; the dedicated filename makes the purpose obvious to operators tailing it.

Path is `<workspace>/.bones/swarm-events.jsonl`. JSONL (one JSON object per line) is the right shape for `tail -f`, `grep`, and streaming-friendly tooling. Each line is small (~150 bytes) so POSIX-atomic appends from concurrent slot processes don't tear.

View Source
const SyntheticTaskTitle = "agent slot (synthetic, ADR 0050)"

SyntheticTaskTitle is the title used for the auto-created task that backs a synthetic slot's claim. The bones-tasks bucket needs an entry for the lease's claim machinery to grip; a stable human-readable title makes the bucket browsable from `bones tasks status` without leaking implementation detail in the name.

Variables

View Source
var ErrAgentIDMissing = errors.New(
	"swarm: workspace has no agent.id — run `bones up` to initialize",
)

ErrAgentIDMissing is returned by JoinAuto when the workspace has no `.bones/agent.id` marker. The caller should run `bones up` (or `bones init`) first; JoinAuto refuses to invent an identity.

View Source
var ErrCASConflict = errors.New("swarm: CAS conflict")

ErrCASConflict reports that a CAS-gated update or delete saw a revision mismatch — another writer raced ours. Mirrors internal/tasks.ErrCASConflict so callers can react identically.

View Source
var ErrCloseRequiresArtifact = errors.New(
	"swarm: close --result=success refused — slot was joined but no work " +
		"committed; either run `bones swarm commit` first, or pass " +
		"--no-artifact to acknowledge an intentional empty close",
)

ErrCloseRequiresArtifact is returned by ResumedLease.Close when the caller asked for CloseTaskOnSuccess but no commit has landed on the slot since join (LastRenewed == StartedAt on the session record). The substrate refuses the silent-bypass shape — closing success without producing an artifact severs the audit trail bones is supposed to preserve. Callers that have a legitimate reason to close success without a commit (e.g. a research subagent that returned findings inline) must opt in via CloseOpts.NoArtifact; per #233 this is a boolean opt-in (not a free-form reason string) so a hallucinated rationale cannot leak into the audit trail.

View Source
var ErrClosed = errors.New("swarm: sessions handle is closed")

ErrClosed reports that a public method was called on a Sessions whose Close has returned. Parallel to internal/presence.ErrClosed.

View Source
var ErrCrossHostOperation = errors.New(
	"swarm: cross-host operation refused — run the verb on the slot's owning host",
)

ErrCrossHostOperation is returned by Resume when the session record's Host field doesn't match this machine's hostname. Cross-host commit/close/status operations would manipulate a leaf process bones can't reach; the right answer is for the operator to run the verb on the owning host.

View Source
var ErrNotFound = errors.New("swarm: session not found")

ErrNotFound reports that the requested slot has no live session record in the bucket. Distinct from a substrate error so callers (swarm status, swarm commit) can distinguish "no session yet" from "NATS unreachable."

View Source
var ErrSessionAlreadyLive = errors.New(
	"swarm: live session already on this slot — pass --force to take over",
)

ErrSessionAlreadyLive is returned by Acquire when a live session record exists for the slot on this host and ForceTakeover is false. Callers that want to overwrite must set AcquireOpts.ForceTakeover.

View Source
var ErrSessionForeignHost = errors.New(
	"swarm: session owned by another host — refusing cross-host takeover",
)

ErrSessionForeignHost is returned by Acquire when the existing session record was written by a different host. Cross-host takeover is refused unconditionally; the operator must run the takeover from the owning host.

View Source
var ErrSessionGone = errors.New(
	"swarm: session record was deleted concurrently",
)

ErrSessionGone is returned by ResumedLease.Commit and ResumedLease.Close when the session record was deleted between Resume and the operation — usually because a concurrent close on another agent ran. Distinct from ErrSessionNotFound so callers can distinguish "never existed" from "deleted under us mid-flight".

View Source
var ErrSessionNotFound = errors.New(
	"swarm: session record not found — run `bones swarm join` first",
)

ErrSessionNotFound is returned by Resume when no session record exists for the slot. Distinct from swarm.ErrNotFound so callers can disambiguate "no record yet" from "Sessions handle closed".

View Source
var ErrStaleClaudeWorktrees = errors.New(
	"bones: refusing to start: legacy .claude/worktrees/agent-*/ dirs " +
		"present (run `bones cleanup --all-worktrees` to migrate; ADR 0050)",
)

ErrStaleClaudeWorktrees is returned by CheckStaleClaudeWorktrees when one or more legacy `.claude/worktrees/agent-*/` directories exist under the workspace root. ADR 0050 §"Migration: refuse-to- start on stale `.claude/worktrees/`": `bones up` and `bones hub start` refuse to proceed until those dirs are cleaned, because pre-ADR-0050 isolation no longer matches the synthetic slot machinery.

Callers errors.Is-test this sentinel to map the failure to a non-zero CLI exit code that's distinct from the generic substrate errors.

View Source
var ErrWorkspaceNotBootstrapped = errors.New(
	"swarm: workspace not bootstrapped — the orchestrator must " +
		"bring up the hub before leaves can join; refusing to " +
		"bootstrap from a leaf context",
)

ErrWorkspaceNotBootstrapped is returned by Acquire when `<workspace>/.bones/hub.fossil` is missing. The role-leak guard from PR #54 lives here: a leaf must NEVER read the error as "run `bones up`" — bootstrap is the orchestrator's job. The error string deliberately omits the `bones up` phrase so a subagent reading stderr can't be misled into bootstrapping its own context.

Functions

func AgentBranchName added in v0.11.0

func AgentBranchName(agentID string) string

AgentBranchName returns the fossil branch name for a synthetic slot (`agent/<full-agent-id>`). Branch suffix uses the FULL agent_id, not the truncated slot prefix, so `bones apply --slot=<name>` can disambiguate when an operator has multiple agents whose 12-char prefixes happen to overlap.

func CheckStaleClaudeWorktrees added in v0.11.0

func CheckStaleClaudeWorktrees(root string) error

CheckStaleClaudeWorktrees globs for `<root>/.claude/worktrees/ agent-*/` and returns ErrStaleClaudeWorktrees with the matching dirs listed in the message when one or more match. Returns nil when no matches exist (empty `.claude/worktrees/` dir is fine — only `agent-*` children are the migration trigger).

Used by `bones up` and `bones hub start`. Other verbs deliberately skip the check: an operator may have legitimate reasons to inspect a stale dir on a workspace they're not actively running, and the loud-refusal point is the SessionStart bring-up, not every read- only verb.

glob errors fall through as nil — a malformed pattern would be a programmer bug, not an operator-visible refusal trigger; the pattern is hardcoded so this can't happen in practice.

func IsSyntheticSlot added in v0.11.0

func IsSyntheticSlot(slot string) bool

IsSyntheticSlot reports whether slot is a synthetic agent slot (i.e. begins with AgentSlotPrefix). Used by callers that need to distinguish plan-anchored slots from agent slots — for example, log filtering and cleanup grouping. Plain prefix check; intentional.

func SlotCleanup added in v0.11.0

func SlotCleanup(
	ctx context.Context, sessions *Sessions, workspaceDir, slot string,
) (bool, error)

SlotCleanup performs the on-disk side of a synthetic-slot reap (ADR 0050): deletes the session record from KV and removes the slot's worktree directory at .bones/swarm/<slot>/wt/. Idempotent — a missing record or missing wt is not an error. Used by `bones cleanup --slot=<name>`; the TTL watcher uses the same shape via reapOne but folds in the `slot_reap` event emission and the hub.log INFO line.

Returns (existed, error): existed is true iff a session record for the slot was present at call time. Callers print a "no-op" message when existed is false and a "reaped" message when true.

func SlotDir

func SlotDir(workspaceDir, slot string) string

SlotDir returns the on-disk directory for the slot's per-leaf state under the workspace root. Layout:

<workspace>/.bones/swarm/<slot>/
├── leaf.fossil          libfossil repo (cloned from hub)
├── leaf.pid             host-local PID tracker
└── wt/                  worktree (Leaf.WT() result)

Pure path derivation; no KV lookup. Callers can compute this without opening a Sessions handle — `bones swarm cwd` exploits exactly that to avoid a NATS round-trip for a path query.

func SlotPidFile

func SlotPidFile(workspaceDir, slot string) string

SlotPidFile returns the slot's host-local PID-tracker path. Used by swarm join to write and swarm close to remove.

func SlotWorktree

func SlotWorktree(workspaceDir, slot string) string

SlotWorktree returns the slot's worktree path. Equivalent to filepath.Join(SlotDir(workspaceDir, slot), "wt"). Pulled out as a distinct helper because `bones swarm cwd` always wants this — the repo file and pid file are not user-facing.

func SyntheticSlotName added in v0.11.0

func SyntheticSlotName(agentID string) string

SyntheticSlotName returns the synthetic slot name for the given agent_id (`agent-<first-AgentSlotIDLen-chars>`). Caller is responsible for ensuring agentID is non-empty; the function panics (via assert) on the empty input rather than returning a bare `agent-` prefix that would collide across workspaces.

Truncation is straight-prefix: if agentID is shorter than AgentSlotIDLen, the whole id is used. The full id always lives in the session record's AgentID field; truncation is only the human-facing slot name.

Types

type AcquireOpts added in v0.1.3

type AcquireOpts struct {
	// HubURL overrides the hub fossil HTTP URL stamped into a fresh
	// session record. Resume reads HubURL from the existing record
	// and ignores this field. Empty → DefaultHubFossilURL.
	HubURL string

	// Hub is an in-process *coord.Hub. When set, the lease's
	// underlying coord.Leaf opens against this Hub directly
	// (LeafConfig.Hub) rather than via HubAddrs. Tests use this to
	// avoid spinning up a separate hub process; the CLI never sets
	// it.
	Hub *coord.Hub

	// Caps overrides fossil capabilities for the slot user on
	// Acquire. Resume ignores this. Empty → DefaultCaps.
	Caps string

	// ForceTakeover lets Acquire CAS-delete an existing same-host
	// session record on the slot. Recovery only — the typical path
	// is an explicit operator decision after `bones doctor`.
	ForceTakeover bool

	// NATSConn is a pre-connected NATS connection. If nil, the lease
	// dials info.NATSURL itself and the resulting connection is
	// closed when the lease is released. When set, the caller owns
	// the connection's lifetime.
	NATSConn *nats.Conn

	// Now is the time source for StartedAt and LastRenewed. nil →
	// time.Now().UTC. Tests inject a fixed clock.
	Now func() time.Time

	// NoAutosync disables the pre-commit hub pull on this lease's
	// underlying coord.Leaf. Default (false) keeps autosync ON,
	// which is the production default per ADR 0023's trunk-linearity
	// promise: every commit pulls from the hub before resolving the
	// trunk tip so the new commit's parent is the hub's-latest-tip,
	// producing one linear chain instead of N parallel forks.
	//
	// Set to true (CLI flag --no-autosync on swarm join / swarm
	// commit) when the caller has an explicit reason to operate in
	// branch-per-slot mode: offline tolerance, single-slot work
	// where no peer commits will race, or testing fan-in semantics.
	// Cost: one less hub HTTP round-trip per commit.
	NoAutosync bool
}

AcquireOpts tunes Acquire and Resume. Zero-value defaults are production-correct: HubURL → DefaultHubFossilURL, Caps → DefaultCaps, NATSConn dialed from info.NATSURL, Now → time.Now().UTC.

type CloseOpts added in v0.1.3

type CloseOpts struct {
	// CloseTaskOnSuccess, when true, calls coord.Leaf.Close on the
	// re-claimed task — closing the task in the bones-tasks bucket
	// in addition to releasing the claim hold. False just releases
	// the claim, leaving the task open for retry by the parent
	// dispatch. swarm close --result=success sets this true; fail
	// and fork leave it false.
	CloseTaskOnSuccess bool

	// NoArtifact, when true, bypasses the "success requires a commit
	// since join" precondition. Per #233 this is a structured boolean
	// opt-in (not a free-form reason string) so a hallucinated
	// rationale cannot leak into the audit trail; the lifecycle event
	// records `no_artifact: true` when set. Ignored when
	// CloseTaskOnSuccess is false (fail and fork results have no
	// artifact requirement).
	NoArtifact bool

	// Reaped, when true, tags the emitted lifecycle event as
	// EventSlotReap rather than EventSlotClose. Set by the
	// `bones swarm reap` verb so post-mortem analysis can tell
	// substrate-driven cleanup ("agent went silent") apart from
	// operator-driven close ("subagent reported done"). No effect
	// on the cleanup mechanics.
	Reaped bool

	// KeepWT, when true, retains the per-slot worktree directory
	// even on a successful close. Default behavior (KeepWT=false
	// + CloseTaskOnSuccess=true) removes the wt so it does not
	// accumulate across cycles. Failed and forked closes
	// (CloseTaskOnSuccess=false) always retain wt regardless of
	// this flag — the operator may need to inspect what the slot
	// left behind.
	KeepWT bool
}

CloseOpts tunes ResumedLease.Close. Zero value is the conservative release-and-cleanup behavior; CloseTaskOnSuccess transitions the underlying task to closed in the bones-tasks bucket.

type CommitResult added in v0.1.3

type CommitResult struct {
	UUID       string
	PushResult *agent.SyncResult
	PushErr    error
	RenewErr   error
}

CommitResult is the outcome of ResumedLease.Commit. UUID is set on every successful local commit. PushResult is set when the post-commit HTTP push to the hub succeeded; PushErr is set when it failed (the local commit lands either way). RenewErr is set when the session-record CAS bump exhausted retries or saw an unrecoverable error. Callers should print warnings for the soft errors but should not roll back the local commit.

type Config

type Config struct {
	// NATSConn is the live NATS connection. Sessions does not dial; the
	// caller owns connect/disconnect lifecycle. Required.
	NATSConn *nats.Conn

	// BucketName overrides DefaultBucketName. Empty string falls back
	// to the package default.
	BucketName string

	// TTL overrides DefaultTTL on bucket creation. Zero falls back to
	// the package default. Ignored if the bucket already exists with
	// a different TTL — JetStream KV is not destructive on Update.
	TTL time.Duration
}

Config configures a Sessions handle. Only NATSConn is required; bucket name and TTL fall back to the package defaults if zero. Defaults match production; tests sometimes pass shorter TTLs to exercise expiry.

func (Config) Validate

func (c Config) Validate() error

Validate returns an error describing the first invalid field, or nil if the config is acceptable. Open calls Validate before any substrate work so callers see config issues with no NATS round-trip.

type Event added in v0.5.0

type Event struct {
	TS         time.Time `json:"ts"`
	Kind       EventKind `json:"event"`
	Slot       string    `json:"slot"`
	TaskID     string    `json:"task_id,omitempty"`
	AgentID    string    `json:"agent_id,omitempty"`
	Host       string    `json:"host,omitempty"`
	Result     string    `json:"result,omitempty"`
	NoArtifact bool      `json:"no_artifact,omitempty"`
	CommitUUID string    `json:"commit_uuid,omitempty"`
}

Event is the on-disk shape written to swarm-events.jsonl. Fields are minimal-but-sufficient for an operator to reconstruct who did what when. Optional fields use omitempty so close events don't carry stale CommitUUIDs and join events don't carry spurious Result strings.

type EventKind added in v0.5.0

type EventKind string

EventKind enumerates the slot-lifecycle event types written to the JSONL log. Each kind corresponds to a distinct verb at the CLI surface; consumers (operator dashboards, replays) can filter by kind without parsing the rest of the line.

const (
	// EventSlotJoin is emitted when Acquire successfully writes a
	// session record. The slot is bound to a task and a leaf is open.
	EventSlotJoin EventKind = "slot_join"

	// EventSlotCommit is emitted when ResumedLease.Commit returns
	// without a leaf-side error. PushErr / RenewErr surface in the
	// returned CommitResult but the structured event records the
	// successful local commit (audit-trail-relevant).
	EventSlotCommit EventKind = "slot_commit"

	// EventSlotClose is emitted when ResumedLease.Close completes
	// the cleanup path. Always carries Result so post-mortem
	// analysis sees success/fail/fork distinction.
	EventSlotClose EventKind = "slot_close"

	// EventSlotReap is emitted by the swarm reap verb instead of
	// EventSlotClose so operators can distinguish operator-driven
	// cleanup ("we shut this slot down") from substrate-driven
	// reaping ("the agent went silent and we cleaned up after it").
	EventSlotReap EventKind = "slot_reap"
)

type FreshLease added in v0.2.0

type FreshLease struct {
	// contains filtered or unexported fields
}

FreshLease is a slot session created by Acquire. It owns an active claim taken at acquisition time. FreshLease.Release is the graceful exit (record persists for later Resume); FreshLease.Abort is the rollback path (record deleted).

FreshLease has neither Commit nor Close — those operate on a resumed lease with the latest record revision. Callers that need to commit after acquiring should Release the FreshLease and Resume in a separate operation.

func Acquire added in v0.2.0

func Acquire(
	ctx context.Context, info workspace.Info,
	slot, taskID string, opts AcquireOpts,
) (*FreshLease, error)

Acquire opens a FreshLease for a slot that has no live session record yet. Used by `bones swarm join`. Steps, in order, with any partial work cleaned up on error:

  1. Verify `<workspace>/.bones/hub.fossil` exists. If not, return ErrWorkspaceNotBootstrapped — the role-leak guard from PR #54.
  2. Create the slot's fossil user on the hub repo if missing.
  3. Open the swarm.Sessions handle (dial NATS or reuse opts.NATSConn).
  4. Read any existing session record. Active on this host without --force → ErrSessionAlreadyLive. Different host → ErrSessionForeignHost. Stale or --force → CAS-delete and continue.
  5. Open coord.Leaf with Autosync=true and claim the task.
  6. Put the session record (CAS create) and write the pid file.

The returned FreshLease holds the claim. Callers MUST call Release (record persists) or Abort (record deleted as rollback) exactly once.

func (*FreshLease) Abort added in v0.2.0

func (l *FreshLease) Abort(ctx context.Context) error

Abort rolls back the fresh acquisition: releases the claim, stops the leaf, CAS-deletes the session record, removes the host-local pid file, and tears down the Sessions handle and NATS connection. Used when join's downstream work fails after Acquire succeeded — the caller wants to leave the slot in the same state Acquire saw. Idempotent.

func (*FreshLease) FossilUser added in v0.2.0

func (b *FreshLease) FossilUser() string

FossilUser returns the slot's fossil user (e.g. "slot-rendering").

func (*FreshLease) HubURL added in v0.2.0

func (b *FreshLease) HubURL() string

HubURL returns the hub fossil HTTP URL the lease's leaf is peered against.

func (*FreshLease) Release added in v0.2.0

func (l *FreshLease) Release(ctx context.Context) error

Release closes the FreshLease without deleting the session record. Releases the claim, stops the leaf, closes the swarm.Sessions handle, and (if the lease owns the NATS connection) closes that too. Idempotent.

`bones swarm join` calls Release after writing the session record so subsequent verbs can Resume against it.

func (*FreshLease) SessionRevision added in v0.2.0

func (b *FreshLease) SessionRevision() uint64

SessionRevision returns the JetStream KV revision the lease last observed. Bumped internally by Commit on each successful CAS. Test-only utility — production verbs go through Commit / Close.

func (*FreshLease) Slot added in v0.2.0

func (b *FreshLease) Slot() string

Slot returns the slot name this lease is bound to.

func (*FreshLease) TaskID added in v0.2.0

func (b *FreshLease) TaskID() string

TaskID returns the task ID stamped into the lease's session record.

func (*FreshLease) WT added in v0.2.0

func (b *FreshLease) WT() string

WT returns the slot's worktree path. Returns the empty string after the leaf has been stopped (e.g. post-Commit).

type JoinAutoResult added in v0.11.0

type JoinAutoResult struct {
	Slot    string
	WT      string
	TaskID  string
	AgentID string
	ReEntry bool
	Lease   *FreshLease
}

JoinAutoResult is the outcome of JoinAuto: either a fresh acquire (FreshLease set, ReEntry=false) or an idempotent re-entry against an existing session record on this host (FreshLease nil, Slot/WT populated, ReEntry=true). Callers that just want the slot dir for printing on stdout should read Slot + WT regardless of which path fired.

The lease is returned only on the fresh path because Acquire is the only function that constructs one with an active claim; re-entry reads the existing record without taking a claim, mirroring how `bones swarm commit` would use Resume rather than re-Acquire.

Callers that took the fresh path MUST call FreshLease.Release (or Abort on a downstream failure) exactly once. Re-entry callers have nothing to release — the existing lease's host-local pid file and session record were written by the prior join.

func JoinAuto added in v0.11.0

func JoinAuto(ctx context.Context, info workspace.Info, opts AcquireOpts) (JoinAutoResult, error)

JoinAuto opens (or rejoins) the synthetic slot for the workspace's agent_id. ADR 0050 §"Slot identity is implicit" + #282.

Steps:

  1. Read .bones/agent.id (info.AgentID is the source of truth; ErrAgentIDMissing if empty).
  2. Derive slot name = `agent-<first-AgentSlotIDLen-chars>`.
  3. Open the swarm.Sessions handle and look up the slot. - Existing record + same agent_id + same host → idempotent re-entry. Bumps LastRenewed (heartbeat-on-rejoin per option (a) in #282) and returns ReEntry=true. - Existing record + different agent_id → ErrSessionAlreadyLive (different agents must use different prefixes, or one collided in the truncation; the operator must clean up). - No record → fall through to fresh acquire below.
  4. Auto-create a synthetic task in bones-tasks (so the claim machinery has something to grip) and call swarm.Acquire with that task ID. The synthetic task's title is SyntheticTaskTitle; it carries a single hold path under .bones/swarm/<slot>/wt/ so the hold-bucket invariant (every claim corresponds to a task with hold paths) holds.

Caller MUST call lease.Release(ctx) when done with the fresh path. Re-entry path: lease is nil; nothing to release.

type ResumedLease added in v0.2.0

type ResumedLease struct {
	// contains filtered or unexported fields
}

ResumedLease is a slot session reconstructed from an existing record by Resume. It does NOT hold a claim; Commit and Close re-acquire as needed. CAS revision on the session record is tracked internally; Commit advances it on every successful LastRenewed bump.

ResumedLease has Commit, Close, and Release. No Abort — fresh records are the only thing that gets rolled back.

func Resume added in v0.1.3

func Resume(
	ctx context.Context, info workspace.Info,
	slot string, opts AcquireOpts,
) (*ResumedLease, error)

Resume opens a ResumedLease for a slot whose session record already exists in KV. Used by every swarm verb other than join.

Resume does NOT re-take the claim — verbs that need a claim (Commit, Close) re-acquire it themselves. This mirrors today's CLI behavior: claim holds have a TTL that outlives a single verb only when bumped, and the record's LastRenewed is the durable liveness signal.

Resume refuses cross-host operations: if the session record's Host field doesn't match this machine, returns ErrCrossHostOperation. Returns ErrSessionNotFound if the slot has no record.

func (*ResumedLease) Close added in v0.2.0

func (l *ResumedLease) Close(ctx context.Context, opts CloseOpts) error

Close terminates the ResumedLease, removes the host-local pid file, and CAS-deletes the session record. When CloseOpts. CloseTaskOnSuccess is true the underlying task is also transitioned to closed in the bones-tasks bucket via Leaf.Close. Idempotent.

Steps, in order:

  1. Re-claim the lease's task (Resume did not claim it).
  2. If CloseTaskOnSuccess: Leaf.Close — closes the task and releases the claim hold in one operation. Otherwise just release the claim.
  3. Stop the leaf.
  4. Remove the host-local pid file.
  5. CAS-delete the session record, retrying on rev advance from a concurrent commit.
  6. Tear down the swarm.Sessions handle + NATS connection.

CAS conflict on step 5 (sibling commit bumped LastRenewed) re-reads the rev and retries — the operator wants to close, regardless of transient renewals. ErrSessionGone surfaces if the record is missing on first read; the close is otherwise idempotent.

func (*ResumedLease) Commit added in v0.2.0

func (l *ResumedLease) Commit(
	ctx context.Context, message string, files []coord.File,
) (CommitResult, error)

Commit re-claims the lease's task, announces holds for the file paths, commits the bytes via the underlying coord.Leaf, releases the claim, stops the leaf, HTTP-pushes the slot's leaf.fossil to the hub via /xfer, and CAS-bumps LastRenewed on the session record.

Returns CommitResult with the local-commit UUID always set on success. PushResult / PushErr report the hub-push outcome; the hub may be unreachable without rolling back the local commit. RenewErr reports the session-record CAS bump; transient CAS conflicts retry up to jskv.MaxRetries times. ErrSessionGone surfaces as RenewErr if the session record was deleted between Resume and now.

After Commit returns, the lease's underlying coord.Leaf has been stopped. The lease can still be Released or Closed; doing other verb work on the lease's leaf after Commit is undefined and would have to re-Resume.

func (*ResumedLease) FossilUser added in v0.2.0

func (b *ResumedLease) FossilUser() string

FossilUser returns the slot's fossil user (e.g. "slot-rendering").

func (*ResumedLease) HubURL added in v0.2.0

func (b *ResumedLease) HubURL() string

HubURL returns the hub fossil HTTP URL the lease's leaf is peered against.

func (*ResumedLease) Release added in v0.2.0

func (l *ResumedLease) Release(ctx context.Context) error

Release closes the ResumedLease without deleting the session record. Stops the leaf, closes the Sessions handle, and (if the lease owns the NATS connection) closes that too. Idempotent.

`bones swarm commit` calls Release after Commit so the record stays live for later verbs.

func (*ResumedLease) SessionRevision added in v0.2.0

func (b *ResumedLease) SessionRevision() uint64

SessionRevision returns the JetStream KV revision the lease last observed. Bumped internally by Commit on each successful CAS. Test-only utility — production verbs go through Commit / Close.

func (*ResumedLease) Slot added in v0.2.0

func (b *ResumedLease) Slot() string

Slot returns the slot name this lease is bound to.

func (*ResumedLease) TaskID added in v0.2.0

func (b *ResumedLease) TaskID() string

TaskID returns the task ID stamped into the lease's session record.

func (*ResumedLease) WT added in v0.2.0

func (b *ResumedLease) WT() string

WT returns the slot's worktree path. Returns the empty string after the leaf has been stopped (e.g. post-Commit).

type Session

type Session struct {
	// Slot is the slot name (matches the plan's `[slot: X]` annotation).
	// Doubles as the KV bucket key — slot is the primary identifier.
	Slot string `json:"slot"`

	// TaskID identifies the task this slot is currently working on.
	// Lookup against bones-tasks for full task state; this field is a
	// denormalized pointer used by `bones swarm status` and the close
	// verb so they don't need a second lookup.
	TaskID string `json:"task_id"`

	// AgentID is "slot-<name>" — the slot's fossil + coord identity.
	// Matches the agent_id used for the holds and presence buckets so
	// claim ownership and liveness join cleanly across substrates.
	AgentID string `json:"agent_id"`

	// Host is os.Hostname() of the machine running the leaf. Different
	// from the workspace host means another machine owns this slot;
	// `bones swarm` verbs abort with a clear cross-host error rather
	// than try to manage a remote leaf process.
	Host string `json:"host"`

	// LeafPID is the host-local PID of the bones CLI process that
	// wrote this record at join time. Informational only — every
	// swarm verb is its own short-lived CLI invocation, so this PID
	// is already-dead by the time any later verb reads it. Kept as a
	// debugging crumb (a pid that was alive at join means the join
	// did happen on this host) but NOT a liveness signal; use
	// LastRenewed for active/stale classification instead.
	LeafPID int `json:"leaf_pid"`

	// HubURL is the hub fossil HTTP URL recorded at join time. Later
	// verbs read this so they can post-commit push the slot's
	// leaf.fossil to the hub via /xfer without re-deriving the URL
	// from a flag. Empty on records written by older clients;
	// callers fall back to the join-flag default.
	HubURL string `json:"hub_url,omitempty"`

	// StartedAt is when this session record was first written.
	// Immutable once set; later renewals only touch LastRenewed.
	StartedAt time.Time `json:"started_at"`

	// LastRenewed is updated on every `swarm commit` (which heartbeats
	// the session) and used by `bones doctor` to flag stale entries.
	// This is the canonical active/stale signal — see LeafPID for the
	// "why we don't use the pid" discussion.
	LastRenewed time.Time `json:"last_renewed"`
}

Session is the per-slot record persisted in the bones-swarm-sessions JetStream KV bucket. One record per active slot in a workspace. Slot names are workspace-scoped because the bucket itself is per- NATS-deployment.

Marshaled as JSON for human-readable debugging via `nats kv get`. All time fields are RFC3339-encoded UTC. New fields must default safely on a missing-key read so older clients can read newer records (additive evolution only).

type Sessions added in v0.2.0

type Sessions struct {
	// contains filtered or unexported fields
}

Sessions owns the JetStream KV bucket holding swarm session records. Reads (Get, List) are public — `bones swarm status`, `bones doctor`, and slot-resolution helpers in the CLI consume them directly. Mutations (put, update, delete) are unexported; the only legal mutator is `swarm.Lease`, which lives in the same package.

The narrow public surface enforces a seam: the lifecycle of a session record is owned end-to-end by Lease, so outside callers cannot bypass Lease's invariants (host match, CAS revision tracking, claim-bound writes).

Every public method is safe to call concurrently. Close is idempotent. Unlike presence.Manager there is no heartbeat goroutine — callers (the bones swarm verbs, via Lease) drive renewal via CAS update.

func Open

func Open(ctx context.Context, cfg Config) (*Sessions, error)

Open creates (or reattaches to) the swarm sessions KV bucket and returns a Sessions handle. Caller must Close at shutdown. Open does not dial NATS: the connection comes pre-wired so reconnect policy stays a single-source concern (same shape as presence.Open).

func (*Sessions) BucketName added in v0.2.0

func (s *Sessions) BucketName() string

BucketName returns the bucket name in use. Useful for diagnostics and for `bones doctor` to mention the right bucket in its messages.

func (*Sessions) Close added in v0.2.0

func (s *Sessions) Close() error

Close marks the Sessions handle closed so subsequent public calls return ErrClosed. Safe to call more than once. Does not delete bucket contents; sessions outlive the process that wrote them (TTL eventually evicts).

func (*Sessions) Get added in v0.2.0

func (s *Sessions) Get(
	ctx context.Context, slot string,
) (Session, uint64, error)

Get reads the session record for slot. Returns ErrNotFound if no record exists. The returned revision is the JetStream KV sequence number suitable for a follow-up CAS update or delete (callable only from inside the swarm package — Sessions's mutators are unexported).

func (*Sessions) List added in v0.2.0

func (s *Sessions) List(ctx context.Context) ([]Session, error)

List returns every live session in the bucket. Order matches the underlying JetStream KV list order (key-name lexicographic in practice; callers that need a specific order should sort).

List is the canonical read-across-slots seam — `bones swarm status`, `bones doctor`, and CLI slot-resolution helpers all consume it. This is one of the read methods that justify keeping Sessions as a public type at all (versus folding into Lease).

type TTLWatcher added in v0.11.0

type TTLWatcher struct {
	// contains filtered or unexported fields
}

TTLWatcher polls the swarm-sessions bucket for stale records and reaps them. One watcher per workspace process; the hub's runForeground starts exactly one in production. Safe to construct directly via NewTTLWatcher; the zero value is not usable.

Lifecycle: caller starts Run in a goroutine, ctx cancellation stops the watcher. Run is NOT safe to call concurrently with itself; the watcher is a single-goroutine consumer.

func NewTTLWatcher added in v0.11.0

func NewTTLWatcher(cfg WatcherConfig) (*TTLWatcher, error)

NewTTLWatcher constructs a watcher with cfg. Zero-valued fields fall back to package defaults (DefaultWatcherTTL, DefaultWatcherTick, noop logger, time.Now().UTC, host-local-only). Returns an error only on cfg validation; substrate errors surface at Run time so transient NATS hiccups during construction don't kill the hub.

func (*TTLWatcher) Run added in v0.11.0

func (w *TTLWatcher) Run(ctx context.Context) error

Run drives the watcher loop until ctx is canceled. Each tick:

  • Lists all sessions in the bucket.
  • For every session whose LastRenewed is older than (now - TTL), drops the session record and removes the slot's wt/ tree.
  • Emits one Infof per reap, naming the slot and the duration the TTL was exceeded by.

Run returns nil when ctx is canceled. Substrate errors during a tick are surfaced via Warnf and the loop continues — a transient NATS hiccup must not kill the watcher.

func (*TTLWatcher) Start added in v0.11.0

func (w *TTLWatcher) Start(parent context.Context) func()

Start launches the watcher in a goroutine bound to ctx. Returns a stop function the caller invokes at shutdown — Stop cancels the watcher's ctx and blocks until Run returns.

type TagSpec added in v0.11.0

type TagSpec = agent.TagSpec

TagSpec is a re-export of agent.TagSpec so swarm callers don't need to depend on the leaf module directly to construct branch-tag pairs. Identical wire shape; carries the libfossil tag primitive (`branch=<name>` and `sym-<name>=*`) attached at commit time.

func AgentBranchTags added in v0.11.0

func AgentBranchTags(agentID string) []TagSpec

AgentBranchTags returns the libfossil branch-tag pair that lands a synthetic-slot commit on `agent/<full-agent-id>` instead of trunk. ADR 0050 §"Branch model" + #288. The pair is:

  • `branch=<name>` — the libfossil branch propagation tag.
  • `sym-<name>=*` — the symbolic-name tag that lets the branch resolve via fossil's name-lookup (`fossil whatis`, `bones apply --slot=<name>`).

Both tags are required; the upstream EdgeSync/libfossil implementation rejects the pair as malformed if either is missing (see leaf v0.0.11 `TestAgent_Commit_BranchTags_LandsOnNamedBranch`).

Caller is responsible for non-empty agentID; AgentBranchName panics on empty input.

type WatcherConfig added in v0.11.0

type WatcherConfig struct {
	// WorkspaceDir is the workspace root. Required — the watcher
	// removes `<workspaceDir>/.bones/swarm/<slot>/wt/` when reaping
	// a stale slot.
	WorkspaceDir string

	// Sessions is the swarm.Sessions handle the watcher reads from
	// and writes to. The watcher does not own the handle's lifetime;
	// the caller closes it when shutting the watcher down.
	Sessions *Sessions

	// TTL is the lease-TTL: any session whose LastRenewed exceeds
	// time.Now() - TTL is reaped. Zero falls back to
	// DefaultWatcherTTL.
	TTL time.Duration

	// Tick is the poll interval. Zero falls back to
	// DefaultWatcherTick.
	Tick time.Duration

	// Logger receives one Infof per reap and one Warnf per
	// substrate error. nil means "silent" (no log spam on the happy
	// path is the hard requirement; a nil logger keeps the watcher
	// from emitting any lines at all).
	Logger WatcherLogger

	// Now is the time source for staleness comparisons. nil →
	// time.Now().UTC. Tests inject a fixed clock.
	Now func() time.Time

	// LocalHostOnly, when true (the default), reaps only sessions
	// whose Host field matches this machine's hostname. Cross-host
	// stale sessions surface in the log but are not reaped — the
	// owning host's PID may still be making local progress, and a
	// remote reap would clobber it. Mirrors `bones swarm reap`'s
	// same-host policy.
	LocalHostOnly bool
}

WatcherConfig configures TTLWatcher. Zero-value defaults are production-correct: TTL → DefaultWatcherTTL, Tick → DefaultWatcherTick, Logger → silent, Now → time.Now().UTC, HostFilter → os.Hostname() match.

type WatcherLogger added in v0.11.0

type WatcherLogger interface {
	Infof(format string, args ...any)
	Warnf(format string, args ...any)
}

WatcherLogger is the minimal logger contract the TTL watcher emits to. The hub package wires its hubLogger here so reaped-slot lines land in .bones/hub.log alongside the rest of the hub lifecycle audit trail (#247). Production callers pass a non-nil logger; tests can pass a buffer-backed implementation to assert reap events landed.

The watcher MUST be silent on the happy path — Infof is called only when at least one slot was reaped. Warnf surfaces transient substrate errors (a single failed list call) without aborting the whole watcher loop.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL