Documentation
¶
Overview ¶
Package swarm holds the per-slot session record schema and the JetStream-KV-backed Manager that bones swarm verbs use to track active swarm sessions in a workspace.
State lives in a single KV bucket (DefaultBucketName) keyed by slot name. ADR 0028 §"Lifecycle and state" explains why session state went to KV instead of a per-slot state.json on disk: every field in this struct is either already authoritative somewhere else in JetStream KV (task_id in bones-tasks, agent_id in bones-presence, the claim hold in bones-holds) or describes the host-local OS process owning the leaf (host, leaf_pid). Putting them in one bucket gives cross-host visibility and a single source of truth for `bones swarm status` and `bones doctor` without re-deriving state from N substrate buckets per call.
The Manager is intentionally simpler than internal/presence: there is no background heartbeat goroutine. Heartbeats are driven by agent-process calls to `bones swarm commit`, which extends TTL via CAS. A slot whose agent has crashed stops heartbeating and the bucket TTL evicts the entry — exactly the surface `bones doctor` is meant to see.
Index ¶
- Constants
- Variables
- func SlotDir(workspaceDir, slot string) string
- func SlotPidFile(workspaceDir, slot string) string
- func SlotWorktree(workspaceDir, slot string) string
- type AcquireOpts
- type CloseOpts
- type CommitResult
- type Config
- type Lease
- func (l *Lease) Close(ctx context.Context, opts CloseOpts) error
- func (l *Lease) Commit(ctx context.Context, message string, files []coord.File) (CommitResult, error)
- func (l *Lease) FossilUser() string
- func (l *Lease) HubURL() string
- func (l *Lease) Release(ctx context.Context) error
- func (l *Lease) SessionRevision() uint64
- func (l *Lease) Slot() string
- func (l *Lease) TaskID() string
- func (l *Lease) WT() string
- type Manager
- func (m *Manager) BucketName() string
- func (m *Manager) Close() error
- func (m *Manager) Delete(ctx context.Context, slot string, expectedRev uint64) error
- func (m *Manager) Get(ctx context.Context, slot string) (Session, uint64, error)
- func (m *Manager) List(ctx context.Context) ([]Session, error)
- func (m *Manager) Put(ctx context.Context, sess Session) error
- func (m *Manager) Update(ctx context.Context, sess Session, expectedRev uint64) error
- type Session
Constants ¶
const DefaultBucketName = "bones-swarm-sessions"
DefaultBucketName is the JetStream KV bucket holding swarm sessions. The name is part of the substrate contract (parallel to bones-tasks, bones-holds, bones-presence) and SHOULD NOT change without a substrate-evolution ADR.
const DefaultCaps = "oih"
DefaultCaps is the fossil capability string granted to a slot user when the caller doesn't override. Matches cli/swarm_join.go.
const DefaultHubFossilURL = "http://127.0.0.1:8765"
DefaultHubFossilURL is the hub fossil HTTP URL bones writes when `bones up` brings a hub to default ports. Mirrors the constant in cli/swarm_join.go so AcquireFresh callers don't have to plumb the URL through the CLI flag layer.
const DefaultTTL = 5 * time.Minute
DefaultTTL is the JetStream KV bucket TTL. Sessions are heartbeated via `bones swarm commit`; a slot whose agent has crashed stops renewing and the bucket evicts the entry after this interval. Five minutes mirrors ADR 0028's "stale after 5min no renewal" rule.
Variables ¶
var ErrCASConflict = errors.New("swarm: CAS conflict")
ErrCASConflict reports that a CAS-gated Update or Delete saw a revision mismatch — another writer raced ours. Mirrors internal/tasks.ErrCASConflict so callers can react identically.
var ErrClosed = errors.New("swarm: manager is closed")
ErrClosed reports that a public method was called on a Manager whose Close has returned. Parallel to internal/presence.ErrClosed.
var ErrCrossHostOperation = errors.New(
"swarm: cross-host operation refused — run the verb on the slot's owning host",
)
ErrCrossHostOperation is returned by Resume when the session record's Host field doesn't match this machine's hostname. Cross-host commit / close / status operations would manipulate a leaf process bones can't reach; the right answer is for the operator to run the verb on the owning host.
var ErrNotFound = errors.New("swarm: session not found")
ErrNotFound reports that the requested slot has no live session record in the bucket. Distinct from a substrate error so callers (swarm status, swarm commit) can distinguish "no session yet" from "NATS unreachable."
var ErrSessionAlreadyLive = errors.New(
"swarm: live session already on this slot — pass --force to take over",
)
ErrSessionAlreadyLive is returned by AcquireFresh when a live session record exists for the slot on this host and ForceTakeover is false. Callers that want to overwrite it must set AcquireOpts.ForceTakeover.
var ErrSessionForeignHost = errors.New(
"swarm: session owned by another host — refusing cross-host takeover",
)
ErrSessionForeignHost is returned by AcquireFresh when the existing session record was written by a different host. Cross- host takeover is refused unconditionally; the operator must run the takeover from the owning host.
var ErrSessionNotFound = errors.New(
"swarm: session record not found — run `bones swarm join` first",
)
ErrSessionNotFound is returned by Resume when no session record exists for the slot. Distinct from swarm.ErrNotFound so callers can disambiguate "no record yet" from "Manager closed".
var ErrWorkspaceNotBootstrapped = errors.New(
"swarm: workspace not bootstrapped — the orchestrator must " +
"bring up the hub before leaves can join; refusing to " +
"bootstrap from a leaf context",
)
ErrWorkspaceNotBootstrapped is returned by AcquireFresh when `<workspace>/.orchestrator/hub.fossil` is missing. The role-leak guard from PR #54 lives here: a leaf must NEVER read the error as "run `bones up`" — bootstrap is the orchestrator's job. The error string deliberately omits the `bones up` phrase so a subagent reading stderr can't be misled into bootstrapping its own context.
Functions ¶
func SlotDir ¶
SlotDir returns the on-disk directory for the slot's per-leaf state under the workspace root. Layout:
<workspace>/.bones/swarm/<slot>/ ├── leaf.fossil libfossil repo (cloned from hub) ├── leaf.pid host-local PID tracker └── wt/ worktree (Leaf.WT() result)
Pure path derivation; no KV lookup. Callers can compute this without opening a Manager — `bones swarm cwd` exploits exactly that to avoid a NATS round-trip for a path query.
func SlotPidFile ¶
SlotPidFile returns the slot's host-local PID-tracker path. Used by swarm join to write and swarm close to remove.
func SlotWorktree ¶
SlotWorktree returns the slot's worktree path. Equivalent to filepath.Join(SlotDir(workspaceDir, slot), "wt"). Pulled out as a distinct helper because `bones swarm cwd` always wants this — the repo file and pid file are not user-facing.
Types ¶
type AcquireOpts ¶ added in v0.1.3
type AcquireOpts struct {
// HubURL overrides the hub fossil HTTP URL stamped into a fresh
// session record. Resume reads HubURL from the existing record
// and ignores this field. Empty → DefaultHubFossilURL.
HubURL string
// Hub is an in-process *coord.Hub. When set, the lease's
// underlying coord.Leaf opens against this Hub directly
// (LeafConfig.Hub) rather than via HubAddrs. Tests use this to
// avoid spinning up a separate hub process; the CLI never sets
// it.
Hub *coord.Hub
// Caps overrides fossil capabilities for the slot user on
// AcquireFresh. Resume ignores this. Empty → DefaultCaps.
Caps string
// ForceTakeover lets AcquireFresh CAS-delete an existing same-
// host session record on the slot. Recovery only — the typical
// path is an explicit operator decision after `bones doctor`.
ForceTakeover bool
// NATSConn is a pre-connected NATS connection. If nil, the
// lease dials info.NATSURL itself and the resulting connection
// is closed when the lease is released. When set, the caller
// owns the connection's lifetime.
NATSConn *nats.Conn
// Now is the time source for StartedAt and LastRenewed. nil →
// time.Now().UTC. Tests inject a fixed clock.
Now func() time.Time
}
AcquireOpts tunes AcquireFresh and Resume. Zero-value defaults are production-correct: HubURL → DefaultHubFossilURL, Caps → DefaultCaps, NATSConn dialed from info.NATSURL, Now → time.Now().UTC.
type CloseOpts ¶ added in v0.1.3
type CloseOpts struct {
// CloseTaskOnSuccess, when true, calls coord.Leaf.Close on the
// re-claimed task — closing the task in the bones-tasks bucket
// in addition to releasing the claim hold. False just releases
// the claim, leaving the task open for retry by the parent
// dispatch. swarm close --result=success sets this true; fail
// and fork leave it false.
CloseTaskOnSuccess bool
}
CloseOpts tunes Lease.Close. Zero value is the conservative release-and-cleanup behavior; CloseTaskOnSuccess transitions the lease's task to closed in the bones-tasks bucket.
type CommitResult ¶ added in v0.1.3
type CommitResult struct {
UUID string
PushResult *libfossil.SyncResult
PushErr error
RenewErr error
}
CommitResult is the outcome of Lease.Commit. UUID is set on every successful local commit. PushResult is set when the post-commit HTTP push to the hub succeeded; PushErr is set when it failed (the local commit lands either way). RenewErr is set when the session-record CAS-bump failed (the commit succeeded but the session may TTL out before the next verb runs). Callers should print warnings for the soft errors but should not roll back the local commit on either of them.
type Config ¶
type Config struct {
// NATSConn is the live NATS connection. Manager does not dial; the
// caller owns connect/disconnect lifecycle. Required.
NATSConn *nats.Conn
// BucketName overrides DefaultBucketName. Empty string falls back
// to the package default.
BucketName string
// TTL overrides DefaultTTL on bucket creation. Zero falls back to
// the package default. Ignored if the bucket already exists with
// a different TTL — JetStream KV is not destructive on Update.
TTL time.Duration
}
Config configures a Manager. Only NATSConn is required; bucket name and TTL fall back to the package defaults if zero. Defaults match production; tests sometimes pass shorter TTLs to exercise expiry.
type Lease ¶ added in v0.1.3
type Lease struct {
// contains filtered or unexported fields
}
Lease is a single CLI invocation's grip on a slot. A Lease owns the per-slot coord.Leaf, an optional claim hold, and the JetStream KV session record for the duration of one CLI verb.
Acquired fresh by `bones swarm join` (which CAS-creates the session record), resumed by every other swarm verb (which read the existing record and reconstruct the leaf). Per-CLI-invocation lifetime — the leaf and claim die with the lease; the persistent state across verbs is the session record in bones-swarm-sessions[slot], not the lease itself.
See ADR 0031 for the design and the rationale for the AcquireFresh / Resume split. See ADR 0030 for why tests against Lease use real NATS + real Fossil rather than mocks.
func AcquireFresh ¶ added in v0.1.3
func AcquireFresh( ctx context.Context, info workspace.Info, slot, taskID string, opts AcquireOpts, ) (*Lease, error)
AcquireFresh opens a Lease for a slot that has no live session record yet. Used by `bones swarm join`. Steps, in order, with any partial work cleaned up on error:
- Verify `<workspace>/.orchestrator/hub.fossil` exists. If not, return ErrWorkspaceNotBootstrapped — the role-leak guard from PR #54.
- Create the slot's fossil user on the hub repo if missing.
- Open the swarm.Manager (dial NATS or reuse opts.NATSConn).
- Read any existing session record. Active on this host without --force → ErrSessionAlreadyLive. Different host → ErrSessionForeignHost. Stale or --force → CAS-delete and continue.
- Open coord.Leaf with Autosync=true and claim the task.
- Put the session record (CAS create) and write the pid file.
Returns a live Lease the caller MUST Release (when the session record should persist for later verbs) or Close (when the slot's work is done and the record should be deleted).
func Resume ¶ added in v0.1.3
func Resume( ctx context.Context, info workspace.Info, slot string, opts AcquireOpts, ) (*Lease, error)
Resume opens a Lease for a slot whose session record already exists in KV. Used by every swarm verb other than join.
Resume does NOT re-take the claim — verbs that need a claim (Lease.Commit / Lease.Close) re-acquire it themselves. This mirrors today's CLI behavior: claim holds have a TTL that outlives a single verb only when bumped, and the record's LastRenewed is the durable liveness signal.
Resume refuses cross-host operations: if the session record's Host field doesn't match this machine, returns ErrCrossHostOperation. The leaf process bones can manipulate lives on the owning host, so the verb has to run there.
Returns ErrSessionNotFound if the slot has no record. Returns the underlying NATS / Fossil errors otherwise.
func (*Lease) Close ¶ added in v0.1.3
Close terminates the lease, removes the host-local pid file, and deletes the session record via a CAS gate against the revision the lease holds. When CloseOpts.CloseTaskOnSuccess is true the underlying task is also transitioned to closed in the bones-tasks bucket via Leaf.Close. Idempotent — a second Close after a successful first is a no-op.
Steps, in order:
- Re-claim the lease's task (Resume did not claim it).
- If CloseTaskOnSuccess: Leaf.Close — closes the task and releases the claim hold in one operation. Otherwise just release the claim.
- Stop the leaf.
- Remove the host-local pid file.
- CAS-delete the session record.
- Tear down the swarm.Manager + NATS connection.
`bones swarm close` calls this. The CAS gate on step 5 ensures we don't delete a record some other process renewed; if it raced us, the caller sees the underlying ErrCASConflict and can decide whether to re-Resume and retry.
func (*Lease) Commit ¶ added in v0.1.3
func (l *Lease) Commit( ctx context.Context, message string, files []coord.File, ) (CommitResult, error)
Commit takes a fresh claim on the lease's task, announces holds for the file paths, commits the bytes via the underlying coord.Leaf, releases the claim, stops the leaf, HTTP-pushes the slot's leaf.fossil to the hub via /xfer, and CAS-bumps LastRenewed on the session record.
Returns CommitResult with the local-commit UUID always set on success. PushResult / PushErr report the hub-push outcome; the hub may be unreachable without rolling back the local commit. RenewErr reports the session-record CAS bump; CAS conflicts are silently treated as success (a sibling renewer raced and bumped the TTL on our behalf).
After Commit returns, the lease's underlying coord.Leaf has been stopped — the push path needs an exclusive libfossil.Repo handle on leaf.fossil. The lease can still be Released or Closed; doing other verb work on the lease's leaf after Commit is undefined and would have to re-Resume.
func (*Lease) FossilUser ¶ added in v0.1.3
FossilUser returns the slot's fossil user (e.g. "slot-rendering").
func (*Lease) HubURL ¶ added in v0.1.3
HubURL returns the hub fossil HTTP URL the lease's leaf is peered against.
func (*Lease) Release ¶ added in v0.1.3
Release closes the lease without deleting the session record. Stops the leaf, releases any held claim, closes the swarm Manager, and (if the lease owns the NATS connection) closes that too. Idempotent.
`bones swarm join` calls Release after writing the session record so subsequent verbs can Resume against it. `bones swarm close` calls Close instead, which also deletes the record.
func (*Lease) SessionRevision ¶ added in v0.1.3
SessionRevision returns the JetStream KV revision the lease captured at acquisition. Test-only utility — callers that want to inspect the underlying record can pair this with a separate swarm.Manager.Get; production verbs go through Commit / Close.
type Manager ¶
type Manager struct {
// contains filtered or unexported fields
}
Manager owns the JetStream KV bucket holding swarm session records. Every public method is safe to call concurrently. Close is idempotent. Unlike presence.Manager, there is no heartbeat goroutine — callers (the bones swarm verbs) drive renewal via Update.
func Open ¶
Open creates (or reattaches to) the swarm sessions KV bucket and returns a Manager. Caller must Close at shutdown. Open does not dial NATS: the connection comes pre-wired so reconnect policy stays a single-source concern (same shape as presence.Open).
func (*Manager) BucketName ¶
BucketName returns the bucket name in use. Useful for diagnostics and for `bones doctor` to mention the right bucket in its messages.
func (*Manager) Close ¶
Close marks the Manager closed so subsequent public calls return ErrClosed. Safe to call more than once. Does not delete bucket contents; sessions outlive the process that wrote them (TTL eventually evicts).
func (*Manager) Delete ¶
Delete removes the session record at slot via a CAS gate against expectedRev. ErrCASConflict on revision mismatch — callers should re-Get and retry only if they still want to delete the (newer) record. ErrNotFound if the key was already missing.
Delete is the close path: `bones swarm close` removes the session after posting the dispatch result and stopping the leaf.
func (*Manager) Get ¶
Get reads the session record for slot. Returns ErrNotFound if no record exists. The returned revision is the JetStream KV sequence number suitable for a follow-up CAS Update or Delete.
func (*Manager) List ¶
List returns every live session in the bucket. Order matches the underlying JetStream KV list order (key-name lexicographic in practice; callers that need a specific order should sort).
func (*Manager) Put ¶
Put writes the session record for sess.Slot as a fresh entry. Fails (with ErrCASConflict) if a record already exists at that slot — callers must Delete first to take over an abandoned slot. Mirrors the create-or-update split in internal/tasks.
func (*Manager) Update ¶
Update overwrites the session record for sess.Slot using a CAS gate against expectedRev. Returns ErrCASConflict if the bucket's current revision differs from expectedRev — another writer raced ours and the caller should re-Get and decide whether to retry.
Update is the heartbeat path: `bones swarm commit` reads the current session, updates LastRenewed, and CAS-writes back.
type Session ¶
type Session struct {
// Slot is the slot name (matches the plan's `[slot: X]` annotation).
// Doubles as the KV bucket key — slot is the primary identifier.
Slot string `json:"slot"`
// TaskID identifies the task this slot is currently working on.
// Lookup against bones-tasks for full task state; this field is a
// denormalized pointer used by `bones swarm status` and the close
// verb so they don't need a second lookup.
TaskID string `json:"task_id"`
// AgentID is "slot-<name>" — the slot's fossil + coord identity.
// Matches the agent_id used for the holds and presence buckets so
// claim ownership and liveness join cleanly across substrates.
AgentID string `json:"agent_id"`
// Host is os.Hostname() of the machine running the leaf. Different
// from the workspace host means another machine owns this slot;
// `bones swarm` verbs abort with a clear cross-host error rather
// than try to manage a remote leaf process.
Host string `json:"host"`
// LeafPID is the host-local PID of the bones CLI process that
// wrote this record at join time. Informational only — every
// swarm verb is its own short-lived CLI invocation, so this PID
// is already-dead by the time any later verb reads it. Kept as a
// debugging crumb (a pid that was alive at join means the join
// did happen on this host) but NOT a liveness signal; use
// LastRenewed for active/stale classification instead.
LeafPID int `json:"leaf_pid"`
// HubURL is the hub fossil HTTP URL recorded at join time. Later
// verbs read this so they can post-commit push the slot's
// leaf.fossil to the hub via /xfer without re-deriving the URL
// from a flag. Empty on records written by older clients;
// callers fall back to the join-flag default.
HubURL string `json:"hub_url,omitempty"`
// StartedAt is when this session record was first written.
// Immutable once set; later renewals only touch LastRenewed.
StartedAt time.Time `json:"started_at"`
// LastRenewed is updated on every `swarm commit` (which heartbeats
// the session) and used by `bones doctor` to flag stale entries.
// This is the canonical active/stale signal — see LeafPID for the
// "why we don't use the pid" discussion.
LastRenewed time.Time `json:"last_renewed"`
}
Session is the per-slot record persisted in the bones-swarm-sessions JetStream KV bucket. One record per active slot in a workspace. Slot names are workspace-scoped because the bucket itself is per- NATS-deployment.
Marshaled as JSON for human-readable debugging via `nats kv get`. All time fields are RFC3339-encoded UTC. New fields must default safely on a missing-key read so older clients can read newer records (additive evolution only).