Documentation
¶
Overview ¶
Package workerheal detects and recovers worker units stuck in systemd's "failed" state. The detector is deliberately cheap — it walks the existing batched unit-state cache shared with the dashboard, so polling stays free even on busy installs. The healer is a single primitive: reset-failed + start. It never writes .lerd.yaml or rewrites unit files; that belongs to `lerd worker add/remove/start/stop` and `lerd init`.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func HealUnit ¶
HealUnit clears any failed state and starts the named worker unit. The single "fix this" primitive — every surface (CLI / UI / TUI / MCP) goes through here. Crucially, it does NOT touch .lerd.yaml or rewrite the unit file: a failed worker is a transient runtime condition, not a change of user intent. The reset-failed step is implicit: on Linux, systemd.DBusStartUnit calls DBusResetFailed first; on macOS launchd's bootstrap path replaces the job entirely.
Types ¶
type Event ¶
type Event struct {
Phase string `json:"phase"` // "starting" | "healed" | "failed" | "done"
Site string `json:"site,omitempty"`
Unit string `json:"unit,omitempty"`
Error string `json:"error,omitempty"`
}
Event is one line in the streaming heal report. Dashboard, MCP, and TUI all consume these so progress is visible without polling.
type Failure ¶
type Failure struct {
Worker UnhealthyWorker `json:"worker"`
Err string `json:"error"`
}
Failure is one heal attempt that errored.
type Result ¶
type Result struct {
Healed []UnhealthyWorker `json:"healed"`
Failed []Failure `json:"failed"`
}
Result is the aggregate report for non-streaming callers.
type UnhealthyWorker ¶
type UnhealthyWorker struct {
Site string `json:"site"`
Worker string `json:"worker"`
Unit string `json:"unit"`
State string `json:"state"` // "failed" today; reserve for future "start-limit-hit", "expected-but-stopped"
LastError string `json:"last_error,omitempty"`
}
UnhealthyWorker is a single failing/stuck worker unit.
func Detect ¶
func Detect() ([]UnhealthyWorker, error)
Detect returns every worker unit systemd considers "failed". Cheap by design: it reads only the existing batched unit-state cache (one systemctl call per 3s, shared with the dashboard's enrichment path) plus sites.yaml. No per-site .lerd.yaml or composer.json reads, no extra subprocess calls. Safe to invoke from a hot endpoint.
Heuristic kept narrow on purpose: worker units that hit Restart= rate limits or crash repeatedly land in "failed" and stay there until something resets them. "Inactive" is too broad — users routinely stop individual workers on purpose, and we can't tell intent apart from drift without an explicit per-worker desired-state field.
func Enrich ¶
func Enrich(in []UnhealthyWorker) []UnhealthyWorker
Enrich populates LastError on every entry by reading the journal once per unit. Walks in slice order until the per-call budget is hit, leaving any remaining entries' LastError empty. Safe with a nil or empty slice. Intended for the dashboard pre-serialization step where there are typically 0–3 entries, so the budget is rarely exercised.