Documentation
¶
Overview ¶
Package probe implements readiness probes for cluster services.
Index ¶
- func ContainerRunning(ctx context.Context, client *docker.Client, name docker.ContainerName) error
- func MungeReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error
- func SSHDReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error
- func SlurmctldReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error
- func SlurmdReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error
- func Snapshot(ctx context.Context, client *docker.Client, name docker.ContainerName, ...) (map[Service]bool, error)
- func SystemdReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error
- func UntilReady(ctx context.Context, client *docker.Client, name docker.ContainerName, ...) error
- func UntilReadyWithEvents(ctx context.Context, client *docker.Client, name docker.ContainerName, ...) error
- type Func
- type Probe
- type Service
- type TerminalError
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ContainerRunning ¶
ContainerRunning verifies that the container is in the "running" state. Returns a TerminalError for states that cannot recover (exited, dead).
func MungeReady ¶
MungeReady verifies that the munge authentication service is active.
func SSHDReady ¶
SSHDReady verifies that sshd is accepting connections and responding with the SSH protocol banner on port 22.
func SlurmctldReady ¶
SlurmctldReady verifies that slurmctld is responding to RPC requests.
func SlurmdReady ¶
SlurmdReady verifies that the slurmd service is active.
func Snapshot ¶ added in v0.9.0
func Snapshot(ctx context.Context, client *docker.Client, name docker.ContainerName, role config.Role) (map[Service]bool, error)
Snapshot returns a one-shot readiness snapshot of a running node, fusing the systemd-based checks (munge, sshd, and on workers slurmd) into a single docker exec. Controllers additionally run scontrol ping because "slurmctld is active" is weaker than "slurmctld answers RPCs" — the unit can be active during startup while RPCs still fail.
Snapshot is intended for status-query call sites such as cluster.GetStatus. Unlike the individual *Ready probes, it does not surface per-probe errors; a failing check simply maps to false. Callers that need retry granularity (e.g. cluster-create readiness polling) should keep using NodeProbes and UntilReady.
The container must be running. Non-exit errors (daemon unreachable, etc.) are propagated; a non-zero exit from systemctl (at least one unit inactive) is expected and parsed normally.
func SystemdReady ¶
SystemdReady verifies that systemd has finished booting. Both "running" and "degraded" are considered ready, since degraded means systemd completed startup but some units failed (which is expected when Slurm daemons haven't been configured yet).
func UntilReady ¶
func UntilReady(ctx context.Context, client *docker.Client, name docker.ContainerName, probes []Probe, interval time.Duration) error
UntilReady polls the given probes until they all pass or the context expires. The caller controls the deadline via the context. The interval controls the delay between polling attempts. On timeout, the error includes the name and message of the last failing probe.
func UntilReadyWithEvents ¶ added in v0.9.0
func UntilReadyWithEvents(ctx context.Context, client *docker.Client, name docker.ContainerName, probes []Probe, interval time.Duration, events <-chan monitor.Event) error
UntilReadyWithEvents is like UntilReady but also listens for events from a monitor. Events trigger immediate probe re-evaluation instead of waiting for the next poll interval, reducing detection latency for event-backed state transitions. Container die events are treated as terminal errors.
Types ¶
type Probe ¶
Probe is a named readiness check.
func ForService ¶ added in v0.7.0
ForService returns the readiness probe for a Slurm daemon service.
func NodeProbes ¶
NodeProbes returns the probes applicable to a node with the given role.
type Service ¶ added in v0.9.0
type Service string
Service identifies a per-node readiness check. The string value is the systemd unit name for munge/sshd/slurmd and the slurm RPC endpoint name for slurmctld, so it also doubles as the user-facing label for each check in status output.
type TerminalError ¶ added in v0.7.1
type TerminalError struct {
Msg string
}
TerminalError indicates a probe failure that cannot be recovered by retrying. For example, a container in "exited" or "dead" state will never become "running" on its own.
func (*TerminalError) Error ¶ added in v0.7.1
func (e *TerminalError) Error() string