probe

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 22, 2026 License: GPL-3.0, LGPL-3.0 Imports: 9 Imported by: 0

Documentation

Overview

Package probe implements readiness probes for cluster services.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ContainerRunning

func ContainerRunning(ctx context.Context, client *docker.Client, name docker.ContainerName) error

ContainerRunning verifies that the container is in the "running" state. Returns a TerminalError for states that cannot recover (exited, dead).

func MungeReady

func MungeReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error

MungeReady verifies that the munge authentication service is active.

func SSHDReady

func SSHDReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error

SSHDReady verifies that sshd is accepting connections and responding with the SSH protocol banner on port 22.

func SlurmctldReady

func SlurmctldReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error

SlurmctldReady verifies that slurmctld is responding to RPC requests.

func SlurmdReady

func SlurmdReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error

SlurmdReady verifies that the slurmd service is active.

func Snapshot added in v0.9.0

func Snapshot(ctx context.Context, client *docker.Client, name docker.ContainerName, role config.Role) (map[Service]bool, error)

Snapshot returns a one-shot readiness snapshot of a running node, fusing the systemd-based checks (munge, sshd, and on workers slurmd) into a single docker exec. Controllers additionally run scontrol ping because "slurmctld is active" is weaker than "slurmctld answers RPCs" — the unit can be active during startup while RPCs still fail.

Snapshot is intended for status-query call sites such as cluster.GetStatus. Unlike the individual *Ready probes, it does not surface per-probe errors; a failing check simply maps to false. Callers that need retry granularity (e.g. cluster-create readiness polling) should keep using NodeProbes and UntilReady.

The container must be running. Non-exit errors (daemon unreachable, etc.) are propagated; a non-zero exit from systemctl (at least one unit inactive) is expected and parsed normally.

func SystemdReady

func SystemdReady(ctx context.Context, client *docker.Client, name docker.ContainerName) error

SystemdReady verifies that systemd has finished booting. Both "running" and "degraded" are considered ready, since degraded means systemd completed startup but some units failed (which is expected when Slurm daemons haven't been configured yet).

func UntilReady

func UntilReady(ctx context.Context, client *docker.Client, name docker.ContainerName, probes []Probe, interval time.Duration) error

UntilReady polls the given probes until they all pass or the context expires. The caller controls the deadline via the context. The interval controls the delay between polling attempts. On timeout, the error includes the name and message of the last failing probe.

func UntilReadyWithEvents added in v0.9.0

func UntilReadyWithEvents(ctx context.Context, client *docker.Client, name docker.ContainerName, probes []Probe, interval time.Duration, events <-chan monitor.Event) error

UntilReadyWithEvents is like UntilReady but also listens for events from a monitor. Events trigger immediate probe re-evaluation instead of waiting for the next poll interval, reducing detection latency for event-backed state transitions. Container die events are treated as terminal errors.

Types

type Func

type Func func(ctx context.Context, client *docker.Client, name docker.ContainerName) error

Func is a probe function that checks a single readiness condition.

type Probe

type Probe struct {
	Name  string
	Check Func
}

Probe is a named readiness check.

func ForService added in v0.7.0

func ForService(svc Service) Probe

ForService returns the readiness probe for a Slurm daemon service.

func NodeProbes

func NodeProbes(role config.Role) []Probe

NodeProbes returns the probes applicable to a node with the given role.

type Service added in v0.9.0

type Service string

Service identifies a per-node readiness check. The string value is the systemd unit name for munge/sshd/slurmd and the slurm RPC endpoint name for slurmctld, so it also doubles as the user-facing label for each check in status output.

const (
	ServiceMunge     Service = "munge"
	ServiceSSHD      Service = "sshd"
	ServiceSlurmctld Service = "slurmctld"
	ServiceSlurmd    Service = "slurmd"
)

Per-node readiness services managed by sind.

func ServiceForRole added in v0.9.0

func ServiceForRole(role config.Role) (Service, bool)

ServiceForRole returns the Slurm readiness-check service associated with a node role. Returns empty string and false for roles with no Slurm service (e.g. submitter).

type TerminalError added in v0.7.1

type TerminalError struct {
	Msg string
}

TerminalError indicates a probe failure that cannot be recovered by retrying. For example, a container in "exited" or "dead" state will never become "running" on its own.

func (*TerminalError) Error added in v0.7.1

func (e *TerminalError) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL