raftengine

package
v0.0.0-...-d37b0a0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 3, 2026 License: AGPL-3.0 Imports: 6 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	// ErrNotLeader indicates the operation was rejected because the
	// local node is not the Raft leader for the target group.
	// Callers that care about leadership (e.g. lease invalidation
	// logic) should match via errors.Is.
	ErrNotLeader = errors.New("raft engine: not leader")
	// ErrLeadershipLost indicates the local node was leader when the
	// operation began but lost leadership before it could complete.
	ErrLeadershipLost = errors.New("raft engine: leadership lost")
	// ErrLeadershipTransferInProgress indicates a leadership transfer
	// is under way and proposals are being held back.
	ErrLeadershipTransferInProgress = errors.New("raft engine: leadership transfer in progress")
)

Shared sentinel errors that both engine implementations should wrap so callers can test with errors.Is across engine backends.

Functions

This section is empty.

Types

type Admin

type Admin interface {
	LeaderView
	StatusReader
	ConfigReader
	AddVoter(ctx context.Context, id string, address string, prevIndex uint64) (uint64, error)
	// AddLearner attaches a non-voting replica that receives MsgApp /
	// MsgSnap and applies log entries locally but does not contribute
	// to the voter quorum. Use this instead of AddVoter when joining
	// a fresh node so the cluster's effective fault tolerance is not
	// reduced during catch-up. Promote with PromoteLearner once the
	// learner has caught up. See
	// docs/design/2026_04_26_proposed_raft_learner.md.
	AddLearner(ctx context.Context, id string, address string, prevIndex uint64) (uint64, error)
	// PromoteLearner promotes an existing learner to voter. The
	// minAppliedIndex precondition is enforced against the leader's
	// Progress[nodeID].Match before the conf change is proposed: if
	// the learner has not yet caught up to that index, the call
	// returns FailedPrecondition.
	//
	// minAppliedIndex == 0 is REJECTED unless skipMinAppliedCheck is
	// also true, so an operator running a copy-pasted script that
	// omits the catch-up check gets a clean FailedPrecondition
	// instead of a silent quorum stall. Set skipMinAppliedCheck only
	// when the catch-up has been confirmed out-of-band.
	PromoteLearner(ctx context.Context, id string, prevIndex uint64, minAppliedIndex uint64, skipMinAppliedCheck bool) (uint64, error)
	RemoveServer(ctx context.Context, id string, prevIndex uint64) (uint64, error)
	TransferLeadership(ctx context.Context) error
	TransferLeadershipToServer(ctx context.Context, id string, address string) error
	// RegisterLeaderAcquiredCallback registers fn to fire every
	// time the local node's Raft state transitions INTO leader
	// (initial election, re-election, transfer target completion).
	// Callbacks fire on the previous!=Leader → status==Leader edge
	// AFTER the engine has published isLeader, so a callback that
	// calls engine.State() observes StateLeader.
	//
	// Use case: per-shard policy hooks that need to audit a
	// freshly-acquired leadership ("am I still allowed to be
	// leader of this group?"). The SQS HT-FIFO leadership-refusal
	// hook (§8 of the split-queue FIFO design) hangs off this to
	// TransferLeadership when the binary lacks the htfifo
	// capability but a partitioned queue is mapped to this Raft
	// group.
	//
	// Same non-blocking + panic-contained contract as
	// LeaseProvider.RegisterLeaderLossCallback. A callback that
	// needs to do real work (enumerate the catalog, call
	// TransferLeadership) MUST offload to a goroutine.
	//
	// The returned function deregisters this specific registration
	// and is safe to call multiple times.
	RegisterLeaderAcquiredCallback(fn func()) (deregister func())
}

type ApplyIndexAware

type ApplyIndexAware interface {
	SetApplyIndex(idx uint64)
}

ApplyIndexAware is an OPTIONAL extension of StateMachine that lets the engine communicate the Raft entry index of the entry being applied. The engine calls SetApplyIndex IMMEDIATELY before each successful Apply (i.e. on the same goroutine that will then call Apply for the same entry), giving the state machine a chance to thread the index into any downstream sinks that need to record it durably alongside the apply's other side-effects.

Motivation: the §9.1 ErrSidecarBehindRaftLog guard compares the encryption sidecar's recorded raft_applied_index against the engine's AppliedIndex on startup. For that comparison to be useful, the sidecar must record an index inside the SAME crash-durable fsync that mutates the keys[] map — which means the encryption applier needs to know the entry index it is applying. The StateMachine.Apply(data) signature does not carry it, so this opt-in interface is the seam that delivers it without forcing every existing implementation to change.

Implementations MUST treat SetApplyIndex as a strictly local hint (not a replicated input). The engine guarantees no concurrent Apply / SetApplyIndex calls — Raft apply is serial at the engine boundary — so plain field assignment is sufficient for the field this hint backs.

type ConfigReader

type ConfigReader interface {
	Configuration(ctx context.Context) (Configuration, error)
}

type Configuration

type Configuration struct {
	Servers []Server
}

type Engine

type Factory

type Factory interface {
	EngineType() string
	Create(cfg FactoryConfig) (*FactoryResult, error)
}

Factory creates raft engine instances. Engine-specific configuration (timeouts, tick intervals, etc.) is provided at factory construction time; the Create method receives only engine-agnostic parameters.

type FactoryConfig

type FactoryConfig struct {
	LocalID      string
	LocalAddress string
	DataDir      string
	Peers        []Server
	Bootstrap    bool
	StateMachine StateMachine
	// JoinAsLearner records the operator-stated expectation that the
	// local node will be added to the cluster via AddLearner, never
	// AddVoter. The flag is purely an alarm: if a post-apply ConfState
	// lists this node as a voter instead of a learner, the engine
	// emits an ERROR-level structured log and increments a metric.
	// The node keeps running -- by the time the conf change has
	// applied, the node already counts toward quorum and an unilateral
	// shutdown would shrink the cluster's effective fault tolerance.
	// See docs/design/2026_04_26_proposed_raft_learner.md §4.5.
	JoinAsLearner bool
}

FactoryConfig holds engine-agnostic parameters for creating a raft engine.

type FactoryResult

type FactoryResult struct {
	Engine            Engine
	RegisterTransport func(grpc.ServiceRegistrar)
	// Close releases engine-specific resources that are not owned by
	// Engine.Close (e.g. raft log stores, transport managers). Callers
	// must call Engine.Close first to ensure the raft instance is fully
	// shut down before the underlying stores and transports are released.
	Close func() error
}

FactoryResult holds the output of Factory.Create.

type HealthReader

type HealthReader interface {
	CheckServing(ctx context.Context) error
}

type LeaderInfo

type LeaderInfo struct {
	ID      string
	Address string
}

type LeaderView

type LeaderView interface {
	State() State
	Leader() LeaderInfo
	VerifyLeader(ctx context.Context) error
	// LinearizableRead blocks until the returned index is safe to read from the
	// local FSM on that node.
	LinearizableRead(ctx context.Context) (uint64, error)
}

type LeaseProvider

type LeaseProvider interface {
	// LeaseDuration returns the time during which a lease holder can serve
	// reads from local state without re-confirming leadership via ReadIndex.
	LeaseDuration() time.Duration
	// AppliedIndex returns the highest log index applied to the local FSM.
	AppliedIndex() uint64
	// LastQuorumAck returns the monotonic-raw instant at which the
	// engine most recently observed majority liveness on the leader
	// -- i.e. the CLOCK_MONOTONIC_RAW reading at which a quorum of
	// follower Progress entries had responded. The engine maintains
	// this in the background from MsgHeartbeatResp / MsgAppResp traffic
	// on the leader, so a fast-path lease read does not need to issue
	// its own ReadIndex to "warm" the lease.
	//
	// Safety: callers must verify the lease against a single
	// `now := monoclock.Now()` sample:
	//   state == raftengine.StateLeader &&
	//   !now.IsZero() && !ack.IsZero() && !ack.After(now) &&
	//   now.Sub(ack) < LeaseDuration()
	//
	// The !now.IsZero() guard fails closed when the caller's
	// clock_gettime read errored (e.g. seccomp denies it) and
	// monoclock.Now() returned the zero Instant; without it, a
	// persistent clock failure could keep a once-warmed lease valid
	// forever. See kv.engineLeaseAckValid.
	//
	// The monotonic-raw clock (CLOCK_MONOTONIC_RAW on Linux / Darwin;
	// runtime-monotonic fallback on FreeBSD / Windows / others, see
	// internal/monoclock) is immune to NTP rate adjustment and
	// wall-clock step events on the raw-clock platforms, so the
	// comparison stays safe even if the system's time daemon slews
	// or steps the wall clock. The !ack.After(now) guard remains as
	// a defensive fail-closed for a zero / bogus ack reading.
	// LeaseDuration is bounded by electionTimeout - safety_margin,
	// guaranteeing no successor leader has accepted writes within
	// that window.
	//
	// Returns the zero Instant when no quorum has been confirmed yet
	// or when the local node is not the leader. Single-node LEADERS
	// may return a recent monoclock.Now() since self is the quorum;
	// non-leader single-node replicas still return the zero Instant.
	LastQuorumAck() monoclock.Instant
	// RegisterLeaderLossCallback registers fn to be invoked whenever the
	// local node leaves the leader role (graceful transfer, partition
	// step-down, or shutdown). Callers use this to invalidate any
	// leader-local lease they hold so the next read takes the slow path.
	// Multiple callbacks can be registered.
	//
	// Callbacks fire synchronously from the engine's status-refresh
	// / shutdown path and MUST be non-blocking -- each should be a
	// lock-free flag flip (e.g. atomic invalidate). A panicking
	// callback is contained so a bug in one holder cannot break
	// others, but a blocking callback would stall the engine's main
	// loop, so the contract is strict. Lease-read fast paths also
	// guard on engine.State() to close the narrow race between a
	// transition and this callback completing.
	//
	// The returned function deregisters this callback and is safe to
	// call multiple times. Callers whose lifetime is shorter than the
	// engine's (ephemeral Coordinators in tests, for example) MUST
	// invoke the returned deregister when they are done so the engine
	// does not accumulate dead callbacks.
	RegisterLeaderLossCallback(fn func()) (deregister func())
}

LeaseProvider is an optional capability implemented by engines that support leader-local lease reads. Callers that want lease-based reads should type-assert to this interface and fall back to LinearizableRead when the underlying engine does not implement it.

type ProposalResult

type ProposalResult struct {
	CommitIndex uint64
	Response    any
}

type Proposer

type Proposer interface {
	Propose(ctx context.Context, data []byte) (*ProposalResult, error)
	ProposeAdmin(ctx context.Context, data []byte) (*ProposalResult, error)
}

Proposer drives a Raft proposal through the engine and returns once it has been committed (or the context/engine cancels first).

Two semantically distinct entry points, differing ONLY in the §7.1 quiescence-barrier check Stage 6E-2d installs on Propose:

  • Propose carries ordinary user-data and control-plane traffic that may be paused during a raft-envelope cutover. 6E-2d will reject these with ErrEnvelopeCutoverInProgress while the barrier is open so the leader cannot admit a fresh entry at `index > raftEnvelopeCutoverIndex` mid-installation.
  • ProposeAdmin carries proposals that MUST remain admissible across the barrier — the EnableRaftEnvelope cutover entry itself (without this exemption the barrier would deadlock on its own cutover proposal) and ConfChange-time RegisterEncryptionWriter proposals (Stage 7c §3.1, so a new member joining mid-barrier can still register its writer-registry entry).

ProposeAdmin is NOT a wrap-exemption: a payload-wrap layer configured above the engine (kv.wrappedProposer) applies its wrap closure to both methods identically. Admin entries that land at `index > raftEnvelopeCutoverIndex` (a leader-restart registration, a post-cutover RotateDEK, etc.) must carry the AEAD envelope the §6.3 strict-`>` apply hook expects; a cleartext admin entry above cutover would halt the apply loop on unwrap-failure. The lone exception is the EnableRaftEnvelope cutover marker (sits at `index == cutover`, strict-`>` leaves it alone), which is proposed via a raw engine reference and never flows through the wrap layer in the first place.

In the current build the two methods are operationally equivalent (the barrier is still 6E-2d work); the distinction at the call site is the migration the future barrier requires — sites still on Propose would fail closed the moment 6E-2d activates the barrier.

type Server

type Server struct {
	ID       string
	Address  string
	Suffrage string
}

type Snapshot

type Snapshot interface {
	WriteTo(w io.Writer) (int64, error)
	Close() error
}

Snapshot is an owned export handle from the state machine. Callers are responsible for closing it after WriteTo completes.

type State

type State string
const (
	StateFollower  State = "follower"
	StateCandidate State = "candidate"
	StateLeader    State = "leader"
	StateShutdown  State = "shutdown"
	StateUnknown   State = "unknown"
)

type StateMachine

type StateMachine interface {
	Apply(data []byte) any
	// Snapshot should capture a stable export handle quickly. Expensive snapshot
	// serialization belongs in Snapshot.WriteTo, which the engine can run off
	// the main raft loop.
	Snapshot() (Snapshot, error)
	Restore(r io.Reader) error
}

StateMachine is the interface that engine-agnostic state machines must implement. Both the hashicorp and etcd backends use this contract.

type Status

type Status struct {
	State             State
	Leader            LeaderInfo
	Term              uint64
	CommitIndex       uint64
	AppliedIndex      uint64
	LastLogIndex      uint64
	LastSnapshotIndex uint64
	FSMPending        uint64
	NumPeers          uint64
	LastContact       time.Duration
	// LeadTransferee is non-zero on the current leader while a leadership
	// transfer is in progress, and zero otherwise (including on followers).
	// Writers should hold new proposals while this is non-zero, since etcd/raft
	// drops proposals during transfer.
	LeadTransferee uint64
}

type StatusReader

type StatusReader interface {
	Status() Status
}

Directories

Path Synopsis
Package raftenginetest provides a shared conformance test suite for raftengine.Engine implementations.
Package raftenginetest provides a shared conformance test suite for raftengine.Engine implementations.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL