agent

package

v1.0.0-rc.3 Latest Latest Go to latest Published: Jul 14, 2022 License: Apache-2.0 Imports: 24 Imported by: 0

Documentation ¶

Overview ¶

Package agent deals with High Availability tasks in a cluster

Tasks include: * Marking nodes that have lost quorum as tainted to repel new pods * Force deletion of Pods and VolumeAttachments on a node with lost quorum, triggering failover * Reconfigure for IO errors instead of IO suspension when pods should be stopped * Stop Pods if running on force-io-error resources.

Index ¶

func NewAgent(opt *Options) (*agent, error)
func TaintNode(ctx context.Context, client kubernetes.Interface, node *corev1.Node, ...) (bool, error)
type DrbdConnection
type DrbdResourceState
type DrbdResources
- func NewDrbdResources(resync time.Duration) DrbdResources
type Options
- func (o *Options) Timeout() time.Duration
type ReconcileRequest
- func (r *ReconcileRequest) FindNode(name string) *corev1.Node
type Reconciler

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewAgent ¶

func NewAgent(opt *Options) (*agent, error)

func TaintNode ¶

func TaintNode(ctx context.Context, client kubernetes.Interface, node *corev1.Node, taint corev1.Taint) (bool, error)

TaintNode adds the specific taint to the node.

Returns false, nil if the taint was already present.

Types ¶

type DrbdConnection ¶

type DrbdConnection struct {
	PeerNodeId      int    `json:"peer-node-id"`
	Name            string `json:"name"`
	ConnectionState string `json:"connection-state"`
	Congested       bool   `json:"congested"`
	PeerRole        string `json:"peer-role"`
	ApInFlight      int    `json:"ap-in-flight"`
	RsInFlight      int    `json:"rs-in-flight"`
	PeerDevices     []struct {
		Volume                 int     `json:"volume"`
		ReplicationState       string  `json:"replication-state"`
		PeerDiskState          string  `json:"peer-disk-state"`
		PeerClient             bool    `json:"peer-client"`
		ResyncSuspended        string  `json:"resync-suspended"`
		Received               int     `json:"received"`
		Sent                   int     `json:"sent"`
		OutOfSync              int     `json:"out-of-sync"`
		Pending                int     `json:"pending"`
		Unacked                int     `json:"unacked"`
		HasSyncDetails         bool    `json:"has-sync-details"`
		HasOnlineVerifyDetails bool    `json:"has-online-verify-details"`
		PercentInSync          float64 `json:"percent-in-sync"`
	} `json:"peer_devices"`
}

type DrbdResourceState ¶

type DrbdResourceState struct {
	Name             string `json:"name"`
	NodeId           int    `json:"node-id"`
	Role             string `json:"role"`
	Suspended        bool   `json:"suspended"`
	SuspendedUser    bool   `json:"suspended-user"`
	SuspendedNoData  bool   `json:"suspended-no-data"`
	SuspendedFencing bool   `json:"suspended-fencing"`
	SuspendedQuorum  bool   `json:"suspended-quorum"`
	ForceIoFailures  bool   `json:"force-io-failures"`
	WriteOrdering    string `json:"write-ordering"`
	Devices          []struct {
		Volume       int    `json:"volume"`
		Minor        int    `json:"minor"`
		DiskState    string `json:"disk-state"`
		Client       bool   `json:"client"`
		Quorum       bool   `json:"quorum"`
		Size         int    `json:"size"`
		Read         int    `json:"read"`
		Written      int    `json:"written"`
		AlWrites     int    `json:"al-writes"`
		BmWrites     int    `json:"bm-writes"`
		UpperPending int    `json:"upper-pending"`
		LowerPending int    `json:"lower-pending"`
	} `json:"devices"`
	Connections []DrbdConnection `json:"connections"`
}

DrbdResourceState is the parsed output of "drbdsetup status --json".

func (*DrbdResourceState) HasQuorum ¶

func (d *DrbdResourceState) HasQuorum() bool

HasQuorum returns true if all local devices have quorum.

func (*DrbdResourceState) MayPromote ¶

func (d *DrbdResourceState) MayPromote() bool

MayPromote returns the best local approximation of the may promote flag from "drbdsetup events2".

func (*DrbdResourceState) Primary ¶

func (d *DrbdResourceState) Primary() bool

Primary returns true if the local resource is primary.

type DrbdResources ¶

type DrbdResources interface {
	// StartUpdates starts the process of updating the current state of DRBD resources.
	StartUpdates(ctx context.Context) error
	// Get returns the resource state at the time the last update was made.
	Get() []DrbdResourceState
}

DrbdResources keeps track of DRBD resources.

func NewDrbdResources ¶

func NewDrbdResources(resync time.Duration) DrbdResources

type Options ¶

type Options struct {
	// NodeName is the name of the local node, as used by DRBD and Kubernetes.
	NodeName string
	// RestConfig is the config used to connect to Kubernetes.
	RestConfig *rest.Config
	// DeletionGraceSec is the number of seconds to wait for graceful pod termination in eviction/deletion requests.
	DeletionGraceSec int64
	// ReconcileInterval is the maximum interval between reconcilation runs.
	ReconcileInterval time.Duration
	// ResyncInterval is the maximum interval between resyncing internal caches with Kubernetes.
	ResyncInterval time.Duration
	// DrbdStatusInterval is the maxmimum interval between drbd state updates.
	DrbdStatusInterval time.Duration
	// OperationTimeout is the timeout used for reconcile operations.
	OperationTimeout time.Duration
	// FailOverTimeout is minimum wait between noticing quorum loss and starting the fail-over process.
	FailOverTimeout time.Duration
}

func (*Options) Timeout ¶

func (o *Options) Timeout() time.Duration

Timeout returns the operations timeout.

type ReconcileRequest ¶

type ReconcileRequest struct {
	RefTime     time.Time
	Resource    *DrbdResourceState
	Volume      *corev1.PersistentVolume
	Pods        []*corev1.Pod
	Attachments []*storagev1.VolumeAttachment
	Nodes       []*corev1.Node
}

func (*ReconcileRequest) FindNode ¶

func (r *ReconcileRequest) FindNode(name string) *corev1.Node

type Reconciler ¶

type Reconciler interface {
	RunForResource(ctx context.Context, req *ReconcileRequest, recorder events.EventRecorder) error
}

func NewFailoverReconciler ¶

func NewFailoverReconciler(opt *Options, client kubernetes.Interface) Reconciler

NewFailoverReconciler creates a reconciler that "fails over" pods that are on storage without quorum.

The reconciler recognizes storage without quorum by: * Having the local copy be promotable * Having pods running * Have pods that are mounting the volume read-write (otherwise the promotable info is useless) * Have a connection to the peer node that is not connected If all of these are true, it waits for a short timeout, before starting the actual "fail over" process. The process involves: * Adding a taint on the node, causing new Pods to avoid the node. * Evicting all pods using that volume from the failed node, creating new pods to replace them. * Delete the volume attachment, informing Kubernetes that attaching the volume to a new node is fine.

func NewForceIoErrorReconciler ¶

func NewForceIoErrorReconciler(opt *Options, client kubernetes.Interface) Reconciler

NewForceIoErrorReconciler creates a reconciler that evicts pods if a volume is reporting IO errors.

If DRBD is in "force IO failures" mode, all opener processes will see IO errors. This lasts until all openers closed the DRBD device, at which point DRBD will start behaving normally again. In order for all openers to be closed we need to force all local Pods to be stopped. This is what this reconciler does: * Adding a taint on the node, causing new Pods to avoid the node. * Evicting all pods using that volume from the failed node, creating new pods to replace them.

func NewSuspendedPodReconciler ¶

func NewSuspendedPodReconciler(opt *Options) Reconciler

NewSuspendedPodReconciler creates a reconciler that gets suspended Pods to resume termination.

While DRBD is suspending IO, all processes (including Pods and filesystems) using the device are stuck. In order to resume, one can force DRBD to report IO errors instead. The reconciler does just that if a local Pod should be stopped while it is suspended by DRBD. This enables a (relatively) clean shutdown of the resource without node reboot.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL