fencer

package
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 26, 2021 License: MIT Imports: 26 Imported by: 0

README

Fencing Controller

The Fencing Controller can be used to enable fast failover of workloads when a node goes offline. This is particularly useful when the workload is deployed using a StatefulSet.

To protect data integrity, Kubernetes guarantees that there will never be more than one instance of a StatefulSet Pod running at a time. It assumes that when a node is determined to be offline it may still be running but partitioned from the network and still running the workload. Since Kubernetes is unable to verify that the Pod has been stopped it errs on the side of caution and does not allow a replacement to start on another node.

For this reason, Kubernetes requires manual intervention to initiate a failover of a StatefulSet Pod.

Since StorageOS is able to determine when a node is no longer able to access a volume and has protection to ensure that a partitioned or formerly partitioned node can not continue to write data, it can work with Kubernetes to perform safe, fast failovers of Pods, including those running in StatefulSets.

When StorageOS detects that a node has gone offline or become partitioned, it marks the node offline and performs volume failover operations.

The fencing controller watches for node failures and determines if there are any Pods assigned to the node that have the storageos.com/fenced=true label set and PVCs backed by StorageOS volumes.

When a Pod has StorageOS volumes and if they are all healthy, the fencing controller will delete the Pod to allow it to be rescheduled on another node. It also deletes the VolumeAtachments for the corresponding volumes so that they can be immediately attached to the new node.

No changes are made to Pods that have StorageOS volumes that are unhealthy. This is likely where a volume was configured to not have any replicas, and the node with the single copy of the data is offline. In this case it is better to wait for the Node to recover.

Fencing works with both dynamically provisioned PVCs and PVCs referencing pre-provisioned volumes.

The fencing feature is opt-in and Pods must have the storageos.com/fenced=true label set to enable fast failover.

Trigger

The controller reconcile will trigger on any StorageOS node in unhealthy state. StorageOS nodes are polled every 5s, configurable with the -node-poll-interval flag. This determines how quickly the fencing controller can react to node failures.

Pods assigned to unhealthy nodes will be evaluated immediately on state change, and then every 1m, configurable with the -node-expiry-interval flag. This retry allows Pods that had unhealthy volumes which have now recovered to eventually failover, or Pods that were rescheduled on an unhealthy node to be re-evaluated for fencing.

Reconcile

When a StorageOS node has been detected offline, the fencing controller performs the following actions:

  • Lists all Pods running on the failed node.

  • For each Pod:

    • Verify that the Pod has the storageos.com/fenced=true label set, otherwise ignore the Pod.
    • Retrieves list of StorageOS PVCs for the Pod. Skips Pods that have no StorageOS PVCs.
    • Verify that the StorageOS volume backing each of the Pod's StorageOS PVCs is healthy. If not, skip the Pod.
    • Delete the Pod.
    • Delete the VolumeAttachments for the StorageOS PVCs.

Documentation

Index

Constants

View Source
const (
	// DriverName is the name of the StorageOS CSI driver.
	DriverName = "csi.storageos.com"
)

Variables

View Source
var (
	// ErrVolumeAttachmentNotFound is returned when a volume attachment was
	// expected but not found.
	ErrVolumeAttachmentNotFound = errors.New("volume attachment not found")

	// ErrUnexpectedVolumeAttacher is returned when a specific attacher
	// was expected but different or not specified.
	ErrUnexpectedVolumeAttacher = errors.New("unexpected volume attacher")

	// ErrNodeTypeAssertion is returned when a type assertion to convert a
	// given object into StorageOS Node fails.
	ErrNodeTypeAssertion = errors.New("failed to convert into StorageOS Node by type assertion")
)
View Source
var (
	// ErrNodeNotCached is returned if the node was expected in the cache but
	// not found.
	ErrNodeNotCached = errors.New("node not found in cache")
)

Functions

This section is empty.

Types

type Controller

type Controller struct {
	client.Client
	// contains filtered or unexported fields
}

Controller implements the Stateless-Action controller interface, fencing k8s node pods when they are detected to be unhealthy in StorageOS.

func NewController

func NewController(k8s client.Client, cache *cache.Object, scheme *runtime.Scheme, api NodeFencer, log logr.Logger) (*Controller, error)

NewController returns a Controller that implements pod fencing based on StorageOS node health status.

func (Controller) BuildActionManager

func (c Controller) BuildActionManager(o interface{}) (action.Manager, error)

func (Controller) GetObject

func (c Controller) GetObject(ctx context.Context, key client.ObjectKey) (interface{}, error)

func (Controller) RequireAction

func (c Controller) RequireAction(ctx context.Context, o interface{}) (bool, error)

type NodeFencer

type NodeFencer interface {
	ListNodes(ctx context.Context) ([]client.Object, error)
	GetVolume(ctx context.Context, key client.ObjectKey) (storageos.Object, error)
}

NodeFencer provides access to nodes and the volumes running on them.

type Reconciler

type Reconciler struct {
	client.Client

	actionv1.Reconciler
	// contains filtered or unexported fields
}

Reconciler reconciles StorageOS Node object health with running Pods, deleting them if we know that they are unable to use their storage.

func NewReconciler

func NewReconciler(api NodeFencer, apiReset chan<- struct{}, k8s client.Client, pollInterval time.Duration, expiryInterval time.Duration) *Reconciler

NewReconciler returns a new Node label reconciler.

The resyncInterval determines how often the periodic resync operation should be run.

func (*Reconciler) SetupWithManager

func (r *Reconciler) SetupWithManager(ctx context.Context, mgr ctrl.Manager, workers int, retryInterval time.Duration, timeout time.Duration) error

Directories

Path Synopsis
Package mocks is a generated GoMock package.
Package mocks is a generated GoMock package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL