health

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 16, 2026 License: MIT Imports: 24 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetKubernetesClient

func GetKubernetesClient() (kubernetes.Interface, error)

GetKubernetesClient creates a Kubernetes client using default resolution ($KUBECONFIG → ~/.kube/config → in-cluster). For an explicit path and diagnostics, use BuildKubeClient.

func ProbeConnection added in v0.7.0

func ProbeConnection(ctx context.Context, client kubernetes.Interface) error

ProbeConnection verifies the Kubernetes API is actually reachable (and basic list RBAC is present) with a bounded timeout, so callers can report an unreachable cluster up-front rather than silently degrading every check.

Types

type Decision

type Decision string

Decision represents the overall decision for proceeding with update

const (
	DecisionProceed Decision = "PROCEED"
	DecisionWarn    Decision = "WARN"
	DecisionBlock   Decision = "BLOCK"
)

type HealthChecker

type HealthChecker struct {
	// contains filtered or unexported fields
}

HealthChecker performs various health checks on the EKS cluster

func NewChecker

func NewChecker(eksClient *eks.Client, k8sClient kubernetes.Interface, cwClient *cloudwatch.Client, asgClient *autoscaling.Client) *HealthChecker

NewChecker creates a new health checker instance

func (*HealthChecker) CheckClusterCapacity

func (hc *HealthChecker) CheckClusterCapacity(ctx context.Context, clusterName string) HealthResult

CheckClusterCapacity validates that the cluster has sufficient capacity for rolling updates

func (*HealthChecker) CheckControlPlaneMetrics added in v0.8.0

func (hc *HealthChecker) CheckControlPlaneMetrics(ctx context.Context, clusterName string) HealthResult

CheckControlPlaneMetrics gates upgrade readiness on the EKS control plane: etcd database usage vs the 8 GiB read-only limit, plus the API-server error rate. It reads the free AWS/EKS CloudWatch metrics (no Container Insights / agent). Clusters below 1.28 don't emit these — the check is then Skipped, not failed. This is a readiness GATE, not a utilization-browse surface (REF-78).

func (*HealthChecker) CheckCriticalWorkloads

func (hc *HealthChecker) CheckCriticalWorkloads(ctx context.Context) HealthResult

CheckCriticalWorkloads validates that critical system workloads are running

func (*HealthChecker) CheckNodeHealth

func (hc *HealthChecker) CheckNodeHealth(ctx context.Context, clusterName string) HealthResult

CheckNodeHealth validates that all nodes in the cluster are ready

func (*HealthChecker) CheckNodeUtilization added in v0.8.0

func (hc *HealthChecker) CheckNodeUtilization(ctx context.Context, _ string) HealthResult

CheckNodeUtilization reports live cluster CPU+memory headroom from metrics-server (metrics.k8s.io) — the "can the remaining nodes absorb a drain" signal, including memory, which the CloudWatch EC2 path can't see. Advisory (non-blocking): the CloudWatch capacity check remains the blocking gate. Skips cleanly (not fails) when metrics-server isn't installed.

func (*HealthChecker) CheckPodDisruptionBudgets

func (hc *HealthChecker) CheckPodDisruptionBudgets(ctx context.Context) HealthResult

CheckPodDisruptionBudgets validates PDB configuration for user workloads

func (*HealthChecker) CheckResourceBalance

func (hc *HealthChecker) CheckResourceBalance(ctx context.Context, clusterName string) HealthResult

CheckResourceBalance validates resource distribution and utilization patterns

func (*HealthChecker) CheckServiceQuotas added in v0.8.0

func (hc *HealthChecker) CheckServiceQuotas(ctx context.Context, _ string) HealthResult

CheckServiceQuotas reports EC2 On-Demand vCPU usage against the account quota — the headroom for adding nodes during a scale-up or roll. The limit comes from Service Quotas; current usage from the AWS/Usage CloudWatch metric (the quota API returns only the limit, not usage). Advisory (non-blocking); skips when either client is missing or the limit/usage can't be read.

func (*HealthChecker) ListPodDisruptionBudgets added in v0.7.0

func (hc *HealthChecker) ListPodDisruptionBudgets(ctx context.Context) ([]PDBInfo, error)

ListPodDisruptionBudgets returns a structured snapshot of every PDB in user namespaces with its current disruption status. Returns (nil, nil) when no Kubernetes client is configured so callers can degrade gracefully. (REF-4)

func (*HealthChecker) NodegroupReadyCounts added in v0.8.0

func (hc *HealthChecker) NodegroupReadyCounts(ctx context.Context) (map[string]int32, bool)

NodegroupReadyCounts lists the cluster's nodes once and returns the number of Ready=True nodes per managed nodegroup, keyed by nodegroup name (the eks.amazonaws.com/nodegroup label). ok is false when no Kubernetes client is wired or the list fails, so callers fall back to an honest "unknown" rather than the DesiredSize proxy. A nodegroup with nodes present but none Ready appears with a count of 0; a nodegroup absent from the map (no nodes observed) is treated by callers as 0 ready.

func (*HealthChecker) RunAllChecks

func (hc *HealthChecker) RunAllChecks(ctx context.Context, clusterName string) HealthSummary

RunAllChecks executes all health checks and returns a summary. The checks are independent, so they run concurrently; capacity and balance share one instance-discovery + CloudWatch fetch via a lazy snapshot.

func (*HealthChecker) SetNodeMetrics added in v0.8.0

func (hc *HealthChecker) SetNodeMetrics(m NodeMetricsLister)

SetNodeMetrics attaches a metrics-server node-metrics lister, enabling the live utilization check. Without it, CheckNodeUtilization is skipped.

func (*HealthChecker) SetServiceQuotas added in v0.8.0

func (hc *HealthChecker) SetServiceQuotas(sq serviceQuotaAPI)

SetServiceQuotas attaches a Service Quotas client, enabling the vCPU quota headroom check. Without it (and a CloudWatch client), the check is skipped.

type HealthResult

type HealthResult struct {
	Name       string       `json:"name"`
	Status     HealthStatus `json:"status"`
	Score      int          `json:"score"` // 0-100
	Message    string       `json:"message"`
	Details    []string     `json:"details,omitempty"`
	IsBlocking bool         `json:"isBlocking"`
	// Skipped marks a check that could not be evaluated (e.g. no Kubernetes
	// client) rather than measured. Skipped checks are excluded from the
	// OverallScore so a missing prerequisite doesn't silently drag the score.
	Skipped bool `json:"skipped,omitempty"`
}

HealthResult represents the result of a single health check

type HealthStatus

type HealthStatus string

HealthStatus represents the status of a health check

const (
	StatusPass HealthStatus = "PASS"
	StatusWarn HealthStatus = "WARN"
	StatusFail HealthStatus = "FAIL"
)

type HealthSummary

type HealthSummary struct {
	Results      []HealthResult `json:"results"`
	OverallScore int            `json:"overallScore"`
	Decision     Decision       `json:"decision"`
	Warnings     []string       `json:"warnings,omitempty"`
	Errors       []string       `json:"errors,omitempty"`
}

HealthSummary represents the overall health check results

type KubeDiag added in v0.7.0

type KubeDiag struct {
	Source  string // "--kubeconfig", "KUBECONFIG", "default", "in-cluster", "none"
	Path    string
	Context string
}

KubeDiag describes how the Kubernetes client was (or would be) resolved, so callers can emit an actionable message when the API can't be reached.

func BuildKubeClient added in v0.7.0

func BuildKubeClient(kubeconfigPath string) (kubernetes.Interface, KubeDiag, error)

BuildKubeClient builds a Kubernetes client, preferring an explicit kubeconfig path, then $KUBECONFIG, then ~/.kube/config, then in-cluster config. It returns a KubeDiag describing what was tried (for diagnostics) alongside the client. An explicit --kubeconfig path that doesn't exist is a hard error.

func (KubeDiag) String added in v0.7.0

func (d KubeDiag) String() string

String renders the resolution attempt for diagnostics.

type NodeMetrics

type NodeMetrics struct {
	NodeName      string
	CPUPercent    float64
	MemoryPercent float64
}

NodeMetrics represents resource metrics for a single node

type NodeMetricsLister added in v0.8.0

type NodeMetricsLister interface {
	List(ctx context.Context, opts metav1.ListOptions) (*metricsv1beta1.NodeMetricsList, error)
}

NodeMetricsLister is the slice of the metrics.k8s.io client the utilization check needs. The metrics clientset's NodeMetricses() satisfies it; tests pass a fake.

func BuildMetricsClient added in v0.8.0

func BuildMetricsClient(kubeconfigPath string) (NodeMetricsLister, error)

BuildMetricsClient builds a metrics-server (metrics.k8s.io) node-metrics lister from the same kubeconfig resolution as BuildKubeClient. A config error is returned; metrics-server simply not being installed is NOT an error here — that surfaces at List time, so the utilization check can skip gracefully.

type PDBInfo added in v0.7.0

type PDBInfo struct {
	Namespace          string `json:"namespace" yaml:"namespace"`
	Name               string `json:"name" yaml:"name"`
	DisruptionsAllowed int32  `json:"disruptionsAllowed" yaml:"disruptionsAllowed"`
	CurrentHealthy     int32  `json:"currentHealthy" yaml:"currentHealthy"`
	DesiredHealthy     int32  `json:"desiredHealthy" yaml:"desiredHealthy"`
	ExpectedPods       int32  `json:"expectedPods" yaml:"expectedPods"`
}

PDBInfo is a structured snapshot of one PodDisruptionBudget's disruption status, used by `nodegroup scale --dry-run` to show which PDBs would constrain a scale-down. (REF-4)

func (PDBInfo) AtRisk added in v0.7.0

func (p PDBInfo) AtRisk() bool

AtRisk reports whether this PDB currently allows zero voluntary disruptions, meaning a node drain (as happens during a scale-down) would be blocked until the workload recovers.

type ResourceAnalysis

type ResourceAnalysis struct {
	CPUStdDev    float64
	MemoryStdDev float64
	MaxCPU       float64
	MaxMemory    float64
	MinCPU       float64
	MinMemory    float64
}

ResourceAnalysis contains analysis of resource distribution. CPUStdDev/MemoryStdDev are the population standard deviation of per-node utilization, in percentage points (a spread measure, not statistical variance).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL