Documentation
¶
Index ¶
- func GetKubernetesClient() (kubernetes.Interface, error)
- func ProbeConnection(ctx context.Context, client kubernetes.Interface) error
- type Decision
- type HealthChecker
- func (hc *HealthChecker) CheckClusterCapacity(ctx context.Context, clusterName string) HealthResult
- func (hc *HealthChecker) CheckControlPlaneMetrics(ctx context.Context, clusterName string) HealthResult
- func (hc *HealthChecker) CheckCriticalWorkloads(ctx context.Context) HealthResult
- func (hc *HealthChecker) CheckNodeHealth(ctx context.Context, clusterName string) HealthResult
- func (hc *HealthChecker) CheckNodeUtilization(ctx context.Context, _ string) HealthResult
- func (hc *HealthChecker) CheckPodDisruptionBudgets(ctx context.Context) HealthResult
- func (hc *HealthChecker) CheckResourceBalance(ctx context.Context, clusterName string) HealthResult
- func (hc *HealthChecker) CheckServiceQuotas(ctx context.Context, _ string) HealthResult
- func (hc *HealthChecker) ListPodDisruptionBudgets(ctx context.Context) ([]PDBInfo, error)
- func (hc *HealthChecker) NodegroupReadyCounts(ctx context.Context) (map[string]int32, bool)
- func (hc *HealthChecker) RunAllChecks(ctx context.Context, clusterName string) HealthSummary
- func (hc *HealthChecker) SetNodeMetrics(m NodeMetricsLister)
- func (hc *HealthChecker) SetServiceQuotas(sq serviceQuotaAPI)
- type HealthResult
- type HealthStatus
- type HealthSummary
- type KubeDiag
- type NodeMetrics
- type NodeMetricsLister
- type PDBInfo
- type ResourceAnalysis
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GetKubernetesClient ¶
func GetKubernetesClient() (kubernetes.Interface, error)
GetKubernetesClient creates a Kubernetes client using default resolution ($KUBECONFIG → ~/.kube/config → in-cluster). For an explicit path and diagnostics, use BuildKubeClient.
func ProbeConnection ¶ added in v0.7.0
func ProbeConnection(ctx context.Context, client kubernetes.Interface) error
ProbeConnection verifies the Kubernetes API is actually reachable (and basic list RBAC is present) with a bounded timeout, so callers can report an unreachable cluster up-front rather than silently degrading every check.
Types ¶
type Decision ¶
type Decision string
Decision represents the overall decision for proceeding with update
type HealthChecker ¶
type HealthChecker struct {
// contains filtered or unexported fields
}
HealthChecker performs various health checks on the EKS cluster
func NewChecker ¶
func NewChecker(eksClient *eks.Client, k8sClient kubernetes.Interface, cwClient *cloudwatch.Client, asgClient *autoscaling.Client) *HealthChecker
NewChecker creates a new health checker instance
func (*HealthChecker) CheckClusterCapacity ¶
func (hc *HealthChecker) CheckClusterCapacity(ctx context.Context, clusterName string) HealthResult
CheckClusterCapacity validates that the cluster has sufficient capacity for rolling updates
func (*HealthChecker) CheckControlPlaneMetrics ¶ added in v0.8.0
func (hc *HealthChecker) CheckControlPlaneMetrics(ctx context.Context, clusterName string) HealthResult
CheckControlPlaneMetrics gates upgrade readiness on the EKS control plane: etcd database usage vs the 8 GiB read-only limit, plus the API-server error rate. It reads the free AWS/EKS CloudWatch metrics (no Container Insights / agent). Clusters below 1.28 don't emit these — the check is then Skipped, not failed. This is a readiness GATE, not a utilization-browse surface (REF-78).
func (*HealthChecker) CheckCriticalWorkloads ¶
func (hc *HealthChecker) CheckCriticalWorkloads(ctx context.Context) HealthResult
CheckCriticalWorkloads validates that critical system workloads are running
func (*HealthChecker) CheckNodeHealth ¶
func (hc *HealthChecker) CheckNodeHealth(ctx context.Context, clusterName string) HealthResult
CheckNodeHealth validates that all nodes in the cluster are ready
func (*HealthChecker) CheckNodeUtilization ¶ added in v0.8.0
func (hc *HealthChecker) CheckNodeUtilization(ctx context.Context, _ string) HealthResult
CheckNodeUtilization reports live cluster CPU+memory headroom from metrics-server (metrics.k8s.io) — the "can the remaining nodes absorb a drain" signal, including memory, which the CloudWatch EC2 path can't see. Advisory (non-blocking): the CloudWatch capacity check remains the blocking gate. Skips cleanly (not fails) when metrics-server isn't installed.
func (*HealthChecker) CheckPodDisruptionBudgets ¶
func (hc *HealthChecker) CheckPodDisruptionBudgets(ctx context.Context) HealthResult
CheckPodDisruptionBudgets validates PDB configuration for user workloads
func (*HealthChecker) CheckResourceBalance ¶
func (hc *HealthChecker) CheckResourceBalance(ctx context.Context, clusterName string) HealthResult
CheckResourceBalance validates resource distribution and utilization patterns
func (*HealthChecker) CheckServiceQuotas ¶ added in v0.8.0
func (hc *HealthChecker) CheckServiceQuotas(ctx context.Context, _ string) HealthResult
CheckServiceQuotas reports EC2 On-Demand vCPU usage against the account quota — the headroom for adding nodes during a scale-up or roll. The limit comes from Service Quotas; current usage from the AWS/Usage CloudWatch metric (the quota API returns only the limit, not usage). Advisory (non-blocking); skips when either client is missing or the limit/usage can't be read.
func (*HealthChecker) ListPodDisruptionBudgets ¶ added in v0.7.0
func (hc *HealthChecker) ListPodDisruptionBudgets(ctx context.Context) ([]PDBInfo, error)
ListPodDisruptionBudgets returns a structured snapshot of every PDB in user namespaces with its current disruption status. Returns (nil, nil) when no Kubernetes client is configured so callers can degrade gracefully. (REF-4)
func (*HealthChecker) NodegroupReadyCounts ¶ added in v0.8.0
NodegroupReadyCounts lists the cluster's nodes once and returns the number of Ready=True nodes per managed nodegroup, keyed by nodegroup name (the eks.amazonaws.com/nodegroup label). ok is false when no Kubernetes client is wired or the list fails, so callers fall back to an honest "unknown" rather than the DesiredSize proxy. A nodegroup with nodes present but none Ready appears with a count of 0; a nodegroup absent from the map (no nodes observed) is treated by callers as 0 ready.
func (*HealthChecker) RunAllChecks ¶
func (hc *HealthChecker) RunAllChecks(ctx context.Context, clusterName string) HealthSummary
RunAllChecks executes all health checks and returns a summary. The checks are independent, so they run concurrently; capacity and balance share one instance-discovery + CloudWatch fetch via a lazy snapshot.
func (*HealthChecker) SetNodeMetrics ¶ added in v0.8.0
func (hc *HealthChecker) SetNodeMetrics(m NodeMetricsLister)
SetNodeMetrics attaches a metrics-server node-metrics lister, enabling the live utilization check. Without it, CheckNodeUtilization is skipped.
func (*HealthChecker) SetServiceQuotas ¶ added in v0.8.0
func (hc *HealthChecker) SetServiceQuotas(sq serviceQuotaAPI)
SetServiceQuotas attaches a Service Quotas client, enabling the vCPU quota headroom check. Without it (and a CloudWatch client), the check is skipped.
type HealthResult ¶
type HealthResult struct {
Name string `json:"name"`
Status HealthStatus `json:"status"`
Score int `json:"score"` // 0-100
Message string `json:"message"`
Details []string `json:"details,omitempty"`
IsBlocking bool `json:"isBlocking"`
// Skipped marks a check that could not be evaluated (e.g. no Kubernetes
// client) rather than measured. Skipped checks are excluded from the
// OverallScore so a missing prerequisite doesn't silently drag the score.
Skipped bool `json:"skipped,omitempty"`
}
HealthResult represents the result of a single health check
type HealthStatus ¶
type HealthStatus string
HealthStatus represents the status of a health check
const ( StatusPass HealthStatus = "PASS" StatusWarn HealthStatus = "WARN" StatusFail HealthStatus = "FAIL" )
type HealthSummary ¶
type HealthSummary struct {
Results []HealthResult `json:"results"`
OverallScore int `json:"overallScore"`
Decision Decision `json:"decision"`
Warnings []string `json:"warnings,omitempty"`
Errors []string `json:"errors,omitempty"`
}
HealthSummary represents the overall health check results
type KubeDiag ¶ added in v0.7.0
type KubeDiag struct {
Source string // "--kubeconfig", "KUBECONFIG", "default", "in-cluster", "none"
Path string
Context string
}
KubeDiag describes how the Kubernetes client was (or would be) resolved, so callers can emit an actionable message when the API can't be reached.
func BuildKubeClient ¶ added in v0.7.0
func BuildKubeClient(kubeconfigPath string) (kubernetes.Interface, KubeDiag, error)
BuildKubeClient builds a Kubernetes client, preferring an explicit kubeconfig path, then $KUBECONFIG, then ~/.kube/config, then in-cluster config. It returns a KubeDiag describing what was tried (for diagnostics) alongside the client. An explicit --kubeconfig path that doesn't exist is a hard error.
type NodeMetrics ¶
NodeMetrics represents resource metrics for a single node
type NodeMetricsLister ¶ added in v0.8.0
type NodeMetricsLister interface {
List(ctx context.Context, opts metav1.ListOptions) (*metricsv1beta1.NodeMetricsList, error)
}
NodeMetricsLister is the slice of the metrics.k8s.io client the utilization check needs. The metrics clientset's NodeMetricses() satisfies it; tests pass a fake.
func BuildMetricsClient ¶ added in v0.8.0
func BuildMetricsClient(kubeconfigPath string) (NodeMetricsLister, error)
BuildMetricsClient builds a metrics-server (metrics.k8s.io) node-metrics lister from the same kubeconfig resolution as BuildKubeClient. A config error is returned; metrics-server simply not being installed is NOT an error here — that surfaces at List time, so the utilization check can skip gracefully.
type PDBInfo ¶ added in v0.7.0
type PDBInfo struct {
Namespace string `json:"namespace" yaml:"namespace"`
Name string `json:"name" yaml:"name"`
DisruptionsAllowed int32 `json:"disruptionsAllowed" yaml:"disruptionsAllowed"`
CurrentHealthy int32 `json:"currentHealthy" yaml:"currentHealthy"`
DesiredHealthy int32 `json:"desiredHealthy" yaml:"desiredHealthy"`
ExpectedPods int32 `json:"expectedPods" yaml:"expectedPods"`
}
PDBInfo is a structured snapshot of one PodDisruptionBudget's disruption status, used by `nodegroup scale --dry-run` to show which PDBs would constrain a scale-down. (REF-4)
type ResourceAnalysis ¶
type ResourceAnalysis struct {
CPUStdDev float64
MemoryStdDev float64
MaxCPU float64
MaxMemory float64
MinCPU float64
MinMemory float64
}
ResourceAnalysis contains analysis of resource distribution. CPUStdDev/MemoryStdDev are the population standard deviation of per-node utilization, in percentage points (a spread measure, not statistical variance).