helmreboot-operator

module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 20, 2025 License: Apache-2.0

README

HelmReboot Operator

Go Report Card License GitHub release Docker Pulls Docker Image Size

A Kubernetes operator that automatically restarts failed Flux HelmRelease resources when they encounter timeout errors. This operator monitors HelmRelease objects and triggers reconciliation by adding the fluxcd.io/reconcileAt annotation when specific failure conditions are detected.

Overview

The HelmReboot Operator is designed to solve a common issue in GitOps workflows where Flux HelmRelease resources fail due to temporary network issues, registry timeouts, or other transient errors. Instead of manual intervention, this operator automatically detects these failures and triggers a retry by adding reconciliation annotations.

Key Features
  • Automatic Recovery: Detects failed HelmRelease resources and triggers automatic retries
  • Smart Detection: Only restarts releases that failed due to specific timeout errors
  • Monitoring Ready: Includes Prometheus metrics and comprehensive logging
  • Lightweight: Minimal resource footprint with efficient reconciliation loops
  • Secure: Follows Kubernetes RBAC best practices with minimal required permissions
  • Well Tested: Comprehensive unit and end-to-end test coverage

How It Works

The operator continuously monitors all HelmRelease resources in the cluster and:

  1. Watches for HelmRelease objects with failed conditions
  2. Detects specific error patterns (e.g., "context deadline exceeded")
  3. Triggers automatic retry by adding the fluxcd.io/reconcileAt annotation
  4. Logs all restart actions for audit and debugging purposes
Supported Error Patterns

Currently, the operator handles:

  • context deadline exceeded - Network timeouts during chart operations
  • Additional patterns can be easily configured in the controller logic

Installation

Prerequisites
  • Kubernetes cluster (v1.20+)
  • Flux v2 installed and running
  • kubectl configured to access your cluster
Quick Start
  1. Install using kubectl:

    kubectl apply -f https://raw.githubusercontent.com/sfotiadis/helmreboot-operator/main/config/default/kustomization.yaml
    
  2. Or build and deploy from source:

    git clone https://github.com/sfotiadis/helmreboot-operator.git
    cd helmreboot-operator
    make deploy
    
  3. Verify installation:

    kubectl get pods -n helmreboot-operator-system
    
Helm Installation (Coming Soon)
helm repo add helmreboot-operator https://sfotiadis.github.io/helmreboot-operator
helm install helmreboot-operator helmreboot-operator/helmreboot-operator

Configuration

Environment Variables
Variable Description Default
METRICS_BIND_ADDRESS Address for metrics server :8080
HEALTH_PROBE_BIND_ADDRESS Address for health probes :8081
LEADER_ELECT Enable leader election false
RBAC Permissions

The operator requires the following permissions:

  • get, list, watch on HelmRelease resources
  • patch, update on HelmRelease resources (for adding annotations)

Monitoring

Prometheus Metrics

The operator exposes metrics on the /metrics endpoint:

  • controller_runtime_reconcile_total - Total number of reconciliations
  • controller_runtime_reconcile_errors_total - Total number of reconciliation errors
  • controller_runtime_reconcile_time_seconds - Time spent in reconciliation
Health Checks

Health endpoints are available:

  • GET /healthz - Liveness probe
  • GET /readyz - Readiness probe

Development

Prerequisites
  • Go 1.21+
  • Docker
  • kubectl
  • Kubebuilder 3.0+
Local Development
  1. Clone the repository:

    git clone https://github.com/sfotiadis/helmreboot-operator.git
    cd helmreboot-operator
    
  2. Install dependencies:

    go mod download
    
  3. Run tests:

    make test
    
  4. Run locally against your cluster:

    make install run
    
Building
# Build the binary
make build

# Build the Docker image
make docker-build

# Build and push Docker image
make docker-build-push

# Run tests with coverage
make test

# Run linting
make lint
Testing

The project includes comprehensive testing:

# Unit tests
make test

# End-to-end tests
make test-e2e

# Integration tests with coverage
make test-integration

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  HelmRelease    │    │  HelmReboot     │    │  Flux           │
│  (Failed)       │───▶│  Operator       │───▶│  Controller     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │
                              ▼
                       ┌─────────────────┐
                       │  Add reconcile  │
                       │  annotation     │
                       └─────────────────┘
Controller Logic
  1. Watch Phase: Monitor all HelmRelease resources for status changes
  2. Analysis Phase: Check if the failure matches known recoverable patterns
  3. Action Phase: Add fluxcd.io/reconcileAt annotation to trigger Flux retry
  4. Monitoring Phase: Log actions and update metrics

Roadmap

  • Helm Chart for easy installation
  • Support for additional error patterns
  • Configurable retry delays and limits
  • Dashboard for monitoring restart actions
  • Integration with popular monitoring systems
  • Multi-cluster support

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Directories

Path Synopsis
internal
test

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL