helmreboot-operator

module

v0.1.0 Latest Latest Go to latest Published: Oct 20, 2025 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sfotiadis/helmreboot-operator

Links

Open Source Insights

README ¶

HelmReboot Operator

A Kubernetes operator that automatically restarts failed Flux HelmRelease resources when they encounter timeout errors. This operator monitors HelmRelease objects and triggers reconciliation by adding the fluxcd.io/reconcileAt annotation when specific failure conditions are detected.

Overview

The HelmReboot Operator is designed to solve a common issue in GitOps workflows where Flux HelmRelease resources fail due to temporary network issues, registry timeouts, or other transient errors. Instead of manual intervention, this operator automatically detects these failures and triggers a retry by adding reconciliation annotations.

Key Features

Automatic Recovery: Detects failed HelmRelease resources and triggers automatic retries
Smart Detection: Only restarts releases that failed due to specific timeout errors
Monitoring Ready: Includes Prometheus metrics and comprehensive logging
Lightweight: Minimal resource footprint with efficient reconciliation loops
Secure: Follows Kubernetes RBAC best practices with minimal required permissions
Well Tested: Comprehensive unit and end-to-end test coverage

How It Works

The operator continuously monitors all HelmRelease resources in the cluster and:

Watches for HelmRelease objects with failed conditions
Detects specific error patterns (e.g., "context deadline exceeded")
Triggers automatic retry by adding the fluxcd.io/reconcileAt annotation
Logs all restart actions for audit and debugging purposes

Supported Error Patterns

Currently, the operator handles:

context deadline exceeded - Network timeouts during chart operations
Additional patterns can be easily configured in the controller logic

Installation

Prerequisites

Kubernetes cluster (v1.20+)
Flux v2 installed and running
kubectl configured to access your cluster

Quick Start

Install using kubectl:

kubectl apply -f https://raw.githubusercontent.com/sfotiadis/helmreboot-operator/main/config/default/kustomization.yaml

Or build and deploy from source:

git clone https://github.com/sfotiadis/helmreboot-operator.git
cd helmreboot-operator
make deploy

Verify installation:

kubectl get pods -n helmreboot-operator-system

Helm Installation (Coming Soon)

helm repo add helmreboot-operator https://sfotiadis.github.io/helmreboot-operator
helm install helmreboot-operator helmreboot-operator/helmreboot-operator

Configuration

Environment Variables

Variable	Description	Default
`METRICS_BIND_ADDRESS`	Address for metrics server	`:8080`
`HEALTH_PROBE_BIND_ADDRESS`	Address for health probes	`:8081`
`LEADER_ELECT`	Enable leader election	`false`

RBAC Permissions

The operator requires the following permissions:

get, list, watch on HelmRelease resources
patch, update on HelmRelease resources (for adding annotations)

Monitoring

Prometheus Metrics

The operator exposes metrics on the /metrics endpoint:

controller_runtime_reconcile_total - Total number of reconciliations
controller_runtime_reconcile_errors_total - Total number of reconciliation errors
controller_runtime_reconcile_time_seconds - Time spent in reconciliation

Health Checks

Health endpoints are available:

GET /healthz - Liveness probe
GET /readyz - Readiness probe

Development

Prerequisites

Go 1.21+
Docker
kubectl
Kubebuilder 3.0+

Local Development

Clone the repository:

git clone https://github.com/sfotiadis/helmreboot-operator.git
cd helmreboot-operator

Install dependencies:
```
go mod download
```
Run tests:
```
make test
```
Run locally against your cluster:
```
make install run
```

Building

# Build the binary
make build

# Build the Docker image
make docker-build

# Build and push Docker image
make docker-build-push

# Run tests with coverage
make test

# Run linting
make lint

Testing

The project includes comprehensive testing:

# Unit tests
make test

# End-to-end tests
make test-e2e

# Integration tests with coverage
make test-integration

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  HelmRelease    │    │  HelmReboot     │    │  Flux           │
│  (Failed)       │───▶│  Operator       │───▶│  Controller     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │
                              ▼
                       ┌─────────────────┐
                       │  Add reconcile  │
                       │  annotation     │
                       └─────────────────┘

Controller Logic

Watch Phase: Monitor all HelmRelease resources for status changes
Analysis Phase: Check if the failure matches known recoverable patterns
Action Phase: Add fluxcd.io/reconcileAt annotation to trigger Flux retry
Monitoring Phase: Log actions and update metrics

Roadmap

Helm Chart for easy installation
Support for additional error patterns
Configurable retry delays and limits
Dashboard for monitoring restart actions
Integration with popular monitoring systems
Multi-cluster support

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Flux CD for the excellent GitOps toolkit
Kubebuilder for the operator framework
Controller Runtime for the underlying controller libraries

Directories ¶

Path	Synopsis
cmd
internal
controller
test
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL