reconciler

package

v1.5.0 Latest Latest Go to latest Published: Oct 15, 2025 License: Apache-2.0 Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cuemby/warren

Links

Open Source Insights

Documentation ¶

Overview ¶

Package reconciler provides failure detection and automatic healing for Warren clusters.

The reconciler continuously monitors the cluster for failures and deviations from desired state, automatically triggering corrective actions to maintain system health. It runs as a background process that detects failed tasks, down nodes, and unhealthy containers, ensuring that services remain available despite infrastructure failures.

Architecture ¶

The reconciler operates on a fixed 10-second interval, monitoring all cluster components and triggering remediation when problems are detected:

┌────────────────────────────────────────────────────────────┐
│                  Reconciliation Loop                       │
│                   (Every 10 seconds)                       │
└────────────────┬───────────────────────────────────────────┘
                 │
    ┌────────────┴────────────┐
    │                         │
    ▼                         ▼
┌─────────────────┐   ┌──────────────────┐
│ Reconcile Nodes │   │ Reconcile Tasks  │
└─────┬───────────┘   └──────┬───────────┘
      │                      │
      ▼                      ▼
  Check                 Check health
  heartbeats            and state
      │                      │
      ▼                      ▼
  Mark down             Replace failed
  nodes                 tasks

Failure Detection ¶

## Node Failure Detection

Nodes send heartbeats every 5 seconds. If no heartbeat is received for 30 seconds, the node is marked as "Down":

Last heartbeat: 2024-01-15 10:30:00
Current time:   2024-01-15 10:30:35  (35 seconds elapsed)
Status: NodeStatusDown

When a node is marked down:

All running tasks on that node are marked as failed
Scheduler creates replacement tasks on healthy nodes
Node becomes eligible for scheduling again after recovery

## Task Failure Detection

The reconciler detects multiple failure scenarios:

1. Task explicitly failed (worker reported failure):

ActualState = TaskStateFailed
Error message available in task.Error
Action: Mark for cleanup (DesiredState = Shutdown)

2. Task health check failed:

HealthStatus.Healthy = false
ConsecutiveFailures > threshold
Action: Mark as failed, scheduler creates replacement

3. Task on down node:

Node.Status = NodeStatusDown
Task.DesiredState = Running
Action: Mark as failed for rescheduling

## Health-Aware Reconciliation

The reconciler integrates with health checks to detect unhealthy containers:

Task running: nginx-1
Health check: HTTP GET /health → 503 Service Unavailable
Consecutive failures: 3 (exceeds threshold)
Action: Mark task as failed
Result: Scheduler creates replacement task

Core Components ¶

Reconciler: The main reconciliation engine that monitors cluster health.

reconciler := NewReconciler(manager)
reconciler.Start()  // Begins 10-second reconciliation loop
defer reconciler.Stop()

Like the scheduler, the reconciler is stateless - it reads cluster state from the manager on each cycle, making decisions based only on current observations.

Reconciliation Strategies ¶

## Node Reconciliation

For each node in the cluster:

Read node.LastHeartbeat timestamp
Calculate time since last heartbeat
If > 30 seconds: • Update node.Status = NodeStatusDown • Persist to storage
If < 30 seconds and currently down: • Update node.Status = NodeStatusReady • Node returns to scheduling pool

## Task Reconciliation

For each task in the cluster:

Check if task is failed but desired to be running: • Mark DesiredState = Shutdown • Scheduler will create replacement
Check if task is running but unhealthy: • Evaluate HealthStatus • If unhealthy, mark ActualState = Failed • Scheduler will create replacement
Check if task is on a down node: • Mark ActualState = Failed • Mark DesiredState = Shutdown • Scheduler will create replacement on healthy node
Garbage collect completed tasks: • If DesiredState = Shutdown and ActualState = Complete • Wait 5 minutes, then delete task

Usage Examples ¶

## Basic Reconciler Setup

import (
	"github.com/cuemby/warren/pkg/reconciler"
	"github.com/cuemby/warren/pkg/manager"
)

// Create manager (provides cluster state access)
mgr := manager.NewManager(store, raftNode)

// Create and start reconciler
rec := reconciler.NewReconciler(mgr)
rec.Start()

// Reconciler runs automatically every 10 seconds
// ...

// Gracefully stop reconciler
rec.Stop()

## Simulating Node Failure

// Simulate node going down (stop heartbeats)
node := &types.Node{
	ID:            "worker-1",
	Status:        types.NodeStatusReady,
	LastHeartbeat: time.Now().Add(-35 * time.Second),  // 35s ago
}

// On next reconciliation cycle (within 10 seconds):
// 1. Reconciler detects: now - LastHeartbeat > 30s
// 2. Sets node.Status = NodeStatusDown
// 3. All tasks on worker-1 marked as failed
// 4. Scheduler creates replacement tasks on healthy nodes

## Health Check Integration

// Service with health check configured
service := &types.Service{
	ID:    "svc-1",
	Name:  "api",
	Image: "myapp:v1",
	HealthCheck: &types.HealthCheck{
		Type:     types.HealthCheckTypeHTTP,
		HTTP:     &types.HTTPHealthCheck{Path: "/health"},
		Interval: 10 * time.Second,
		Timeout:  5 * time.Second,
		Retries:  3,
	},
}

// Task with failed health check
task := &types.Task{
	ID:           "task-1",
	ServiceID:    "svc-1",
	ActualState:  types.TaskStateRunning,
	DesiredState: types.TaskStateRunning,
	HealthStatus: &types.HealthStatus{
		Healthy:             false,
		ConsecutiveFailures: 3,
		Message:            "HTTP 503 Service Unavailable",
	},
}

// Reconciler will:
// 1. Detect task is running but unhealthy
// 2. Set task.ActualState = TaskStateFailed
// 3. Set task.Error = "health check failed: HTTP 503..."
// 4. Scheduler creates new healthy task

Integration Points ¶

## Manager Integration

The reconciler depends on the manager package for cluster state:

ListNodes() - Get all nodes to check heartbeats
UpdateNode(node) - Mark nodes as down/ready
ListTasks() - Get all tasks for health checks
UpdateTask(task) - Mark tasks as failed
DeleteTask(taskID) - Garbage collect old tasks
GetNode(nodeID) - Get node details for task reconciliation

## Scheduler Coordination

The reconciler and scheduler form a feedback loop:

Reconciler: "This task is broken, mark it for cleanup"
Scheduler: "I see a service needs more replicas, create task"
Reconciler: "New task is healthy, no action needed"

This separation ensures:

Reconciler focuses on detection
Scheduler focuses on placement
Clean, testable boundaries

## Health Check Integration

The reconciler reads health status from tasks:

Worker runs health checks (pkg/health)
Worker updates task.HealthStatus via manager
Reconciler reads task.HealthStatus
Reconciler marks unhealthy tasks as failed

## Worker Integration

Workers interact with reconciliation indirectly:

Worker reports task state to manager
Reconciler detects failed state
Reconciler marks task for cleanup
Worker sees DesiredState = Shutdown
Worker stops container and marks task complete

Design Patterns ¶

## Stateless Reconciliation

Like the scheduler, the reconciler maintains no state between cycles:

All decisions based on current cluster state
No memory of previous failures
Crash-safe and simple to reason about

## Level-Triggered Reconciliation

The reconciler uses level-triggered (not edge-triggered) logic:

Edge-triggered: "Node just went down, react"
Level-triggered: "Node is down, ensure proper state"

This means:

Reconciler doesn't need to track state changes
System converges even if reconciliation cycles are missed
More robust to timing issues and race conditions

## Garbage Collection Pattern

Completed tasks are not deleted immediately:

Task finishes (DesiredState = Shutdown, ActualState = Complete)
Wait 5 minutes (grace period for debugging)
Delete task from storage

This provides a debugging window while preventing unbounded storage growth.

Performance Characteristics ¶

## Time Complexity

Per reconciliation cycle (N nodes, T tasks):

List nodes: O(N)
Check heartbeats: O(N)
List tasks: O(T)
Check health: O(T)
Get node per task: O(T * log N) if manager uses index
Overall: O(N + T)

For a typical cluster (10 nodes, 500 tasks):

~100-200ms per reconciliation cycle
Well within 10-second interval

## Memory Usage

Minimal memory footprint:

No state caching
Temporary allocations for node/task lists
~5-10 MB for typical cluster sizes

## Failure Detection Latency

Time from failure to remediation:

Node failure: 30-40 seconds (30s timeout + 10s cycle)
Task failure: 0-10 seconds (detected in next cycle)
Health check failure: (retries * interval) + 10s cycle
Example: 3 retries * 10s + 10s = ~40 seconds

Troubleshooting ¶

## Tasks Not Being Replaced

If failed tasks are not being replaced:

1. Check reconciler is running:

Look for "Reconciler error" in manager logs
Verify reconciler.Start() was called

2. Check scheduler is running:

Reconciler marks tasks failed
Scheduler creates replacements
Both must be running!

3. Verify task state:

Run: warren service ps <service-name>
Check DesiredState vs ActualState
Look for error messages

## False Positive Node Failures

If nodes are incorrectly marked as down:

1. Check heartbeat interval:

Workers should heartbeat every 5s
Network latency < 25s to avoid false positives

2. Check manager clock skew:

Reconciler uses manager's clock
NTP synchronization recommended

3. Verify network connectivity:

Workers must reach manager node
Check firewalls, security groups

## Tasks Flapping (Constant Restarts)

If tasks are constantly being killed and recreated:

1. Check health check configuration:

Retries too low? (minimum 2-3)
Timeout too short? (minimum 5s)
StartPeriod too short for slow apps?

2. Check application logs:

Is app actually healthy?
Look for crash loops

3. Review reconciliation logs:

Look for patterns in failure messages
Check timing of failures

## Slow Failure Recovery

If recovery takes too long:

1. Tune reconciliation interval:

Default 10s balances overhead vs. speed
Can reduce to 5s for faster recovery
Be aware of increased CPU usage

2. Tune node heartbeat timeout:

Default 30s prevents false positives
Can reduce to 15s for faster detection
Risk of false positives on slow networks

3. Tune health check parameters:

Reduce interval for faster detection
Reduce retries for quicker action
Balance against false positives

Monitoring Metrics ¶

The reconciler exports Prometheus metrics:

# Reconciliation cycle metrics
reconciliation_duration_seconds - Time to complete cycle
reconciliation_cycles_total - Total reconciliation cycles

## Additional Observable Metrics

Failed tasks detected (log: "Task X failed")
Unhealthy tasks detected (log: "Task X is unhealthy")
Nodes marked down (log: "Node X is down")
Tasks cleaned up (log: "Task X cleaned up after completion")

Best Practices ¶

1. Reconciliation Interval Tuning

Default 10s provides good balance
For critical workloads, reduce to 5s
For large clusters (>100 nodes), consider 15s
Always < scheduler interval for proper coordination

2. Health Check Configuration

Set Retries = 3 (prevents false positives)
Set Interval = 10-30s (not too aggressive)
Set StartPeriod for slow-starting apps
Monitor health check logs for tuning

3. Node Failure Tolerance

Plan for 1-2 minute recovery time
Ensure sufficient spare capacity for rescheduling
Use persistent volumes for stateful workloads
Test failure scenarios in staging

4. Task Cleanup

Default 5-minute grace period is usually sufficient
For high-churn workloads, reduce to 1 minute
For debugging, increase to 15-30 minutes
Monitor task count growth over time

5. Resource Planning

Account for double resource usage during failover
Ensure (N-1) nodes can handle full load
Monitor node resource usage
Plan for cascading failures

Index ¶

type Reconciler
- func NewReconciler(mgr *manager.Manager) *Reconciler
- func (r *Reconciler) Start()
- func (r *Reconciler) Stop()

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Reconciler ¶

type Reconciler struct {
	// contains filtered or unexported fields
}

Reconciler ensures actual cluster state matches desired state

func NewReconciler ¶

func NewReconciler(mgr *manager.Manager) *Reconciler

NewReconciler creates a new reconciler

func (*Reconciler) Start ¶

func (r *Reconciler) Start()

Start begins the reconciliation loop

func (*Reconciler) Stop ¶

func (r *Reconciler) Stop()

Stop stops the reconciler

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Documentation ¶

Overview ¶

Architecture ¶

Failure Detection ¶

Core Components ¶

Reconciliation Strategies ¶

Usage Examples ¶

Integration Points ¶

Design Patterns ¶

Performance Characteristics ¶

Troubleshooting ¶

Monitoring Metrics ¶

Best Practices ¶

See Also ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type Reconciler ¶

func NewReconciler ¶

func (*Reconciler) Start ¶

func (*Reconciler) Stop ¶

Source Files ¶