health

package

v1.3.0 Latest Latest Go to latest Published: Oct 14, 2025 License: Apache-2.0 Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cuemby/warren

Links

Open Source Insights

Documentation ¶

Overview ¶

Package health provides health check mechanisms for monitoring container health in Warren clusters.

This package implements three types of health checks: HTTP, TCP, and Exec. Health checks enable automatic detection of unhealthy containers and trigger automatic replacement via the reconciler, ensuring service availability and reliability without manual intervention.

Architecture ¶

Warren's health check system follows a modular checker design:

┌─────────────────────────────────────────────────────────────┐
│                   Health Check System                       │
└─────┬──────────────────────────────────────────────────────┘
      │
      ▼
┌──────────────────────────────────────────────────────────────┐
│                     Checker Interface                        │
│  • Check(ctx) Result                                         │
│  • Type() CheckType                                          │
└────────┬─────────────────────────────────────────────────────┘
         │
    ┌────┴──────┬──────────┐
    ▼           ▼          ▼
┌────────┐  ┌──────┐  ┌────────┐
│  HTTP  │  │ TCP  │  │  Exec  │
│Checker │  │Checker│ │Checker │
└────────┘  └──────┘  └────────┘
     │          │          │
     ▼          ▼          ▼
  GET /    Connect     Run cmd
  /health    :port      in container

## Health Check Flow

Task starts → Worker creates health checker
Wait for StartPeriod (grace period for slow apps)
Every Interval: Run health check
If check fails: Increment consecutive failures
If failures >= Retries: Mark task unhealthy
Reconciler detects unhealthy task → Replaces it

Health Check Types ¶

## HTTP Health Checks

HTTP checks perform HTTP requests to verify application health:

Check Type: HTTP
Configuration:
├── URL: http://container-ip:8080/health
├── Method: GET, POST, HEAD
├── Headers: Custom HTTP headers
├── Expected Status: 200-399 (configurable)
└── Timeout: 10 seconds

Example responses:

200 OK → Healthy
503 Service Unavailable → Unhealthy
Connection timeout → Unhealthy
Connection refused → Unhealthy

## TCP Health Checks

TCP checks verify that a port is listening and accepting connections:

Check Type: TCP
Configuration:
├── Address: container-ip:6379
├── Timeout: 5 seconds
└── Connection test only (no data sent)

Use cases:

Database health (PostgreSQL, MySQL, Redis)
Message queue health (RabbitMQ, Kafka)
Any service with TCP listener

## Exec Health Checks

Exec checks run commands inside the container and check exit codes:

Check Type: Exec
Configuration:
├── Command: ["pg_isready", "-U", "postgres"]
├── Timeout: 10 seconds
├── Exit code 0 → Healthy
└── Exit code != 0 → Unhealthy

Use cases:

Database-specific checks (pg_isready, mysqladmin ping)
Custom health scripts
File system checks
Process checks

Core Components ¶

## Checker Interface

All health checkers implement this interface:

type Checker interface {
	Check(ctx context.Context) Result
	Type() CheckType
}

This allows polymorphic health checking - workers don't need to know the check type, just call Check() and interpret the Result.

## Result Structure

All checks return a standardized Result:

type Result struct {
	Healthy   bool          // Check passed?
	Message   string        // Human-readable message
	CheckedAt time.Time     // When check ran
	Duration  time.Duration // How long check took
}

## Status Tracking

Status tracks health over time:

type Status struct {
	ConsecutiveFailures  int    // Failure streak
	ConsecutiveSuccesses int    // Success streak
	LastCheck            time.Time
	LastResult           Result
	Healthy              bool   // Current health state
	StartedAt            time.Time
}

The status implements hysteresis - multiple failures required before marking unhealthy, preventing flapping from transient issues.

## Configuration

Health checks are configured per service:

type Config struct {
	Interval    time.Duration  // Time between checks (default: 30s)
	Timeout     time.Duration  // Max check duration (default: 10s)
	Retries     int            // Failures before unhealthy (default: 3)
	StartPeriod time.Duration  // Grace period for slow startup (default: 0)
}

Usage Examples ¶

## HTTP Health Check

import "github.com/cuemby/warren/pkg/health"

// Create HTTP checker
checker := health.NewHTTPChecker("http://192.168.1.10:8080/health")

// Customize (optional)
checker.WithMethod("GET").
	WithHeader("User-Agent", "Warren-Health/1.0").
	WithStatusRange(200, 299).  // Only 2xx is healthy
	WithTimeout(5 * time.Second)

// Perform check
ctx := context.Background()
result := checker.Check(ctx)

if result.Healthy {
	fmt.Printf("✓ Healthy: %s (took %v)\n", result.Message, result.Duration)
} else {
	fmt.Printf("✗ Unhealthy: %s\n", result.Message)
}

// Output:
// ✓ Healthy: HTTP 200 OK (took 12ms)

## TCP Health Check

// Create TCP checker for Redis
checker := health.NewTCPChecker("192.168.1.10:6379")
checker.WithTimeout(3 * time.Second)

// Check if Redis is listening
result := checker.Check(ctx)

if result.Healthy {
	fmt.Println("Redis is accepting connections")
} else {
	fmt.Printf("Redis unreachable: %s\n", result.Message)
}

// Output:
// Redis is accepting connections

## Exec Health Check

// Create exec checker for PostgreSQL
checker := health.NewExecChecker([]string{
	"pg_isready",
	"-U", "postgres",
	"-d", "mydb",
})
checker.WithTimeout(5 * time.Second)
checker.WithContainer("container-abc123")  // Run in this container

// Check database
result := checker.Check(ctx)

if result.Healthy {
	fmt.Println("PostgreSQL is ready")
} else {
	fmt.Printf("PostgreSQL not ready: %s\n", result.Message)
}

## Health Status Tracking

// Create status tracker
status := health.NewStatus()

// Configure health check
config := health.Config{
	Interval:    10 * time.Second,
	Timeout:     5 * time.Second,
	Retries:     3,
	StartPeriod: 30 * time.Second,
}

// Simulate health check loop
checker := health.NewHTTPChecker("http://app:8080/health")

for {
	// Check if in startup grace period
	if status.InStartPeriod(config) {
		fmt.Println("In startup period, skipping health check")
		time.Sleep(config.Interval)
		continue
	}

	// Run health check
	ctx, cancel := context.WithTimeout(context.Background(), config.Timeout)
	result := checker.Check(ctx)
	cancel()

	// Update status
	status.Update(result, config)

	// Check if unhealthy
	if !status.Healthy {
		fmt.Printf("Container unhealthy after %d failures\n",
			status.ConsecutiveFailures)
		// Trigger replacement...
		break
	}

	time.Sleep(config.Interval)
}

## Service with Health Check

// Define service with HTTP health check
service := &types.Service{
	ID:    "svc-api",
	Name:  "api",
	Image: "myapp:v1",
	HealthCheck: &types.HealthCheck{
		Type:     types.HealthCheckTypeHTTP,
		HTTP: &types.HTTPHealthCheck{
			Path:   "/health",
			Port:   8080,
			Scheme: "http",
		},
		Interval:    15 * time.Second,
		Timeout:     5 * time.Second,
		Retries:     3,
		StartPeriod: 60 * time.Second,  // Allow 60s for startup
	},
}

// Warren will:
// 1. Start container
// 2. Wait 60s (StartPeriod)
// 3. Check /health every 15s
// 4. After 3 failures, mark unhealthy
// 5. Reconciler replaces unhealthy task

Integration Points ¶

## Worker Integration

Workers manage health check execution:

Task assigned to worker
Worker starts container
Worker creates appropriate health checker
Worker runs checks on configured interval
Worker updates task.HealthStatus via manager
Reconciler reads HealthStatus, triggers replacement if needed

## Reconciler Integration

The reconciler uses health status to detect failures:

// Check task health
if task.ActualState == types.TaskStateRunning {
	if task.HealthStatus != nil && !task.HealthStatus.Healthy {
		// Mark task as failed
		task.ActualState = types.TaskStateFailed
		task.Error = fmt.Sprintf("health check failed: %s",
			task.HealthStatus.Message)
		manager.UpdateTask(task)
	}
}

## Scheduler Integration

The scheduler considers health when placing tasks:

Unhealthy tasks don't count toward active replicas
Scheduler creates replacement tasks
Load balancer excludes unhealthy backends

## Manager Integration

The manager stores health status:

Task {
	ID:          "task-abc123"
	ActualState: "running"
	HealthStatus: {
		Healthy:             false
		ConsecutiveFailures: 3
		Message:            "HTTP 503 Service Unavailable"
		LastCheck:          "2024-01-15T10:30:00Z"
	}
}

Design Patterns ¶

## Strategy Pattern

Different checkers implement the Checker interface:

Checker (interface)
├── HTTPChecker (HTTP strategy)
├── TCPChecker (TCP strategy)
└── ExecChecker (Exec strategy)

This allows runtime selection of check type without code changes.

## Builder Pattern

Checkers use fluent builders for configuration:

checker := NewHTTPChecker(url).
	WithMethod("POST").
	WithHeader("Auth", "token").
	WithTimeout(5 * time.Second)

This provides clean, readable configuration with optional parameters.

## Hysteresis Pattern

Status tracking implements hysteresis to prevent flapping:

Healthy → 1 failure → Still healthy
Healthy → 2 failures → Still healthy
Healthy → 3 failures → Unhealthy!

Unhealthy → 1 success → Healthy!

This prevents oscillation from transient issues while still responding to persistent problems.

## Context-Based Cancellation

All checks respect context deadlines:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()

result := checker.Check(ctx)  // Respects timeout

This enables proper timeout handling and resource cleanup.

Performance Characteristics ¶

## HTTP Check Performance

HTTP checks are network-bound:

Latency: 1-100ms (depends on network + app)
Memory: ~10KB per check (HTTP client)
CPU: Minimal (mostly waiting for I/O)

For 100 checks/second:

~1% CPU usage
~1MB memory

## TCP Check Performance

TCP checks are very lightweight:

Latency: 1-10ms (just TCP handshake)
Memory: ~1KB per check
CPU: Negligible

TCP checks are ideal for high-frequency monitoring.

## Exec Check Performance

Exec checks are most expensive:

Latency: 10-1000ms (depends on command)
Memory: Command output size
CPU: Command execution

Use exec checks sparingly and increase check interval.

## Recommended Check Intervals

HTTP: 10-30 seconds
TCP: 5-15 seconds
Exec: 30-60 seconds

Troubleshooting ¶

## False Positive Failures

If healthy containers are marked unhealthy:

1. Check timeout settings:

Timeout too short for slow responses?
Network latency accounted for?
Increase timeout to 2x expected duration

2. Check retry count:

Retries = 1 → Very sensitive to transients
Retries = 3 → More tolerant (recommended)
Increase retries for flaky networks

3. Check StartPeriod:

App takes 60s to start but StartPeriod = 10s?
Set StartPeriod > app startup time
Monitor app startup logs

## Health Checks Not Running

If health checks aren't being performed:

1. Verify configuration:

Check service.HealthCheck is set
Verify Interval > 0
Ensure worker is running

2. Check worker logs:

Look for "health check" messages
Check for errors creating checker
Verify container IP/port reachable

3. Check network connectivity:

Can worker reach container IP?
Firewall blocking health check port?
Container actually listening on port?

## Health Checks Too Slow

If health checks impact performance:

1. Optimize check endpoint:

Health check should be lightweight
Don't hit database on every check
Cache health status if expensive to compute

2. Tune check interval:

Reduce check frequency
Balance detection speed vs. overhead
30s interval is usually sufficient

3. Use appropriate check type:

TCP faster than HTTP
HTTP faster than Exec
Choose lightest check that's still reliable

## Container Flapping

If containers restart repeatedly:

1. Check application health:

Is app actually healthy?
Check application logs for errors
Test health endpoint manually

2. Tune health check parameters:

Increase retries (tolerate transients)
Increase interval (reduce check frequency)
Increase timeout (allow slower responses)
Increase StartPeriod (slow startup)

3. Review health check logic:

Is check too strict?
Is check testing right thing?
Consider application-specific checks

Monitoring Metrics ¶

Key health check metrics:

Health checks performed per second
Health check success rate
Health check latency (p50, p95, p99)
Consecutive failures per task
Tasks marked unhealthy per hour

Best Practices ¶

1. Health Check Design

Check critical dependencies (database, cache, etc.)
Return quickly (< 1 second ideal)
Don't overwhelm backend services
Cache expensive computations
Return detailed status in response

2. Configuration Tuning

Set Interval = 10-30s (balance detection vs. overhead)
Set Timeout = 5-10s (2x expected response time)
Set Retries = 3 (tolerate transients)
Set StartPeriod = 2x app startup time

3. Application Integration

Implement /health endpoint in all services
Return 200 when healthy, 503 when not
Include dependency status in health response
Test health endpoint in development

4. HTTP Health Endpoints

Keep checks lightweight
Don't require authentication
Return JSON with status details
Include version and uptime
Test with curl before deploying

5. Progressive Readiness

Use StartPeriod for slow-starting apps
Consider separate readiness vs. liveness checks (future)
Gradual health restoration (don't kill on first failure)

Security Considerations ¶

## HTTP Health Checks

Health endpoints should not require authentication
Don't expose sensitive information in health responses
Use internal networks only (not public internet)
Rate limit health check endpoints

## Exec Health Checks

Validate command arguments (prevent injection)
Run commands as non-root user
Limit command execution time
Monitor for command abuse

Future Enhancements ¶

Planned health check features:

gRPC health checks (gRPC health protocol)
Custom health check scripts
Readiness vs. liveness checks (Kubernetes-style)
Health check metrics export (Prometheus)
Dependency health aggregation
Circuit breaker integration

Index ¶

type CheckType
type Checker
type Config
- func DefaultConfig() Config
type ExecChecker
- func NewExecChecker(command []string) *ExecChecker
type HTTPChecker
- func NewHTTPChecker(url string) *HTTPChecker
type Result
type Status
- func NewStatus() *Status
- func (s *Status) InStartPeriod(config Config) bool
- func (s *Status) Update(result Result, config Config)
type TCPChecker
- func NewTCPChecker(address string) *TCPChecker

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type CheckType ¶

type CheckType string

CheckType represents the type of health check

const (
	CheckTypeHTTP CheckType = "http"
	CheckTypeTCP  CheckType = "tcp"
	CheckTypeExec CheckType = "exec"
)

type Checker ¶

type Checker interface {
	// Check performs the health check and returns the result
	Check(ctx context.Context) Result

	// Type returns the type of health check
	Type() CheckType
}

Checker is the interface that all health checkers must implement

type Config ¶

type Config struct {
	// Interval is the time between health checks
	Interval time.Duration

	// Timeout is the maximum time to wait for a health check to complete
	Timeout time.Duration

	// Retries is the number of consecutive failures before marking as unhealthy
	Retries int

	// StartPeriod is the grace period before starting health checks
	// Used to allow slow-starting containers to initialize
	StartPeriod time.Duration
}

Config contains common configuration for all health checks

func DefaultConfig ¶

func DefaultConfig() Config

DefaultConfig returns a Config with sensible defaults

type ExecChecker ¶

type ExecChecker struct {
	// Command is the command to execute (e.g., ["pg_isready", "-U", "postgres"])
	Command []string

	// Timeout is the command execution timeout (default: 10 seconds)
	Timeout time.Duration

	// ContainerID is the ID of the container to exec into
	// If empty, runs on host (useful for testing)
	ContainerID string
}

ExecChecker performs exec-based health checks by running a command

func NewExecChecker ¶

func NewExecChecker(command []string) *ExecChecker

NewExecChecker creates a new exec health checker

func (*ExecChecker) Check ¶

func (e *ExecChecker) Check(ctx context.Context) Result

Check performs the exec health check

func (*ExecChecker) Type ¶

func (e *ExecChecker) Type() CheckType

Type returns the health check type

func (*ExecChecker) WithContainer ¶

func (e *ExecChecker) WithContainer(containerID string) *ExecChecker

WithContainer sets the container ID for exec

func (*ExecChecker) WithTimeout ¶

func (e *ExecChecker) WithTimeout(timeout time.Duration) *ExecChecker

WithTimeout sets the execution timeout

type HTTPChecker ¶

type HTTPChecker struct {
	// URL is the full HTTP URL to check (e.g., "http://container-ip:8080/health")
	URL string

	// Method is the HTTP method to use (default: GET)
	Method string

	// Headers are custom HTTP headers to include in the request
	Headers map[string]string

	// ExpectedStatusMin is the minimum acceptable HTTP status code (default: 200)
	ExpectedStatusMin int

	// ExpectedStatusMax is the maximum acceptable HTTP status code (default: 399)
	ExpectedStatusMax int

	// Client is the HTTP client to use (allows custom configuration)
	Client *http.Client
}

HTTPChecker performs HTTP-based health checks

func NewHTTPChecker ¶

func NewHTTPChecker(url string) *HTTPChecker

NewHTTPChecker creates a new HTTP health checker

func (*HTTPChecker) Check ¶

func (h *HTTPChecker) Check(ctx context.Context) Result

Check performs the HTTP health check

func (*HTTPChecker) Type ¶

func (h *HTTPChecker) Type() CheckType

Type returns the health check type

func (*HTTPChecker) WithHeader ¶

func (h *HTTPChecker) WithHeader(key, value string) *HTTPChecker

WithHeader adds a custom HTTP header

func (*HTTPChecker) WithMethod ¶

func (h *HTTPChecker) WithMethod(method string) *HTTPChecker

WithMethod sets the HTTP method

func (*HTTPChecker) WithStatusRange ¶

func (h *HTTPChecker) WithStatusRange(min, max int) *HTTPChecker

WithStatusRange sets the expected status code range

func (*HTTPChecker) WithTimeout ¶

func (h *HTTPChecker) WithTimeout(timeout time.Duration) *HTTPChecker

WithTimeout sets the HTTP client timeout

type Result ¶

type Result struct {
	Healthy   bool
	Message   string
	CheckedAt time.Time
	Duration  time.Duration
}

Result represents the outcome of a health check

type Status ¶

type Status struct {
	// ConsecutiveFailures tracks the number of consecutive failed checks
	ConsecutiveFailures int

	// ConsecutiveSuccesses tracks the number of consecutive successful checks
	ConsecutiveSuccesses int

	// LastCheck is the timestamp of the last health check
	LastCheck time.Time

	// LastResult is the result of the last health check
	LastResult Result

	// Healthy indicates if the container is currently considered healthy
	Healthy bool

	// StartedAt is when health monitoring started for this container
	StartedAt time.Time
}

Status tracks the current health status of a container

func NewStatus ¶

func NewStatus() *Status

NewStatus creates a new Status with default values

func (*Status) InStartPeriod ¶

func (s *Status) InStartPeriod(config Config) bool

InStartPeriod returns true if we're still in the startup grace period

func (*Status) Update ¶

func (s *Status) Update(result Result, config Config)

Update updates the status based on a new health check result

type TCPChecker ¶

type TCPChecker struct {
	// Address is the TCP address to connect to (e.g., "container-ip:6379")
	Address string

	// Timeout is the connection timeout (default: 5 seconds)
	Timeout time.Duration
}

TCPChecker performs TCP-based health checks

func NewTCPChecker ¶

func NewTCPChecker(address string) *TCPChecker

NewTCPChecker creates a new TCP health checker

func (*TCPChecker) Check ¶

func (t *TCPChecker) Check(ctx context.Context) Result

Check performs the TCP health check

func (*TCPChecker) Type ¶

func (t *TCPChecker) Type() CheckType

Type returns the health check type

func (*TCPChecker) WithTimeout ¶

func (t *TCPChecker) WithTimeout(timeout time.Duration) *TCPChecker

WithTimeout sets the connection timeout

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Documentation ¶

Overview ¶

Architecture ¶

Health Check Types ¶

Core Components ¶

Usage Examples ¶

Integration Points ¶

Design Patterns ¶

Performance Characteristics ¶

Troubleshooting ¶

Monitoring Metrics ¶

Best Practices ¶

Security Considerations ¶

Future Enhancements ¶

See Also ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

Types ¶

type CheckType ¶

type Checker ¶

type Config ¶

func DefaultConfig ¶

type ExecChecker ¶

func NewExecChecker ¶

func (*ExecChecker) Check ¶

func (*ExecChecker) Type ¶

func (*ExecChecker) WithContainer ¶

func (*ExecChecker) WithTimeout ¶

type HTTPChecker ¶

func NewHTTPChecker ¶

func (*HTTPChecker) Check ¶

func (*HTTPChecker) Type ¶

func (*HTTPChecker) WithHeader ¶

func (*HTTPChecker) WithMethod ¶

func (*HTTPChecker) WithStatusRange ¶

func (*HTTPChecker) WithTimeout ¶

type Result ¶

type Status ¶

func NewStatus ¶

func (*Status) InStartPeriod ¶

func (*Status) Update ¶

type TCPChecker ¶

func NewTCPChecker ¶

func (*TCPChecker) Check ¶

func (*TCPChecker) Type ¶

func (*TCPChecker) WithTimeout ¶

Source Files ¶