cronjob-guardian

module

v0.1.1 Latest Latest Go to latest Published: Jan 2, 2026 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/iLLeniumStudios/cronjob-guardian

Links

Open Source Insights

README ¶

CronJob Guardian

A Kubernetes operator for monitoring CronJobs with SLA tracking, intelligent alerting, and a built-in dashboard.

Why CronJob Guardian?

CronJobs power critical operations—backups, ETL pipelines, reports, cache warming—but Kubernetes provides no built-in monitoring for them. When jobs fail silently or stop running, you only find out when it's too late.

CronJob Guardian watches your CronJobs and alerts you when something goes wrong:

Job failures with logs, events, and suggested fixes
Missed schedules via dead-man's switch detection
Performance regressions when jobs slow down over time
SLA breaches when success rates drop below thresholds

Architecture

                                    Kubernetes Cluster
┌──────────────────────────────────────────────────────────────────────────────────┐
│                                                                                  │
│   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐                │
│   │ CronJobMonitor  │   │  AlertChannel   │   │    CronJobs     │                │
│   │     (CRD)       │   │     (CRD)       │   │    & Jobs       │                │
│   └────────┬────────┘   └────────┬────────┘   └────────┬────────┘                │
│            │                     │                     │                         │
│            └─────────────────────┼─────────────────────┘                         │
│                                  ▼                                               │
│   ┌──────────────────────────────────────────────────────────────────────────┐   │
│   │                      CronJob Guardian Operator                           │   │
│   │                                                                          │   │
│   │   ┌────────────────┐   ┌────────────────┐   ┌────────────────┐           │   │
│   │   │  Controllers   │   │   Schedulers   │   │    Alerting    │           │   │
│   │   │                │   │                │   │   Dispatcher   │───────────────────┐
│   │   │  • Monitor     │   │  • Dead-man    │   │                │           │   │   │
│   │   │  • Job         │◀─│  • SLA recalc  │─▶│  • Dedup       │           │   │   │
│   │   │  • Channel     │   │  • Prune       │   │  • Rate limit  │           │   │   │
│   │   └───────┬────────┘   └────────────────┘   └────────────────┘           │   │   │
│   │           │                                                              │   │   │
│   │           ▼                                                              │   │   │
│   │   ┌─────────────────────────────────────┐   ┌────────────────┐           │   │   │
│   │   │              Store                  │   │   Prometheus   │           │   │   │
│   │   │    SQLite / PostgreSQL / MySQL      │   │    Metrics     │───────────────────┐
│   │   │                                     │   │   :8443        │           │   │   │
│   │   │  • Executions  • Logs  • Alerts     │   └────────────────┘           │   │   │
│   │   └──────────────────┬──────────────────┘                                │   │   │
│   │                      │                                                   │   │   │
│   │   ┌──────────────────┴──────────────────┐                                │   │   │
│   │   │        Web UI & REST API            │                                │   │   │
│   │   │             :8080                   │────────────────────────────────────────┐
│   │   └─────────────────────────────────────┘                                │   │   │
│   └──────────────────────────────────────────────────────────────────────────┘   │   │
│                                                                                  │   │
└──────────────────────────────────────────────────────────────────────────────────┘   │
                                                                                       │
     ┌─────────────────────────────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                              External Services                                  │
│                                                                                 │
│   ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐     │
│   │   Slack   │  │ PagerDuty │  │  Webhook  │  │   Email   │  │Prometheus │     │
│   └───────────┘  └───────────┘  └───────────┘  └───────────┘  └───────────┘     │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

How it works:

Create CronJobMonitor resources to define what to watch (label selectors, SLA thresholds)
Create AlertChannel resources to configure alert destinations (Slack, PagerDuty, etc.)
The operator watches CronJobs and Jobs, records executions to the store
Background schedulers check for missed schedules, SLA breaches, and duration regressions
When issues are detected, alerts are dispatched with context (logs, events, suggested fixes)

Screenshots

Main Dashboard

CronJob Guardian Dashboard

SLA Compliance View

CronJob Guardian SLA Compliance

CronJob Details View

CronJob Guardian CronJob Details

Features

Monitoring

Dead-Man's Switch: Alert when CronJobs don't run within expected windows. Auto-detects expected intervals from cron schedules.
SLA Tracking: Monitor success rates, duration percentiles (P50/P95/P99), and detect performance regressions.
Execution History: Store and query job execution records with logs and events.
Prometheus Metrics: Export metrics for integration with existing monitoring infrastructure.

Alerting

Multiple Channels: Slack, PagerDuty, generic webhooks, and email
Rich Context: Alerts include pod logs, Kubernetes events, and suggested fixes
Deduplication: Configurable suppression windows and alert delays for flaky jobs
Severity Routing: Route critical and warning alerts to different channels

Operations

Maintenance Windows: Suppress alerts during scheduled maintenance
Built-in Dashboard: Feature-rich web UI for monitoring and analytics
REST API: Programmatic access to all monitoring data
Multiple Storage Backends: SQLite (default), PostgreSQL, or MySQL

Prometheus Metrics

CronJob Guardian exports the following metrics:

Metric	Type	Description
`cronjob_guardian_success_rate`	Gauge	Success rate percentage (0-100) per CronJob
`cronjob_guardian_duration_seconds`	Histogram	Execution duration with P50/P95/P99 buckets
`cronjob_guardian_alerts_total`	Counter	Total alerts sent by type, severity, and channel
`cronjob_guardian_executions_total`	Counter	Total executions by status (success/failed)
`cronjob_guardian_active_alerts`	Gauge	Currently active alerts per CronJob

Metrics are available at /metrics endpoint on port 8080.

Quick Start

Prerequisites

Kubernetes 1.26+
kubectl configured with cluster access
Helm 3.8+ (for OCI registry support)

Installation

Helm (Recommended)

CronJob Guardian is distributed as an OCI Helm chart:

# Install with default configuration (SQLite storage)
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
  --namespace cronjob-guardian \
  --create-namespace

# Install with custom values
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
  --namespace cronjob-guardian \
  --create-namespace \
  --values values.yaml

Quick Start with PostgreSQL

# Create a secret for database credentials
kubectl create namespace cronjob-guardian
kubectl create secret generic postgres-credentials \
  --namespace cronjob-guardian \
  --from-literal=password=your-secure-password

# Install with PostgreSQL storage
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
  --namespace cronjob-guardian \
  --set config.storage.type=postgres \
  --set config.storage.postgres.host=postgres.database.svc \
  --set config.storage.postgres.database=guardian \
  --set config.storage.postgres.username=guardian \
  --set config.storage.postgres.existingSecret=postgres-credentials

High Availability Setup

helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
  --namespace cronjob-guardian \
  --create-namespace \
  --set replicaCount=2 \
  --set leaderElection.enabled=true \
  --set config.storage.type=postgres \
  --set config.storage.postgres.host=postgres.database.svc \
  --set config.storage.postgres.database=guardian \
  --set config.storage.postgres.username=guardian \
  --set config.storage.postgres.existingSecret=postgres-credentials

Install from Source

# Clone the repository
git clone https://github.com/iLLeniumStudios/cronjob-guardian.git
cd cronjob-guardian

# Install using local chart
helm install cronjob-guardian ./deploy/helm/cronjob-guardian \
  --namespace cronjob-guardian \
  --create-namespace

kubectl (Alternative)

# Install CRDs and operator
kubectl apply -f https://raw.githubusercontent.com/iLLeniumStudios/cronjob-guardian/main/dist/install.yaml

Or build from source:

make docker-build docker-push IMG=your-registry/cronjob-guardian:latest
make deploy IMG=your-registry/cronjob-guardian:latest

Helm Configuration

The Helm chart supports extensive configuration for storage backends, high availability, metrics, and more.

See the Helm Chart Documentation for complete configuration reference including:

Storage backends (SQLite, PostgreSQL, MySQL)
High availability with leader election
Ingress and OpenShift Route support for UI access
Prometheus ServiceMonitor integration
Resource limits and scheduling
All available values and their defaults

Basic Setup

Create an AlertChannel for notifications:

kubectl apply -f examples/alertchannels/slack.yaml

Create a CronJobMonitor to watch your jobs:

kubectl apply -f examples/monitors/basic.yaml

See the examples/ directory for complete configuration examples.

Configuration

CronJobMonitor

The main resource for configuring what to monitor. Select CronJobs by labels, expressions, names, or namespaces.

Selector Pattern	Example
All in namespace	`selector: {}`
By labels	`matchLabels: {tier: critical}`
By expressions	`matchExpressions: [{key: tier, operator: In, values: [critical]}]`
By names	`matchNames: [daily-backup, weekly-report]`
Multiple namespaces	`namespaces: [prod, staging]`
Namespace labels	`namespaceSelector: {matchLabels: {env: prod}}`
Cluster-wide	`allNamespaces: true`

See examples/monitors/ for complete examples of each pattern.

Key Features

Feature	Description
Dead-Man's Switch	Alert when jobs don't run within expected window
SLA Tracking	Monitor success rates and duration percentiles
Maintenance Windows	Suppress alerts during planned maintenance
Severity Routing	Route critical/warning alerts to different channels

AlertChannel

Define where to send alerts. AlertChannel resources are cluster-scoped.

Type	Description	Example
Slack	Incoming webhook	slack.yaml
PagerDuty	Events API	pagerduty.yaml
Webhook	Generic HTTP	webhook.yaml
Email	SMTP	email.yaml

Native Kubernetes Features

CronJob Guardian focuses on monitoring and alerting, leaving job execution control to native Kubernetes features:

Feature	Spec Field	Description	Example
Timeout	`activeDeadlineSeconds`	Kill stuck jobs	with-timeout.yaml
Retry	`backoffLimit`	Auto-retry failed jobs	with-retry.yaml
Timezone	`timeZone`	Schedule in specific timezone	with-timezone.yaml
Concurrency	`concurrencyPolicy`	Prevent overlapping runs	with-concurrency.yaml

Use Cases

Example monitors for common scenarios:

Use Case	Description	Example
Database Backups	Critical backups with 100% SLA	database-backups.yaml
Data Pipelines	ETL with performance tracking	data-pipeline.yaml
Reports	Business reports with maintenance windows	financial-reports.yaml
Full Featured	All configuration options	full-featured.yaml

Web Dashboard

CronJob Guardian includes a feature-rich web UI that serves both an interactive dashboard and REST API on port 8080.

Dashboard Pages

Page	Description
Overview	Summary cards, CronJob table with health status, active alerts panel
CronJob Details	Per-job metrics, execution history, duration/success charts, health heatmap
Monitors	CronJobMonitor list with aggregate metrics and cronjob counts
Channels	AlertChannel management with test functionality
Alerts	Alert history with filtering by type, severity, and time range
SLA	SLA compliance dashboard with breach tracking
Settings	System config, storage stats, data pruning, and Pattern Tester

Visualization Features

Success Rate Charts: Bar charts with 14/30/90 day range selection and week-over-week comparison
Duration Trend Charts: Line charts showing P50/P95 with regression detection and baseline indicators
Health Heatmap: GitHub-style calendar view showing daily success rates (30/60/90 days)
Monitor Aggregate Charts: Cross-CronJob comparison charts, health distribution pie charts
SLA Dashboard: Summary cards and compliance table with status indicators and trend arrows

Export Features

CSV Export: Download execution history or SLA reports as CSV files
PDF Reports: Generate printable reports with metrics, charts summary, and alert history

Accessing the Dashboard

kubectl port-forward -n cronjob-guardian svc/cronjob-guardian-ui 8080:8080

Then open http://localhost:8080 in your browser.

For production deployments, you can expose the UI via Ingress or OpenShift Route. See the Helm Chart Documentation for configuration details.

REST API

The operator exposes a REST API for programmatic access to monitoring data, CronJob management, and alerting.

# Get all monitored CronJobs
curl http://localhost:8080/api/v1/cronjobs

# Get execution history
curl http://localhost:8080/api/v1/cronjobs/production/daily-backup/executions

# Trigger a job manually
curl -X POST http://localhost:8080/api/v1/cronjobs/production/daily-backup/trigger

See the API Reference for complete endpoint documentation.

Suggested Fixes

CronJob Guardian includes intelligent fix suggestions that analyze failure context (exit codes, reasons, logs, events) and provide actionable guidance in alerts.

Built-in Patterns

Pattern	Trigger	Suggestion
OOMKilled	Reason: `OOMKilled`	Increase `resources.limits.memory`
SIGKILL (137)	Exit code 137	Check for OOM, inspect pod state
SIGTERM (143)	Exit code 143	Check `activeDeadlineSeconds` or eviction
ImagePullBackOff	Reason match	Verify image name and `imagePullSecrets`
CrashLoopBackOff	Reason match	Check application startup logs
ConfigError	Reason: `CreateContainerConfigError`	Verify Secret/ConfigMap references
DeadlineExceeded	Reason match	Increase deadline or optimize job
BackoffLimitExceeded	Reason match	Check logs from failed attempts
Evicted	Reason match	Check node pressure, set pod priority
FailedScheduling	Event pattern	Check resources, taints, affinity

Custom Patterns

Define custom patterns in your CronJobMonitor to match application-specific failures:

alerting:
  suggestedFixPatterns:
    - name: db-connection-failed
      match:
        logPattern: "connection refused.*:5432|ECONNREFUSED"
      suggestion: "PostgreSQL connection failed. Check: kubectl get pods -n {{.Namespace}} -l app=postgres"
      priority: 150  # Higher than built-ins (1-100)
    - name: s3-access-denied
      match:
        logPattern: "AccessDenied|NoCredentialProviders"
      suggestion: "S3 access denied. Verify IAM role and bucket policy."
      priority: 140

Pattern Tester

Test patterns before deploying via the Settings > Pattern Tester page in the UI. Enter match criteria and sample failure data to verify your pattern works correctly.

Template Variables

Suggestions support Go template variables:

{{.Namespace}} - CronJob namespace
{{.Name}} - CronJob name
{{.JobName}} - Job name (includes timestamp suffix)
{{.ExitCode}} - Container exit code
{{.Reason}} - Termination reason

Storage Backends

CronJob Guardian supports multiple storage backends for execution history:

Backend	Use Case	HA Support
SQLite (default)	Single-replica, lightweight	No
PostgreSQL	Production, high-availability	Yes
MySQL/MariaDB	Enterprise environments	Yes

Configure via Helm values or the GuardianConfig resource. See Helm Chart Documentation for details.

Development

Prerequisites

Go 1.23+
Docker
Kind (for local testing)
Node.js 20+ or Bun (for UI development)

Building

# Build the operator binary
make build

# Build Docker image
make docker-build IMG=cronjob-guardian:dev

# Build UI
cd ui && pnpm build

# Generate CRDs and code
make manifests generate

# Run linters
make lint

# Run tests
make test

Running Locally

# Install CRDs
make install

# Run the operator locally
make run

# Or run in a local Kind cluster
make test-e2e

Updating Helm Documentation

Before releasing, regenerate the Helm chart documentation:

# Generate both values.schema.json and update README.md Values section
make helm-docs

# Or run individually:
make helm-schema  # Generate values.schema.json only
make helm-readme  # Update README.md Values section only

# Sync CRDs if API types changed
make helm-sync-crds

This uses helm-tool to:

Generate values.schema.json from values.yaml comments (enables IDE autocompletion)
Update the ## Values section in the chart README with HTML tables organized by section

Documenting values.yaml:

Use # Description comments above properties to add descriptions
Use # +docs:section=SectionName to organize values into sections
Section comments can include additional description text on following lines

Example:

# +docs:section=Storage
# Configuration for the storage backend.

config:
  storage:
    # Storage type: sqlite, postgres, or mysql
    type: sqlite

Uninstalling

Helm

# Uninstall the release
helm uninstall cronjob-guardian --namespace cronjob-guardian

# Delete CRDs (optional - this removes all CronJobMonitor and AlertChannel data)
kubectl delete crd cronjobmonitors.guardian.illenium.net
kubectl delete crd alertchannels.guardian.illenium.net

# Delete the namespace
kubectl delete namespace cronjob-guardian

kubectl

# Remove all CronJobMonitor and AlertChannel resources
kubectl delete cronjobmonitors --all-namespaces --all
kubectl delete alertchannels --all-namespaces --all

# Remove the operator
make undeploy

# Remove CRDs
make uninstall

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Directories ¶

Path	Synopsis
api
v1alpha1 Package v1alpha1 contains API Schema definitions for the guardian v1alpha1 API group.	Package v1alpha1 contains API Schema definitions for the guardian v1alpha1 API group.
cmd
docs
swagger Package swagger Code generated by swaggo/swag.	Package swagger Code generated by swaggo/swag.
internal
alerting
analyzer
api
config
controller
metrics
scheduler
store
testutil Package testutil provides shared test utilities and mock implementations for use across the cronjob-guardian test suites.	Package testutil provides shared test utilities and mock implementations for use across the cronjob-guardian test suites.
test
e2e/framework
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL