README
¶
CronJob Guardian
A Kubernetes operator for monitoring CronJobs with SLA tracking, intelligent alerting, and a built-in dashboard.
Why CronJob Guardian?
CronJobs power critical operations—backups, ETL pipelines, reports, cache warming—but Kubernetes provides no built-in monitoring for them. When jobs fail silently or stop running, you only find out when it's too late.
CronJob Guardian watches your CronJobs and alerts you when something goes wrong:
- Job failures with logs, events, and suggested fixes
- Missed schedules via dead-man's switch detection
- Performance regressions when jobs slow down over time
- SLA breaches when success rates drop below thresholds
Architecture
Kubernetes Cluster
┌──────────────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ CronJobMonitor │ │ AlertChannel │ │ CronJobs │ │
│ │ (CRD) │ │ (CRD) │ │ & Jobs │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └─────────────────────┼─────────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ CronJob Guardian Operator │ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Controllers │ │ Schedulers │ │ Alerting │ │ │
│ │ │ │ │ │ │ Dispatcher │───────────────────┐
│ │ │ • Monitor │ │ • Dead-man │ │ │ │ │ │
│ │ │ • Job │◀─│ • SLA recalc │─▶│ • Dedup │ │ │ │
│ │ │ • Channel │ │ • Prune │ │ • Rate limit │ │ │ │
│ │ └───────┬────────┘ └────────────────┘ └────────────────┘ │ │ │
│ │ │ │ │ │
│ │ ▼ │ │ │
│ │ ┌─────────────────────────────────────┐ ┌────────────────┐ │ │ │
│ │ │ Store │ │ Prometheus │ │ │ │
│ │ │ SQLite / PostgreSQL / MySQL │ │ Metrics │───────────────────┐
│ │ │ │ │ :8443 │ │ │ │
│ │ │ • Executions • Logs • Alerts │ └────────────────┘ │ │ │
│ │ └──────────────────┬──────────────────┘ │ │ │
│ │ │ │ │ │
│ │ ┌──────────────────┴──────────────────┐ │ │ │
│ │ │ Web UI & REST API │ │ │ │
│ │ │ :8080 │────────────────────────────────────────┐
│ │ └─────────────────────────────────────┘ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │ │
│ │ │
└──────────────────────────────────────────────────────────────────────────────────┘ │
│
┌─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ External Services │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Slack │ │ PagerDuty │ │ Webhook │ │ Email │ │Prometheus │ │
│ └───────────┘ └───────────┘ └───────────┘ └───────────┘ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
How it works:
- Create
CronJobMonitorresources to define what to watch (label selectors, SLA thresholds) - Create
AlertChannelresources to configure alert destinations (Slack, PagerDuty, etc.) - The operator watches CronJobs and Jobs, records executions to the store
- Background schedulers check for missed schedules, SLA breaches, and duration regressions
- When issues are detected, alerts are dispatched with context (logs, events, suggested fixes)
Screenshots
Main Dashboard

SLA Compliance View

CronJob Details View

Features
Monitoring
- Dead-Man's Switch: Alert when CronJobs don't run within expected windows. Auto-detects expected intervals from cron schedules.
- SLA Tracking: Monitor success rates, duration percentiles (P50/P95/P99), and detect performance regressions.
- Execution History: Store and query job execution records with logs and events.
- Prometheus Metrics: Export metrics for integration with existing monitoring infrastructure.
Alerting
- Multiple Channels: Slack, PagerDuty, generic webhooks, and email
- Rich Context: Alerts include pod logs, Kubernetes events, and suggested fixes
- Deduplication: Configurable suppression windows and alert delays for flaky jobs
- Severity Routing: Route critical and warning alerts to different channels
Operations
- Maintenance Windows: Suppress alerts during scheduled maintenance
- Built-in Dashboard: Feature-rich web UI for monitoring and analytics
- REST API: Programmatic access to all monitoring data
- Multiple Storage Backends: SQLite (default), PostgreSQL, or MySQL
Prometheus Metrics
CronJob Guardian exports the following metrics:
| Metric | Type | Description |
|---|---|---|
cronjob_guardian_success_rate |
Gauge | Success rate percentage (0-100) per CronJob |
cronjob_guardian_duration_seconds |
Histogram | Execution duration with P50/P95/P99 buckets |
cronjob_guardian_alerts_total |
Counter | Total alerts sent by type, severity, and channel |
cronjob_guardian_executions_total |
Counter | Total executions by status (success/failed) |
cronjob_guardian_active_alerts |
Gauge | Currently active alerts per CronJob |
Metrics are available at /metrics endpoint on port 8080.
Quick Start
Prerequisites
- Kubernetes 1.26+
- kubectl configured with cluster access
- Helm 3.8+ (for OCI registry support)
Installation
Helm (Recommended)
CronJob Guardian is distributed as an OCI Helm chart:
# Install with default configuration (SQLite storage)
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
--namespace cronjob-guardian \
--create-namespace
# Install with custom values
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
--namespace cronjob-guardian \
--create-namespace \
--values values.yaml
Quick Start with PostgreSQL
# Create a secret for database credentials
kubectl create namespace cronjob-guardian
kubectl create secret generic postgres-credentials \
--namespace cronjob-guardian \
--from-literal=password=your-secure-password
# Install with PostgreSQL storage
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
--namespace cronjob-guardian \
--set config.storage.type=postgres \
--set config.storage.postgres.host=postgres.database.svc \
--set config.storage.postgres.database=guardian \
--set config.storage.postgres.username=guardian \
--set config.storage.postgres.existingSecret=postgres-credentials
High Availability Setup
helm install cronjob-guardian oci://ghcr.io/illeniumstudios/charts/cronjob-guardian \
--namespace cronjob-guardian \
--create-namespace \
--set replicaCount=2 \
--set leaderElection.enabled=true \
--set config.storage.type=postgres \
--set config.storage.postgres.host=postgres.database.svc \
--set config.storage.postgres.database=guardian \
--set config.storage.postgres.username=guardian \
--set config.storage.postgres.existingSecret=postgres-credentials
Install from Source
# Clone the repository
git clone https://github.com/iLLeniumStudios/cronjob-guardian.git
cd cronjob-guardian
# Install using local chart
helm install cronjob-guardian ./deploy/helm/cronjob-guardian \
--namespace cronjob-guardian \
--create-namespace
kubectl (Alternative)
# Install CRDs and operator
kubectl apply -f https://raw.githubusercontent.com/iLLeniumStudios/cronjob-guardian/main/dist/install.yaml
Or build from source:
make docker-build docker-push IMG=your-registry/cronjob-guardian:latest
make deploy IMG=your-registry/cronjob-guardian:latest
Helm Configuration
The Helm chart supports extensive configuration for storage backends, high availability, metrics, and more.
See the Helm Chart Documentation for complete configuration reference including:
- Storage backends (SQLite, PostgreSQL, MySQL)
- High availability with leader election
- Ingress and OpenShift Route support for UI access
- Prometheus ServiceMonitor integration
- Resource limits and scheduling
- All available values and their defaults
Basic Setup
- Create an AlertChannel for notifications:
kubectl apply -f examples/alertchannels/slack.yaml
- Create a CronJobMonitor to watch your jobs:
kubectl apply -f examples/monitors/basic.yaml
See the examples/ directory for complete configuration examples.
Configuration
CronJobMonitor
The main resource for configuring what to monitor. Select CronJobs by labels, expressions, names, or namespaces.
| Selector Pattern | Example |
|---|---|
| All in namespace | selector: {} |
| By labels | matchLabels: {tier: critical} |
| By expressions | matchExpressions: [{key: tier, operator: In, values: [critical]}] |
| By names | matchNames: [daily-backup, weekly-report] |
| Multiple namespaces | namespaces: [prod, staging] |
| Namespace labels | namespaceSelector: {matchLabels: {env: prod}} |
| Cluster-wide | allNamespaces: true |
See examples/monitors/ for complete examples of each pattern.
Key Features
| Feature | Description |
|---|---|
| Dead-Man's Switch | Alert when jobs don't run within expected window |
| SLA Tracking | Monitor success rates and duration percentiles |
| Maintenance Windows | Suppress alerts during planned maintenance |
| Severity Routing | Route critical/warning alerts to different channels |
AlertChannel
Define where to send alerts. AlertChannel resources are cluster-scoped.
| Type | Description | Example |
|---|---|---|
| Slack | Incoming webhook | slack.yaml |
| PagerDuty | Events API | pagerduty.yaml |
| Webhook | Generic HTTP | webhook.yaml |
| SMTP | email.yaml |
Native Kubernetes Features
CronJob Guardian focuses on monitoring and alerting, leaving job execution control to native Kubernetes features:
| Feature | Spec Field | Description | Example |
|---|---|---|---|
| Timeout | activeDeadlineSeconds |
Kill stuck jobs | with-timeout.yaml |
| Retry | backoffLimit |
Auto-retry failed jobs | with-retry.yaml |
| Timezone | timeZone |
Schedule in specific timezone | with-timezone.yaml |
| Concurrency | concurrencyPolicy |
Prevent overlapping runs | with-concurrency.yaml |
Use Cases
Example monitors for common scenarios:
| Use Case | Description | Example |
|---|---|---|
| Database Backups | Critical backups with 100% SLA | database-backups.yaml |
| Data Pipelines | ETL with performance tracking | data-pipeline.yaml |
| Reports | Business reports with maintenance windows | financial-reports.yaml |
| Full Featured | All configuration options | full-featured.yaml |
Web Dashboard
CronJob Guardian includes a feature-rich web UI that serves both an interactive dashboard and REST API on port 8080.
Dashboard Pages
| Page | Description |
|---|---|
| Overview | Summary cards, CronJob table with health status, active alerts panel |
| CronJob Details | Per-job metrics, execution history, duration/success charts, health heatmap |
| Monitors | CronJobMonitor list with aggregate metrics and cronjob counts |
| Channels | AlertChannel management with test functionality |
| Alerts | Alert history with filtering by type, severity, and time range |
| SLA | SLA compliance dashboard with breach tracking |
| Settings | System config, storage stats, data pruning, and Pattern Tester |
Visualization Features
- Success Rate Charts: Bar charts with 14/30/90 day range selection and week-over-week comparison
- Duration Trend Charts: Line charts showing P50/P95 with regression detection and baseline indicators
- Health Heatmap: GitHub-style calendar view showing daily success rates (30/60/90 days)
- Monitor Aggregate Charts: Cross-CronJob comparison charts, health distribution pie charts
- SLA Dashboard: Summary cards and compliance table with status indicators and trend arrows
Export Features
- CSV Export: Download execution history or SLA reports as CSV files
- PDF Reports: Generate printable reports with metrics, charts summary, and alert history
Accessing the Dashboard
kubectl port-forward -n cronjob-guardian svc/cronjob-guardian-ui 8080:8080
Then open http://localhost:8080 in your browser.
For production deployments, you can expose the UI via Ingress or OpenShift Route. See the Helm Chart Documentation for configuration details.
REST API
The operator exposes a REST API for programmatic access to monitoring data, CronJob management, and alerting.
# Get all monitored CronJobs
curl http://localhost:8080/api/v1/cronjobs
# Get execution history
curl http://localhost:8080/api/v1/cronjobs/production/daily-backup/executions
# Trigger a job manually
curl -X POST http://localhost:8080/api/v1/cronjobs/production/daily-backup/trigger
See the API Reference for complete endpoint documentation.
Suggested Fixes
CronJob Guardian includes intelligent fix suggestions that analyze failure context (exit codes, reasons, logs, events) and provide actionable guidance in alerts.
Built-in Patterns
| Pattern | Trigger | Suggestion |
|---|---|---|
| OOMKilled | Reason: OOMKilled |
Increase resources.limits.memory |
| SIGKILL (137) | Exit code 137 | Check for OOM, inspect pod state |
| SIGTERM (143) | Exit code 143 | Check activeDeadlineSeconds or eviction |
| ImagePullBackOff | Reason match | Verify image name and imagePullSecrets |
| CrashLoopBackOff | Reason match | Check application startup logs |
| ConfigError | Reason: CreateContainerConfigError |
Verify Secret/ConfigMap references |
| DeadlineExceeded | Reason match | Increase deadline or optimize job |
| BackoffLimitExceeded | Reason match | Check logs from failed attempts |
| Evicted | Reason match | Check node pressure, set pod priority |
| FailedScheduling | Event pattern | Check resources, taints, affinity |
Custom Patterns
Define custom patterns in your CronJobMonitor to match application-specific failures:
alerting:
suggestedFixPatterns:
- name: db-connection-failed
match:
logPattern: "connection refused.*:5432|ECONNREFUSED"
suggestion: "PostgreSQL connection failed. Check: kubectl get pods -n {{.Namespace}} -l app=postgres"
priority: 150 # Higher than built-ins (1-100)
- name: s3-access-denied
match:
logPattern: "AccessDenied|NoCredentialProviders"
suggestion: "S3 access denied. Verify IAM role and bucket policy."
priority: 140
Pattern Tester
Test patterns before deploying via the Settings > Pattern Tester page in the UI. Enter match criteria and sample failure data to verify your pattern works correctly.
Template Variables
Suggestions support Go template variables:
{{.Namespace}}- CronJob namespace{{.Name}}- CronJob name{{.JobName}}- Job name (includes timestamp suffix){{.ExitCode}}- Container exit code{{.Reason}}- Termination reason
Storage Backends
CronJob Guardian supports multiple storage backends for execution history:
| Backend | Use Case | HA Support |
|---|---|---|
| SQLite (default) | Single-replica, lightweight | No |
| PostgreSQL | Production, high-availability | Yes |
| MySQL/MariaDB | Enterprise environments | Yes |
Configure via Helm values or the GuardianConfig resource. See Helm Chart Documentation for details.
Development
Prerequisites
- Go 1.23+
- Docker
- Kind (for local testing)
- Node.js 20+ or Bun (for UI development)
Building
# Build the operator binary
make build
# Build Docker image
make docker-build IMG=cronjob-guardian:dev
# Build UI
cd ui && pnpm build
# Generate CRDs and code
make manifests generate
# Run linters
make lint
# Run tests
make test
Running Locally
# Install CRDs
make install
# Run the operator locally
make run
# Or run in a local Kind cluster
make test-e2e
Updating Helm Documentation
Before releasing, regenerate the Helm chart documentation:
# Generate both values.schema.json and update README.md Values section
make helm-docs
# Or run individually:
make helm-schema # Generate values.schema.json only
make helm-readme # Update README.md Values section only
# Sync CRDs if API types changed
make helm-sync-crds
This uses helm-tool to:
- Generate
values.schema.jsonfromvalues.yamlcomments (enables IDE autocompletion) - Update the
## Valuessection in the chart README with HTML tables organized by section
Documenting values.yaml:
- Use
# Descriptioncomments above properties to add descriptions - Use
# +docs:section=SectionNameto organize values into sections - Section comments can include additional description text on following lines
Example:
# +docs:section=Storage
# Configuration for the storage backend.
config:
storage:
# Storage type: sqlite, postgres, or mysql
type: sqlite
Uninstalling
Helm
# Uninstall the release
helm uninstall cronjob-guardian --namespace cronjob-guardian
# Delete CRDs (optional - this removes all CronJobMonitor and AlertChannel data)
kubectl delete crd cronjobmonitors.guardian.illenium.net
kubectl delete crd alertchannels.guardian.illenium.net
# Delete the namespace
kubectl delete namespace cronjob-guardian
kubectl
# Remove all CronJobMonitor and AlertChannel resources
kubectl delete cronjobmonitors --all-namespaces --all
kubectl delete alertchannels --all-namespaces --all
# Remove the operator
make undeploy
# Remove CRDs
make uninstall
Contributing
Contributions are welcome! Please feel free to submit issues and pull requests.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
Copyright 2025.
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Directories
¶
| Path | Synopsis |
|---|---|
|
api
|
|
|
v1alpha1
Package v1alpha1 contains API Schema definitions for the guardian v1alpha1 API group.
|
Package v1alpha1 contains API Schema definitions for the guardian v1alpha1 API group. |
|
docs
|
|
|
swagger
Package swagger Code generated by swaggo/swag.
|
Package swagger Code generated by swaggo/swag. |
|
internal
|
|
|
testutil
Package testutil provides shared test utilities and mock implementations for use across the cronjob-guardian test suites.
|
Package testutil provides shared test utilities and mock implementations for use across the cronjob-guardian test suites. |
|
test
|
|