caesium
Open-source distributed job scheduler with DAG pipelines, multi-runtime support, and an embedded web UI
Caesium lets you define jobs as declarative YAML DAGs, run them on Docker, Podman, or Kubernetes, and operate them through a REST API, Prometheus metrics, an embedded React UI, and an optional GraphQL endpoint when API-key auth is disabled.
Local Developer Experience
Caesium is designed so job authors can validate, visualize, and execute pipelines locally before pushing them to a server.
Validate definitions
caesium test --path jobs/ --verbose
Use --check-images to verify local image availability.
Run executable harness scenarios
caesium test --scenario ./harness
Harness scenario files use the Harness kind and let you assert run status, task status, output fragments, schema-violation counts, cache hits, log content, Prometheus metric values, and emitted OpenLineage events against a real local execution.
Visualize a DAG
caesium job preview --path jobs/fanout-join.job.yaml
Run locally
caesium dev --once --path jobs/nightly-etl.job.yaml
caesium dev without --once watches YAML files and re-runs the DAG on save. The local runner uses an in-memory SQLite database and the same execution engine as the server path.
Quick Start
1. Write a job definition
apiVersion: v1
kind: Job
metadata:
alias: nightly-etl
trigger:
type: cron
configuration:
cron: "0 2 * * *"
timezone: "UTC"
steps:
- name: extract
image: alpine:3.20
command: ["sh", "-c", "echo extracting"]
- name: transform
image: alpine:3.20
command: ["sh", "-c", "echo transforming"]
- name: load
image: alpine:3.20
command: ["sh", "-c", "echo loading"]
2. Validate and preview it
caesium test --path jobs/ --verbose
caesium test --scenario ./harness
caesium job preview --path jobs/nightly-etl.job.yaml
caesium job lint --path jobs/
3. Run it locally
caesium dev --once --path jobs/nightly-etl.job.yaml
4. Start the server and apply definitions
# Start the server
just run
# Apply definitions
caesium job apply --path jobs/ --server http://localhost:8080
Features
- Declarative YAML job definitions with validation, diffing, schema reporting, and Git sync.
- DAG execution with fan-out, fan-in, retry controls, trigger rules, and run parameters.
- Docker, Podman, and Kubernetes task runtimes.
- Cron and HTTP triggers.
- Distributed execution backed by dqlite, including mixed
amd64 and arm64 clusters.
- Embedded operator UI with live run updates, DAG inspection, backfill controls, and log streaming.
- Smart incremental execution: cache task results and skip re-execution when inputs are unchanged.
- OpenLineage event emission.
- Prometheus metrics plus optional in-browser operator tools for server logs and database inspection.
Server Workflow
Prerequisites
Run the server
just run
The API and embedded UI are served from http://localhost:8080.
Using Podman
Set the following environment variables to use Podman instead of Docker:
export CAESIUM_PODMAN=true
just run
When CAESIUM_PODMAN=true, Caesium defaults to the rootless Podman socket at
$XDG_RUNTIME_DIR/podman/podman.sock (or /run/user/$UID/podman/podman.sock
when XDG_RUNTIME_DIR is unset) and uses the podman CLI. Override either if
your setup differs:
export CAESIUM_SOCK=/custom/path/podman.sock
export CAESIUM_CONTAINER_CLI=podman
just run
| Variable |
Default |
Description |
CAESIUM_PODMAN |
false |
Prefix image references with localhost/ for Podman's local image store |
CAESIUM_CONTAINER_CLI |
docker or podman when CAESIUM_PODMAN=true |
Container CLI used by just recipes |
CAESIUM_SOCK |
/var/run/docker.sock or $XDG_RUNTIME_DIR/podman/podman.sock when CAESIUM_PODMAN=true |
Host-side container socket to mount into the container |
CAESIUM_PORT |
8080 |
Host port to expose the server on |
Load example jobs
just hydrate
Trigger a run manually
curl -X POST http://localhost:8080/v1/jobs/<job-id>/run
Backfill a cron job
caesium backfill create \
--job-id <job-id> \
--start 2026-03-01T00:00:00Z \
--end 2026-03-03T00:00:00Z \
--server http://localhost:8080
Job Definitions
Jobs use the apiVersion / kind / metadata / trigger / steps schema. For full authoring guidance see docs/job-definitions.md and the generated reference in docs/job-schema-reference.md.
Useful CLI commands:
caesium job lint --path ./jobs
caesium job diff --path ./jobs
caesium job apply --path ./jobs --server http://localhost:8080
caesium job schema --doc
caesium run retry-callbacks --job-id <job-id> --run-id <run-id>
Building and Testing
Runtime images are published as multi-arch Docker manifests. docker pull caesiumcloud/caesium:<tag> resolves to the native architecture automatically.
| Command |
Description |
just build |
Build a release image for the host platform |
CAESIUM_PLATFORM=linux/arm64 just build |
Cross-build for a specific architecture |
just build-cross linux/arm64 |
Cross-build a single platform with buildx |
just build-multiarch tag=<tag> |
Build and push a multi-arch manifest |
just unit-test |
Run Go unit tests with race detector and coverage |
just ui-test |
Run UI unit tests and bundle budget checks |
just ui-e2e |
Run Playwright against the embedded UI and a real Caesium server |
just integration-test |
Run integration tests |
just helm-lint |
Validate the Helm chart |
Supported runtime image targets:
The embedded UI exposes a few optional power-user surfaces:
- Server log console: enabled by
CAESIUM_LOG_CONSOLE_ENABLED=true and backed by GET /v1/logs/stream, GET /v1/logs/level, and PUT /v1/logs/level.
- Database console: enabled by
CAESIUM_DATABASE_CONSOLE_ENABLED=true and backed by GET /v1/database/schema and POST /v1/database/query.
- Worker inspection:
GET /v1/nodes/:address/workers.
- Fleet-level stats:
GET /v1/stats.
API Reference
The server exposes REST on port 8080. GraphQL is available at GET /gql only when CAESIUM_AUTH_MODE=none; when API-key auth is enabled, authentication in this release applies to the REST API, /metrics, and embedded UI only, and webhook delivery continues to use per-trigger webhook signature configuration rather than bearer tokens. The UI determines whether login is required through the explicit GET /auth/status endpoint rather than probing protected resources.
| Endpoint |
Purpose |
GET /health |
Health check |
GET /auth/status |
Report whether API-key auth is enabled for the UI |
GET /metrics |
Prometheus metrics (viewer auth required when CAESIUM_AUTH_MODE=api-key) |
GET /gql |
GraphQL endpoint when CAESIUM_AUTH_MODE=none |
GET /v1/jobs |
List jobs |
GET /v1/jobs/:id |
Get one job |
GET /v1/jobs/:id/tasks |
List persisted task definitions for a job |
GET /v1/jobs/:id/dag |
Retrieve DAG nodes and edges |
POST /v1/jobs/:id/run |
Trigger a new run |
PUT /v1/jobs/:id/pause |
Pause a job |
PUT /v1/jobs/:id/unpause |
Unpause a job |
GET /v1/jobs/:id/runs |
List runs for a job |
GET /v1/jobs/:id/runs/:run_id |
Get one run |
GET /v1/jobs/:id/runs/:run_id/logs?task_id=<task-id> |
Stream or retrieve task logs |
POST /v1/jobs/:id/runs/:run_id/callbacks/retry |
Retry failed callbacks |
POST /v1/jobs/:id/backfill |
Start a backfill |
GET /v1/jobs/:id/backfills |
List backfills |
PUT /v1/jobs/:id/backfills/:backfill_id/cancel |
Cancel a backfill |
POST /v1/jobdefs/apply |
Apply one or more job definitions |
GET /v1/triggers |
List triggers |
GET /v1/atoms |
List atoms |
GET /v1/events |
Subscribe to lifecycle events over SSE |
GET /v1/stats |
Get aggregated job/run statistics |
GET /v1/nodes/:address/workers |
Inspect worker state for one node |
The log and database console endpoints are intentionally gated by environment variables because they are operator-facing debugging features rather than default public APIs.
For the auth management CLI, prefer supplying credentials through CAESIUM_API_KEY; the --api-key flag remains available but is visible in process listings.
When CAESIUM_AUTH_MODE=api-key, you must also set CAESIUM_AUTH_KEY_HASH_SECRET to a long random server-side secret. New and rotated API keys are stored as HMAC-SHA256 hashes derived from that secret. Existing legacy SHA-256 key hashes continue to validate after upgrade so you can roll the change out safely, but you should rotate those keys so the database no longer contains legacy unkeyed hashes.
Documentation
Contributing
See CONTRIBUTING.md for setup, development workflow, and PR guidance.
License
See LICENSE for details.