cocoon-operator

command module

v0.2.3 Latest Latest Go to latest Published: Apr 20, 2026 License: MIT Imports: 22 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cocoonstack/cocoon-operator

Links

Open Source Insights

README ¶

cocoon-operator

Kubernetes operator that manages VM-backed pod lifecycles through two CRDs:

CocoonSet — declarative spec for an agent group (one main agent + N sub-agents + M toolboxes)
CocoonHibernation — per-pod hibernate / wake request

Both reconcilers are built on controller-runtime and consume the typed CRD shapes shipped from cocoon-common/apis/v1.

The binary entry point is main.go; the reconcilers themselves live in subpackages so each one is independently testable:

cocoon-operator/
├── main.go              # manager wiring + flag parsing
├── cocoonset/           # CocoonSet reconciler, pod builders, status diff
├── hibernation/         # CocoonHibernation reconciler
└── epoch/               # SnapshotRegistry interface + epoch HTTP adapter

Architecture

┌──────────────────────────────────────────────────────────────────┐
│                        cocoon-operator                            │
│                                                                  │
│  ┌────────────────────────┐    ┌─────────────────────────────┐  │
│  │  cocoonset.Reconciler  │    │ hibernation.Reconciler      │  │
│  │  - finalizer + GC       │    │  - HibernateState patches   │  │
│  │  - main → subs → tbs    │    │  - epoch.HasManifest probe  │  │
│  │  - patch /status        │    │  - Conditions               │  │
│  └────────┬───────────────┘    └────────────┬────────────────┘  │
│           │                                  │                   │
│           ▼                                  ▼                   │
│  ┌────────────────────┐         ┌──────────────────────┐        │
│  │ controller-runtime │         │ epoch SnapshotRegistry│        │
│  │ Manager            │         │ (HTTP via             │        │
│  │  - leader election │         │  registryclient)      │        │
│  │  - metrics :8080    │        └──────────────────────┘        │
│  │  - probes :8081     │                                         │
│  └────────────────────┘                                          │
└──────────────────────────────────────────────────────────────────┘

CocoonSet reconcile loop

Fetch the CocoonSet (return early on NotFound).
If DeletionTimestamp is set, walk owned pods, delete them, optionally epoch.DeleteManifest each VM (per-pod, gated on meta.ShouldSnapshotVM(spec) so main-only does not issue DeleteManifest against sub-agent / toolbox tags vk-cocoon never pushed), then drop the finalizer.
Ensure the cocoonset.cocoonstack.io/finalizer is in place.
List owned pods by cocoonset.cocoonstack.io/name=<cs.Name> and classify by role label.
Suspend short-circuit: if spec.suspend == true, write meta.HibernateState(true) onto every pod and report Phase=Suspended.
Un-suspend: if spec.suspend == false and any owned pod still carries the hibernate annotation from a prior suspend, clear it via PatchHibernateState(false) so vk-cocoon wakes the VMs. PatchHibernateState(false) is a no-op on pods whose annotation is already absent, so this is cheap in the common "never suspended" case.
Ensure the main agent (slot 0). If it is not yet Ready, requeue in 5 seconds and report Phase=Pending.
Ensure sub-agents [1..Replicas]; delete extras above the requested count.
Ensure toolboxes by name; skip creation with an error if the toolbox pod name collides with an existing non-toolbox pod (e.g. an agent). Delete extras.
Re-list and patch /status (with structural diff so unchanged status patches are no-ops).

Pods are constructed via meta.FromAgentSpec / meta.FromToolboxSpec factory helpers so the operator never touches the annotation map directly. These factories propagate the full VMOptions surface (OS, Backend, ConnType, Network, ForcePull, NoDirectIO, ProbePort, Storage, Resources) into the pod annotations that vk-cocoon consumes. The For watch uses predicate.GenerationChangedPredicate so reconciles only fire when the spec actually changes — status-only patches the operator makes itself never loop back. The Owns side filters pod events to creation, deletion, and meaningful transitions (phase change, readiness flip, label/annotation mutation) via a podRelevantChange predicate so pure VK status churn does not trigger reconcile storms.

CocoonHibernation reconcile loop

Spec.Desire	What the reconciler does	Terminal phase
`Hibernate`	`meta.HibernateState(true).Apply` on the target pod, then poll `epoch.HasManifest(vmName, meta.HibernateSnapshotTag)` until the snapshot lands. A probe error (transport / 5xx / auth) surfaces as a returned error so controller-runtime logs + retries with backoff.	`Hibernated`
`Wake`	Check if the container is already `Running` (skip annotation patch if so), otherwise clear `meta.HibernateState` once (skip if already cleared to avoid triggering informer events on every requeue cycle), then wait for the container to be `Running` and drop the hibernation snapshot tag from epoch. A wake that does not complete within `wakeTimeout` (5 minutes) is escalated to `Phase=Failed` with a dated message in the `Ready` condition.	`Active`

There is no cocoon-vm-snapshots ConfigMap bridge — epoch is the single source of truth for hibernation state. Failure paths set Phase=Failed with a one-shot message in the Ready condition instead of looping forever on a bad reference. A Failed wake is recoverable: on re-entry into Waking from a non-Waking phase the reconciler explicitly refreshes the Ready condition's LastTransitionTime so the wake budget resets cleanly (without the override, apimeta.SetStatusCondition would preserve the stale timestamp across the False → False transition and the recovered wake would trip the deadline on the next reconcile).

Configuration

Variable	Default	Description
`KUBECONFIG`	unset	Path to kubeconfig when running outside the cluster
`OPERATOR_LOG_LEVEL`	`info`	`projecteru2/core/log` level
`EPOCH_URL`	`http://epoch.cocoon-system.svc:8080`	Base URL of the epoch registry
`EPOCH_TOKEN`	unset	Bearer token (read-only is enough)
`EPOCH_CA_CERT`	unset	Path to PEM-encoded CA certificate for TLS verification against epoch
`METRICS_ADDR`	`:8080`	Prometheus listener
`PROBE_ADDR`	`:8081`	healthz / readyz listener
`LEADER_ELECT`	`true`	Enable leader election so only one replica reconciles

CLI flags (--metrics-bind-address, --health-probe-bind-address, --leader-elect) override the corresponding env var.

Installation

kubectl apply -k github.com/cocoonstack/cocoon-operator/config/default?ref=main

This installs:

cocoon-system namespace
Both CRDs (imported from cocoon-common via make import-crds)
ServiceAccount, ClusterRole, and ClusterRoleBinding
The operator Deployment (1 replica with leader election on)

To override the image tag or replica count, build a kustomize overlay that imports config/default as a base.

Keeping CRDs in sync with cocoon-common

The CRD YAML lives under config/crd/bases/ and is committed so a clean clone works out of the box. After bumping the cocoon-common dependency, regenerate the bases with:

go get github.com/cocoonstack/cocoon-common@<version>
make import-crds
git add config/crd/bases && git commit

The import-crds target uses go list -m -f '{{.Dir}}' to resolve the cocoon-common module path and copies the YAML straight from there. CI rejects PRs that forget this step.

Development

make all            # full pipeline: deps + fmt + lint + test + build
make build          # build cocoon-operator binary
make test           # vet + race-detected tests
make lint           # golangci-lint on linux + darwin
make import-crds    # refresh config/crd/bases from cocoon-common
make help           # show all targets

The Makefile detects Go workspace mode (go env GOWORK) and skips go mod tidy when active so cross-module references resolve through go.work without forcing a release of cocoon-common.

Project	Role
cocoon-common	CRD types, annotation contract, shared helpers
cocoon-webhook	Admission webhook for sticky scheduling and CocoonSet validation
epoch	Snapshot registry; the operator queries it via `SnapshotRegistry`
vk-cocoon	Virtual kubelet provider managing VM lifecycle

License

MIT

Documentation ¶

Overview ¶

cocoon-operator runs the CocoonSet and CocoonHibernation controllers.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cocoonset Package cocoonset hosts the CocoonSet reconciler and pod builder helpers.	Package cocoonset hosts the CocoonSet reconciler and pod builder helpers.
epoch Package epoch wraps the epoch registry client behind a SnapshotRegistry interface for testability.	Package epoch wraps the epoch registry client behind a SnapshotRegistry interface for testability.
hibernation Package hibernation drives hibernate/wake transitions for CocoonHibernation CRs.	Package hibernation drives hibernate/wake transitions for CocoonHibernation CRs.
version Package version holds build-time metadata set via ldflags.	Package version holds build-time metadata set via ldflags.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

cocoon-operator

Architecture

CocoonSet reconcile loop

CocoonHibernation reconcile loop

Configuration

Installation

Keeping CRDs in sync with cocoon-common

Development

Related projects

License

Documentation ¶

Overview ¶

Source Files ¶

Directories ¶