schedule

package
v0.0.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 11, 2026 License: Apache-2.0 Imports: 17 Imported by: 0

README

Package schedule — Per-Pool Streaming Claim Scheduler

This package implements the claim scheduler that pairs incoming sandbox-create requests with idle pods in a SandboxPool. Each pool gets one PoolScheduler instance running a dedicated goroutine.


Architecture

Component Overview
graph TD
    API["API Server\n(sandbox_service.go)"]
    SCHED["PoolScheduler\n(one per pool)"]
    RQ["readyQueue\n(FIFO, UID-deduped)"]
    RES["reservations\n(TTL-bounded set)"]
    CAS["doCAS goroutines\n(capped at 128)"]
    K8S["k8s apiserver"]
    CACHE["informer cache\n(controller-runtime)"]
    CTRL["SandboxPool Reconciler"]

    API -->|"Enqueue(req)"| SCHED
    CTRL -->|"NotifyIdle() after\nMarkUpdateCompleted"| SCHED
    SCHED -->|"refreshReady()\nListIdlePodsForPool"| CACHE
    SCHED -->|"popUnreservedAndReserve"| RQ
    RQ -->|"reserves pod name"| RES
    SCHED -->|"go doCAS(req, pod)"| CAS
    CAS -->|"TriggerUpdateWithOptions\n(phase: Idle→Starting)"| K8S
    K8S -->|"watch event\n(~50–200 ms)"| CACHE
    CAS -->|"success → req.ResultCh"| API
    CAS -->|"conflict → requeue + wake"| SCHED
Data Flow: Request Lifecycle
sequenceDiagram
    participant C as API Caller
    participant S as PoolScheduler
    participant Q as readyQueue
    participant G as doCAS goroutine
    participant K as k8s apiserver

    C->>S: Enqueue(req) → reqCh
    S-->>S: triggerCh wakeup
    S->>Q: popUnreservedAndReserve()
    Q-->>S: pod (reserved)
    S->>G: go doCAS(req, pod)
    G->>K: TriggerUpdateWithOptions (Idle→Starting)
    alt success
        K-->>G: updated pod
        G-->>C: ResultCh ← ClaimResult{Pod}
    else conflict / ErrUnexpectedPodPhase
        K-->>G: 409 Conflict
        G->>S: requeue(req) + wake()
    else hard error
        K-->>G: 403/500
        G->>S: reserved.release(pod.Name)
        G-->>C: ResultCh ← ClaimResult{Err}
    end
Idle Pod Notification Flow
sequenceDiagram
    participant REC as Reconciler\n(syncInplaceUpdatePhases)
    participant K as k8s apiserver
    participant INF as informer cache
    participant S as PoolScheduler

    REC->>K: MarkUpdateCompleted (Stopping→Idle)
    Note over REC: errgroup.Wait() for all pods
    REC->>S: NotifyIdleAvailable()
    Note over S: goroutine: sleep 200ms\n(informer propagation window)
    K-->>INF: watch event (~50–200ms)
    Note over INF: cache updated: pod is now Idle
    S->>INF: refreshReady() → ListIdlePodsForPool
    INF-->>S: idle pods (including newly recycled)
    S-->>S: appendFiltered → readyQueue
    S-->>S: tryDispatchPending

The 200 ms delay in NotifyIdle is deliberate: firing refreshReady before the watch event arrives returns a stale cache view where the just-recycled pod still appears as Stopping, causing the scheduler to fall back to the 10-second poll timer.


Key Design Decisions

FIFO readyQueue + UID dedup

Pods enter the queue in the order refreshReady first observes them as Idle. Because NotifyIdle fires on every Stopping→Idle transition, the earliest-recycled pod is always dispatched first — equivalent to "prefer older idle" semantics without a scorer.

Duplicate entries are prevented by a UID set: repeated refreshReady calls (e.g. from both idleCh and pollTimer) do not double-admit the same pod.

TTL-bounded reservations

When a pod is handed to a doCAS goroutine it is atomically reserved (pod name → expiry). Subsequent ListIdlePodsForPool calls from the informer cache may still report that pod as Idle for up to ~300 ms (watch propagation latency). The reservation prevents re-selecting it during that window, eliminating ErrUnexpectedPodPhase conflicts under high QPS.

TTL (reservationTTL = 2 s) is long enough to outlast any realistic informer lag but far shorter than the time a pod takes to legitimately return to Idle again (tens of seconds minimum).

Non-blocking scheduler goroutine

doCAS always runs in a fresh goroutine; the scheduler loop never blocks on apiserver round-trips. This keeps request latency proportional to the number of idle pods, not the number of concurrent requests.

Exponential back-off + immediate wakeup

When no idle pods are available, pollTimer backs off exponentially (10 s → 5 min). Any of three events immediately resets the interval to 10 s and triggers a dispatch attempt:

  • New request arrivestriggerCh
  • Pod becomes IdleidleCh (delayed by idleNotifyDelay)
  • CAS conflicttriggerCh via wake()
Scale-up signal

When requests are pending and the ready queue is empty, triggerScaleUpOnce writes PoolScaleUpPendingAnnotationKey onto the SandboxPool. A singleflight guard deduplicates concurrent calls so a burst of arrivals does not cause an annotation storm. The Reconciler's autoscaler observes this annotation and increases spec.replicas.


Running Stress Tests

Stress tests live in scheduler_stress_test.go and are gated behind the stress build tag so they never run during normal go test ./....

# Baseline run
SCHEDULE_STRESS_SCALE=1.0 go test -race -count=1 -tags=stress ./pkg/lifecycle/schedule/...

# Heavier load (2× requests, pods, and durations)
SCHEDULE_STRESS_SCALE=2.0 go test -v -race -count=3 -tags=stress ./pkg/lifecycle/schedule/...

# Quick smoke run (half load)
SCHEDULE_STRESS_SCALE=0.5 go test -race -count=1 -tags=stress ./pkg/lifecycle/schedule/...
SCHEDULE_STRESS_SCALE

A positive floating-point multiplier (default 1.0) applied to request counts, pod counts, and test durations. Scaling above 1.0 increases the chance of catching rare race conditions; scaling below 1.0 speeds up CI smoke runs.

Test Coverage
Test What it checks
TestScheduler_ExtremeScale 2000 concurrent requests against 500 pods; every pod dispatched exactly once; every request terminates
TestScheduler_ConflictInjectionUnderLoad ~30% of pods inject a retriable Conflict on first attempt; scheduler retries and still claims all pods
TestScheduler_StressStorm Continuous enqueue + NotifyIdle storm for 600 ms; Shutdown() returns within 5 s
TestScheduler_ConcurrentEnqueueAndShutdown Concurrent Enqueue callers while Shutdown is called; neither hangs

The -race flag is the authoritative correctness signal. The stress tests intentionally do not assert on goroutine counts: parallel test processes inflate runtime.NumGoroutine() independent of real leaks.

Documentation

Overview

Package schedule implements the per-pool streaming claim scheduler.

Streaming model (vs. the retired batch flight):

  • A persistent ready-queue holds idle pods observed via informer cache. notifyIdle() triggers an incremental refresh that appends newly-idle pods in arrival order; no global sort is performed on the hot path.

  • ClaimRequest arrivals pop one pod and spawn a goroutine that executes inplaceupdate.TriggerUpdateWithOptions. The scheduler goroutine never blocks on the apiserver round-trip, so new requests are handled as fast as they arrive.

  • A TTL-bounded reservation map (see reservations.go) prevents a pod from being re-selected while its phase transition is still propagating from apiserver → informer cache. This is the main defence against the ErrUnexpectedPodPhase conflict observed under high QPS in the batch implementation.

See /home/ylli/.claude/plans/spicy-pondering-kahn.md for the design note.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ClaimOptions

type ClaimOptions struct {
	ContainerImages map[string]string
	Labels          map[string]string
	Annotations     map[string]string
	// TargetPodPhase defaults to SandboxPhaseRunning when empty.
	TargetPodPhase string
}

ClaimOptions carries the per-request options applied to the target pod when the CAS succeeds. Fields map 1:1 onto inplaceupdate.UpdateOptions.

type ClaimRequest

type ClaimRequest struct {
	Ctx      context.Context
	Opts     ClaimOptions
	Deadline time.Time
	ResultCh chan<- ClaimResult
	// EnqueuedAt is set by Enqueue() and read by doCAS to record the
	// end-to-end dispatch latency via ScheduleDispatchLatencySeconds. Callers
	// may leave it zero; Enqueue will populate it if so.
	EnqueuedAt time.Time
}

ClaimRequest is one pending claim awaiting an idle pod.

type ClaimResult

type ClaimResult struct {
	Pod *corev1.Pod
	Err error
}

ClaimResult is the outcome of a single Enqueue/claim attempt.

type PoolScheduler

type PoolScheduler struct {
	// contains filtered or unexported fields
}

PoolScheduler is a per-pool background goroutine that streams claim requests to idle pods as they become available.

func NewPoolScheduler

func NewPoolScheduler(ns, name, team, user, env string, k8sClient client.Client) *PoolScheduler

NewPoolScheduler allocates a scheduler without starting its goroutine. team/user/env identify the owning pool for metrics labelling — env is the owning SandboxEnv name read from the Pool's LabelEnv (empty string for legacy Pools that pre-date Env adoption). k8sClient may be nil; in that case the background status writer that mirrors the in-memory queue length onto SandboxPool.Status.PendingRequests is skipped (useful for unit tests that exercise only the in-process dispatch path).

func (*PoolScheduler) Enqueue

func (s *PoolScheduler) Enqueue(req *ClaimRequest) bool

Enqueue hands req to the scheduler. Returns false (and does NOT write to ResultCh) when the internal channel is full; the caller must translate that into backpressure.

func (*PoolScheduler) NotifyIdle

func (s *PoolScheduler) NotifyIdle()

NotifyIdle is called when a pod transitions Stopping → Idle for this pool. It schedules a delayed wake of the scheduler so the informer cache has time to reflect the apiserver write before refreshReady() runs its List call. The delay (idleNotifyDelay ≈ 200 ms) covers P99 informer propagation latency; without it refreshReady() fires against a stale cache that still reports the pod as Stopping and misses it, causing the scheduler to fall back to the 10-second pollTimer. Safe to call from any goroutine.

func (*PoolScheduler) Run

func (s *PoolScheduler) Run(ctx context.Context)

Run is the scheduler goroutine; call it exactly once per scheduler.

func (*PoolScheduler) Shutdown

func (s *PoolScheduler) Shutdown()

Shutdown stops the scheduler goroutine and blocks until it has drained pending requests.

func (*PoolScheduler) Snapshot added in v0.0.5

func (s *PoolScheduler) Snapshot() Snapshot

Snapshot returns a point-in-time view of the scheduler's internal counters. See the Snapshot doc comment for field semantics. Safe to call from any goroutine; cost is O(1) atomic / channel-len reads.

type Snapshot added in v0.0.5

type Snapshot struct {
	// IdleReady is the number of idle Pods currently admitted to the ready
	// queue and not reserved — what would be popped if a request arrived now.
	IdleReady int
	// QueueLen is the total number of ClaimRequests this scheduler is
	// holding that have not yet reached a terminal result. Counts both
	// requests still in the producer→consumer channel AND requests
	// parked inside the scheduler goroutine waiting for an idle pod.
	// Sourced from an atomic counter so the value stays stable while a
	// request transiently moves between buffers (the autoscaler relies
	// on this for its reactive demand signal).
	QueueLen int
	// ReservedCount is the number of pods currently reserved (handed off to
	// a CAS goroutine but not yet observed back as Idle/Starting through the
	// informer cache).
	ReservedCount int
	// InflightCAS is the number of TriggerUpdateWithOptions goroutines
	// currently running. Useful for diagnosing apiserver back-pressure.
	InflightCAS int
	// LastDispatchAt is the wall-clock time of the most recent successful
	// dispatch (doCAS OK branch). Zero when no dispatch has ever succeeded.
	LastDispatchAt time.Time
}

Snapshot is a point-in-time view of a PoolScheduler's internal counters. All fields are read with atomic / channel-len semantics; calling Snapshot is safe from any goroutine and does not interact with the apiserver.

EnvScheduler reads Snapshots on the request hot path to rank candidate member pools, so the cost must stay in the tens-of-nanoseconds range.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL