Documentation
¶
Overview ¶
Package coldstart implements the / the design doc rule cascade that decides how an agent resolves a digest when its local cache misses and the DHT lookup did not return enough providers.
The orchestrator is invoked by the mirror miss path after a FindProviders call returns empty. It runs the following pipeline:
1. Compute HRW top-K from the local membership snapshot. 2. Dial all K in parallel with `pull_intent_query`, collecting responses up to a 2 s timeout (the step 4). 3. Apply the 7-rule cascade in priority order (the step 5): 1. failure short-circuit -> return ErrFailureShortCircuit (5xx) 2. cache hit -> return the responder's transfer addr 3. in-flight piggyback -> DHT-poll until provider appears 4. transient cooldown -> return ErrCooldownActive (5xx) 5. all-unreachable expand -> re-run step 2 with top-2K 6. degraded eager expand -> re-run step 2 with top-2K 7. cold-start -> please_pull to lowest-rank reachable, then DHT-poll for the provider 4. While DHT-polling (rules 3 and 7), bound by the per-digest timeout from the design doc (manifest/config 5 s; layer max(10 s, size/50 MB/s) × 3) and the per-kind poll interval (200 ms manifest/config; 1 s layer).
The Resolver is stateless across calls; concurrent Resolve invocations for the same digest are safe and will independently arrive at the same outcome (the inflight map at the puller side dedupes the origin pull itself).
Index ¶
- Variables
- type ChildDigest
- type Discovery
- type MetricsHooks
- type Options
- type Resolution
- type Resolver
- func (r *Resolver) PrefetchChildren(ctx context.Context, children []ChildDigest, registry, repository string) error
- func (r *Resolver) PrefetchLayers(ctx context.Context, digests []digest.Digest, registry, repository string) error
- func (r *Resolver) Resolve(ctx context.Context, d digest.Digest, kind ifaces.OriginRefKind, ...) (*Resolution, error)
Constants ¶
This section is empty.
Variables ¶
var ( // ErrFailureShortCircuit fires rule 1. ErrFailureShortCircuit = errors.New("coldstart: failure short-circuit") // ErrCooldownActive fires rule 4. ErrCooldownActive = errors.New("coldstart: transient cooldown active") // ErrExhausted fires when the cascade reaches its terminal state // without producing a provider (e.g., expanded-2K exhausted, or // please_pull completed but the DHT poll timed out). ErrExhausted = errors.New("coldstart: cascade exhausted") )
Sentinel errors. Mirror layer maps all of these to 5xx; tests distinguish to validate which rule fired.
var ErrPrefetchInvalid = errors.New("coldstart: prefetch invalid arguments")
ErrPrefetchInvalid signals a programmer error in the prefetch call: registry or repository was empty, which would produce a malformed PleasePull RPC.
var ErrPrefetchPartial = errors.New("coldstart: prefetch had per-puller failures")
ErrPrefetchPartial signals at least one puller's PleasePull RPC failed. The other pullers were still asked to pull. Callers can errors.Is against this for retry / logging logic; the prefetch is best-effort so this is informational rather than fatal.
Functions ¶
This section is empty.
Types ¶
type ChildDigest ¶
type ChildDigest struct {
Digest digest.Digest
Kind ifaces.OriginRefKind
}
ChildDigest pairs a child digest with the OCI URL-family kind the puller MUST target. Kind is one of ifaces.KindConfig (the manifest's image-config blob) or ifaces.KindBlob (every layer descriptor); both are pulled from /v2/<repo>/blobs/<digest> at the registry level but are carried separately on the wire so per-kind metrics agree end-to-end across the please_pull boundary. internal/manifest's TypedChildren is the canonical producer of these values.
type Discovery ¶
type Discovery interface {
FindProviders(ctx context.Context, d digest.Digest) ([]ifaces.Provider, error)
Health() float64
}
Discovery is the subset of the libp2p discovery host that the orchestrator needs. Kept narrow for ease of mocking.
type MetricsHooks ¶
type MetricsHooks struct {
// OnRankMismatch fires once per pull_intent response whose
// reported hrw_rank disagrees with the requester's computed
// rank for that responder. kindLabel is "manifest", "config",
// or "layer".
OnRankMismatch func(kindLabel string, responder ifaces.NodeID)
// OnDhtFalseEmpty fires when the orchestrator observes the
// false-empty case: DHT had returned 0 providers, but a
// pull_intent_query reports has_cached=true.
OnDhtFalseEmpty func()
// OnTopKProbeHit fires when any rule before rule 7 (cold-start)
// resolves the request. Used to track how often the probe saves
// an origin pull.
OnTopKProbeHit func()
// OnColdStartDuration is called once per Resolve with the total
// elapsed time and the outcome rule that fired ("rule1".."rule7"
// or "expanded_rule_N").
OnColdStartDuration func(kindLabel, outcome string, d time.Duration)
// OnDesignatedPullerTakeover fires when a pull_intent_query
// responder reports in_flight=true but its started_at is older
// than the per-the design doc stall threshold, so the requester excludes
// it from rule-3 piggyback and routes via the next-ranked node
// (rule 6 / rule 7). kindLabel is "manifest", "config", or
// "layer". Maps to the design doc metric
// `p2p_designated_puller_takeover_total`.
OnDesignatedPullerTakeover func(kindLabel string)
// OnTopKExpansion fires once per expansion pass to top-2K (or
// top-(K × TopKExpansionFactor) when the factor is configured).
// reason is "degraded" (rule-6 DHT-degraded expand) or
// "all_unreachable" (rule-5 expansion). Maps to the design doc metric
// `p2p_topk_expansion_total{reason=}`.
OnTopKExpansion func(reason string)
// OnPrefetchBatch fires once per PrefetchLayers call with the
// number of distinct pullers contacted and the number of layer
// digests grouped into those batches (after self / unreachable
// filtering). Maps to the design doc / the design doc batched-please_pull metrics:
// `p2p_prefetch_batches_total` (count) and
// `p2p_prefetch_digests_batched_total` (sum).
OnPrefetchBatch func(pullers, digests int)
}
MetricsHooks lets the metrics package wire Prometheus counters without coupling the orchestrator to client_golang. All hooks are nil-safe.
type Options ¶
type Options struct {
Members ifaces.Members
Discovery Discovery
Coord ifaces.Coordinator
Inflight *inflight.Map
Logger *slog.Logger
Metrics MetricsHooks
Now func() time.Time
HrwK int // default 3
HrwScope hrw.Scope // default ScopeCluster
SelfZone string // required when HrwScope == ScopeZone
// LocalIntent computes self's PullIntent synchronously, without
// the libp2p coord round-trip. When non-nil, the cold-start
// orchestrator includes self as a first-class participant in the
// the design doc rule cascade - rule 2 (cache hit on self), rule 3 (self
// in-flight), rule 4 (self in cooldown), and rule 7 (self picked
// as designated puller) all behave the same as for any peer.
//
// Without LocalIntent, the resolver excludes self from
// queryTargets and from `reachable`, which means a self-as-HRW-
// rank-0 case routes please_pull to rank 1 - two nodes both
// trying to delegate to each other can each origin-pull the same
// digest, violating the cache-hit "one origin pull per digest"
// invariant. New deployments MUST wire LocalIntent.
LocalIntent ifaces.LocalIntentProvider
// LocalPull starts an origin pull on self without the libp2p
// please_pull RPC. Used when rule 7's lowest-rank-reachable puller
// is self. nil + LocalIntent non-nil + rule 7 picks self ->
// resolver falls back to Coord.PleasePull(self, ...), which will
// either fail to dial or burn a stream slot to no benefit; tests
// should set both together.
LocalPull ifaces.LocalPullStarter
// Tunables (defaults applied if zero).
QueryTimeout time.Duration // default 2s - step 5 wait window
PollManifest time.Duration // default 200ms -
PollLayer time.Duration // default 1s -
TransientCooldownCap time.Duration // default 30s - rule 4
HonorSweepInterval time.Duration // default 1m, negative disables full sweeps
// TopKExpansionFactor is the multiplier applied to HrwK on the
// expansion pass under rule 5 / rule 6 (the step 5; the design doc
// `topk_expansion_factor_degraded`). Defaults to 2 when ≤1.
TopKExpansionFactor int
// TrustedFailureClasses is the set of the design doc origin-error classes
// the requester accepts as a cluster-wide 5xx-immediate signal
// when a top-K responder reports recently_failed in rule 1.
// Classes outside this set (notably `transient`) are honored
// only locally by the reporting puller. Empty defaults to
// {auth, not_found, rate_limited} per design the design doc / config
// `origin_failure_classes_trusted_cluster_wide`.
TrustedFailureClasses []ifaces.FailureClass
}
Options configures a Resolver.
type Resolution ¶
type Resolution struct {
// Providers are transfer endpoints (host:port) the caller should
// fetch from, in priority order. Non-empty on success.
Providers []ifaces.Provider
// Outcome names which rule fired. Useful for tests and metrics.
Outcome string
}
Resolution carries the orchestrator's verdict.
type Resolver ¶
type Resolver struct {
// contains filtered or unexported fields
}
Resolver runs the cascade.
func (*Resolver) PrefetchChildren ¶
func (r *Resolver) PrefetchChildren(ctx context.Context, children []ChildDigest, registry, repository string) error
PrefetchChildren is the kind-preserving sibling of PrefetchLayers. It groups children by (HRW puller, kind) and emits one PleasePull (or StartLocalPull) RPC per group, so the single-repo-per- batch invariant ("all digests in a batch MUST share kind") is honored while still preserving the per-kind metric label all the way through the wire.
A manifest typically yields one KindConfig digest and N KindBlob digests. If all N+1 children HRW to the same puller, PrefetchChildren issues TWO RPCs (one per kind) rather than one mixed RPC - that's the trade for keeping the kind label honest. The CPU/RPC overhead is negligible (one extra round-trip per manifest serve, dwarfed by the layer pull itself), and the alternative (collapsing config into blob on the wire) leaves p2p_origin_pull_total{kind="config"} permanently zero.
func (*Resolver) PrefetchLayers ¶
func (r *Resolver) PrefetchLayers(ctx context.Context, digests []digest.Digest, registry, repository string) error
PrefetchLayers groups digests by their HRW rank-0 reachable designated puller and issues one PleasePull RPC per puller. Digests HRW'ing to self are diverted to the local LocalPullStarter (if configured) and batched as a single StartLocalPull call; if no LocalPullStarter is configured, the self-bucket is skipped (the per-digest Resolve cascade will still recover via rule 7 when containerd actually asks for the layer). Already-cached or otherwise filtered digests should be removed by the caller before invoking.
PrefetchLayers blocks until every per-puller RPC has completed or errored, but each RPC is bounded by QueryTimeout. Callers run it in a goroutine for fire-and-forget semantics. A returned error means at least one puller's RPC failed; callers can log it and otherwise ignore - the next per-digest Resolve call from containerd will fall back to the full the design doc cascade.
registry and repository identify the upstream and OCI repo. They MUST be non-empty (the design doc single-repo-per-batch invariant); an empty value returns ErrPrefetchInvalid without issuing any RPC.
All digests passed via PrefetchLayers are sent as KindBlob; this is a back-compat shim that pre-dates per-kind labelling. New callers should use PrefetchChildren which preserves the config-vs-layer kind end-to-end through the wire so per-kind metrics ("manifest | config | layer") remain honest.
func (*Resolver) Resolve ¶
func (r *Resolver) Resolve(ctx context.Context, d digest.Digest, kind ifaces.OriginRefKind, registry, repository string, expectedSize int64) (*Resolution, error)
Resolve runs the cascade for d. The returned error is one of the sentinel errors above, or a context cancellation, or a transport error. expectedSize is 0 if unknown (e.g., manifest digest before parsing); it is used to compute the per-the design doc stall threshold for rule 3 (in_flight) and the DHT-polling deadline overall.
registry+repository identify the upstream and OCI repo for the please_pull RPC (the design doc single-repo-per-batch invariant). Both must be non-empty for rule 7 to fire; the orchestrator otherwise falls through to ErrExhausted.