wcprof

package
v1.0.0-beta.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 24, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

README

wcprof: engine wall-clock profiling

wcprof is an experimental, cheap wall-clock profiler for the Dagger engine, built to find latency bottlenecks that CPU/memory profiles can't see (e.g. a 300ms serial tax before every container start). It is intentionally separate from OTel: events are fixed-size structs appended to sharded in-memory buffers with interned strings, and all heavy analysis happens offline.

What it records

Three event types, written by hooks at the engine's blocking choke points:

  • ops: timed intervals with a kind (call, call_exec, lazy, exec, exec_phase, service_start, session_phase, ...), a class (e.g. Container.withExec, exec.setupNetwork), an instance ident (recipe digest, exec ID), a structural parent (from context), and an outcome (hit/executed/joined/...).
  • waits: exact blocked-on intervals — who waited, on which op (or named resource, e.g. a cache-volume lock), why, from when to when. These are recorded at the choke points themselves (dagql singleflight, lazy-eval waiters, service starts, exec completion), not inferred from span nesting.
  • links: non-blocking correlations, most importantly "exec op X hosts nested client Y" so module-function calls back into the API are stitched under the exec that made them.

Current hook points:

  • dagql/cache.go getOrInitCall: one call op per caller (cache hits and do-not-cache calls included — this deliberately records what OTel suppresses), one shared call_exec op per actual resolver execution, wait edges from every caller to it, and a dagql.publishResult op for publication cost.
  • dagql/cache.go evaluateOne: one lazy op per lazy-evaluation run, classed by the call that created the lazy value, with wait edges from the triggering and joining ops.
  • engine/engineutil/executor.go: one exec op per container run, one exec_phase op per setup phase (exec.setupNetwork, exec.setupRootfs, ..., exec.runContainer), plus a split of exec.containerStart (engine overhead) vs exec.processRun (user work).
  • core/container_exec.go: cache-volume lock waits, mount preparation and output-commit phases, and the wait on the executor.
  • core/services.go: service_start ops with singleflight wait edges.
  • engine/server/session.go serveQuery: per-query session_phase ops (attachables wait, workspace load, module load, schema build, query).

Enabling and dumping

Two scopes:

  • Per-session: dagger --profile <anything> (hidden flag) sets ClientMetadata.Profile. Only that session's work is recorded — including its nested module/SDK clients — via a context mark applied by the session server and propagated to derived contexts. Caveat: if a profiled call joins in-flight identical work spawned by an unprofiled session, the join shows up as an unresolved wait (modeled by the analyzer as a fixed delay) since the other session's execution wasn't recorded.

  • Engine-global: record everything. Enable at startup with the engine's --wcprof flag (hidden) or _DAGGER_WCPROF=1, or at runtime by POSTing to the debug endpoint:

    curl -X POST -d on  http://<engine-debug-addr>/debug/wcprof/enabled
    curl -X POST -d off http://<engine-debug-addr>/debug/wcprof/enabled
    curl http://<engine-debug-addr>/debug/wcprof/enabled   # report state
    

    POSTing off stops engine-global recording but keeps buffered events dumpable, lets --profile sessions keep recording, and lets in-flight recorded ops finish; on re-enables on the same recorder (dumps from before and after merge cleanly since the epoch is unchanged).

_DAGGER_WCPROF_MAX_EVENTS=N overrides the event cap (default ~4M), read when the recorder is first created.

  • Dump: GET http://<engine-debug-addr>/debug/wcprof/dump streams a JSON header line (string tables, open ops) followed by one JSON event per line, and flushes the buffer (pass ?flush=0 to keep it).

Typical dev-engine workflow:

./hack/dev   # build + start dagger-engine.dev (publishes debug port 6060)
./hack/with-dev ./bin/dagger --profile call engine-dev container sync
curl -s http://localhost:6060/debug/wcprof/dump > /tmp/wcprof.dump
go run ./cmd/wcprof-analyze /tmp/wcprof.dump

Analysis (cmd/wcprof-analyze, engine/wcprof/wcanalyze)

The analyzer reconstructs the op graph (parents, waits, nested-client stitching) and reports:

  • what-if rankings (the headline): a replay-based discrete-event simulation re-executes the recorded schedule under "class X self-time × f" hypotheses (f ∈ {0, 0.5, 0.9} by default) and ranks classes by how much end-to-end makespan each would actually save. This accounts for critical- path shifts, dedup (singleflighted/lazy work counted once), and dependency chains — unlike naive "total time per class" tables.
  • per-class self-time tables (self = duration − waits − child intervals), with outcome counts and duplicate-execution detection.
  • the end-of-workload blocking chain.
  • dead air: trace gaps where no recorded op was running (= uninstrumented blocking, or client-side stalls).

Simulation assumptions (v1, deliberate): unlimited resources (never CPU-bound), recorded dependency structure is invariant under the hypothesis, waits on named resources (locks) are fixed delays. The simulated baseline makespan is reported against the actual makespan as a drift sanity check.

Status / caveats

This is a prototype for validating the approach:

  • there is one engine-wide recorder and event buffer; per-session enablement scopes what gets recorded, but all enabled sessions share the buffer and dumps (the analyzer can group by client/root, but the dump endpoint has no per-session filter).
  • events are kept until dumped; long sessions can hit the event cap (the dump and the analyzer both surface the drop count loudly).
  • leaf I/O ops (git fetch, image pull, filesync) are not yet instrumented; they currently show up as self-time of their calling op, and the "dead air" / unexplained-self-time reports are the guide for where to add hooks next.

Documentation

Overview

Package wcprof implements cheap wall-clock profiling for the engine.

It records operation intervals, wait (blocked-on) intervals, and link events into an in-memory buffer, for offline analysis of where wall-clock time goes and which operation classes are bottlenecks. It is intentionally not OTel: events are fixed-size structs appended to sharded in-memory buffers, with all strings interned. The expensive work (graph reconstruction, counterfactual simulation) happens offline after dumping the buffer via the engine debug endpoint.

Recording can be enabled two ways:

  • Engine-global: the _DAGGER_WCPROF environment variable, the engine's --wcprof flag, or POSTing "on" to the /debug/wcprof/enabled endpoint. All work on the engine is recorded until "off" is POSTed to the same endpoint.
  • Per-session: a client connecting with ClientMetadata.Profile set (the hidden --profile CLI flag). Only work attributable to that session (including its nested module/SDK clients) is recorded: the session server marks the session's contexts via ContextWithProfiling, and the mark propagates to derived contexts.

When profiling has never been enabled, all recording calls are a single atomic load + nil check.

Index

Constants

View Source
const DumpSchemaVersion = 1

DumpSchemaVersion identifies the dump wire format.

Variables

This section is empty.

Functions

func ContextWithOpID

func ContextWithOpID(ctx context.Context, opID uint64) context.Context

ContextWithOpID returns a context carrying the given op ID as the current op. Useful when handing work to a goroutine that should be attributed to an existing op.

func ContextWithProfiling

func ContextWithProfiling(ctx context.Context) context.Context

ContextWithProfiling marks ctx (and any context derived from it) for recording even when engine-global recording is off. The session server applies this to contexts of sessions that opted into profiling.

func CurrentOpID

func CurrentOpID(ctx context.Context) uint64

CurrentOpID returns the profiling op ID carried by ctx, or 0.

func DisableGlobal

func DisableGlobal()

DisableGlobal turns off engine-global recording. The recorder and its buffered events are retained for dumping, sessions that explicitly enabled profiling keep recording, and already-recording flows (contexts carrying a recorded op) run to completion.

func EnableGlobal

func EnableGlobal()

EnableGlobal turns on recording of all work on the engine.

func Enabled

func Enabled(ctx context.Context) bool

Enabled reports whether work running under ctx should be recorded. Call sites use it to skip computing op metadata (class/ident strings) when profiling is off. When profiling has never been enabled this is a single atomic load + nil check.

func EnsureRecorder

func EnsureRecorder()

EnsureRecorder creates the recorder if it does not exist yet, without turning on engine-global recording. Used when a session opts into profiling.

func GloballyEnabled

func GloballyEnabled() bool

GloballyEnabled reports whether engine-global recording is on.

func Link(ctx context.Context, kind LinkKind, fromOpID, targetOpID uint64, ident string, resultID uint64)

Link records a non-blocking correlation from fromOpID (0 means the current op in ctx) to an op, an interned ident, and/or a dagql result ID.

func NowNS

func NowNS() int64

NowNS returns the current recorder-relative timestamp, or 0 when profiling is disabled.

func ReadDump

func ReadDump(rd io.Reader) (*DumpHeader, []DumpEvent, error)

ReadDump parses a dump produced by WriteDump.

func RecordOp

func RecordOp(ctx context.Context, kind OpKind, class string, opts OpOpts, startNS, endNS int64, outcome Outcome) uint64

RecordOp records an already-completed op interval with explicit timestamps (from NowNS). Useful when an op's boundaries are observed from callbacks where holding an *Op handle would race. Returns the op ID, or 0 when profiling is disabled.

Types

type DumpEvent

type DumpEvent struct {
	Type     string `json:"e"`
	OpKind   string `json:"k,omitempty"`
	WorkType string `json:"w,omitempty"`
	Outcome  string `json:"o,omitempty"`
	Reason   string `json:"rs,omitempty"`
	LinkKind string `json:"lk,omitempty"`

	OpID     uint64 `json:"id,omitempty"`
	ParentID uint64 `json:"p,omitempty"`
	TargetID uint64 `json:"t,omitempty"`
	ResultID uint64 `json:"r,omitempty"`

	ClassID  uint32 `json:"c,omitempty"`
	IdentID  uint32 `json:"i,omitempty"`
	ClientID uint32 `json:"cl,omitempty"`

	StartNS int64 `json:"s"`
	EndNS   int64 `json:"d"`
}

DumpEvent is the JSON form of an Event. Short keys keep dumps compact.

type DumpHeader

type DumpHeader struct {
	SchemaVersion  int      `json:"schema_version"`
	EpochUnixNano  int64    `json:"epoch_unix_nano"`
	DumpedUnixNano int64    `json:"dumped_unix_nano"`
	DroppedEvents  uint64   `json:"dropped_events"`
	EventCount     int      `json:"event_count"`
	Strings        []string `json:"strings"`
	// OpenOps are ops begun but not ended at dump time (e.g. in-flight or
	// hung work).
	OpenOps []DumpOpenOp `json:"open_ops,omitempty"`
}

DumpHeader is the first line of a dump. The remaining lines are one DumpEvent JSON object per line.

type DumpOpenOp

type DumpOpenOp struct {
	OpID     uint64 `json:"op_id"`
	ParentID uint64 `json:"parent_id,omitempty"`
	Kind     string `json:"kind"`
	WorkType string `json:"work_type,omitempty"`
	ClassID  uint32 `json:"class_id,omitempty"`
	IdentID  uint32 `json:"ident_id,omitempty"`
	ClientID uint32 `json:"client_id,omitempty"`
	StartNS  int64  `json:"start_ns"`
}

DumpOpenOp describes an in-progress op at dump time.

type Event

type Event struct {
	Type     EventType
	OpKind   OpKind
	WorkType WorkType
	Outcome  Outcome
	Reason   WaitReason
	LinkKind LinkKind

	OpID     uint64
	ParentID uint64
	TargetID uint64
	ResultID uint64

	ClassID  uint32
	IdentID  uint32
	ClientID uint32

	StartNS int64
	EndNS   int64
}

Event is one fixed-size profiling record. Field meaning varies by Type:

  • Op: OpID/ParentID identify the op and its structural parent. StartNS and EndNS bound the interval. ClassID/IdentID/ClientID are interned strings. ResultID is the dagql shared result ID when known.
  • Wait: ParentID is the waiting op, TargetID the awaited op (or 0 when waiting on a named resource, in which case IdentID names it).
  • Link: ParentID is the from-op, TargetID an op (optional), IdentID an interned identifier (optional), ResultID a dagql result (optional).

type EventType

type EventType uint8

EventType discriminates the union in Event.

const (
	EventTypeInvalid EventType = iota
	// EventTypeOp is a completed operation interval.
	EventTypeOp
	// EventTypeWait is a completed wait interval: an op was blocked on
	// another op (or named resource) from Start to End.
	EventTypeWait
	// EventTypeLink is a non-blocking correlation between an op and another
	// op or an interned identifier.
	EventTypeLink
)

type LinkKind

type LinkKind uint8

LinkKind classifies link events.

const (
	LinkKindInvalid LinkKind = iota
	// LinkKindNestedClient: op (an exec) hosts a nested client whose ID is
	// the link's interned ident. Ops recorded for that client belong under
	// this exec.
	LinkKindNestedClient
	// LinkKindResult: op produced or returned the dagql result with
	// ResultID.
	LinkKindResult
	// LinkKindReusedResult: op was satisfied by reusing the dagql result
	// with ResultID (cache hit).
	LinkKindReusedResult
)

func (LinkKind) String

func (k LinkKind) String() string

type Op

type Op struct {
	// contains filtered or unexported fields
}

Op is a handle for an in-progress operation. A nil *Op is valid and all methods are no-ops, so call sites do not need to check whether profiling is enabled.

func BeginOp

func BeginOp(ctx context.Context, kind OpKind, class string, opts OpOpts) (context.Context, *Op)

BeginOp starts recording an operation. It returns a derived context that carries the new op as the current op (so nested ops and waits parent to it), and a handle to end it. When profiling is disabled it returns ctx unchanged and a nil handle.

func (*Op) End

func (op *Op) End(outcome Outcome)

End records the op's completion.

func (*Op) EndErr

func (op *Op) EndErr(err error)

EndErr records completion with OutcomeOK or OutcomeError based on err.

func (*Op) EndWithResult

func (op *Op) EndWithResult(outcome Outcome, resultID uint64)

EndWithResult records the op's completion, associating it with a dagql shared result ID when non-zero.

func (*Op) ID

func (op *Op) ID() uint64

ID returns the op's ID, or 0 for a nil handle.

func (*Op) OutcomeHint

func (op *Op) OutcomeHint() Outcome

OutcomeHint returns the hint set by SetOutcomeHint, or OutcomeNone.

func (*Op) SetIdent

func (op *Op) SetIdent(ident string)

SetIdent updates the op's instance identity (useful when it only becomes known after the op began, e.g. a recipe digest derived mid-call). Must be called from the goroutine that owns the op, before End.

func (*Op) SetOutcomeHint

func (op *Op) SetOutcomeHint(outcome Outcome)

outcomeHint carries an outcome decided mid-op (e.g. joined vs executed), read back by the code that ends the op.

type OpKind

type OpKind uint8

OpKind classifies operations.

const (
	OpKindInvalid OpKind = iota
	// OpKindCall is one dagql GetOrInitCall invocation (per caller,
	// including cache hits and singleflight joiners).
	OpKindCall
	// OpKindCallExec is the shared execution of a call's resolver function.
	// Singleflighted callers all wait on one of these.
	OpKindCallExec
	// OpKindLazy is one run of a lazy evaluation callback for a result.
	OpKindLazy
	// OpKindExecPhase is a setup/run phase of a container exec
	// (e.g. exec.setupNetwork, exec.runContainer).
	OpKindExecPhase
	// OpKindExec is the overall run of a container by the executor.
	OpKindExec
	// OpKindServiceStart is the start (incl. health check) of a service.
	OpKindServiceStart
	// OpKindSessionPhase is a per-query session serving phase
	// (e.g. session.workspaceLoad, session.query).
	OpKindSessionPhase
	// OpKindIO is a leaf I/O operation (git fetch, image pull, filesync...).
	OpKindIO
	// OpKindInternal is engine-internal background work (gc, persistence).
	OpKindInternal
)

func (OpKind) String

func (k OpKind) String() string

type OpOpts

type OpOpts struct {
	// Ident is an instance identity for the op (e.g. recipe digest, exec ID).
	Ident string
	// ClientID is the dagger client this op serves, when known.
	ClientID string
	// ResultID may be set at Begin time when already known.
	WorkType WorkType
}

OpOpts carries optional per-op metadata.

type Outcome

type Outcome uint8

Outcome describes how an op completed.

const (
	OutcomeNone Outcome = iota
	// OutcomeHit: dagql call satisfied from cache.
	OutcomeHit
	// OutcomeExecuted: dagql call missed cache and this caller spawned the
	// execution.
	OutcomeExecuted
	// OutcomeJoined: dagql call missed cache and joined an in-flight
	// execution started by another caller.
	OutcomeJoined
	// OutcomeDoNotCache: dagql call executed inline without caching.
	OutcomeDoNotCache
	// OutcomeOK: generic success for non-call ops.
	OutcomeOK
	// OutcomeError: op failed.
	OutcomeError
	// OutcomeCanceled: op canceled.
	OutcomeCanceled
)

func (Outcome) String

func (o Outcome) String() string

type Recorder

type Recorder struct {
	// contains filtered or unexported fields
}

Recorder collects profiling events. Safe for concurrent use.

func Active

func Active() *Recorder

Active returns the recorder, or nil if profiling was never enabled. The recorder outlives DisableGlobal so buffered events can still be dumped; use GloballyEnabled/Enabled to ask whether work is being recorded.

func NewRecorder

func NewRecorder(maxEvents int64) *Recorder

NewRecorder returns a recorder with the given event cap (<=0 means default).

func (*Recorder) Intern

func (r *Recorder) Intern(s string) uint32

Intern interns s and returns its table ID.

func (*Recorder) Now

func (r *Recorder) Now() int64

Now returns the current recorder-relative timestamp in nanoseconds.

func (*Recorder) WriteDump

func (r *Recorder) WriteDump(w io.Writer, flush bool) error

WriteDump streams the recorder state to w: one DumpHeader JSON line, then one DumpEvent JSON line per event. When flush is true, recorded events are removed from the buffer (the string table, open ops, and op ID counter are retained so later dumps remain consistent).

type Wait

type Wait struct {
	// contains filtered or unexported fields
}

Wait is a handle for an in-progress wait interval. A nil *Wait is valid and End is a no-op.

func BeginWait

func BeginWait(ctx context.Context, targetOpID uint64, reason WaitReason) *Wait

BeginWait starts recording that the current op (from ctx) is blocked on targetOpID for the given reason. Returns nil when profiling is disabled.

func BeginWaitIdent

func BeginWaitIdent(ctx context.Context, ident string, reason WaitReason) *Wait

BeginWaitIdent is BeginWait for waits on a named resource (e.g. a lock) rather than another op.

func (*Wait) End

func (w *Wait) End()

End records the wait interval.

func (*Wait) SetTarget

func (w *Wait) SetTarget(targetOpID uint64)

SetTarget updates the awaited op (useful when the target becomes known only after the wait began).

type WaitReason

type WaitReason uint8

WaitReason describes why an op was blocked.

const (
	WaitReasonInvalid WaitReason = iota
	// WaitReasonCallExec: blocked waiting for a call's resolver execution
	// (the caller that spawned it).
	WaitReasonCallExec
	// WaitReasonSingleflight: blocked joining another caller's in-flight
	// execution of the same call.
	WaitReasonSingleflight
	// WaitReasonLazy: blocked waiting for a lazy evaluation to finish.
	WaitReasonLazy
	// WaitReasonService: blocked waiting for a service to start/be healthy.
	WaitReasonService
	// WaitReasonLock: blocked acquiring a lock (target is an interned
	// resource name, not an op).
	WaitReasonLock
	// WaitReasonExec: blocked waiting for a container exec to finish.
	WaitReasonExec
	// WaitReasonIO: blocked on external I/O.
	WaitReasonIO
)

func (WaitReason) String

func (r WaitReason) String() string

type WorkType

type WorkType uint8

WorkType is a coarse attribution of an op's self-time, used by analysis to separate actionable engine overhead from user workload time and external I/O time.

const (
	WorkTypeEngine WorkType = iota
	// WorkTypeUser is time running user-controlled work (e.g. the container
	// process itself).
	WorkTypeUser
	// WorkTypeExternal is time bound on external systems (registry pulls,
	// git remotes, host filesync...).
	WorkTypeExternal
)

func (WorkType) String

func (w WorkType) String() string

Directories

Path Synopsis
Package wcanalyze reconstructs an operation graph from a wcprof dump and runs offline wall-clock bottleneck analysis over it: self-time accounting, a replay-based counterfactual simulator, and per-class what-if rankings.
Package wcanalyze reconstructs an operation graph from a wcprof dump and runs offline wall-clock bottleneck analysis over it: self-time accounting, a replay-based counterfactual simulator, and per-class what-if rankings.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL