buildpipe

package
v0.0.0-...-85391f8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 13, 2026 License: AGPL-3.0 Imports: 30 Imported by: 0

Documentation

Overview

Package buildpipe — cache.go implements the A3 file-level incremental cache (spec v0.2 § 4 Phase 1). Cache key composition, manifest diffing (hit/miss/removed classification) and parser-version derivation live here.

Phase 1 scope: per-file SHA256 + cache key, skip parse on hit, full Pass 2 re-run, full PageRank/Leiden recompute when ANY file is dirty. Phase 2 (reverse-reference invalidation, partial Pass 2) is C1's job.

Package buildpipe — incremental.go drives the A3 file-level cache build path (spec v0.2 § 4 Phase 1). Two entry points:

  • runCold: full rebuild (legacy V0 path, used on --no-cache or unusable cache).
  • runIncremental: parse only dirty files, reload cached node sets from DB, then rerun Pass 2 / cluster / score across the merged graph.

Phase 1 simplifications (per spec, deferred to C1+):

  • PageRank/Leiden recompute on ANY dirt (no <1% change-ratio shortcut).
  • Cross-language Sol↔TS link rebuilt whenever any TS or Sol file is dirty.
  • Reverse-reference index for partial Pass 2 invalidation: NOT implemented (Phase 2, C1's job). Pass 2 Resolve always sees the full per-language node set.

Package buildpipe — language_runners.go contains the per-language Pass 1 + Pass 2 driver functions (runGoPipeline, runTSPipeline, runSolPipeline) and their immediate helpers (stampFilePath, convertABI). Extracted from pipeline.go in G4 to keep the orchestrator file under the soft 400-line cap. Pure file move — no behavior change.

Package buildpipe — lock_propagation.go implements D1 Stage B (W-A, Within-language semantics Phase 5): cross-function lock propagation for the Go concurrency pass. Extends the existing intra-function accessed_under_lock detector (internal/parse/golang/concurrency_underlock.go) by walking the call graph from lock-holding functions into their callees and emitting accessed_under_lock(field, mutex) edges for fields touched inside reachable callee bodies.

Spec reference: docs/design/go-cross-function-lock-propagation.md (decisions resolved 2026-05-11, §5.0 — Stage B DFS depth=5, INFERRED confidence for all cross-function emits, calls+invokes traversal, goroutine bodies forced to INFERRED, opt-in flag, dedup with confidence priority).

Opt-in only: gated by Options.LockPropagation (CLI: --lock-propagation, default false). When the flag is off, the pass is a structural no-op — the existing intra-function B1 Phase 4 emit is unchanged.

Package buildpipe orchestrates the full Pass 1..4 build (spec §4.7): detect → parse → resolve → graph build/validate → cluster → score → persist. Three routing paths: cold rebuild, short-circuit (all-cached), and incremental (partial-hit — reuse cached files, re-parse dirty). See Run for routing logic.

Package buildpipe — staleness.go computes the manifest's staleness fingerprint. Prefers a path-aware git lookup (so unrelated commits don't flip the banner — see internal/server/staleness.go for the symmetrical serve-side comparison); falls back to mtime sum of up to 5 detected files when the source root isn't a git checkout. Extracted from pipeline.go in G4. Pure file move — no behavior change.

Package buildpipe — temporal.go wires CKS G6 Temporal edges (E4) into the cold-rebuild path. Conceptually:

  • Run a single `git log --raw` over the repo root containing srcRoot.
  • Translate repo-rooted paths into srcRoot-relative slash paths so they align with the rel paths the parsers stamped on Node.FilePath.
  • For every distinct commit, append a NodeCommit (one per SHA).
  • For every (file, commit) the log surfaces, emit `changed_in` from EVERY symbol in that file → that commit (file-level heuristic; line- level blame is deferred — see EdgeChangedIn doc-comment).
  • For every file, emit ONE `blame` edge from its File node → its most recent commit (V0 simplification of `file:line → commit`).

Skips silently (no error) when srcRoot isn't inside a git checkout, so non-git source trees still build cleanly without temporal edges.

Package buildpipe — temporal_hunks.go wires the CKS G6 Hunk-graph H1 stage (schema 1.8) on top of emitTemporalEdges. Per design docs/design/hunk-graph.md (decisions finalised 2026-05-09):

  • Encoding (§11.1): gzip stdlib, ~70% size reduction on diff text.
  • Dedup (§11.2): none in H1; keep chronology of rebased hunks.
  • Reach (§11.3): H1 only collects HEAD-reachable hunks (Confidence='EXTRACTED'). A future PR adds unreachable collection (Confidence='AMBIGUOUS') via reflog/fsck — H3's EvidencePack assembler MUST filter to EXTRACTED so the LLM never sees force-pushed-away code paths.
  • Lang (§11.4): hunk inherits its target file's extension when in {go, ts, sol}; everything else becomes 'git'.
  • Cap (§11.6): 64KB patch cap. Larger patches are stored as first 32KB + truncation marker + last 32KB. Compression is applied AFTER truncation.
  • Manifest (§11.8): Hunk node IDs are NOT recorded in the per-file manifest entries (they live outside file-level cache invalidation; emitTemporalEdges runs them wholesale on every build). isMetaNodeType is the single source of truth that buildFileEntries + computeColdFileEntries + extractBlobs share.

Index

Constants

View Source
const SchemaVersion = "1.10"

SchemaVersion is the cache-key contributor for the extraction schema. Bumped from "1.1" to "1.2" by A3 (FK ON DELETE CASCADE on edges/blobs/pkg_tree/ topic_tree). Bumped from "1.2" to "1.3" by E3 because new node kinds (Endpoint, MessageType) and edge kinds (listens_on, handles_message, rpc_calls) materially change the extraction surface — pre-1.3 DBs are missing those rows, so incremental invalidation must force a cold rebuild on first run with this binary. Bumped from "1.3" to "1.4" by E4 (CKS G6 Temporal): NodeCommit + changed_in/blame edges are emitted by the new post-Build temporal pass; pre-1.4 DBs are missing those rows. Bumped from "1.4" to "1.5" by G6 v3 (pending_refs persistence): Pass 1's unresolved cross-file references are now persisted per-file so the partial-cache rebuild path can reconstruct Pass 2's input without re-parsing cached files. Pre-1.5 DBs are missing the table, so the first 1.5 build is forced cold by ManifestUsable's version check. Bumped from "1.5" to "1.6" by P2 (CKS G3 control-flow context propagation): timeout_path / cancellation_path self-loop edges are emitted from Go context.With* call sites; pre-1.6 DBs are missing those rows so the first 1.6 build must run cold. Bumped from "1.6" to "1.7" by Track C (detector gap fill): the edges row gains an optional `dispatch_kind` TEXT column populated for `invokes` edges (P1b), plus three new emit sites — `uses_type` (P0), `instantiates` (P1c), and the lock-edge fix inside goroutine bodies (P1a). Pre-1.7 DBs are missing the column AND the new edges; opening such a DB triggers an idempotent ALTER ADD COLUMN via Migrate(), and ManifestUsable's version check forces a cold rebuild on first 1.7 run so the new edges land in their natural emission order. Bumped from "1.7" to "1.8" by Hunk-graph H1 (CKS G6 Temporal extension): new node type NodeHunk + new edges has_hunk / adjacent + gzip-compressed unified-diff blobs persisted under the existing blobs.node_id PK. No schema DDL change (the new rows reuse existing tables); pre-1.8 DBs are missing the rows + the new node/edge type literals so ManifestUsable's version check forces a cold rebuild on first 1.8 run. Bumped from "1.8" to "1.9" by W1 of schema-1.9-spec (cross-language interop expansion): TypeScript HTTP server endpoint detection (Express / Fastify / Hono / Next.js App Router). Reuses the existing NodeEndpoint type + `listens_on` edge — no new enum literals, no new columns. The bump is purely a cache-key contributor so pre-1.9 DBs don't carry forward a missing-Endpoint TS graph view on first 1.9 build. Per §6.1 of the design spec, future W2/W3/W4 stages (HTTP client matching, gRPC, message queue) will stay on 1.9 and append-only. Bumped from "1.9" to "1.10" by within-language semantics Phase 4 (2026-05-11): slot reservation for W-B (`NodeAwaitPoint` + `EdgeAwaits`, TS async/await suspension flow) and W-C (`EdgeOverrides`, Solidity virtual/override semantics). detectors land in Phase 5 — this commit is slot-only, so pre-1.10 DBs are byte-identical in their existing rows but the cache key flip forces a cold rebuild on first 1.10 run for symmetry with prior bumps. No new DDL (the new enum literals fit existing nodes.type / edges.type TEXT columns); see docs/DISPATCH-WITHIN-LANG-SEMANTICS.md §2 Phase 4 and docs/design/{ts-async-await-and-interface,solidity-inheritance-and-interface-dispatch}.md.

Kept here (not in pkg/types) because only the cache key needs it; pkg/types schema version bumps already trigger rebuilds via this constant.

Variables

This section is empty.

Functions

func ComputeCacheKey

func ComputeCacheKey(content []byte, ckgVersion, parserVersion string) string

ComputeCacheKey returns the SHA256 of:

file_content + "|ckg:" + ckgVersion + "|parser:" + parserVersion + "|schema:" + SchemaVersion

Any change in the four contributors invalidates the cache for that file and forces a reparse on next build (spec v0.2 § 4 design).

func ManifestUsable

func ManifestUsable(old *persist.Manifest, ckgVersion string) bool

ManifestUsable reports whether old can be used as the cache base for a build under (ckgVersion, SchemaVersion). Returns false when the global invariants drifted — caller must discard the cache and full-rebuild (silent reuse with stale schema would corrupt the DB).

nil manifest → false. Empty schema/ckg version → false (defensive).

func Run

func Run(opt Options) (persist.Manifest, error)

Run executes the full pipeline. Side effects: writes OutDir/graph.db and OutDir/manifest.json. Returns the persisted Manifest summary so the caller can print stats without re-reading SQLite.

Cache routing (A3 Phase 1):

  • --no-cache OR no prior manifest OR schema/version mismatch → cold rebuild
  • all-cached AND no removals → short-circuit (timestamp refresh only)
  • mixed dirty/cached → incremental (parse only dirty, reuse cached node sets)

func SHA256Hex

func SHA256Hex(content []byte) string

SHA256Hex returns the hex SHA256 of content. Exposed separately because FileEntry.SHA256 stores content-only hash (used for the "mtime changed but content identical" fast/slow path), distinct from the full cache key.

Types

type CacheDecisions

type CacheDecisions struct {
	Decisions []FileDecision
	Hits      int
	Misses    int
	Removed   int
}

CacheDecisions is the sorted, fully-classified result of one diff pass. Sorted by Path for deterministic logging.

func DiffManifest

func DiffManifest(srcRoot string, discovered []DiscoveredFile, old *persist.Manifest, ckgVersion string) (CacheDecisions, error)

DiffManifest classifies every discovered file against the OLD manifest and emits a CacheDecisions in deterministic Path order. Files in old but not in the discovery are emitted as classRemoved.

Fast/slow path (spec § 4 build flow): mtime-equal entries skip the SHA256 recomputation and reuse the old hash; mtime-mismatched entries fall through to a full hash. Either way, the cache decision is byte-equal whether mtime changed or not — mtime is purely a perf hint.

func (CacheDecisions) CachedPaths

func (cd CacheDecisions) CachedPaths() []string

CachedPaths returns the srcRoot-relative paths whose cache hit, in sorted order. Caller uses these to load nodes/edges from the DB.

func (CacheDecisions) DirtyPaths

func (cd CacheDecisions) DirtyPaths() []string

DirtyPaths returns the srcRoot-relative paths of files needing reparse, in the discovery order they were emitted (deterministic).

func (CacheDecisions) FormatLogLine

func (cd CacheDecisions) FormatLogLine() string

FormatLogLine returns a single human-readable summary line. Stable phrasing so operator runbooks can grep for it.

func (CacheDecisions) IsAllCached

func (cd CacheDecisions) IsAllCached() bool

IsAllCached returns true when every decision is classCached and there are no removals. Used by the build pipeline to short-circuit Pass 2 / metrics when nothing actually changed.

func (CacheDecisions) RemovedPaths

func (cd CacheDecisions) RemovedPaths() []string

RemovedPaths returns the srcRoot-relative paths that were in the old manifest but are not in the current discovery. Caller deletes their data.

type DiscoveredFile

type DiscoveredFile struct {
	Path     string
	Language string
}

DiscoveredFile describes one file produced by detect.* — used as input to the diff. Path is srcRoot-relative slash form.

type FileDecision

type FileDecision struct {
	Path     string
	Language string
	Class    fileClass
	// Populated for classDirty/classCached:
	SHA256        string
	CacheKey      string
	MTime         int64
	ParserVersion string
	// Populated for classCached only — the matching entry from the OLD
	// manifest, so the caller can pull NodeIDs out for reload.
	Cached *persist.FileEntry
}

FileDecision is the cache decision for one file in the current discovery. For classRemoved the Path comes from the OLD manifest (file is gone) and Language/SHA256/CacheKey/MTime are zero.

type Options

type Options struct {
	SrcRoot    string
	OutDir     string
	Languages  []string // {"auto"} | subset of {"go","ts","sol"}
	Logger     *slog.Logger
	CKGVersion string
	// NoCache forces a full rebuild — bypasses the A3 incremental cache and
	// wipes graph.db at start. Use when the cache is suspect, or for clean
	// benchmark runs.
	NoCache bool
	// RebuildMetrics forces PageRank/Leiden recompute even when the cache
	// would otherwise reuse them. Phase 1 ALWAYS recomputes when any file
	// is dirty (see Run below) — this flag is the explicit operator escape
	// for the "all-cached but I want fresh metrics" case.
	RebuildMetrics bool
	// DBDSN is an optional PostgreSQL DSN (e.g. "postgres://user:pass@host/db").
	// When set, the build persists to PostgreSQL instead of a local SQLite file.
	// OutDir is still used for manifest.json; --no-cache and incremental work the
	// same way (NodesByFilePath reads from PG with ORDER BY start_line).
	DBDSN string
	// StrictValidate, when true, fails the build on the first dangling edge or
	// schema violation (legacy v0.x behaviour). Default false: dangling edges
	// are dropped with a warning, schema violations still abort. Lenient mode
	// is required for dogfooding self-analysis, where parser bugs would
	// otherwise prevent graph.db from being written and block measurement.
	StrictValidate bool
	// FilesFromPath is the optional path to a JSON include/exclude filter
	// (see internal/filterlist). When set, only files matching the filter
	// reach the parsers. Empty means "use heuristic discovery as before".
	FilesFromPath string
	// LockPropagation enables D1 Stage B cross-function lock propagation
	// (W-A, Within-language semantics Phase 5). When true, the cold build
	// path walks the Go call graph from every lock-holding function up to
	// lockPropagationMaxDepth=5 hops and emits accessed_under_lock(field,
	// mutex) edges for fields touched in reachable callee bodies. Default
	// false (opt-in per W-A §5.0 Q5) so existing builds are byte-identical.
	// Incremental cache path skips propagation regardless of this flag —
	// run with --no-cache when the flag is on to measure full effect.
	// Spec: docs/design/go-cross-function-lock-propagation.md.
	LockPropagation bool
}

Options controls one ckg build invocation.

type TopicTreeForPersist

type TopicTreeForPersist = persist.TopicTreeInput

TopicTreeForPersist re-exposes persist.TopicTreeInput under a buildpipe- local alias so persistIncrementalArtifacts can take it as a typed param without leaking the persist package detail to every caller.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL