persist

package

v0.0.0-...-2c871a6 Latest Latest Go to latest Published: Jun 30, 2026 License: AGPL-3.0 Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/0xmhha/code-knowledge-graph

Links

Open Source Insights

Documentation ¶

Overview ¶

Package persist — node_attrs.go (W-C W11 V7, 2026-05-19) bridges the per-node JSON-blob `attrs` column to the type fields on types.Node that have no dedicated SQLite column. Adding a new marker on types.Node only requires a new field in this struct — the schema stays put.

Marshalling rule: every nodeAttrs field is annotated `,omitempty` so a Node with no markers serialises to `{}` (or NULL if the column was never touched) and consumes minimum space. Empty attrs strings produced by old (pre-1.9) writers parse back to the zero-valued struct, which makes incremental DB upgrades safe.

Package persist defines storage interfaces (StoreReader / StoreWriter / Store) and a SQLite implementation. Consumers should depend on the interfaces, not the concrete sqliteStore — this is the foundation for future backends (e.g. PostgreSQL — see docs/spec-ckg-v0.2.md §3, scheduled for B2 in docs/WORK-PLAN.md).

The interfaces are split by role (Interface Segregation Principle):

StoreReader: read-only surface used by serve / mcp / eval / audit.
StoreWriter: write surface used by buildpipe (full lifecycle).
Store: composition of both, for callers that need everything.

A single god interface (~25 methods) was rejected because most consumers only need a subset; ISP keeps test fakes and future backends focused.

Index ¶

Variables
func DSNHost(dsn string) string
func Open(path string) (*sqliteStore, error)
func OpenReadOnly(path string) (*sqliteStore, error)
type ClusterEdge
type FileEntry
type FindSymbolOptions
type HierarchyRow
type Manifest
type PendingRefRow
type PostgresExporter
- func (e *PostgresExporter) Export(ctx context.Context, dsn string, store StoreReader, log *slog.Logger) error
type SearchFTSOptions
type SearchHit
type Store
- func OpenPostgres(dsn string) (Store, error)
- func OpenPostgresCold(dsn string) (Store, error)
type StoreReader
- func OpenPostgresReadOnly(dsn string) (StoreReader, error)
type StoreWriter
type TopicTreeInput

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrInvalidMetric = errors.New("invalid metric: want pagerank|usage")

ErrInvalidMetric is returned by StoreReader.TopNodes when the metric argument is not one of the known column names. Sentinel rather than a string-typed error so callers (HTTP handler) can map it to 400.

Functions ¶

func DSNHost ¶

func DSNHost(dsn string) string

DSNHost extracts the host portion of a PostgreSQL DSN for safe logging (avoids printing credentials in log output). It handles both URL format (postgres://user:pass@host/db) and key=value format (host=localhost dbname=mydb). Returns "<unparseable>" on any parse failure.

func Open ¶

func Open(path string) (*sqliteStore, error)

Open opens (or creates) a SQLite file at path.

PRAGMAs are passed via DSN so modernc.org/sqlite applies them per-connection. This is required because PRAGMA foreign_keys / journal_mode are connection-scoped: setting them once via Migrate() would not propagate to other pooled connections, leaving FK constraints unenforced and WAL inactive on most queries.

func OpenReadOnly ¶

func OpenReadOnly(path string) (*sqliteStore, error)

OpenReadOnly opens a SQLite file in read-only mode (used by serve/mcp). FK pragma is enforced per-connection via DSN; WAL/synchronous are omitted because read-only mode cannot mutate journal state. busy_timeout still helps a reader wait out a concurrent checkpoint instead of erroring.

Types ¶

type ClusterEdge ¶

type ClusterEdge struct {
	ParentID, ChildID string
	Level             int
}

ClusterEdge mirrors cluster.Edge to avoid making persist's exported surface reach across packages. cluster.PersistClusterEdge is a structurally identical type defined in the cluster package; InsertPkgTreeFromCluster bridges them.

type FileEntry ¶

type FileEntry struct {
	Path          string   `json:"path"`           // srcRoot-relative slash form
	Language      string   `json:"language"`       // "go" | "ts" | "sol"
	SHA256        string   `json:"sha256"`         // hex of file content
	CacheKey      string   `json:"cache_key"`      // hex of full key
	MTime         int64    `json:"mtime"`          // unix nanoseconds (fast path)
	NodeIDs       []string `json:"node_ids"`       // IDs this file produced
	EdgeIDs       []int64  `json:"edge_ids"`       // edge row IDs
	ParserVersion string   `json:"parser_version"` // see ComputeCacheKey
}

FileEntry records the cache fingerprint and produced node/edge IDs for one source file. CacheKey covers content + ckg_version + parser_version + schema_version (see internal/buildpipe/cache.go ComputeCacheKey) so any upstream change correctly invalidates the entry.

type FindSymbolOptions ¶

type FindSymbolOptions struct {
	// Language pushes a `language = ?` predicate when non-empty.
	// Empty string disables the predicate.
	Language string

	// Kinds restricts results to the named NodeTypes (SQL `type IN (...)`).
	// Empty slice (or nil) disables the predicate — all kinds are returned.
	// Duplicates are tolerated (the SQL planner dedupes); callers don't need
	// to deduplicate.
	Kinds []types.NodeType
}

FindSymbolOptions configures filter push-down for StoreReader.FindSymbol. Zero value means "no filter" — every match on the name (and exactness mode) passes through.

See docs/followups-from-cks-dogfood-2026-05-19.md item CKG-4 for the downstream motivation: cks Stage 2's `arch_explain` intent fetches Function / Type / Interface symbols separately, paying N round-trips for what should be one. With Kinds set, the SQL layer returns a kind- tagged result in a single query so cks can dedupe by Citation key on the way back.

type HierarchyRow ¶

type HierarchyRow struct {
	ParentID   string `json:"parent_id"`
	ChildID    string `json:"child_id"`
	Level      int    `json:"level"`
	TopicLabel string `json:"topic_label,omitempty"`
}

HierarchyRow is the wire shape returned by LoadHierarchy. ParentID may be empty for top-level topic communities (resolution=0), so callers must treat "" as a sentinel for "root".

type Manifest ¶

type Manifest struct {
	SchemaVersion       string         `json:"schema_version"`
	CKGVersion          string         `json:"ckg_version"`
	BuildTimestamp      string         `json:"build_timestamp"`
	SrcRoot             string         `json:"src_root"`
	SrcRelPath          string         `json:"src_rel_path,omitempty"` // src_root relative to git repo root; enables path-aware staleness
	SrcCommit           string         `json:"src_commit,omitempty"`
	StalenessMethod     string         `json:"staleness_method"` // "git" | "mtime"
	StalenessFiles      []string       `json:"staleness_files,omitempty"`
	StalenessMTimeSum   int64          `json:"staleness_mtime_sum,omitempty"`
	Languages           map[string]int `json:"languages"`
	Stats               map[string]int `json:"stats"`
	CKGIgnore           []string       `json:"ckgignore,omitempty"`
	ParseErrorsCount    int            `json:"parse_errors_count"`
	UnresolvedRefsCount int            `json:"unresolved_refs_count"`
	ClusteringStatus    string         `json:"clustering_status"` // "ok" | "pkg_only"
	// Files is the per-file incremental-cache record (A3 Phase 1, schema 1.2).
	// Each entry tracks the SHA256 + cache key of one source file plus the
	// node/edge IDs it produced, enabling subsequent builds to skip parsing
	// for unchanged files. omitempty so pre-1.2 manifests reload as nil and
	// trigger a full rebuild on the next ckg build invocation.
	Files []FileEntry `json:"files,omitempty"`
}

Manifest captures build-time metadata. Stored as key/value rows in the manifest table; complex fields are JSON-encoded.

SchemaVersion policy: bumped on BREAKING changes only — i.e. changes that existing readers cannot transparently tolerate (renamed/removed fields, changed semantics, incompatible row layout). Additive optional fields with `omitempty` do NOT bump SchemaVersion: old readers ignore unknown JSON fields and unset optional fields decode as zero values. Example: the SrcRelPath field was added without a bump (1.0 → still 1.0) because empty SrcRelPath triggers the legacy back-compat branch in callers. Resist the urge to over-bump; spurious bumps force unnecessary rebuilds across all existing graph DBs.

type PendingRefRow ¶

type PendingRefRow struct {
	FilePath     string
	SrcID        string
	TargetQName  string
	EdgeType     string
	Line         int
	HintFile     string
	DispatchKind string
}

PendingRefRow is the storage wire shape for parse.PendingRef. Defined in persist (rather than reusing parse.PendingRef directly) so persist stays import-free of the parse package — buildpipe bridges the two when emitting from cold path or reloading for partial-cache rebuild.

G6 v3 (schema 1.5): persisting pending refs lets the partial path replay Pass 2 over the merged dirty + cached input set without re-parsing cached files. Without this table the cached-side pending refs were silently dropped (the v1/v2 cross-file edge regression).

DispatchKind (Track C P1b, schema 1.7): mirrors the edges table column — preserves the AST-time dispatch classification across the cache boundary. Empty for static `calls`.

type PostgresExporter ¶

type PostgresExporter struct{}

PostgresExporter reads a SQLite graph (via StoreReader) and pushes all nodes, edges and blobs to a PostgreSQL database in a single one-shot transfer. It is intentionally write-only: the target schema is created on first run (IF NOT EXISTS), so re-running against an already-populated database is idempotent at the DDL level but will conflict on PK inserts. Callers that need upsert semantics should truncate the target tables first.

func (*PostgresExporter) Export ¶

func (e *PostgresExporter) Export(ctx context.Context, dsn string, store StoreReader, log *slog.Logger) error

Export reads all nodes, edges and blobs from store and inserts them into the PostgreSQL database reachable at dsn. It creates the schema on first call. The operation is not wrapped in a single transaction to keep memory pressure bounded; partial exports leave the target in a consistent (though potentially incomplete) state.

type SearchFTSOptions ¶

type SearchFTSOptions struct {
	// Language pushes a WHERE language = ? predicate into the SQL.
	// Empty string disables the predicate (no language filter).
	Language string

	// Mode selects how multi-token queries combine. The zero value
	// (empty string) preserves the historical OR-broadening behaviour:
	// rewriteFTSQuery joins tokens with FTS5 OR so any one match
	// surfaces a candidate, then BM25 + PageRank + usage rerank.
	//
	// Mode = "and" engages a post-FTS filter that drops hits whose
	// FTS-indexed columns (name + qualified_name + signature +
	// doc_comment) miss any query token. Mirrors the
	// pkg/evidence/BuildPack Mode="and" semantics so external
	// consumers see consistent AND behaviour across the search and
	// evidence surfaces. Implementation over-fetches (limit × 3,
	// floor 30) before filtering to preserve recall.
	//
	// Mode = "or" is accepted as a synonym of the zero value for
	// callers that want to be explicit. Any other value is treated
	// as "or" (forward-compatible — future modes are append-only).
	Mode string

	// NodeKinds restricts the result set to specific node types. The
	// zero value (nil slice) applies the *default symbol-only filter*:
	// search_text returns only the types that types.NodeType.IsSymbol
	// reports true for, which strips statement-level nodes
	// (IfStmt/LoopStmt/CallSite/ReturnStmt/SwitchStmt/AwaitPoint),
	// meta nodes (Commit/Hunk), and path-only nodes (Import/Export)
	// from FTS hits that match purely on the enclosing symbol's qname
	// prefix.
	//
	// To surface every node type the FTS index matched, pass an
	// explicit slice — typically types.AllNodeTypes() — or list the
	// specific kinds you need. An empty (non-nil) slice is treated
	// the same as nil and applies the default symbol filter; callers
	// that mean "match nothing" should not call SearchFTS at all.
	NodeKinds []types.NodeType
}

SearchFTSOptions configures filter push-down for StoreReader.SearchFTS. Zero value means "no filter" — every match passes through.

Filters that the persistence layer cannot or chooses not to push down (e.g. path globs cheap on the client) are deliberately absent. Adding them later is a non-breaking change because struct fields default to zero on omission.

See docs/followups-from-cks-dogfood-2026-05-19.md item CKG-2 for the downstream motivation: cks currently over-fetches by FilterOverfetchRatio=3 and post-filters client-side on Language, which caps recall when filters drop most of a small page.

type SearchHit ¶

type SearchHit struct {
	Node     types.Node
	Score    float64 // normalized to [0, 1], result-set local
	RawScore float64 // backend-native, higher = stronger match
}

SearchHit pairs a node with its full-text search relevance score.

Returned by StoreReader.SearchFTS so downstream rerankers can distinguish "one strong unique-identifier hit" from "five weak common-word hits" — the gap that drove the cks workaround at internal/ckgclient/real.go (1 - i/(N+1) fake score, see docs/followups-from-cks-dogfood-2026-05-19.md item CKG-1).

Two scores are exposed:

Score: result-set min-max normalized to [0, 1]. Comparable within a single SearchFTS call. NOT comparable across calls — different result sets have different min/max windows. Recommended field for downstream rerankers.
RawScore: backend-native score, retained for debugging or advanced rerankers that already know the backend's scale. SQLite: -bm25(nodes_fts), sign-flipped so higher is better. PostgreSQL: ts_rank(search_vector, plainto_tsquery). The two scales differ — do NOT cross-compare RawScore across backends.

type Store ¶

type Store interface {
	StoreReader
	StoreWriter
}

Store is the union of the read and write surfaces — for callers (e.g. buildpipe) that need both. Embedded composition keeps the role surfaces reusable in isolation.

func OpenPostgres ¶

func OpenPostgres(dsn string) (Store, error)

OpenPostgres opens PostgreSQL for read/write. Used by incremental builds, serve, and mcp. The pool is configured with pgxpool defaults (max 4 conns on standard hardware).

func OpenPostgresCold ¶

func OpenPostgresCold(dsn string) (Store, error)

OpenPostgresCold wipes all data in FK-safe order via TRUNCATE … CASCADE, then calls Migrate() to ensure the schema is current. Equivalent to os.Remove(graph.db) + Open() for the SQLite cold path.

type StoreReader ¶

type StoreReader interface {
	// Lifecycle
	Close() error

	// Manifest
	GetManifest() (Manifest, error)

	// Hierarchy
	LoadHierarchy(kind string) ([]HierarchyRow, error)

	// Node queries
	// FindSymbol returns nodes matching name (exact or LIKE-suffix per `exact`).
	// See FindSymbolOptions for filter push-down (Language, Kinds).
	FindSymbol(name string, exact bool, opts FindSymbolOptions) ([]types.Node, error)
	// FindByCanonicalID returns the single node whose canonical_id matches
	// exactly. canonical_id is the globally-unique, import-path-qualified
	// identity (ADR-0001), so the match is unambiguous — unlike FindSymbol's
	// short name, it cannot collide across packages. Returns found=false (nil
	// error) when nothing matches or canonicalID is empty.
	FindByCanonicalID(canonicalID string) (types.Node, bool, error)
	NodesByIDs(ids []string) ([]types.Node, error)
	QueryNodes(parent string, limit int) ([]types.Node, error)
	// TopNodes returns the top-N nodes ranked by metric, descending.
	// Designed for the viewer's boot view: a meaningful initial seed where
	// hub functions/methods/types appear naturally so 1-hop expansion shows
	// real call/import structure rather than 37 disconnected packages.
	//
	// metric ∈ {"pagerank", "usage"} — values map to the nodes.pagerank and
	// nodes.usage_score columns respectively. Unknown metric → ErrInvalidMetric.
	// Result is sorted DESC by the chosen column, ties broken by id ASC for
	// determinism. Limit ≤0 is normalised by callers (HTTP layer caps).
	//
	// excludeTypes (variadic) lets callers drop irrelevant node types from
	// the boot seed without re-fetching client-side. The motivating case is
	// the viewer: with 178 git Commit nodes outranking real symbols by
	// pagerank, ~52% of the top-200 boot was Commit nodes, whose only
	// outgoing edge type (`changed_in`) is off by default — so the canvas
	// rendered Commit halos with no visible edges. Pass excludeTypes=
	// []string{"Commit"} to keep boot focused on symbols. No-op when empty.
	TopNodes(metric string, limit int, excludeTypes ...string) ([]types.Node, error)
	DistinctFilePaths(language string) ([]string, error)

	// Edge queries
	QueryEdgesByType(t string) ([]types.Edge, error)
	QueryEdgesForNodes(ids []string) ([]types.Edge, error)
	// EdgeCountsByType returns total edge count per edge type across the
	// entire graph (no node filter). Used by viewer Track D to show G1..G6
	// distribution next to each pill so users can read "G4 has 19 edges
	// total" at a glance — without it, toggling a sparse axis looks dead
	// because the canvas barely changes. Result is `map[edge_type] = count`.
	EdgeCountsByType() (map[string]int, error)

	// Traversal
	NeighborhoodByQname(qname string, depth int, reverse bool, edgeTypes ...string) ([]types.Node, []types.Edge, error)
	SubgraphByQname(qname string, depth int) ([]types.Node, []types.Edge, error)

	// Search
	Search(q string, limit int) ([]types.Node, error)
	// SearchWithOpts is Search with explicit SearchFTSOptions. Adds
	// AND-mode and Language filtering to the routed search path.
	// Options apply on the FTS branch only; the CJK substring fallback
	// ignores them (substring matching has no multi-token semantics).
	// Returns the same []types.Node shape as Search so callers can
	// migrate incrementally without touching their result handling.
	SearchWithOpts(q string, limit int, opts SearchFTSOptions) ([]types.Node, error)
	// SearchFTS returns FTS matches with BM25-derived relevance scores.
	// See SearchHit for the meaning of Score (normalized) vs RawScore.
	// See SearchFTSOptions for filter push-down + Mode.
	SearchFTS(q string, limit int, opts SearchFTSOptions) ([]SearchHit, error)

	// Source bodies
	GetBlob(id string) ([]byte, error)

	// Per-file lookups (A3 incremental cache, schema 1.2). Used by
	// buildpipe to load nodes/edges/blobs for files whose content hash
	// matched the previous manifest — those rows are reused as-is rather
	// than re-parsing.
	NodesByFilePath(path string) ([]types.Node, error)
	EdgesByFilePath(path string) ([]types.Edge, error)
	BlobsByFilePath(path string) (map[string][]byte, error)
	// PendingRefsByFilePath: G6 v3 partial-cache rebuild reads cached files'
	// unresolved cross-file refs back so Pass 2 Resolve sees the cold-equivalent
	// input. Schema 1.5.
	PendingRefsByFilePath(path string) ([]PendingRefRow, error)

	// ReverseDepsForFiles returns every cached file path that has pending_refs
	// targeting a qualified_name defined in any of dirtyPaths. Used by C1
	// (reverse-reference invalidation) to find which cached files need their
	// pending_refs re-resolved when dirty files change their exported symbols.
	// MUST be called BEFORE deleting dirty nodes — the lookup joins
	// pending_refs to nodes still in DB. Returns nil when dirtyPaths is empty.
	ReverseDepsForFiles(dirtyPaths []string) ([]string, error)

	// Static export (chunked JSON, spec §6.6). On StoreReader rather than
	// StoreWriter because ExportChunked only reads from the store and writes
	// JSON to disk — its sole caller (cmd/ckg/export_static.go) opens via
	// OpenReadOnly, which proves it doesn't need write access to the DB.
	ExportChunked(outDir string, nodeChunkSize, edgeChunkSize int) error

	// AmbiguousMetaNodes returns Hunk + Commit nodes whose confidence is
	// AMBIGUOUS — the §11.3 unreachable-history track populated by
	// LoadUnreachableHunks. Powers the viewer's Recovery panel; deliberately
	// scoped to meta-node types so other AMBIGUOUS rows (e.g. multi-candidate
	// TS resolutions on Function nodes) don't pollute the recovery view.
	// Returns nil + nil when no AMBIGUOUS rows exist (fresh graph).
	AmbiguousMetaNodes() ([]types.Node, error)

	// AllNodes / AllEdges return the full graph. Added for `ckg validate`
	// which reconstructs the in-memory graph from a built DB so it can
	// run validators (schema, future LLM) against persisted state. Avoid
	// these on huge graphs in tight loops; they are intentionally
	// streaming-unaware (callers want everything in memory).
	AllNodes() ([]types.Node, error)
	AllEdges() ([]types.Edge, error)

	// GetNodePRs returns every PR breadcrumb recorded against nodeID
	// whose merge timestamp is strictly before cutoff (ckg-NEW-3
	// temporal slicing). Pass time.Time{} (zero value) to disable the
	// cutoff and return the full history.
	//
	// Order: descending by merge timestamp — most recent first, so
	// "show me the last N changes around this symbol" requires no
	// client-side sort. Empty slice (not error) when the node has no
	// recorded PRs or every match was filtered out by the cutoff.
	//
	// See pkg/types.PRRef for the record schema and the build-time
	// derivation (internal/buildpipe.ScanPRHistory).
	GetNodePRs(nodeID string, cutoff time.Time) ([]types.PRRef, error)
}

StoreReader is the read-only surface. serve, mcp, eval and audit all depend on this — none of them write to the graph.

func OpenPostgresReadOnly ¶

func OpenPostgresReadOnly(dsn string) (StoreReader, error)

OpenPostgresReadOnly opens PostgreSQL in read-only mode. The underlying pool is identical — PostgreSQL doesn't have a read-only pool option — but write methods panic to catch unintended writes at the call site rather than at the server. Callers that only need StoreReader should use this.

type StoreWriter ¶

type StoreWriter interface {
	// Lifecycle
	Close() error
	Migrate() error

	// Bulk insert
	InsertNodes(nodes []types.Node) error
	InsertEdges(edges []types.Edge) error
	InsertBlobs(blobs map[string][]byte) error
	InsertPkgTreeFromCluster(edges []cluster.PersistClusterEdge) error
	InsertTopicTree(t TopicTreeInput) error
	// InsertPendingRefs: G6 v3 — cold path persists every Pass 1 unresolved
	// cross-file ref so the next partial build can replay Pass 2 over a
	// merged dirty + cached input. Schema 1.5.
	InsertPendingRefs(refs []PendingRefRow) error

	// InsertNodePRs writes the PR breadcrumb map (ckg-NEW-2, schema
	// 1.12). Keyed by node ID; the slice value is the full list of
	// PRs whose merge commit overlapped the node's source range.
	// Idempotent — node_prs has PRIMARY KEY (node_id, number), so
	// re-runs with INSERT OR REPLACE rewrite the existing rows.
	InsertNodePRs(byNode map[string][]types.PRRef) error

	// Per-file delete (A3 incremental cache). Drops every node whose
	// file_path matches; FK ON DELETE CASCADE wipes the dependent edges
	// and blobs in the same statement. Caller is responsible for then
	// re-inserting the new parse output.
	DeleteNodesByFilePath(path string) error

	// Per-type edge delete (A3 incremental cache). Used to clear
	// cross-language edges (e.g. binds_to) before they are recomputed —
	// they have no FilePath so the per-file delete cannot reach them.
	DeleteEdgesByType(t string) error

	// Indexing
	RebuildFTS() error

	// Manifest
	SetManifest(m Manifest) error
}

StoreWriter is the write surface used by buildpipe to materialise a graph end-to-end (Migrate → Insert* → RebuildFTS → SetManifest).

type TopicTreeInput ¶

type TopicTreeInput interface {
	ResolutionsCount() int
	ResolutionGamma(i int) float64
	ResolutionMembers(i int) map[string][]string // label -> []nodeID
}

TopicTreeInput abstracts the per-resolution view of a topic tree so persist can consume it without importing cluster types directly. *cluster.TopicTree satisfies this interface (see internal/cluster/persist_adapter.go).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL