types

package
v0.0.0-...-b8e9622 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 30, 2026 License: AGPL-3.0 Imports: 7 Imported by: 0

Documentation

Overview

Package types holds the cross-package data contracts: Chunk, Hit, Filter, the Embedder and VectorStore interfaces. Keeping these here (rather than inside internal/) lets future CKS code import them without pulling in indexer/store implementations.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ChunkID

func ChunkID(file string, startLine, endLine int, contentSHA256 string) string

ChunkID computes the deterministic chunk identifier:

sha256(file + "\n" + start_line + ":" + end_line + "\n" + content_sha256)

content_sha256 is the SHA-256 of the chunk Text (raw bytes — no whitespace normalization). A rename of the file changes the ID; this is intentional — rename tracking is the caller's responsibility.

func ContentSHA256

func ContentSHA256(text string) string

ContentSHA256 returns the canonical hash used in chunk_id and stored alongside each chunk for stale-detection. Single-source-of-truth helper — every caller (chunker, store loader, eval harness) goes through this so hashing stays consistent.

func IsTestPath

func IsTestPath(path, lang string) bool

IsTestPath classifies a source-relative path as a test file based on the conventional patterns of its language. It is intentionally a pure function so the chunker can call it without depending on language parsers, and so callers can re-classify when reindexing without a schema migration.

Conventions covered:

Go         "*_test.go"
TypeScript "*.test.ts(x)", "*.spec.ts(x)"
JavaScript "*.test.js(x)", "*.spec.js(x)" (for future JS parser)
Solidity   "*.t.sol" (Foundry), any segment named "test" or "tests"

path is forward-slash, repo-relative. lang is the language tag the discover/parse layer assigned ("go", "typescript", "solidity", or "").

Why per-language convention: testing frameworks pick filename rules that drift between ecosystems. JUnit Java would be `Test*.java`, Python pytest is `test_*.py`. Adding a language => add one branch here. Keep the function short and explicit (P5 — readable beats clever) so the contributor adding the next language can see at a glance what to extend.

Types

type Branch

type Branch struct {
	When string `json:"when"`
	Then string `json:"then"`
	At   string `json:"at"`
}

Branch is one conditional edge inside a flow step: under condition When, control goes to Then at code location At. Mapping a symptom (When) to its cause site (Then@At) is the core of flow-based root-cause analysis.

type Chunk

type Chunk struct {
	ID              string                `json:"id"` // see ChunkID
	File            string                `json:"file"`
	StartLine       int                   `json:"start_line"`
	EndLine         int                   `json:"end_line"`
	Language        string                `json:"language"`          // "go" | "typescript" | "solidity" | "markdown"
	IsTest          bool                  `json:"is_test,omitempty"` // _test.go, *.test.ts, *.spec.ts, *.t.sol, test/... — populated by IsTestPath
	SymbolName      string                `json:"symbol_name,omitempty"`
	SymbolKind      SymbolKind            `json:"symbol_kind,omitempty"`
	ChunkKind       ChunkKind             `json:"chunk_kind"`
	CommitHash      string                `json:"commit_hash"`
	ContentSHA256   string                `json:"content_sha256"`
	CKGNodeID       string                `json:"ckg_node_id,omitempty"`      // 1:1 alignment when CKG path is provided
	CanonicalID     string                `json:"canonical_id,omitempty"`     // ckg's import-path-qualified symbol id (ADR-0001), copied verbatim from the aligned ckg node; the stable key cks uses to FindByCanonicalID against ckg
	RecentPRs       []PRRef               `json:"recent_prs,omitempty"`       // PRs that touched this chunk's file
	Category        string                `json:"category,omitempty"`         // policy category: consensus|state|crypto|p2p|... (empty = unclassified)
	Guidance        *ModificationGuidance `json:"guidance,omitempty"`         // attached by policy loader; nil for unclassified
	Invariants      []InvariantRef        `json:"invariants,omitempty"`       // back-pointers to ChunkInvariant chunks extracted from this source
	ConventionStats map[string]any        `json:"convention_stats,omitempty"` // populated on ChunkConvention chunks; empty for source chunks
	FlowStep        *FlowStepMeta         `json:"flow_step,omitempty"`        // populated on ChunkFlowStep chunks (flow_meta column)
	FlowSpine       *FlowSpineMeta        `json:"flow_spine,omitempty"`       // populated on ChunkFlowSpine chunks (flow_meta column)
	Provenance      string                `json:"provenance,omitempty"`       // invariant origin: "auto" (extracted) | "curated" (corpus); empty for non-invariant chunks
	EnforcedAt      []EnforcePoint        `json:"enforced_at,omitempty"`      // populated on curated ChunkInvariant chunks (enforced_at column)
	Text            string                `json:"text"`                       // chunk source (for re-embedding / display)
}

Chunk is the unit CKV embeds and stores. It is the indexable record produced by parse → chunk; the embedder turns Text into a vector and the store persists everything except Text-derived caches.

func (Chunk) Citation

func (c Chunk) Citation() Citation

Citation returns the citation view of this chunk. Always populated for indexed chunks; never returns a zero-value citation for a real chunk.

type ChunkKind

type ChunkKind string

ChunkKind classifies the chunking strategy that produced the chunk. Distinct from SymbolKind because a long function may produce several "function_split" chunks, all of SymbolKind=Function.

const (
	ChunkSymbol        ChunkKind = "symbol"         // whole function/method/type
	ChunkFunctionSplit ChunkKind = "function_split" // sub-chunk of a long function
	ChunkFileHeader    ChunkKind = "file_header"    // import block / top-of-file
	// ChunkDoc covers markdown heading sections (DocSection/ADRSection).
	// Kept distinct from ChunkSymbol so callers can filter the corpus by
	// "code vs documentation" without inspecting SymbolKind. The chunker
	// promotes spans whose SymbolKind is DocSection or ADRSection.
	ChunkDoc ChunkKind = "doc"

	// PR corpus kinds. Additive — existing schema_version 1.0
	// indexes continue working; these appear only in indexes built with
	// --include-pr-history.
	ChunkPRBackground  ChunkKind = "pr_background"
	ChunkPRSolution    ChunkKind = "pr_solution"
	ChunkCommitMessage ChunkKind = "commit_message"

	// ChunkInvariant carries an invariant statement found inside or
	// adjacent to a source chunk. Each invariant chunk is paired (via
	// the source chunk's Invariants []InvariantRef list) with the code
	// it constrains. The agent can query invariants for a file to
	// learn what changes must NOT break.
	ChunkInvariant ChunkKind = "invariant"

	// ChunkConvention is a per-package summary of AST-derived patterns
	// (error handling style, logging library, naming, concurrency).
	// The agent queries these to learn what idioms the package follows
	// before proposing edits — preventing convention drift.
	ChunkConvention ChunkKind = "convention"

	// Flow-corpus kinds. A curated flow corpus (corpus.jsonl, loaded via
	// --flow-corpus) describes "현상 → 원인" causal paths through the code so
	// an agent can trace a symptom to its cause. Additive — present only in
	// indexes built with --flow-corpus.
	ChunkFlowStep  ChunkKind = "flow_step"  // one step in a flow (symbol + branches)
	ChunkFlowSpine ChunkKind = "flow_spine" // a flow's entry/summary backbone
)

type Citation

type Citation struct {
	File       string `json:"file"`
	StartLine  int    `json:"start_line"`
	EndLine    int    `json:"end_line"`
	CommitHash string `json:"commit_hash"`
}

Citation is the {file, start_line, end_line, commit_hash} tuple CKV attaches to every chunk and every search hit. CKG uses the same shape, so hybrid responses can be merged without translation.

type Embedder

type Embedder interface {
	Identity() EmbeddingIdentity
	Name() string
	Dimension() int
	MaxInputTokens() int
	Embed(ctx context.Context, batch []string) ([][]float32, error)
}

Embedder turns text into a fixed-dimension vector. Implementations:

  • internal/embed/mock — deterministic hash-based, for tests
  • internal/embed/bgeonnx — ONNX-backed local embedder; supports a model registry (see model_config.go), currently bge-large-en-v1.5 by default.
  • pkg/embed/ollama — Ollama HTTP API backend.

Embedder interface contract:

  • Identity reports the embedding space (provider/model/dim/pooling/ normalization). Every backend implements it from its own model definition, so a new model or provider conforms to the same contract and gets index-compatibility enforcement (query.Open) for free. Name() and Dimension() are kept for convenience and MUST agree with Identity().Model and Identity().Dim.
  • Name returns a stable identifier persisted in the manifest (e.g. "bge-large-en-v1.5"). Mismatch on rebuild → IndexUnavailable.
  • Dimension is the vector length. Used to size the sqlite-vec column.
  • MaxInputTokens is the model's context limit; the chunker truncates overlong text up front (signature stays at the head).
  • Embed is batched. Implementations choose internal batching (CPU≈32, GPU≈256) but the caller MAY pass arbitrary-size slices.

type EmbeddingIdentity

type EmbeddingIdentity struct {
	Provider  string // backend that produced the vectors, e.g. "ollama", "bgeonnx", "mock"
	Model     string // model name, e.g. "bge-m3"
	Dim       int    // vector dimension
	Pooling   string // "cls" | "mean" | "last_token"; "" when the backend does not expose it
	Normalize string // "l2" | "none"; "" when unknown
}

EmbeddingIdentity describes the vector space an Embedder produces. It is model-agnostic: each embedder fills it from its own configuration (e.g. a model registry), so adding or swapping an embedding model needs no change here — the identity flows from the model definition.

func (EmbeddingIdentity) Checksum

func (id EmbeddingIdentity) Checksum() string

Checksum is a stable identity string for the embedding space. Two embedders that produce comparable vectors yield the same Checksum; any difference (provider, model, dim, pooling, normalization) yields a different one. It is recorded in the manifest at build time and compared on Open so a silently-incompatible index/embedder pair (e.g. Ollama bge-m3 vs ONNX bge-m3) is rejected with a reindex hint instead of returning meaningless similarity scores.

type EnforcePoint

type EnforcePoint struct {
	Flow string `json:"flow"`
	Step string `json:"step"`
	Loc  string `json:"loc"`
}

EnforcePoint records where a curated invariant is enforced: a step in a flow at a code location. Serialized into the enforced_at column.

type Filter

type Filter struct {
	Language    string       `json:"language,omitempty"`
	PathGlob    string       `json:"path,omitempty"`
	SymbolKinds []SymbolKind `json:"symbol_kinds,omitempty"`
	CommitHash  string       `json:"commit_hash,omitempty"`
}

Filter narrows a vector search by metadata. All fields are optional; an empty field is treated as "any". Filters are AND-combined.

Filter fields:

  • Language: "go" | "typescript" | "solidity" | "markdown"
  • PathGlob: filepath.Match-style glob (single-star; doublestar planned)
  • SymbolKinds: e.g. {Function, Method}
  • CommitHash: pin to a specific historical commit's chunks

func (Filter) IsZero

func (f Filter) IsZero() bool

IsZero reports whether the filter would match every chunk. Used by store implementations to skip the post-filter step entirely on the hot path.

func (Filter) Matches

func (f Filter) Matches(c Chunk) bool

Matches reports whether c satisfies every set field of f. Implemented here so both the store layer (post-filter) and the query layer (sanity check) share one definition.

NOTE: PathGlob uses filepath.Match semantics (single-star, no "**").

type FlowSpineMeta

type FlowSpineMeta struct {
	FlowID     string   `json:"flow_id"`
	EntryPoint string   `json:"entry_point,omitempty"`
	Trigger    string   `json:"trigger,omitempty"`
	RootSymbol string   `json:"root_symbol,omitempty"`
	Links      []string `json:"links,omitempty"`
	CalledBy   []string `json:"called_by,omitempty"`
}

FlowSpineMeta is the structured metadata for a ChunkFlowSpine chunk: a flow's entry point, what triggers it, and how it links to other flows. Serialized into the flow_meta column (populated in Phase B).

type FlowStepMeta

type FlowStepMeta struct {
	FlowID     string   `json:"flow_id"`
	StepID     string   `json:"step_id"`
	Symbol     string   `json:"symbol,omitempty"`
	Kind       string   `json:"kind,omitempty"`
	Calls      []string `json:"calls,omitempty"`
	Reads      string   `json:"reads,omitempty"`
	Writes     string   `json:"writes,omitempty"`
	Emits      string   `json:"emits,omitempty"`
	Branches   []Branch `json:"branches,omitempty"`
	Invariants []string `json:"invariants,omitempty"`
}

FlowStepMeta is the structured metadata for a ChunkFlowStep chunk: the symbol the step runs at, the symbols it calls, what it reads/writes/emits, its conditional branches, and the invariant ids it must uphold. Serialized into the flow_meta column (populated in Phase B).

type Hit

type Hit struct {
	Chunk Chunk    `json:"chunk"`
	Score HitScore `json:"score"`
	// StaleCitation is set by the citation-enforcement step when the
	// chunk's recorded commit_hash differs from the source tree's
	// current git HEAD. The hit is still returned — the file usually
	// still has useful content at a different commit — but downstream
	// consumers can warn the user or downgrade the snippet shape.
	StaleCitation bool `json:"stale_citation,omitempty"`
}

Hit is a single search result. Score values are normalized so callers can compare across backends; raw distance is preserved for RRF input.

type HitScore

type HitScore struct {
	Normalized     float64 `json:"normalized"`            // 1 - distance/2, in [0,1]
	VectorDistance float64 `json:"vector_distance"`       // raw cosine distance, in [0,2]
	VectorRank     int     `json:"vector_rank"`           // 1-based within this query's vector hits
	BM25Score      float64 `json:"bm25_score,omitempty"`  // candidate-set BM25, 0 when rerank disabled or no token match
	HybridRank     int     `json:"hybrid_rank,omitempty"` // 1-based position after RRF fusion; 0 when rerank disabled
}

HitScore exposes both the normalized score (higher = better, range [0,1]) and the raw cosine distance (lower = better, range [0,2]). The RRF fuser upstream consumes Rank; lower-layer query callers display Normalized.

BM25Score and HybridRank are omitempty fields for the optional BM25 rerank pass. They stay zero (and absent from JSON) when Options.EnableBM25Rerank is off, preserving the schema for callers that haven't opted in.

type InvariantRef

type InvariantRef struct {
	ChunkID string        `json:"chunk_id"`         // ID of the ChunkInvariant chunk
	Tier    InvariantTier `json:"tier"`             // 1, 2, or 3
	Marker  string        `json:"marker,omitempty"` // e.g. "CRITICAL", "panic"
}

InvariantRef is a back-pointer attached to a source Chunk pointing at the ChunkInvariant(s) extracted from inside or near it. Kept small so adding it to every chunk does not balloon storage.

type InvariantTier

type InvariantTier int

InvariantTier classifies how an invariant was detected.

Tier 1 — existing marker (// CRITICAL, // IMPORTANT, // WARNING, // Deprecated:)
Tier 2 — new convention marker (// INVARIANT:, // CONSENSUS:, // SECURITY:)
Tier 3 — heuristic (panic(...) / fmt.Errorf(...) with policy keywords)

Lower tiers carry higher confidence; the agent can filter by tier when noise tolerance is low (e.g. only tier 1+2 during a release).

const (
	InvariantTierExistingMarker InvariantTier = 1
	InvariantTierNewMarker      InvariantTier = 2
	InvariantTierHeuristic      InvariantTier = 3
)

type ModificationGuidance

type ModificationGuidance struct {
	AlsoReview    []string `json:"also_review,omitempty"`    // other categories/files to inspect together
	RequiredTests []string `json:"required_tests,omitempty"` // test suites the change should exercise
	WatchOut      []string `json:"watch_out,omitempty"`      // pitfalls / hard-fork / byzantine risks
}

ModificationGuidance is project-policy advice attached to a chunk by the policy loader. It surfaces "if you touch this code, here is what else to consider" hints derived from the chunk's path category (e.g. consensus, state, p2p). All fields may be empty.

Guidance is informative, not enforcement. A nil pointer means the chunk's path did not match any policy rule.

type PRRef

type PRRef struct {
	Number      int    `json:"number"`
	Title       string `json:"title"`
	MergedAtUTC string `json:"merged_at_utc,omitempty"`
}

PRRef records a PR that touched a chunk's file or symbol. Stored as JSON in the recent_prs column; the temporal slicing key (MergedAtUTC) lets query-time filtering exclude PRs merged after a cutoff.

type Stats

type Stats struct {
	ChunkCount     int    `json:"chunk_count"`
	EmbeddingModel string `json:"embedding_model"`
	EmbeddingDim   int    `json:"embedding_dim"`
	IndexedHead    string `json:"indexed_head"`
	BuiltAt        string `json:"built_at"`
	SchemaVersion  string `json:"schema_version"`
}

Stats reports index health. Returned by VectorStore.Stats and surfaced via the MCP `cks.ops.health` tool.

type SymbolKind

type SymbolKind string

SymbolKind enumerates the AST node kinds CKV chunks against. Stored as a plain string for forward-compatibility with new languages.

const (
	KindFunction   SymbolKind = "Function"
	KindMethod     SymbolKind = "Method"
	KindType       SymbolKind = "Type"
	KindStruct     SymbolKind = "Struct"
	KindInterface  SymbolKind = "Interface"
	KindContract   SymbolKind = "Contract" // Solidity
	KindEvent      SymbolKind = "Event"    // Solidity (TBD)
	KindModifier   SymbolKind = "Modifier" // Solidity (TBD)
	KindFileHeader SymbolKind = "FileHeader"
	// Markdown indexing kinds.
	// Each heading-bounded section in a *.md / *.markdown file becomes one
	// SymbolSpan; the chunker emits a chunk per span so "왜 X 결정했나" style
	// queries can hit a specific decision section.
	KindDocSection SymbolKind = "DocSection" // markdown heading section
	KindADRSection SymbolKind = "ADRSection" // ADR-* / docs/adr/* markdown sections
)

type VectorStore

type VectorStore interface {
	// Upsert inserts or replaces chunks keyed by Chunk.ID. The vector is
	// derived from chunk.Text via the configured Embedder before calling.
	// Note the (chunk, embedding) pairing is positional and equal-length.
	Upsert(ctx context.Context, chunks []Chunk, embeddings [][]float32) error

	// DeleteByFile removes every chunk whose File equals path. Used by
	// the incremental indexer and by the file-rename safety path.
	DeleteByFile(ctx context.Context, path string) error

	// Search returns the top-k nearest chunks under cosine distance,
	// post-filtered by `filter`. k is the desired result count; the
	// implementation may over-fetch (e.g. 3*k) for re-rank head-room.
	Search(ctx context.Context, query []float32, k int, filter Filter) ([]Hit, error)

	// Stats reports indexed counts and the embedding model identity
	// stored at build time. Cheap (single SQL roundtrip).
	Stats(ctx context.Context) (Stats, error)

	// Close releases the backing handle. Idempotent.
	Close() error
}

VectorStore is the persistence + ANN search surface. Implementations:

  • internal/store/sqlitevec — SQLite + vec0 virtual table
  • internal/store/memory — in-RAM map (tests + dev loop)

All methods are safe to call from a single goroutine; concurrent callers must serialize themselves (the indexer pipeline is sequential per file).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL