codegraph

package
v1.10.0-rc.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 4, 2026 License: MIT Imports: 35 Imported by: 0

Documentation

Overview

BM25 additive scoring on top of the MinHash codegraph. Layered on the existing Jaccard ranking so both signals are returned per result and the caller (or future rank-fusion code) can reason about them independently.

BM25 addresses SPEC §8.2 Issue #2 (ubiquitous noise) at the scoring level rather than the token-filtering level: common tokens like "get" or "error" get low IDF weight and contribute little to the score even when they're shared between query and symbol. Rare tokens like "postgresql" or "kubernetes" dominate the score when they match.

Unlike the stopwords filter, BM25 is SYMMETRIC by design: a token's IDF is the same whether it shows up in a query or a symbol, so scoring is reciprocal and doesn't need asymmetric filtering logic.

Change impact analysis — maps git diffs to affected symbols and callers.

Given a set of changed files + line ranges from git diff, identifies which symbols were directly modified and which callers are transitively affected. Produces risk-scored, priority-ordered review guidance.

LSH (Locality-Sensitive Hashing) band table for sub-linear semantic search. Implements the persisted band table approach validated in CODEGRAPH_LSH_RESEARCH.md — instead of loading all MinHash signatures and comparing exhaustively (O(N)), precompute band hashes at index time and query them at search time to retrieve a small candidate set (typically 0.1-1% of the corpus) that is then ranked by exact Jaccard similarity.

Band configuration: 64 bands × 2 rows per band from the 128-element MinHash signature. This is the empirically-derived "64×2" config from the research doc — counterintuitive from a classical LSH perspective (it produces large candidate sets) but necessary because code search operates at much lower Jaccard values (0.05-0.20) than document similarity (0.5-0.8). The conventional 16×8 and 32×4 configs produce 0% recall on code search queries.

At grafana scale (77,420 symbols), this provides a 20x speedup over brute-force (89ms → 4.3ms) by eliminating the full signature load.

Multi-language tree-sitter parser. Supports Python, Rust, TypeScript, JavaScript, Go, Java, C, C++, C#, Ruby, PHP, Scala, and more via the language spec mappings in parser_ts_languages.go.

Each language requires a tree-sitter grammar Go binding. Languages without an available Go binding fall back to the regex GenericParser.

TypeScript parser backed by tree-sitter. Replaces GenericParser for .ts and .tsx files when celeste is built with CGo enabled. The regex-based GenericParser produced symbols that matched identifier shapes but could not resolve call-graph edges through TypeScript's type-aware method dispatch, leaving most TS interfaces with edgeCount=0 in the codegraph (documented in SPEC §8.2 and surfaced by the Task 19 ⚠ zero-edge warning). An AST-based parser sees the real call sites and writes the edges that were previously missing.

Scope for v2.0.0: TypeScript (.ts and .tsx) only. Python and Rust stay on the regex GenericParser for now — they aren't the validation target for this task and they have no zero-edge warnings in the bundled benchmark corpus.

CGo caveat: this file and its dependencies pull in tree-sitter's C runtime and the bundled TypeScript/TSX grammars. The //go:build cgo constraint at the top gates compilation on CGO_ENABLED=1 — when cross-building release binaries from a Linux host for darwin/windows the Go toolchain disables CGo implicitly, and the stub in parser_ts_stub.go takes over. Stub builds fall back to the regex GenericParser for TypeScript files; they still work, just without the tree-sitter edge-resolution improvement. Users who want the full experience must either build from source with CGo enabled or wait for the v2.1.0 release workflow which will cross-compile against a proper C toolchain (zig CC or matrix of native runners).

Multi-language tree-sitter AST node type mappings.

Defines which tree-sitter node types correspond to classes, functions, imports, and calls for each supported language. Used by the tree-sitter parser to extract symbols and edges from source files.

Reference: derived from code-review-graph's parser.py (MIT licensed, github.com/tirth8205/code-review-graph) and adapted for celeste-cli's codegraph symbol model.

Structural rerank layer.

After the Jaccard + BM25 fused ranking produces a preliminary order, this layer applies a scalar rescore that incorporates features the fusion doesn't see: how many query tokens actually matched the symbol's shingle set, how well-connected the symbol is in the call graph, and what KIND of symbol it is. The goal is to bubble the "obviously relevant" results above close-score ties where the fused order alone is ambiguous.

Pure Go, zero dependencies, zero cloud calls. All signals come from fields already present on SearchResult, so this layer has no effect on indexing latency and only a tiny reorder cost at query time.

The Reranker interface below is a deliberate seam: the current StructuralReranker is hand-tuned feature engineering, but a future EmbeddingReranker (local llama.cpp bridge, xAI embeddings endpoint, ONNX sentence-transformers, ...) can drop in at the same call site without touching the search pipeline.

Graph snapshot and diff — tracks what changed between index builds.

A snapshot captures the set of symbol names + kinds and edge pairs at a point in time (keyed by git commit SHA). Diffing two snapshots reveals added/removed symbols and edges, enabling blast-radius analysis for code review.

Stopwords runtime integration — embeds stopwords.json at build time and exposes parsed lookup sets for ShinglesForSymbol and SemanticSearch to consume.

The embedded file is produced by celeste-stopwords and licensed under CC BY 4.0 — see stopwords_NOTICE.md for attribution.

Store methods for graph snapshots and change impact analysis.

Index

Constants

View Source
const (
	// WarnDemotedTest — result was demoted because its path matches a
	// test directory or test filename suffix.
	WarnDemotedTest = "demoted: test path"

	// WarnDemotedMock — result was demoted because its path is in a
	// mocks/, fixtures/, or stubs/ directory.
	WarnDemotedMock = "demoted: mock path"

	// WarnDemotedDeclaration — result was demoted because the symbol
	// lives in a .d.ts / .d.mts declaration-only file. Useful for TS
	// consumers: declaration files describe API surfaces but have no
	// runtime code, so a matching declaration probably isn't what you
	// want when looking for implementations.
	WarnDemotedDeclaration = "demoted: declaration-only file"

	// WarnDemotedVendored — result was demoted because the symbol is
	// in a vendored third-party directory (vendor/, node_modules/,
	// third_party/).
	WarnDemotedVendored = "demoted: vendored code"

	// WarnDemotedGenerated — result was demoted because the symbol is
	// in a build-output directory (dist/, build/, .next/, target/).
	WarnDemotedGenerated = "demoted: generated code"

	// WarnZeroEdge — the symbol has zero incoming AND zero outgoing
	// edges in the code graph. Two possible interpretations:
	//   1. Genuine dead code. Nothing calls it, it calls nothing.
	//   2. Parser limitation. The regex parser for TS/Python/Rust
	//      cannot resolve many call sites and edges for non-Go
	//      languages are systematically undercounted. An LLM should
	//      NOT conclude "dead code" from this warning alone — verify
	//      by reading the file.
	// SPEC §8.2 Issue #2 documents this ambiguity.
	WarnZeroEdge = "zero edges — may be dead code or parser limitation"

	// WarnLowConfidence — Jaccard similarity is below 0.10. Results
	// at this tier are right at the signal/noise boundary for MinHash
	// with 128 hash functions (pairwise-independent FNV variant). An
	// LLM should treat these as "maybe relevant" not "definitely
	// relevant" and verify by reading the source.
	WarnLowConfidence = "low confidence (jaccard < 0.10)"

	// WarnDeclarationOnlyType — symbol is a pure type/interface with
	// no body and zero edges. Common in TS type declaration files
	// and Go interface-only types. Probably not runtime code the
	// user wants to find.
	WarnDeclarationOnlyType = "type/interface declaration without references"
)

Confidence warning constants. These strings are stable across releases because callers (LLM tool users, UIs, scripts) may match on them directly. Add new ones freely but do NOT rename or remove existing ones without a version bump.

View Source
const DefaultNumHashes = 128

DefaultNumHashes is the number of hash functions used for MinHash signatures. 128 provides good accuracy with sub-10ms query time for 50k symbols.

Variables

This section is empty.

Functions

func BytesToSeeds added in v1.9.0

func BytesToSeeds(data []byte) ([]uint64, error)

BytesToSeeds deserializes a byte slice back into a seed slice. Returns an error if the length is not a multiple of 8.

func ComputeBM25Score added in v1.9.0

func ComputeBM25Score(queryTokens []string, docTokens map[string]int, docLength int, idf map[string]float64, avgDocLen float64) float64

ComputeBM25Score computes the BM25 score for a single symbol against a query.

queryTokens: deduplicated lowercase tokens from the query docTokens: map[token] = term frequency (TF) for this symbol docLength: total token count for the symbol idf: map[token] = precomputed IDF for each query token avgDocLen: average doc length across the corpus

Pure function, no store access. Callers resolve the inputs from stored data first, then call this repeatedly across candidate set.

func ComputeBandHashes added in v1.10.0

func ComputeBandHashes(sig MinHashSignature) []uint64

ComputeBandHashes splits a 128-element MinHash signature into 64 bands of 2 elements each and hashes each band to a single uint64. The band hash is computed by XORing the two elements with a band-specific salt — cheap and effective for LSH purposes where we only need the property that identical input bands produce identical hashes.

Returns exactly lshNumBands (64) hashes. Panics if len(sig) != lshNumBands * lshBandSize — callers must pass a full signature.

func ComputeFusedRanking added in v1.9.0

func ComputeFusedRanking(jaccardRanks, bm25Ranks map[int64]int) []int64

ComputeFusedRanking combines two ranked lists into a single ranking using Reciprocal Rank Fusion: each entry's fused score is the sum of 1/(k + rank) across the lists it appears in. Higher fused score = better overall rank.

byID maps symbol ID to its position in each list (1-indexed). A symbol absent from a list contributes nothing for that list.

This is deliberately not a method on any type — it's a pure function over data structures so it's trivial to test in isolation.

func DefaultIndexPath added in v1.8.3

func DefaultIndexPath(projectRoot string) string

DefaultIndexPath returns the path to the code graph database for a project. It stores the index under ~/.celeste/projects/<hash>/codegraph.db to avoid polluting the project directory.

func DetectLanguage

func DetectLanguage(filename string) string

DetectLanguage returns the language for a file based on its extension. Returns empty string if the language is not recognized.

func DetectProjectLanguage

func DetectProjectLanguage(dir string) string

DetectProjectLanguage determines the primary language of a project by checking for manifest files in the given directory.

func FormatConfidenceLine added in v1.9.0

func FormatConfidenceLine(r SearchResult) string

FormatConfidenceLine returns a human-readable one-line summary of a SearchResult's confidence metadata, suitable for appending to CLI / tool output. Empty string if there's nothing notable.

Example output:

"  ⚠ demoted: mock path; zero edges — may be dead code or parser limitation; edges=0"
"  edges=12"

func IsDemotable added in v1.9.0

func IsDemotable(flags []PathFlag) bool

IsDemotable returns true if the flag set is non-empty — i.e., at least one demotion reason applies. Pure convenience helper.

func IsIndexableFile

func IsIndexableFile(filename string) bool

IsIndexableFile returns true if the file's language has parser support.

func JaccardSimilarity

func JaccardSimilarity(a, b MinHashSignature) float64

JaccardSimilarity estimates the Jaccard similarity between two MinHash signatures. Returns a value between 0.0 (completely different) and 1.0 (identical sets).

IMPORTANT: both signatures must have been computed with the SAME hash seeds for this to produce meaningful results. Comparing signatures from different MinHashers (different seeds) yields noise.

func LookupLangSpec added in v1.10.0

func LookupLangSpec(ext string) *langSpec

LookupLangSpec returns the language spec for a file extension. Returns nil if the language is not supported.

func PathFlagStrings added in v1.9.0

func PathFlagStrings(flags []PathFlag) []string

PathFlagStrings converts a []PathFlag to a []string for serialization to JSON / API responses / logs.

func SeedsToBytes added in v1.9.0

func SeedsToBytes(seeds []uint64) []byte

SeedsToBytes serializes a seed slice to bytes for persistence. Layout is little-endian uint64 × N, for a total of 8*N bytes.

func ShinglesForSymbol

func ShinglesForSymbol(sym Symbol, source []byte, lang string) []string

ShinglesForSymbol generates enriched shingles for a symbol, used as input to MinHash for semantic similarity search. Each shingle is a lowercased token derived from the symbol's name, types, body references, package, and comments.

The final token list is filtered through the embedded stopwords.json (celeste-stopwords v1.0.0, CC BY 4.0) via stopWords.Filter. The filter applies the universal set plus the per-language set identified by lang. Pass "" for lang to apply only the universal set — this is the right choice for callers that don't know the file's language.

func ShouldSkipPath

func ShouldSkipPath(path string) bool

ShouldSkipPath returns true if the path should be excluded from indexing.

func SupportedLanguage added in v1.10.0

func SupportedLanguage(ext string) string

SupportedLanguage returns the language name for a file extension, or empty string if unsupported.

Types

type BM25CorpusStats added in v1.9.0

type BM25CorpusStats struct {
	NumDocs      int
	AvgDocLength float64
}

BM25CorpusStats holds the corpus-wide statistics BM25 needs for scoring: total document count and average document length (in shingle tokens). Computed once at the end of Build() and cached via the meta table so query-time scoring is a single lookup.

type ChangedRange added in v1.10.0

type ChangedRange struct {
	File      string
	StartLine int
	EndLine   int
}

ChangedRange represents a modified line range in a file.

func ParseGitDiffRanges added in v1.10.0

func ParseGitDiffRanges(workspace string, base string) ([]ChangedRange, error)

ParseGitDiffRanges runs git diff and extracts changed line ranges per file.

type CodeSmell added in v1.8.3

type CodeSmell struct {
	Kind      CodeSmellKind `json:"kind"`
	Name      string        `json:"name"`
	File      string        `json:"file"`
	Line      int           `json:"line"`
	FuncKind  string        `json:"func_kind"`
	OutEdges  int           `json:"outgoing_edges"`
	InEdges   int           `json:"incoming_edges"`
	Score     float64       `json:"score"`
	Reason    string        `json:"reason"`
	Signature string        `json:"signature,omitempty"`
	Snippet   string        `json:"snippet,omitempty"`
}

CodeSmell represents a structurally detected code issue.

type CodeSmellKind added in v1.8.3

type CodeSmellKind string

CodeSmellKind categorizes the type of code smell detected.

const (
	SmellLazyRedirect CodeSmellKind = "LAZY_REDIRECT"
	SmellStub         CodeSmellKind = "STUB"
	SmellPlaceholder  CodeSmellKind = "PLACEHOLDER"
	SmellTodoFixme    CodeSmellKind = "TODO_FIXME"
	SmellEmptyHandler CodeSmellKind = "EMPTY_HANDLER"
	SmellHardcoded    CodeSmellKind = "HARDCODED"
)

type DiffSummary added in v1.10.0

type DiffSummary struct {
	SymbolsAdded   int `json:"symbols_added"`
	SymbolsRemoved int `json:"symbols_removed"`
	EdgesAdded     int `json:"edges_added"`
	EdgesRemoved   int `json:"edges_removed"`
}

DiffSummary holds aggregate counts for a graph diff.

type Edge

type Edge struct {
	SourceID int64
	TargetID int64
	Kind     EdgeKind
}

Edge represents a relationship between two symbols.

type EdgeKind

type EdgeKind string

EdgeKind identifies the kind of relationship between symbols.

const (
	EdgeCalls      EdgeKind = "calls"
	EdgeImports    EdgeKind = "imports"
	EdgeImplements EdgeKind = "implements"
	EdgeEmbeds     EdgeKind = "embeds"
	EdgeReferences EdgeKind = "references"
)

type FileEdge added in v1.8.3

type FileEdge struct {
	Source string
	Target string
	Count  int
}

FileEdge represents a connection between two files.

type FileRecord

type FileRecord struct {
	Path        string
	Language    string
	Size        int64
	ContentHash string
	IndexedAt   int64
}

FileRecord tracks indexed files for incremental updates.

type FunctionEdgeInfo added in v1.8.3

type FunctionEdgeInfo struct {
	Name        string
	File        string
	Line        int
	Kind        string
	Signature   string
	Decorators  string // comma-separated decorator names captured at parse time
	BaseClasses string // comma-separated base-class names of the enclosing class
	OutEdges    int
	InEdges     int
}

FunctionEdgeInfo holds a function's identity and edge counts for analysis.

type GenericParser

type GenericParser struct {
	// contains filtered or unexported fields
}

GenericParser extracts symbols from non-Go source files using regex patterns. Covers Python, JavaScript, TypeScript, and Rust. No call graph (would need tree-sitter / CGo). Focuses on declarations: functions, classes, imports.

func NewGenericParser

func NewGenericParser(language string) *GenericParser

NewGenericParser creates a parser for the given language.

func (*GenericParser) ParseFile

func (p *GenericParser) ParseFile(path string) (*ParseResult, error)

ParseFile parses a source file and extracts symbols using regex.

type GitignoreFilter added in v1.8.3

type GitignoreFilter struct {
	// contains filtered or unexported fields
}

GitignoreFilter holds compiled gitignore patterns for matching.

func LoadGitignore added in v1.8.3

func LoadGitignore(projectRoot string) *GitignoreFilter

LoadGitignore reads a .gitignore file and returns a filter. Returns nil (no filter) if the file doesn't exist or can't be read.

func (*GitignoreFilter) ShouldSkip added in v1.8.3

func (f *GitignoreFilter) ShouldSkip(relPath string, isDir bool) bool

ShouldSkip returns true if the given relative path should be ignored. isDir indicates whether the path is a directory.

type GoParser

type GoParser struct{}

GoParser extracts symbols and edges from Go source files using go/ast.

func NewGoParser

func NewGoParser() *GoParser

NewGoParser creates a new Go AST parser.

func (*GoParser) ParseFile

func (p *GoParser) ParseFile(path string) (*ParseResult, error)

ParseFile parses a single Go source file and extracts symbols and edges.

type ImpactResult added in v1.10.0

type ImpactResult struct {
	// DirectlyChanged are symbols whose line ranges overlap the diff.
	DirectlyChanged []ImpactSymbol `json:"directly_changed"`
	// AffectedCallers are symbols that call the directly changed symbols.
	AffectedCallers []ImpactSymbol `json:"affected_callers"`
	// UncoveredByTests are changed symbols with no test edges.
	UncoveredByTests []string `json:"uncovered_by_tests"`
	// Summary statistics
	Summary ImpactSummary `json:"summary"`
}

ImpactResult holds the blast radius analysis for a set of changes.

type ImpactSummary added in v1.10.0

type ImpactSummary struct {
	FilesChanged    int     `json:"files_changed"`
	SymbolsChanged  int     `json:"symbols_changed"`
	CallersAffected int     `json:"callers_affected"`
	TestGaps        int     `json:"test_gaps"`
	MaxRiskScore    float64 `json:"max_risk_score"`
}

ImpactSummary holds aggregate counts.

type ImpactSymbol added in v1.10.0

type ImpactSymbol struct {
	Name      string     `json:"name"`
	Kind      SymbolKind `json:"kind"`
	File      string     `json:"file"`
	Line      int        `json:"line"`
	RiskScore float64    `json:"risk_score"`
}

ImpactSymbol is a symbol with risk metadata.

type Indexer

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer manages the code graph lifecycle: build, update, and query.

func NewIndexer

func NewIndexer(workspace, dbPath string) (*Indexer, error)

NewIndexer creates an indexer for the given workspace, using the specified SQLite database path.

Reloads the MinHasher seeds from the store's meta table if present so stored signatures remain comparable across process invocations. If no seeds are stored (fresh index or pre-v1.9.0 index), generates fresh random seeds that will be persisted on the first Build().

func NewIndexerWithStore added in v1.8.3

func NewIndexerWithStore(store *Store, workspace string) *Indexer

NewIndexerWithStore creates an indexer using an existing store. This is useful for testing where the store is set up manually. Unlike NewIndexer, does NOT attempt to load seeds from the store — the caller is responsible for passing a store that either has no meta row yet or whose seeds are irrelevant for the test.

func (*Indexer) AnalyzeChanges added in v1.10.0

func (idx *Indexer) AnalyzeChanges(base string) (*ImpactResult, error)

AnalyzeChanges maps changed line ranges to affected symbols and their callers.

func (*Indexer) Build

func (idx *Indexer) Build() error

Build performs a full index of the workspace. Walks the file tree, parses source files, extracts symbols and edges, computes MinHash signatures, and stores everything in SQLite.

Two-pass design (fix for issue #47): all symbols are stored in the first pass so that cross-file call edges can be resolved in the second pass regardless of file processing order. Without the two-pass approach, files processed alphabetically before their callee files (e.g. a.py calling in_databricks defined in util.py) would silently drop their edges because the target symbol didn't exist in the DB yet.

func (*Indexer) BuildWithContext added in v1.10.0

func (idx *Indexer) BuildWithContext(ctx context.Context) error

BuildWithContext is the cancellable variant of Build. It checks ctx between files so an index build started under a tool deadline can abort instead of walking the whole repo (task 349f1f14 complement).

func (*Indexer) Close

func (idx *Indexer) Close() error

Close releases the underlying database connection and any native resources held by the tree-sitter TS parser.

func (*Indexer) FindCodeSmells added in v1.8.3

func (idx *Indexer) FindCodeSmells(kinds []CodeSmellKind, maxResults int, includeTests bool) ([]CodeSmell, error)

FindCodeSmells performs a single-pass structural analysis over all functions in the graph, detecting multiple code smell patterns simultaneously. This is more efficient than separate queries and more powerful than grep because it combines graph structure (edges, connectivity) with body analysis.

func (*Indexer) FindLazyRedirects added in v1.8.3

func (idx *Indexer) FindLazyRedirects(maxResults int, includeTests bool) ([]LazyRedirectResult, error)

FindLazyRedirects uses structural analysis to detect functions whose names imply complex behavior but whose graph structure shows they're trivially simple. This goes beyond grep-based detection by measuring the divergence between a function's semantic vocabulary (shingles) and its actual call graph connectivity.

Scoring factors:

  • Name complexity: action verbs in name suggest the function should DO work
  • Edge poverty: fewer outgoing edges = less actual work done
  • Shingle richness: domain-specific vocabulary in body that doesn't connect to edges

Returns results sorted by divergence score (highest = most suspicious).

func (*Indexer) KeywordSearch

func (idx *Indexer) KeywordSearch(query string, limit int) ([]Symbol, error)

KeywordSearch finds symbols matching a keyword query using SQL LIKE.

func (*Indexer) LatestSnapshot added in v1.10.0

func (idx *Indexer) LatestSnapshot() (*Snapshot, error)

LatestSnapshot returns the most recent snapshot, or nil if none exist.

func (*Indexer) LoadSnapshot added in v1.10.0

func (idx *Indexer) LoadSnapshot(commitSHA string) (*Snapshot, error)

LoadSnapshot retrieves a snapshot by commit SHA.

func (*Indexer) PackageGraph added in v1.8.3

func (idx *Indexer) PackageGraph() ([]PackageInfo, []PackageEdge, error)

PackageGraph returns package-level connectivity for visualization.

func (*Indexer) ProjectSummary

func (idx *Indexer) ProjectSummary() string

ProjectSummary returns a brief summary suitable for the system prompt.

func (*Indexer) SaveSnapshot added in v1.10.0

func (idx *Indexer) SaveSnapshot(snap *Snapshot) error

SaveSnapshot persists a snapshot to the store.

func (*Indexer) SemanticSearch

func (idx *Indexer) SemanticSearch(query string, topK int) ([]SearchResult, error)

SemanticSearch finds symbols semantically similar to the query string. The query is split into shingles, MinHashed, then compared against all symbol signatures using brute-force Jaccard similarity.

Applies the path-based post-filter by default — test/mock/generated/ vendored/declaration results are partitioned below clean-path results of comparable similarity. Use SemanticSearchWithOptions to disable.

func (*Indexer) SemanticSearchWithContext added in v1.10.0

func (idx *Indexer) SemanticSearchWithContext(ctx context.Context, query string, opts SemanticSearchOptions) ([]SearchResult, error)

SemanticSearchWithContext is the cancellable variant. It checks ctx before the expensive candidate-scoring loops so a search invoked as a tool can be aborted at the agent's tool deadline instead of spinning the corpus to completion (task 349f1f14 complement — stops the abandoned goroutine's CPU burn).

func (*Indexer) SemanticSearchWithOptions added in v1.9.0

func (idx *Indexer) SemanticSearchWithOptions(query string, opts SemanticSearchOptions) ([]SearchResult, error)

SemanticSearchWithOptions is the full-options variant of SemanticSearch.

func (*Indexer) Stats

func (idx *Indexer) Stats() (*StoreStats, error)

Stats returns aggregate stats for the indexed codebase.

func (*Indexer) Store

func (idx *Indexer) Store() *Store

Store returns the underlying store for direct queries (used by tools).

func (*Indexer) TakeSnapshot added in v1.10.0

func (idx *Indexer) TakeSnapshot() (*Snapshot, error)

TakeSnapshot captures the current graph state from the store.

func (*Indexer) Update

func (idx *Indexer) Update() error

Update performs an incremental update. Only re-indexes files whose content hash has changed since the last index. Removes symbols for deleted files.

func (*Indexer) UpdateWithContext added in v1.10.0

func (idx *Indexer) UpdateWithContext(ctx context.Context) error

UpdateWithContext is the cancellable variant of Update. It checks ctx between re-indexed files so an incremental update started under a tool deadline can abort instead of re-parsing the whole repo (task 349f1f14 complement).

type LazyRedirectCandidate added in v1.8.3

type LazyRedirectCandidate struct {
	Name      string
	File      string
	Line      int
	Kind      string
	OutEdges  int
	InEdges   int
	Signature string
}

LazyRedirectCandidate represents a function whose name implies complex behavior but whose graph structure shows it's structurally trivial — a potential lazy redirect.

type LazyRedirectResult added in v1.8.3

type LazyRedirectResult struct {
	Name      string  `json:"name"`
	File      string  `json:"file"`
	Line      int     `json:"line"`
	Kind      string  `json:"kind"`
	OutEdges  int     `json:"outgoing_edges"`
	InEdges   int     `json:"incoming_edges"`
	Score     float64 `json:"divergence_score"`
	Reason    string  `json:"reason"`
	Signature string  `json:"signature"`
}

LazyRedirectResult is a scored candidate for lazy redirect detection.

type MinHashEntry

type MinHashEntry struct {
	SymbolID  int64
	Signature MinHashSignature
}

MinHashEntry pairs a symbol ID with its MinHash signature for bulk queries.

type MinHashSignature

type MinHashSignature []uint64

MinHashSignature is a fixed-length array of hash values for similarity search.

type MinHasher

type MinHasher struct {
	// contains filtered or unexported fields
}

MinHasher computes MinHash signatures for sets of shingles. Uses FNV-1a with different uint64 seeds to simulate N independent hash functions. Seeds are a fixed []uint64 so the hasher can be persisted to the codegraph store and restored across process invocations — essential for reliable cross-process semantic search.

func NewMinHasher

func NewMinHasher(numHashes int) *MinHasher

NewMinHasher creates a MinHasher with the specified number of hash functions, generating fresh random seeds from crypto/rand. Use NewMinHasherFromSeeds when reloading a persisted hasher from the store.

func NewMinHasherFromSeeds added in v1.9.0

func NewMinHasherFromSeeds(seeds []uint64) *MinHasher

NewMinHasherFromSeeds creates a MinHasher with pre-determined seeds, typically reloaded from the codegraph store's meta table. This is the critical path for cross-process signature stability: a MinHash signature computed with seeds S can only be compared to another signature computed with the SAME seeds S. Persisting the seeds and restoring them on Open is what makes SemanticSearch work across process boundaries.

func (*MinHasher) NumHashes added in v1.9.0

func (m *MinHasher) NumHashes() int

NumHashes returns the signature length.

func (*MinHasher) Seeds added in v1.9.0

func (m *MinHasher) Seeds() []uint64

Seeds returns a copy of the hasher's seeds. Used by the indexer to persist them into the codegraph store's meta table at build time.

func (*MinHasher) Signature

func (m *MinHasher) Signature(shingles []string) MinHashSignature

Signature computes the MinHash signature for a set of shingles. Each element of the returned slice is the minimum hash value across all shingles for that hash function.

type MultiLangParser added in v1.10.0

type MultiLangParser struct {
	// contains filtered or unexported fields
}

MultiLangParser parses source files across multiple languages using tree-sitter grammars and the langSpec node type mappings.

func NewMultiLangParser added in v1.10.0

func NewMultiLangParser() *MultiLangParser

NewMultiLangParser initializes the parser with all available grammars.

func (*MultiLangParser) Close added in v1.10.0

func (m *MultiLangParser) Close()

Close releases native resources.

func (*MultiLangParser) ParseFile added in v1.10.0

func (m *MultiLangParser) ParseFile(path string) (*ParseResult, error)

ParseFile reads a source file and returns extracted symbols and edges using the tree-sitter AST and language-specific node type mappings.

func (*MultiLangParser) SupportsFile added in v1.10.0

func (m *MultiLangParser) SupportsFile(path string) bool

SupportsFile returns true if this parser can handle the given file.

type PackageEdge added in v1.8.3

type PackageEdge struct {
	Source string
	Target string
	Count  int
}

PackageEdge represents a connection between two packages.

type PackageInfo added in v1.8.3

type PackageInfo struct {
	Name        string
	SymbolCount int
	FileCount   int
}

PackageInfo holds package-level stats for visualization.

type ParseResult

type ParseResult struct {
	Symbols []Symbol
	Edges   []RawEdge
	Source  []byte // raw file content for shingle generation
}

ParseResult holds the symbols and edges extracted from a single file.

type PathFlag added in v1.9.0

type PathFlag string

PathFlag is a machine-readable marker attached to a search result when the symbol's file path matches a known pattern that affects its interpretation — test fixture, mock, type declaration, vendored code, build output, etc.

Flags are computed at query time (not stored in the index), so adding new flag categories does not invalidate existing codegraph databases. Callers can read SearchResult.PathFlags to understand WHY a symbol was demoted from the "clean" ranking tier.

const (
	// FlagTest — symbol lives in a test file or test directory. These
	// are genuine test helpers: TestFoo functions in Go's _test.go files,
	// tests/*.py in Python, *.spec.ts / *.test.ts in TypeScript, and so on.
	FlagTest PathFlag = "test"

	// FlagMock — symbol is in a mocks/, fixtures/, or stubs/ directory.
	// Mock handlers, fake services, test doubles. These pollute queries
	// like "http request handler middleware" because they share
	// discriminative tokens with production middleware without BEING
	// production middleware. Q2 in the grafana A/B test was 100% mock
	// handlers for exactly this reason.
	FlagMock PathFlag = "mock"

	// FlagDeclaration — symbol is in a pure type declaration file
	// (e.g. TypeScript .d.ts). These describe an API surface but have
	// no runtime code. Usually undesirable as a semantic search match
	// because the user is looking for implementations, not declarations.
	// JQueryStatic lives in a .d.ts file and this flag would demote it
	// even without the splitCamelCase fix.
	FlagDeclaration PathFlag = "declaration"

	// FlagVendored — symbol is in a vendored dependency or third-party
	// package directory (vendor/, node_modules/, bower_components/).
	// These are external code the user didn't write. Usually irrelevant.
	FlagVendored PathFlag = "vendored"

	// FlagGenerated — symbol is in a generated-code output directory
	// (dist/, build/, .next/, out/, target/). Post-compile artifacts,
	// transpiled output, build caches. Never what the user wants.
	FlagGenerated PathFlag = "generated"
)

func ClassifyPath added in v1.9.0

func ClassifyPath(path string) []PathFlag

ClassifyPath inspects a file path and returns the set of PathFlags that apply. Empty result means the symbol is in a "clean" path with no demotion warranted.

Deterministic, fast (O(path length)), and order-independent: the same path always produces the same flag set.

type RawEdge

type RawEdge struct {
	SourceName string
	TargetName string
	Kind       EdgeKind
}

RawEdge is an unresolved edge that uses symbol names instead of IDs. Resolved to Edge (with IDs) when inserted into the store.

type Reranker added in v1.9.0

type Reranker interface {
	Rerank(results []SearchResult, queryTokenCount int) []SearchResult
}

Reranker rescores a set of SearchResult candidates using features beyond the baseline Jaccard + BM25 fusion. Implementations MUST be pure — no network, no file I/O beyond what was passed in. Callers are responsible for cloning the input if they need the original order preserved; Rerank is allowed to mutate the slice in place.

queryTokenCount is the number of distinct query shingles that made it through the stop-word filter. Rerankers that compute a matched-token ratio need this for normalization.

type SearchResult

type SearchResult struct {
	Symbol     Symbol
	Similarity float64

	// BM25Score is the additive per-symbol BM25 score for this query,
	// computed alongside the Jaccard similarity at search time. Not a
	// replacement for Similarity — both signals are returned so callers
	// (or a downstream re-rank layer) can reason about them independently.
	// Zero when the BM25 corpus stats table is empty (pre-v1.9.0 index).
	BM25Score float64

	// MatchedTokens are the query tokens that appeared in this symbol's
	// filtered shingle set (intersection of query and symbol tokens).
	// Populated only when BM25 scoring is active. Useful reasoning output
	// for LLMs: "this result matched because it contains X, Y, Z".
	MatchedTokens []string

	PathFlags          []string
	EdgeCount          int
	ConfidenceWarnings []string
}

SearchResult pairs a symbol with its similarity score and a set of machine-readable reasoning fields that tell an LLM (or a human) WHY this result was returned and how confident celeste is in it.

PathFlags: markers attached when the symbol's file path triggered the path-based post-filter — e.g. ["test"], ["mock", "generated"]. Clean- path results have an empty PathFlags slice. SemanticSearch demotes flagged results below clean results by default; see SemanticSearchOptions.ApplyPathFilter to disable.

EdgeCount: total incoming + outgoing edges on this symbol in the code graph. A function that is called from 4 places and calls 2 others has EdgeCount=6. Zero-edge symbols are suspicious — they may be genuine dead code, but they may also be symbols the parser failed to resolve (especially TS/Python/Rust where the regex parser can't follow call sites through type definitions). SPEC §8.2 Issue #2 documents this ambiguity explicitly; LLMs should NOT treat EdgeCount=0 as proof of dead code without corroborating evidence.

ConfidenceWarnings: human-readable strings describing caveats about this result. Derived at query time from PathFlags, EdgeCount, Kind, and Similarity — no schema change, no precomputation. Callers should surface these to whoever consumes the search results so low-quality matches are recognized as such instead of being treated as confident answers.

type SemanticSearchOptions added in v1.9.0

type SemanticSearchOptions struct {
	// TopK is the maximum number of results to return. Required.
	TopK int

	// MinSimilarity is the Jaccard floor below which results are dropped
	// entirely. Zero means use the default (0.05).
	MinSimilarity float64

	// ApplyPathFilter, when true, demotes results whose file path matches
	// a known "noisy" pattern (test/mock/generated/vendored/declaration)
	// below clean-path results. Default when using SemanticSearch is true.
	// Set false for raw unfiltered results.
	ApplyPathFilter bool

	// Reranker, when non-nil, is applied to the candidate list after
	// the Jaccard + BM25 fusion and before the path filter tiering.
	// A pluggable seam — the default (set via SemanticSearch) is
	// StructuralReranker which does pure-Go feature-based rescoring.
	// Future cloud/local embedding rerankers can implement this
	// interface without touching the search pipeline.
	//
	// Pass a zero value (nil) together with
	// DisableRerank=true to get the pre-Task-24 behavior (fusion-only).
	Reranker Reranker

	// DisableRerank bypasses the Reranker even if one is set.
	// Useful for A/B testing and for callers that want the raw
	// fused ordering without any structural adjustments.
	DisableRerank bool
}

SemanticSearchOptions configures SemanticSearch behavior. Existing callers of SemanticSearch(query, topK) get the default behavior — path filter ON, structural rerank ON — without any changes.

type Snapshot added in v1.10.0

type Snapshot struct {
	CommitSHA   string            `json:"commit_sha"`
	Timestamp   time.Time         `json:"timestamp"`
	SymbolCount int               `json:"symbol_count"`
	EdgeCount   int               `json:"edge_count"`
	Symbols     map[string]string `json:"symbols"` // name → kind
	Edges       []string          `json:"edges"`   // "source→target:kind"
}

Snapshot captures the graph state at a point in time.

type SnapshotDiff added in v1.10.0

type SnapshotDiff struct {
	BeforeSHA      string      `json:"before_sha"`
	AfterSHA       string      `json:"after_sha"`
	AddedSymbols   []string    `json:"added_symbols"`
	RemovedSymbols []string    `json:"removed_symbols"`
	AddedEdges     []string    `json:"added_edges"`
	RemovedEdges   []string    `json:"removed_edges"`
	Summary        DiffSummary `json:"summary"`
}

SnapshotDiff describes what changed between two graph states.

func DiffSnapshots added in v1.10.0

func DiffSnapshots(before, after *Snapshot) *SnapshotDiff

DiffSnapshots compares two snapshots and returns what changed.

type StopWords added in v1.9.0

type StopWords struct {
	Version   string
	Universal map[string]bool
	ByLang    map[string]map[string]bool
	Compound  map[string]bool
}

StopWords holds the parsed lookup sets used at shingle-generation time and query-tokenization time. Built once at init.

func (*StopWords) Filter added in v1.9.0

func (s *StopWords) Filter(tokens []string, lang string) []string

Filter removes any tokens in the universal set or in the per-language set for the given lang from the input slice. Empty lang means universal-only filtering. The input slice is NOT mutated.

Preserves the order of surviving tokens. Returns a freshly allocated slice (safe for the caller to hold).

func (*StopWords) IsCompound added in v1.9.0

func (s *StopWords) IsCompound(name string) bool

IsCompound returns true if the lowercased identifier is in the compound_identifiers list. Used by splitIdentifier to keep known compound names (jquery, github, mysql, ...) atomic instead of decomposing them into parts that pollute searches.

The splitCamelCase fix in v1.9.0 (min-3-uppercase rule) already handles most compound-name cases structurally, so this lookup is a belt-and-suspenders layer: it catches snake_case compounds (mysql_config → would split to ["mysql", "config"] without this, but we WANT "mysql" to stay atomic because splitting to "my"+"sql" or similar is worse) and any lowercase-only compounds that splitCamelCase wouldn't touch at all.

func (*StopWords) UniversalSize added in v1.9.0

func (s *StopWords) UniversalSize() int

UniversalSize returns the number of universal stop words. Used by the anchor test to assert the embedded file isn't obviously broken.

type Store

type Store struct {
	// contains filtered or unexported fields
}

Store manages the SQLite database for the code graph.

func NewStore

func NewStore(dbPath string) (*Store, error)

NewStore opens (or creates) a SQLite database at the given path and initializes the schema.

func (*Store) AddEdge

func (s *Store) AddEdge(sourceID, targetID int64, kind EdgeKind) error

AddEdge records a directional relationship between two symbols.

func (*Store) AllEdgeKeys added in v1.10.0

func (s *Store) AllEdgeKeys() ([]string, error)

AllEdgeKeys returns a list of "source→target:kind" strings for all edges.

func (*Store) AllSymbolNamesAndKinds added in v1.10.0

func (s *Store) AllSymbolNamesAndKinds() (map[string]string, error)

AllSymbolNamesAndKinds returns a map of symbol name → kind for all symbols.

func (*Store) CallerCount added in v1.10.0

func (s *Store) CallerCount(targetName string) int

CallerCount returns the number of symbols that call the named symbol.

func (*Store) CallersOf added in v1.10.0

func (s *Store) CallersOf(targetName string) []Symbol

CallersOf returns symbols that have a "calls" edge targeting the named symbol.

func (*Store) Close

func (s *Store) Close() error

Close closes the underlying database connection.

func (*Store) DeleteFile

func (s *Store) DeleteFile(path string) error

DeleteFile removes a file record.

func (*Store) DeleteFileSymbols

func (s *Store) DeleteFileSymbols(file string) error

DeleteFileSymbols removes all symbols (and their edges) for a file.

func (*Store) EdgeCount added in v1.10.0

func (s *Store) EdgeCount(name string) int

EdgeCount returns the total number of edges connected to a symbol.

func (*Store) FindAllFunctionsWithEdges added in v1.8.3

func (s *Store) FindAllFunctionsWithEdges() ([]FunctionEdgeInfo, error)

FindAllFunctionsWithEdges returns all functions/methods with their edge counts. Used by the unified code smell detector for single-pass analysis.

func (*Store) FindLazyRedirectCandidates added in v1.8.3

func (s *Store) FindLazyRedirectCandidates(includeTests bool) ([]LazyRedirectCandidate, error)

FindLazyRedirectCandidates returns functions/methods with low outgoing edges (0-2) that are NOT known leaf patterns (constructors, getters, interface impls). These are candidates for lazy redirect analysis via shingle/edge divergence.

func (*Store) FindStubs added in v1.8.3

func (s *Store) FindStubs(includeTests bool) ([]StubResult, error)

FindStubs returns functions/methods with zero outgoing call edges. These are likely stubs, placeholders, or dead code.

func (*Store) GetAllFiles

func (s *Store) GetAllFiles() ([]FileRecord, error)

GetAllFiles returns all indexed file records.

func (*Store) GetAllMinHashes

func (s *Store) GetAllMinHashes() ([]MinHashEntry, error)

GetAllMinHashes retrieves all symbol IDs and their MinHash signatures for similarity search. Symbols without a signature are skipped.

func (*Store) GetEdgesFrom

func (s *Store) GetEdgesFrom(sourceID int64) ([]Edge, error)

GetEdgesFrom returns all outgoing edges from the given symbol.

func (*Store) GetEdgesTo

func (s *Store) GetEdgesTo(targetID int64) ([]Edge, error)

GetEdgesTo returns all incoming edges to the given symbol.

func (*Store) GetFile

func (s *Store) GetFile(path string) (*FileRecord, error)

GetFile retrieves a file record by path.

func (*Store) GetFileGraph added in v1.8.3

func (s *Store) GetFileGraph() ([]FileEdge, error)

GetFileGraph returns file-level connectivity data for visualization. Works for all languages — shows which files call into other files.

func (*Store) GetIDFs added in v1.9.0

func (s *Store) GetIDFs(tokens []string) (map[string]float64, error)

GetIDFs reads IDF values for a set of tokens in one batched query. Returns a map containing only tokens that exist in token_stats — missing tokens contribute zero to BM25 scores.

func (*Store) GetMeta added in v1.9.0

func (s *Store) GetMeta(key string) ([]byte, error)

GetMeta reads a raw byte value from the meta key/value table. Returns (nil, nil) if the key is not present — callers should treat nil as "not set" and decide whether to generate and persist.

func (*Store) GetMinHash

func (s *Store) GetMinHash(symbolID int64) (MinHashSignature, error)

GetMinHash retrieves the MinHash signature for a symbol.

func (*Store) GetPackageGraph added in v1.8.3

func (s *Store) GetPackageGraph() ([]PackageInfo, []PackageEdge, error)

GetPackageGraph returns package-level connectivity data for visualization.

func (*Store) GetSymbol

func (s *Store) GetSymbol(id int64) (*Symbol, error)

GetSymbol retrieves a symbol by its ID.

func (*Store) GetSymbolIDByName added in v1.8.2

func (s *Store) GetSymbolIDByName(name string) (int64, bool)

GetSymbolIDByName returns the ID of a symbol by exact name match. If multiple symbols share the same name, returns the first found.

func (*Store) GetSymbolTokens added in v1.9.0

func (s *Store) GetSymbolTokens(symbolID int64) (map[string]int, int, error)

GetSymbolTokens reads the stored TF map for a single symbol. Used at query time to compute BM25 scores. Returns an empty map (not nil) if the symbol has no token rows, so callers can treat it as "zero contribution" without nil-checks.

func (*Store) GetSymbolsByFile

func (s *Store) GetSymbolsByFile(file string) ([]Symbol, error)

GetSymbolsByFile returns all symbols in the given file.

func (*Store) GetSymbolsByPackage

func (s *Store) GetSymbolsByPackage(pkg string) ([]Symbol, error)

GetSymbolsByPackage returns all symbols in the given package.

func (*Store) HasLSHData added in v1.10.0

func (s *Store) HasLSHData() bool

HasLSHData returns true if the lsh_bands table has at least one row. Used to decide whether to use the LSH path or fall back to brute-force at query time — pre-LSH indexes have no band data and must still work.

func (*Store) HasTestCoverage added in v1.10.0

func (s *Store) HasTestCoverage(name string) bool

HasTestCoverage returns true if a symbol has any callers from test files.

func (*Store) LatestSnapshot added in v1.10.0

func (s *Store) LatestSnapshot() ([]byte, error)

LatestSnapshot returns the most recent snapshot data, or nil if none exist.

func (*Store) LoadSnapshot added in v1.10.0

func (s *Store) LoadSnapshot(commitSHA string) ([]byte, error)

LoadSnapshot retrieves a snapshot by commit SHA.

func (*Store) QueryLSHCandidates added in v1.10.0

func (s *Store) QueryLSHCandidates(queryBands []uint64) ([]int64, error)

QueryLSHCandidates retrieves the set of symbol IDs that share at least one band hash with the query. This is the LSH candidate set — typically 0.1-1% of the corpus — which is then ranked by exact Jaccard similarity.

The query is a single SELECT with OR clauses across all 64 bands:

SELECT DISTINCT symbol_id FROM lsh_bands
WHERE (band_id = 0 AND band_hash = ?) OR (band_id = 1 AND band_hash = ?) OR ...

The index on (band_id, band_hash) makes each OR branch an O(1) lookup. DISTINCT collapses symbols that match on multiple bands.

func (*Store) ReadBM25Stats added in v1.9.0

func (s *Store) ReadBM25Stats() (*BM25CorpusStats, error)

ReadBM25Stats reads the cached corpus-wide BM25 stats. Returns (nil, nil) if the meta row is absent (fresh index or pre-BM25 index).

func (*Store) RebuildTokenStats added in v1.9.0

func (s *Store) RebuildTokenStats() (*BM25CorpusStats, error)

RebuildTokenStats walks the entire symbol_tokens table and computes df + idf for every token. Replaces the contents of token_stats atomically (delete-all + insert) so re-runs produce a consistent state. Called at the end of Build() and Update() — cheap compared to the full indexing pass because it's just aggregation over rows we just wrote.

Also computes the corpus-wide NumDocs + AvgDocLength stats and persists them to the meta table so query time can read them in a single lookup instead of COUNT(DISTINCT symbol_id) and AVG() scans.

func (*Store) SaveSnapshot added in v1.10.0

func (s *Store) SaveSnapshot(commitSHA string, ts time.Time, data []byte) error

SaveSnapshot persists a graph snapshot.

func (*Store) SearchSymbolsByName

func (s *Store) SearchSymbolsByName(query string) ([]Symbol, error)

SearchSymbolsByName returns symbols whose name contains the query (case-insensitive).

func (*Store) SetMeta added in v1.9.0

func (s *Store) SetMeta(key string, value []byte) error

SetMeta writes a raw byte value to the meta key/value table. Upserts on conflict so the caller can treat this as idempotent.

func (*Store) Stats

func (s *Store) Stats() (*StoreStats, error)

Stats returns aggregate counts for the indexed codebase.

func (*Store) SymbolsInLineRange added in v1.10.0

func (s *Store) SymbolsInLineRange(file string, startLine, endLine int) []Symbol

SymbolsInLineRange returns symbols in a file whose line range overlaps [startLine, endLine].

func (*Store) UpdateMinHash

func (s *Store) UpdateMinHash(symbolID int64, sig MinHashSignature) error

UpdateMinHash stores the MinHash signature for a symbol.

func (*Store) UpsertFile

func (s *Store) UpsertFile(f FileRecord) error

UpsertFile inserts or updates a file record.

func (*Store) UpsertLSHBands added in v1.10.0

func (s *Store) UpsertLSHBands(symbolID int64, bands []uint64) error

UpsertLSHBands writes the band hashes for a single symbol. Deletes any existing rows first (re-index case) then bulk-inserts 64 rows.

func (*Store) UpsertSymbol

func (s *Store) UpsertSymbol(sym Symbol) (int64, error)

UpsertSymbol inserts or updates a symbol. Uniqueness is determined by (name, kind, package, file). Returns the row ID.

func (*Store) UpsertSymbolTokens added in v1.9.0

func (s *Store) UpsertSymbolTokens(symbolID int64, tokens []string) error

UpsertSymbolTokens writes the per-symbol token frequencies for a given symbol. Called from indexFile after the shingles are computed so we have the raw frequencies before deduplication collapses them to 1-per-token.

tokens is passed as a slice (not a set) because we want TF counts: the same token appearing twice in the shingle stream should count 2. Celeste's current shingle pipeline dedupes, so TF is always 1 in practice, but we preserve the more general API for future extractor improvements that might count frequency more accurately.

type StoreStats

type StoreStats struct {
	TotalSymbols  int
	TotalEdges    int
	TotalFiles    int
	SymbolsByKind map[SymbolKind]int
	FilesByLang   map[string]int
}

StoreStats holds aggregate counts for the indexed codebase.

type StructuralReranker added in v1.9.0

type StructuralReranker struct {
	// MatchedTokenWeight scales the matched-token-ratio contribution.
	// Default 1.0 — a full-match symbol gets +1.0 added to its base
	// score, which is significant relative to the typical Jaccard
	// range of 0.1-0.2 but doesn't trivially override BM25.
	MatchedTokenWeight float64

	// EdgeDensityWeight scales the log-normalized edge count
	// contribution. Default 0.3 — mild boost; edge count alone
	// shouldn't overwhelm real textual relevance.
	EdgeDensityWeight float64

	// KindBoostFunction is the additive weight for function / method
	// symbols. Default 0.15 — small but enough to break ties in favor
	// of actual implementations over type aliases.
	KindBoostFunction float64

	// ZeroEdgePenalty is the additive weight (usually negative) for
	// function/method symbols with zero edges. Default -0.25 — pushes
	// likely-dead-code below real matches without entirely removing it.
	ZeroEdgePenalty float64
}

StructuralReranker is the default Reranker shipped in v1.9.0. It scores each candidate using a weighted combination of features that the RRF fusion can't see:

  • MatchedTokenRatio: fraction of query tokens that appear in the symbol's filtered shingle set. A symbol matching 4/4 query tokens should rank above one matching 1/4 even if the Jaccard estimator happens to put them at similar percentiles.

  • EdgeDensity: log-normalized edge count. Well-connected symbols are more likely to be the "real" implementation of a feature than zero-edge stub interfaces. Capped logarithmically so that a symbol with 200 edges doesn't dwarf one with 20.

  • KindWeight: function and method symbols get a small boost over type/interface declarations for implementation-hunting queries. Tuned on the SPEC §5.1 benchmark queries which all target "find me the code that does X" rather than "find me the type definition for X".

  • ZeroEdgePenalty: a symbol with zero edges and kind in {function, method} is either dead code or a parser limitation. Push it below other candidates that actually have connectivity.

All features are normalized to [0,1]-ish ranges before the weighted sum. Weights are exposed on the struct so callers can A/B tune without recompiling; the zero value uses sensible defaults picked by hand-inspection on the Task 23 content-control benchmark.

func NewStructuralReranker added in v1.9.0

func NewStructuralReranker() *StructuralReranker

NewStructuralReranker returns a reranker with the default weights picked by hand-inspection on the content-control benchmark. Callers that want to experiment can construct StructuralReranker{} directly with custom weights instead.

func (*StructuralReranker) Rerank added in v1.9.0

func (r *StructuralReranker) Rerank(results []SearchResult, queryTokenCount int) []SearchResult

Rerank applies the structural rescore to results and returns a new ordering. The original Similarity / BM25Score fields on each SearchResult are preserved; only the slice order changes. Callers can audit the rerank by comparing the old order to the new one.

Ties (exact equal structural scores) are broken by the incoming order so the rerank is stable relative to the fused ranking. This matters because the fused ranking already encodes meaningful signal — we're enhancing it, not replacing it, and ties should fall back to "trust the upstream signal".

type StubResult added in v1.8.3

type StubResult struct {
	Name     string
	File     string
	Line     int
	Kind     string
	OutEdges int
	InEdges  int
}

StubResult represents a function/method with zero outgoing call edges.

type Symbol

type Symbol struct {
	ID          int64
	Name        string
	Kind        SymbolKind
	Package     string
	File        string
	Line        int
	Signature   string
	Decorators  string // comma-separated decorator names captured at parse time
	BaseClasses string // comma-separated base-class names of the enclosing class
}

Symbol represents a code entity (function, type, interface, etc.).

type SymbolKind

type SymbolKind string

SymbolKind identifies the kind of code symbol.

const (
	SymbolFunction  SymbolKind = "function"
	SymbolMethod    SymbolKind = "method"
	SymbolType      SymbolKind = "type"
	SymbolInterface SymbolKind = "interface"
	SymbolConst     SymbolKind = "const"
	SymbolVar       SymbolKind = "var"
	SymbolStruct    SymbolKind = "struct"
	SymbolImport    SymbolKind = "import"
	SymbolClass     SymbolKind = "class"
)

type TSParser added in v1.9.0

type TSParser struct {
	// contains filtered or unexported fields
}

TSParser parses TypeScript and TSX source files using tree-sitter. One parser holds both language pointers — selection is per-file by extension. The underlying tree_sitter.Parser is re-used across files (Parse() resets the internal state) so allocation stays cheap.

func NewTSParser added in v1.9.0

func NewTSParser() *TSParser

NewTSParser initializes the tree-sitter parser with the TypeScript and TSX grammars loaded. Returns an error if grammar wiring fails (shouldn't happen in practice — the grammars are statically linked).

func (*TSParser) Close added in v1.9.0

func (p *TSParser) Close()

Close releases the tree-sitter parser's native resources.

func (*TSParser) ParseFile added in v1.9.0

func (p *TSParser) ParseFile(path string) (*ParseResult, error)

ParseFile reads a .ts or .tsx file and returns extracted symbols + edges. Uses the TSX grammar for .tsx files and the plain TypeScript grammar for everything else.

type TokenStat added in v1.9.0

type TokenStat struct {
	Token string
	DF    int
	IDF   float64
}

TokenStat is a per-token corpus statistic: document frequency (how many symbols contain this token) and precomputed IDF.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL