ingest

package
v0.6.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2026 License: Apache-2.0 Imports: 31 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var MaxIngestFileSize int64 = 100 << 20 // 100 MB

isBinaryFile returns true if the file appears to contain binary content. Uses the same heuristic as git: if the first 512 bytes contain a null byte, the file is binary. SQLite files (.db) are handled before this is called. MaxIngestFileSize is the largest file we'll read into memory during ingestion or schema inference. Files above this are silently skipped. Set to 0 to disable the size limit. Configurable via --max-file-size.

Functions

func DetectLanguageFromExt added in v0.2.0

func DetectLanguageFromExt(ext string) (langName string, grammar *sitter.Language, ok bool)

DetectLanguageFromExt returns the language name and tree-sitter Language for a given file extension. Returns ok=false for unsupported extensions. Thin wrapper over lang.ForExt for backward compatibility.

func FlattenAST added in v0.1.1

func FlattenAST(root *sitter.Node) []any

FlattenAST walks the tree and returns a list of records for FCA analysis.

func FlattenASTWithLanguage added in v0.2.0

func FlattenASTWithLanguage(root *sitter.Node, langName string) []any

FlattenASTWithLanguage walks the tree and returns records for FCA analysis, using language-specific enrichment if available.

func GetGitHints added in v0.2.0

func GetGitHints() map[string]string

GetGitHints returns the default inference hints for Git repositories.

func GetLanguage added in v0.2.0

func GetLanguage(langName string) *sitter.Language

GetLanguage returns the tree-sitter language for a language name string. Returns nil for unsupported languages. Deprecated: use lang.ForName(name).Grammar() instead.

func LoadFileIndex added in v0.6.1

func LoadFileIndex(dbPath string) (map[string]FileIndexEntry, error)

LoadFileIndex reads the file_index table from an existing index database. Returns a map of path → (modTime, size) for incremental comparison.

func LoadGitCommits added in v0.2.0

func LoadGitCommits(repoPath string) ([]any, error)

LoadGitCommits loads all commits from a repository using git log.

func LoadGitignore added in v0.6.3

func LoadGitignore(rootDir string) *gitignoreMatcher

LoadGitignore reads .gitignore from rootDir and discovers nested .gitignore files in the tree. Returns nil if no .gitignore exists at all.

func LoadSQLite

func LoadSQLite(dbPath string) ([]any, error)

LoadSQLite opens a SQLite database, reads all records from the results table, parses each JSON record, and returns them as a slice. Kept for backward compatibility with tests; prefer StreamSQLite for large datasets.

func ParseSize added in v0.5.6

func ParseSize(s string) (int64, error)

ParseSize parses a human-readable size string (e.g. "100MB", "1GB", "0"). Returns bytes. Supported suffixes: KB, MB, GB (case-insensitive).

func RegisterAddressRefQuery added in v0.6.2

func RegisterAddressRefQuery(langName, scheme, query string)

RegisterAddressRefQuery registers an address-aware ref extraction query for a specific language. The query must capture values as @ref. When matched, captured strings are unquoted (if quoted) and prefixed with scheme + ":" before being emitted as ref tokens.

Multiple queries can be registered per language by calling this function multiple times; entries are appended.

func RegisterContextQuery added in v0.2.0

func RegisterContextQuery(langName, query string)

RegisterContextQuery registers a context extraction query for a specific language. This should be called during initialization.

func RegisterQualifiedCallQuery added in v0.2.0

func RegisterQualifiedCallQuery(langName, query string)

RegisterQualifiedCallQuery registers a call extraction query that captures both @call (function name) and @pkg (package qualifier) for a language.

func RegisterRefQuery added in v0.2.0

func RegisterRefQuery(langName, query string)

RegisterRefQuery registers a reference extraction query for a specific language. This should be called during initialization.

func RenderTemplate

func RenderTemplate(tmpl string, values map[string]any) (string, error)

RenderTemplate delegates to internal/template.Render. Kept as a public alias for backward compatibility with existing callers.

func RenderTemplateWithFuncs added in v0.6.2

func RenderTemplateWithFuncs(tmpl string, values map[string]any, extraFuncs template.FuncMap, cache *sync.Map) (string, error)

RenderTemplateWithFuncs delegates to internal/template.RenderWithFuncs.

func SchemaUsesTreeSitter added in v0.2.0

func SchemaUsesTreeSitter(schema *api.Topology) bool

SchemaUsesTreeSitter returns true if the schema's selectors are tree-sitter S-expressions rather than JSONPath. S-expressions always start with '('.

func ShouldSkipDir added in v0.6.0

func ShouldSkipDir(base string) bool

ShouldSkipDir returns true for hidden dirs and common build artifact directories.

func ShouldSkipFile added in v0.5.6

func ShouldSkipFile(path string, size int64) bool

ShouldSkipFile returns true if the file should not be ingested. Checks extension blocklist, size limit, and binary content.

func StreamSQLite

func StreamSQLite(dbPath string, fn func(recordID string, record any) error) error

StreamSQLite iterates over all records in a SQLite database, calling fn for each one. Only one parsed record is alive at a time, keeping memory usage constant.

func StreamSQLiteRaw

func StreamSQLiteRaw(dbPath string, fn func(id, raw string) error) error

StreamSQLiteRaw iterates over all records yielding raw (id, json) strings without parsing. Used by the parallel ingestion pipeline where workers handle JSON parsing on their own goroutines.

Types

type ASTRoot added in v0.6.7

type ASTRoot struct {
	DB           *sql.DB
	SourceID     string // which source file (key into _source)
	ParentPrefix string // scope queries to children under this prefix
}

ASTRoot is the root context for ASTWalker queries. It scopes queries to a subtree of the AST via the parentPrefix.

type ASTWalker added in v0.6.7

type ASTWalker struct {
	// contains filtered or unexported fields
}

ASTWalker implements Walker by querying _ast and nodes tables produced by ley-line's ll-open/ts crate. This eliminates the CGO dependency on tree-sitter Go bindings — the AST was already parsed by Rust and stored in SQLite. Mache reads it via sqlite3_deserialize (zero-copy).

See ADR-014 for the design rationale.

func NewASTWalker added in v0.6.7

func NewASTWalker(db *sql.DB) *ASTWalker

NewASTWalker creates a walker backed by a SQLite database containing ley-line's _ast, _source, and nodes tables.

func (*ASTWalker) Close added in v0.6.7

func (w *ASTWalker) Close()

Close is a no-op — the ASTWalker doesn't own the database connection.

func (*ASTWalker) EnsureIndexes added in v0.6.8

func (w *ASTWalker) EnsureIndexes() error

EnsureIndexes creates compound indexes on the _ast table for query performance. Call once after opening the DB, before concurrent queries. Transforms findNodesByKind from O(N) full table scan to O(K) index lookup. Returns an error if the index cannot be created (e.g., no _ast table, read-only DB, or connection pool exhausted).

func (*ASTWalker) ExtractAddressRefs added in v0.6.7

func (w *ASTWalker) ExtractAddressRefs(sourcePath, langName string) ([]string, error)

ExtractAddressRefs runs all registered address ref queries for the given language by querying the _ast table. Returns deduplicated, scheme-prefixed tokens (e.g., "env:DATABASE_URL"). Mirrors SitterWalker.ExtractAddressRefs but uses SQL instead of CGO tree-sitter.

func (*ASTWalker) Query added in v0.6.7

func (w *ASTWalker) Query(root any, selector string) ([]Match, error)

Query implements Walker. The selector is a tree-sitter S-expression pattern. ASTWalker translates it to SQL queries against the nodes and _ast tables.

Currently supports the common pattern: (node_kind field: (child_kind) @capture) @scope, plus simple #eq? predicates over captured text. #match? requires SitterWalker.

type Engine

type Engine struct {
	Schema           *api.Topology
	Store            IngestionTarget
	RootPath         string // absolute path to the root of the ingestion
	RespectGitignore bool   // when true, skip files matching .gitignore patterns (default: true)
	// contains filtered or unexported fields
}

Engine drives the ingestion process.

func NewEngine

func NewEngine(schema *api.Topology, store IngestionTarget) *Engine

func (*Engine) DiagramFuncMap added in v0.6.2

func (e *Engine) DiagramFuncMap() template.FuncMap

DiagramFuncMap returns a template.FuncMap containing the {{diagram "name"}} function. The returned FuncMap is built once via sync.Once and reused; the closure inside captures the Engine and lazily initializes community data on first call. Safe for concurrent use.

func (*Engine) Gitignore added in v0.6.3

func (e *Engine) Gitignore() GitignoreMatcher

Gitignore returns the gitignore matcher loaded during Ingest, or nil if none was loaded. Pass this to WithGitignore when creating a Watcher so the watcher skips the same directories the engine does.

func (*Engine) Ingest

func (e *Engine) Ingest(path string) error

Ingest processes a file or directory. Safe to call multiple times — internal dedup state is reset on each call.

func (*Engine) IngestRecords added in v0.2.0

func (e *Engine) IngestRecords(records []any) error

IngestRecords processes in-memory records (e.g. from Git).

func (*Engine) PrintRoutingSummary added in v0.2.0

func (e *Engine) PrintRoutingSummary()

PrintRoutingSummary outputs a summary of files routed to _project_files/.

func (*Engine) ReIngestFile added in v0.6.0

func (e *Engine) ReIngestFile(path string) error

ReIngestFile re-ingests a single file, preserving the existing RootPath. Used by the live graph refresher to update stale nodes without a full walk. After re-ingestion, the store's file mtime is updated.

func (*Engine) RenderContentTemplate added in v0.6.2

func (e *Engine) RenderContentTemplate(tmpl string, values map[string]any) (string, error)

RenderContentTemplate renders a content template with the standard mache functions plus the Engine's diagram function. This is the method that processNode and collectNodes should use for file content rendering.

func (*Engine) SetASTWalker added in v0.6.7

func (e *Engine) SetASTWalker(w *ASTWalker)

SetASTWalker configures the engine to use a SQL-backed ASTWalker for tree-sitter schema selectors instead of CGO SitterWalker. When set, source file parsing is skipped — the ASTWalker queries pre-parsed _ast/_source tables from a ley-line .db.

func (*Engine) SetFileIndex added in v0.6.1

func (e *Engine) SetFileIndex(index map[string]FileIndexEntry)

SetFileIndex sets a cached file index for incremental re-ingestion. Files matching (path, mtime, size) will be skipped during ingestion.

type FileIndexEntry added in v0.6.1

type FileIndexEntry struct {
	ModTime time.Time
	Size    int64
}

FileIndexEntry stores cached file metadata for incremental comparison.

type GitignoreMatcher added in v0.6.3

type GitignoreMatcher interface {
	Match(rel string, isDir bool) bool
}

GitignoreMatcher matches paths against .gitignore-style rules.

type IngestionTarget

type IngestionTarget interface {
	graph.Graph
	AddNode(n *graph.Node)
	AddRoot(n *graph.Node)
	AddRef(token, nodeID string) error
	AddDef(token, dirID string) error
	DeleteFileNodes(filePath string)
	AddFileChildren(parent *graph.Node, files []*graph.Node)
}

IngestionTarget combines Graph reading with writing capabilities.

type JsonWalker

type JsonWalker struct{}

JsonWalker implements Walker for JSON-like data.

func NewJsonWalker

func NewJsonWalker() *JsonWalker

func (*JsonWalker) Query

func (w *JsonWalker) Query(root any, selector string) ([]Match, error)

Query implements Walker.

type Match

type Match interface {
	// Values returns the captured values.
	// For Tree-sitter, these are the named captures from the query (e.g., "res.type" -> "aws_s3_bucket").
	// For JSONPath, if the match is an object, its fields are returned as values.
	// If the match is a primitive, it might be returned under a default key (e.g., "value").
	Values() map[string]any

	// Context returns the underlying object/node to be used as the root for child queries.
	// For JSONPath, this is the matched object.
	// For Tree-sitter, this is the node captured as @scope (or similar convention).
	Context() any
}

Match represents a single result from a query. It provides a map of values that can be used to render path templates.

type OriginProvider

type OriginProvider interface {
	CaptureOrigin(name string) (startByte, endByte uint32, ok bool)
}

OriginProvider is an optional interface that Match implementations can satisfy to expose source byte ranges for write-back. Type-asserted in engine, not required by JSON walker.

type SQLiteWriter added in v0.2.0

type SQLiteWriter struct {
	// contains filtered or unexported fields
}

SQLiteWriter implements IngestionTarget for the new high-performance schema.

func NewSQLiteWriter added in v0.2.0

func NewSQLiteWriter(dbPath string) (*SQLiteWriter, error)

NewSQLiteWriter creates a new writer and initializes the schema.

func (*SQLiteWriter) Act added in v0.5.0

func (w *SQLiteWriter) Act(id, action, payload string) (*graph.ActionResult, error)

func (*SQLiteWriter) AddDef added in v0.2.0

func (w *SQLiteWriter) AddDef(token, dirID string) error

func (*SQLiteWriter) AddFileChildren added in v0.6.8

func (w *SQLiteWriter) AddFileChildren(parent *graph.Node, files []*graph.Node)

func (*SQLiteWriter) AddNode added in v0.2.0

func (w *SQLiteWriter) AddNode(n *graph.Node)

AddNode writes a node to the database.

func (*SQLiteWriter) AddRef added in v0.2.0

func (w *SQLiteWriter) AddRef(token, nodeID string) error

func (*SQLiteWriter) AddRoot added in v0.2.0

func (w *SQLiteWriter) AddRoot(n *graph.Node)

func (*SQLiteWriter) Close added in v0.2.0

func (w *SQLiteWriter) Close() error

func (*SQLiteWriter) DeleteFileNodes added in v0.2.0

func (w *SQLiteWriter) DeleteFileNodes(filePath string)

func (*SQLiteWriter) GetCallees added in v0.2.0

func (w *SQLiteWriter) GetCallees(id string) ([]*graph.Node, error)

func (*SQLiteWriter) GetCallers added in v0.2.0

func (w *SQLiteWriter) GetCallers(token string) ([]*graph.Node, error)

func (*SQLiteWriter) GetNode added in v0.2.0

func (w *SQLiteWriter) GetNode(id string) (*graph.Node, error)

func (*SQLiteWriter) Invalidate added in v0.2.0

func (w *SQLiteWriter) Invalidate(id string)

func (*SQLiteWriter) ListChildStats added in v0.6.8

func (w *SQLiteWriter) ListChildStats(id string) ([]graph.NodeStat, error)

func (*SQLiteWriter) ListChildren added in v0.2.0

func (w *SQLiteWriter) ListChildren(id string) ([]string, error)

func (*SQLiteWriter) ReadContent added in v0.2.0

func (w *SQLiteWriter) ReadContent(id string, buf []byte, offset int64) (int, error)

func (*SQLiteWriter) RecordFile added in v0.6.1

func (w *SQLiteWriter) RecordFile(path string, modTime time.Time, size int64)

RecordFile stores file metadata for incremental re-ingestion. On subsequent mounts, files with matching (path, mod_time, size) are skipped.

type SitterRoot

type SitterRoot struct {
	Node     *sitter.Node
	FileRoot *sitter.Node // The top-level file node (for global context)
	Source   []byte
	Lang     *sitter.Language
	LangName string // "go", "python", "hcl", etc.
}

SitterRoot encapsulates the necessary context for querying a Tree-sitter tree. It includes the root node, the source code (for extracting content), and the language (for compiling the query).

type SitterWalker

type SitterWalker struct {
	// contains filtered or unexported fields
}

SitterWalker implements Walker for Tree-sitter parsed code.

func NewSitterWalker

func NewSitterWalker() *SitterWalker

func (*SitterWalker) Close added in v0.6.1

func (w *SitterWalker) Close()

Close releases all cached compiled queries. Call when the SitterWalker is no longer needed (e.g., after ingestion completes).

func (*SitterWalker) ExtractAddressRefs added in v0.6.2

func (w *SitterWalker) ExtractAddressRefs(root *sitter.Node, source []byte, lang *sitter.Language, langName string) ([]string, error)

ExtractAddressRefs runs all registered address ref queries for the given language against the AST node. Returns deduplicated, scheme-prefixed tokens (e.g., "env:DATABASE_URL"). String captures are automatically unquoted.

func (*SitterWalker) ExtractCalls

func (w *SitterWalker) ExtractCalls(root *sitter.Node, source []byte, lang *sitter.Language, langName string) ([]string, error)

ExtractCalls finds all function calls in the given node using a predefined query. The compiled query is cached per language to avoid recompilation on every call.

func (*SitterWalker) ExtractContext added in v0.2.0

func (w *SitterWalker) ExtractContext(root *sitter.Node, source []byte, lang *sitter.Language, langName string) ([]byte, error)

ExtractContext finds package-level context nodes.

func (*SitterWalker) ExtractGoImports added in v0.6.7

func (w *SitterWalker) ExtractGoImports(root *sitter.Node, source []byte, lang *sitter.Language) map[string]string

ExtractGoImports extracts structured import mappings from a Go AST. Returns a map of alias → import path (e.g., "fmt" → "fmt", "mypkg" → "github.com/foo/bar"). For unaliased imports, the alias is the last path segment.

func (*SitterWalker) ExtractQualifiedCalls added in v0.2.0

func (w *SitterWalker) ExtractQualifiedCalls(root *sitter.Node, source []byte, lang *sitter.Language, langName string) ([]graph.QualifiedCall, error)

ExtractQualifiedCalls finds all function calls with optional package qualifiers. For languages with a registered qualified call query, returns QualifiedCall with both Token and Qualifier. For others, falls back to ExtractCalls (bare tokens).

func (*SitterWalker) Query

func (w *SitterWalker) Query(root any, selector string) ([]Match, error)

Query implements Walker.

type Walker

type Walker interface {
	// Query executes a selector (query) against the given root node and returns a list of matches.
	// The root node can be a *sitter.Node (for code) or a generic Go object (for data).
	Query(root any, selector string) ([]Match, error)
}

Walker abstracts over JSONPath (Data) and Tree-sitter (Code). It provides a unified way to query a tree-like structure and extract values for path templating.

func SelectWalker added in v0.6.7

func SelectWalker(db *sql.DB) (Walker, error)

SelectWalker inspects a SQLite database and returns the best Walker. If the database has an _ast table (produced by ley-line's ll-open/ts), returns an ASTWalker (pure Go, no CGO). Otherwise returns a SitterWalker (requires CGO tree-sitter bindings).

type Watcher added in v0.6.1

type Watcher struct {
	// contains filtered or unexported fields
}

Watcher monitors a directory tree for file changes and invokes callbacks when source files are created/modified or deleted. It debounces rapid changes so that a burst of writes to the same file produces a single callback after a quiet period.

func NewWatcher added in v0.6.1

func NewWatcher(rootDir string, onChange, onDelete func(path string), opts ...WatcherOption) (*Watcher, error)

NewWatcher creates a file watcher on rootDir. onChange is called for created/modified files; onDelete is called for removed files. Both callbacks receive the absolute file path. Hidden files, .git directories, and non-source extensions are ignored.

func (*Watcher) Stop added in v0.6.1

func (w *Watcher) Stop()

Stop shuts down the watcher. Safe to call concurrently and multiple times.

type WatcherOption added in v0.6.1

type WatcherOption func(*Watcher)

WatcherOption configures a Watcher.

func WithDebounce added in v0.6.1

func WithDebounce(d time.Duration) WatcherOption

WithDebounce sets the quiet period before a change callback fires. Defaults to 100ms.

func WithGitignore added in v0.6.3

func WithGitignore(gi GitignoreMatcher) WatcherOption

WithGitignore configures the watcher to skip directories matching gitignore rules. This prevents watching build artifact directories (target/, dist/, node_modules/) that would otherwise consume thousands of kqueue FDs on macOS.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL