crawl

package

v0.4.2 Latest Latest Go to latest Published: Jun 19, 2026 License: MIT Imports: 25 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/neumachen/gojira

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawl is the recursive crawl orchestrator for gojira.

Composition contract ¶

crawl is the only package in the module that knows the end-to-end recursive workflow. It composes the following sibling packages, in this order for each issue:

internal/fetch — retrieve raw issue bytes from Jira Cloud
internal/parse — convert raw bytes to a typed Issue value
internal/extract — discover outbound references from the Issue
internal/render — convert the Issue to Markdown content
internal/output — write Markdown content to the filesystem

Events are emitted to internal/events throughout. Configuration is read from internal/config. classify is used to map classify.Kind constants to the string labels that render.OutboundRef expects.

What crawl does NOT own ¶

- HTTP transport (client owns it) - JSON parsing (parse owns it) - ADF traversal (adf and render own it) - Filesystem layout decisions (output owns them) - Flag/env parsing (cmd/gojira owns that) - Event formatting (cmd/gojira's sink owns that) - Signal handling (cmd/gojira owns that; crawl responds to ctx cancellation)

Sentinel error import deviation ¶

crawl imports client solely to use errors.Is against client's typed sentinel errors (ErrUnauthorized, ErrForbidden, ErrNotFound, ErrRateLimited). fetch propagates these sentinels unwrapped, so errors.Is works correctly. The design doc §5.1 lists only seven allowed imports; this is a documented, minimal deviation. crawl does not use any other symbol from client.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ChildDiscoverer ¶

type ChildDiscoverer interface {
	Children(ctx context.Context, issue parse.Issue) ([]string, error)
}

ChildDiscoverer is the interface the crawl orchestrator depends on for discovering hierarchy children of an already-fetched issue via JQL search. It is satisfied by *hierarchy.Discoverer in production; tests may substitute a fake.

Children must return the deduplicated, sorted set of child keys for issue. An error is treated as a per-issue non-fatal warning by the crawl orchestrator: the issue itself is still rendered, but the KindIssueFailed event is emitted with a "child discovery failed" message.

type DevStatusEnricher ¶

type DevStatusEnricher interface {
	Enrich(ctx context.Context, issue parse.Issue) (parse.DevStatusData, error)
}

DevStatusEnricher is the interface the crawl orchestrator depends on for discovering the pull-request, branch, commit, repository, and build metadata associated with an already-fetched issue via Jira's Dev Status API. It is satisfied by *devstatus.Enricher in production; tests may substitute a fake.

Enrich must return a deduplicated parse.DevStatusData value for issue, or the zero value when enrichment is opt-out or the issue has no associated entities. The production implementation queries every configured (application, dataType) pair unconditionally; there is no per-issue gate based on the customfield_10000 summary blob.

Errors are treated as per-issue non-fatal warnings by the crawl orchestrator: the issue is still rendered (with any partial entities that were collected), and a events.KindDevStatusPartialFailure event is emitted at WARN level with a "dev status enrichment failed" message. The issue is NOT counted as Failed in the crawl summary — only the enrichment was partial, the issue itself succeeded. This distinction prevents log filtering and alerting from conflating a degraded external enrichment source with a genuine crawl-level failure.

type Summary ¶

type Summary struct {
	// Fetched is the number of issues successfully fetched, rendered,
	// and written to disk.
	Fetched int

	// Skipped is the number of issues that already existed on disk and
	// were not re-fetched (cfg.Refetch == false).
	Skipped int

	// Stubbed is the number of issues that could not be fetched due to
	// a 403 or 404 response; a stub index.md was written for each.
	Stubbed int

	// Failed is the number of issues that encountered an unrecoverable
	// per-issue error (not 401) and were NOT rendered (not even as a
	// stub). Rate-limited issues after retries exhausted fall here.
	Failed int

	// CapLimited is the number of issues that were discovered but not
	// fetched because an issue cap, depth cap, or context cancellation
	// prevented them from being enqueued or processed.
	CapLimited int

	// PRsFound is the count of distinct GitHub PR URLs discovered
	// across the entire crawl.
	PRsFound int

	// FetchedKeys lists the keys of successfully fetched issues,
	// sorted alphabetically.
	FetchedKeys []string

	// StubbedKeys lists the keys of stubbed issues, sorted
	// alphabetically.
	StubbedKeys []string

	// FailedKeys maps each failed issue key to a human-readable reason.
	FailedKeys map[string]string

	// CapLimitedKeys lists the keys of cap-limited issues, sorted
	// alphabetically.
	CapLimitedKeys []string

	// Duration is the wall-clock time elapsed during the crawl.
	Duration time.Duration

	// APICallCounts maps a phase label (fetch, hierarchy_jql, dev_status,
	// parse, render, store) to the number of times that phase ran across the
	// entire crawl. Surfaced for measurement attribution; not a stability
	// contract for downstream consumers.
	APICallCounts map[string]int

	// APITimeByPhase maps the same phase labels to the total wall-clock time
	// spent in that phase across the crawl. Useful for answering "where did
	// the 32s go?".
	APITimeByPhase map[string]time.Duration

	// TotalAPITime is the sum of APITimeByPhase values — total wall-clock
	// time spent in any instrumented phase across all issues, summed.
	TotalAPITime time.Duration
}

Summary is the structured result returned by Crawl after the run completes (successfully or partially). All counts are non-negative. Key slices are sorted alphabetically for determinism.

func Crawl ¶

func Crawl(
	ctx context.Context,
	cfg config.Config,
	startKeys []string,
	fetcher fetch.Fetcher,
	sink events.Sink,
) (Summary, error)

Crawl executes a recursive Jira issue crawl starting from startKeys.

It seeds a work queue with startKeys at depth 0, then runs up to cfg.Concurrency workers (minimum 1) that each pull a key from the queue, fetch the issue, parse it, extract outbound references, render Markdown, and write the output to disk.

Error handling policy ¶

401 Unauthorized: the crawl is aborted immediately. All in-flight workers finish their current fetch+render+write, then Crawl returns a partial Summary and an error wrapping client.ErrUnauthorized. The caller should map this to exit code 1.
403 Forbidden / 404 Not Found: a stub index.md is written for the issue. The crawl continues. The issue is counted in Stubbed.
Rate limited (429, retries exhausted): the issue is counted in Failed with reason "Rate limited (429) exhausted". No stub is written. The crawl continues.
Network/transport error (after client retries): a stub is written with reason "Fetch failed: <err>". The crawl continues.
Parse/render/output error: the issue is counted in Failed with the error message. No stub is written — these are bugs, not expected operational failures, and papering over them with a stub would hide the problem.

Skip-if-exists probe ¶

Before calling fetcher.Fetch, crawl checks whether <outputDir>/<KEY>/ index.md already exists. If it does and cfg.Refetch is false, the issue is counted as Skipped without burning an API call. This probe lives in crawl (not output) because output.ErrAlreadyExists fires only after a fetch has already happened; the pre-fetch short-circuit is a crawl-level optimization.

Graceful shutdown ¶

When ctx is cancelled (SIGINT/SIGTERM from the CLI, a 401 abort, or a time cap), workers finish their in-flight fetch+render+write and exit. Items remaining in the queue are drained and counted as CapLimited.

Concurrency model ¶

Workers are plain goroutines managed with sync.WaitGroup. The work queue is a buffered channel. An atomic "pending" counter tracks the number of items that are either in the queue or being processed by a worker. When pending reaches zero, the queue channel is closed (under c.mu), which causes all workers to exit their range loop.

All sends to the queue channel happen inside enqueue, which is always called with c.mu held. The channel is closed inside closeQueueLocked, which is also always called with c.mu held. This guarantees that close(c.queue) never races with c.queue <- item.

A dedicated "closer" goroutine watches for context cancellation and drains + closes the queue when the context is done. The visited map and summary are protected by c.mu.

func CrawlGraphWithEnrichers ¶ added in v0.2.0

func CrawlGraphWithEnrichers(
	ctx context.Context,
	cfg config.Config,
	startKeys []string,
	fetcher fetch.Fetcher,
	sink events.Sink,
	hier ChildDiscoverer,
	devStatus DevStatusEnricher,
	store output.Store,
	logger *slog.Logger,
) (Summary, graph.Model, error)

CrawlGraphWithEnrichers is the in-memory graph variant of CrawlWithEnrichers. It runs the same crawl with graph collection FORCED ON (independent of cfg.EmitGraph) and SUPPRESSES the post-loop disk write of graph.json / graph.d2, returning the collected graph.Model alongside the Summary instead.

All other observable behavior — events, summary, exit-code mapping of errors — is byte-identical to CrawlWithEnrichers. This is the entry point the gRPC GetGraph handler uses and is also useful to library callers who want the graph in memory without touching the filesystem.

func CrawlWithDiscoverer ¶

func CrawlWithDiscoverer(
	ctx context.Context,
	cfg config.Config,
	startKeys []string,
	fetcher fetch.Fetcher,
	sink events.Sink,
	hier ChildDiscoverer,
) (Summary, error)

CrawlWithDiscoverer is the extended entry point that additionally accepts a ChildDiscoverer for hierarchy expansion.

When hier is nil OR cfg.IncludeChildren is false, hierarchy discovery is skipped and the crawl behaves exactly as the legacy Crawl function. When both are present, every hierarchy-capable issue (per hierarchy.HierarchyCapable) has its JQL children fetched after the fetch+parse+extract sequence completes and the resulting keys are enqueued at depth+1, subject to the usual issue and depth caps.

This entry point is preserved as a thin wrapper over CrawlWithEnrichers for callers that only need hierarchy expansion. Tests that do not care about hierarchy may continue to call Crawl.

func CrawlWithEnrichers ¶

func CrawlWithEnrichers(
	ctx context.Context,
	cfg config.Config,
	startKeys []string,
	fetcher fetch.Fetcher,
	sink events.Sink,
	hier ChildDiscoverer,
	devStatus DevStatusEnricher,
	store output.Store,
	logger *slog.Logger,
) (Summary, error)

CrawlWithEnrichers is the extended entry point that accepts both a ChildDiscoverer for hierarchy expansion and a DevStatusEnricher for Dev Status pull-request, branch, commit, repository, and build enrichment.

hier and devStatus are independent: passing nil for either disables that enrichment (also independently gated by cfg.IncludeChildren and cfg.IncludeDevStatus respectively). The gojira facade constructs both from the shared *client.Client and supplies them here.

The store parameter is additive over the previous signature: it selects the destination for rendered Markdown. Passing nil defaults to an output.FSStore rooted at cfg.OutputDir, preserving the historical on-disk behavior (skip-if-exists vs. refetch semantics continue to be honored). Alternative output.Store implementations can be injected by callers that want to deliver crawl output to a non-filesystem destination (e.g. an in-memory buffer or a future service front-end).

The logger parameter is additive over the previous signature: it is the orchestrator's *slog.Logger sink for the structured span instrumentation (crawl.start / issue.process.start / phase.* / issue.process.end / crawl.end / crawl.measurement). Passing nil defaults to a no-op logger, preserving the prior behavior of emitting nothing for callers that have not yet adopted the observability instrument.

The signature is additive over CrawlWithDiscoverer; existing callers that only need hierarchy expansion are unaffected.

Source Files ¶

View all Source files

crawl.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL