crawler

package
v1.3.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 21, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Documentation

Overview

FILE: pkg/crawler/crawler.go

FILE: pkg/crawler/output.go

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler orchestrates the web crawling process for a single configured site

func NewCrawler

func NewCrawler(
	appCfg *config.AppConfig,
	siteCfg *config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Entry,
	store storage.VisitedStore,
	fetcher fetch.HTTPFetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
) (*Crawler, error)

NewCrawler creates and initializes a new Crawler instance and its components

func NewCrawlerWithOptions

func NewCrawlerWithOptions(
	appCfg *config.AppConfig,
	siteCfg *config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Entry,
	store storage.VisitedStore,
	fetcher fetch.HTTPFetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
	opts *CrawlerOptions,
) (*Crawler, error)

NewCrawlerWithOptions creates a new Crawler with optional configuration

func (*Crawler) FoundSitemap

func (c *Crawler) FoundSitemap(sitemapURL string)

FoundSitemap implements fetch.SitemapDiscoverer for the RobotsHandler callback. It's called by RobotsHandler when a sitemap URL is found in robots.txt.

func (*Crawler) GetProgress

func (c *Crawler) GetProgress() CrawlerProgress

GetProgress returns the current progress of the crawler

func (*Crawler) Run

func (c *Crawler) Run(resume bool) error

Run starts the crawling process for the configured site and blocks until completion or cancellation.

type CrawlerOptions

type CrawlerOptions struct {
	// SharedSemaphore allows sharing a global semaphore across multiple crawlers
	// If nil, the crawler creates its own semaphore based on appCfg.MaxRequests
	SharedSemaphore *semaphore.Weighted
}

CrawlerOptions contains optional parameters for NewCrawler

type CrawlerProgress

type CrawlerProgress struct {
	SiteKey        string
	PagesProcessed int64
	PagesQueued    int
	IsRunning      bool
}

CrawlerProgress contains progress information for a crawler

type OutputManager

type OutputManager struct {
	// contains filtered or unexported fields
}

OutputManager owns all output file handles and metadata collection for a crawl.

func NewOutputManager

func NewOutputManager(log *logrus.Entry, resolved *config.ResolvedSiteConfig, siteCfg *config.SiteConfig, enableTokenCounting bool, siteKey, siteOutputDir string) *OutputManager

NewOutputManager creates an OutputManager without opening files. Call OpenFiles after the output directory is ready (e.g. after cleanSiteOutputDir).

func (*OutputManager) Close

func (om *OutputManager) Close() error

Close syncs and closes all output files and writes the YAML metadata file.

func (*OutputManager) OpenFiles

func (om *OutputManager) OpenFiles(resume bool)

OpenFiles opens all configured output files (TSV, JSONL, chunks). Must be called after the output directory exists and has been cleaned if needed.

func (*OutputManager) PagesSaved

func (om *OutputManager) PagesSaved() int

PagesSaved returns the number of pages whose metadata has been collected.

func (*OutputManager) RecordPageOutput

func (om *OutputManager) RecordPageOutput(finalURL, normalizedURL, savedContentPath string, markdownBytes []byte, pageTitle string, currentDepth, imageCount int, taskLog *logrus.Entry)

RecordPageOutput handles all post-save output: TSV write, YAML metadata collection, JSONL write, and chunks write. Called after content is successfully saved to disk. markdownBytes is the already-written markdown content, passed through to avoid re-reading the file.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL