crawler

package

v1.3.3 Latest Latest Go to latest Published: Feb 21, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Sriram-PR/doc-scraper

Links

Open Source Insights

Documentation ¶

Overview ¶

FILE: pkg/crawler/crawler.go

FILE: pkg/crawler/output.go

Index ¶

type Crawler
- func NewCrawler(appCfg *config.AppConfig, siteCfg *config.SiteConfig, siteKey string, ...) (*Crawler, error)
- func NewCrawlerWithOptions(appCfg *config.AppConfig, siteCfg *config.SiteConfig, siteKey string, ...) (*Crawler, error)
type CrawlerOptions
type CrawlerProgress
type OutputManager
- func NewOutputManager(log *logrus.Entry, resolved *config.ResolvedSiteConfig, ...) *OutputManager

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler orchestrates the web crawling process for a single configured site

func NewCrawler ¶

func NewCrawler(
	appCfg *config.AppConfig,
	siteCfg *config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Entry,
	store storage.VisitedStore,
	fetcher fetch.HTTPFetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
) (*Crawler, error)

NewCrawler creates and initializes a new Crawler instance and its components

func NewCrawlerWithOptions ¶

func NewCrawlerWithOptions(
	appCfg *config.AppConfig,
	siteCfg *config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Entry,
	store storage.VisitedStore,
	fetcher fetch.HTTPFetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
	opts *CrawlerOptions,
) (*Crawler, error)

NewCrawlerWithOptions creates a new Crawler with optional configuration

func (*Crawler) FoundSitemap ¶

func (c *Crawler) FoundSitemap(sitemapURL string)

FoundSitemap implements fetch.SitemapDiscoverer for the RobotsHandler callback. It's called by RobotsHandler when a sitemap URL is found in robots.txt.

func (*Crawler) GetProgress ¶

func (c *Crawler) GetProgress() CrawlerProgress

GetProgress returns the current progress of the crawler

func (*Crawler) Run ¶

func (c *Crawler) Run(resume bool) error

Run starts the crawling process for the configured site and blocks until completion or cancellation.

type CrawlerOptions ¶

type CrawlerOptions struct {
	// SharedSemaphore allows sharing a global semaphore across multiple crawlers
	// If nil, the crawler creates its own semaphore based on appCfg.MaxRequests
	SharedSemaphore *semaphore.Weighted
}

CrawlerOptions contains optional parameters for NewCrawler

type CrawlerProgress ¶

type CrawlerProgress struct {
	SiteKey        string
	PagesProcessed int64
	PagesQueued    int
	IsRunning      bool
}

CrawlerProgress contains progress information for a crawler

type OutputManager ¶

type OutputManager struct {
	// contains filtered or unexported fields
}

OutputManager owns all output file handles and metadata collection for a crawl.

func NewOutputManager ¶

func NewOutputManager(log *logrus.Entry, resolved *config.ResolvedSiteConfig, siteCfg *config.SiteConfig, enableTokenCounting bool, siteKey, siteOutputDir string) *OutputManager

NewOutputManager creates an OutputManager without opening files. Call OpenFiles after the output directory is ready (e.g. after cleanSiteOutputDir).

func (*OutputManager) Close ¶

func (om *OutputManager) Close() error

Close syncs and closes all output files and writes the YAML metadata file.

func (*OutputManager) OpenFiles ¶

func (om *OutputManager) OpenFiles(resume bool)

OpenFiles opens all configured output files (TSV, JSONL, chunks). Must be called after the output directory exists and has been cleaned if needed.

func (*OutputManager) PagesSaved ¶

func (om *OutputManager) PagesSaved() int

PagesSaved returns the number of pages whose metadata has been collected.

func (*OutputManager) RecordPageOutput ¶

func (om *OutputManager) RecordPageOutput(finalURL, normalizedURL, savedContentPath string, markdownBytes []byte, pageTitle string, currentDepth, imageCount int, taskLog *logrus.Entry)

RecordPageOutput handles all post-save output: TSV write, YAML metadata collection, JSONL write, and chunks write. Called after content is successfully saved to disk. markdownBytes is the already-written markdown content, passed through to avoid re-reading the file.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL