crawler

package
v1.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 7, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Documentation

Overview

FILE: pkg/crawler/crawler.go

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler orchestrates the web crawling process for a single configured site

func NewCrawler

func NewCrawler(
	appCfg config.AppConfig,
	siteCfg config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Logger,
	store storage.VisitedStore,
	fetcher *fetch.Fetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
) (*Crawler, error)

NewCrawler creates and initializes a new Crawler instance and its components

func NewCrawlerWithOptions

func NewCrawlerWithOptions(
	appCfg config.AppConfig,
	siteCfg config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Logger,
	store storage.VisitedStore,
	fetcher *fetch.Fetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
	opts *CrawlerOptions,
) (*Crawler, error)

NewCrawlerWithOptions creates a new Crawler with optional configuration

func (*Crawler) FoundSitemap

func (c *Crawler) FoundSitemap(sitemapURL string)

FoundSitemap implements fetch.SitemapDiscoverer for the RobotsHandler callback. It's called by RobotsHandler when a sitemap URL is found in robots.txt.

func (*Crawler) GetProgress

func (c *Crawler) GetProgress() CrawlerProgress

GetProgress returns the current progress of the crawler

func (*Crawler) Run

func (c *Crawler) Run(resume bool) error

Run starts the crawling process for the configured site and blocks until completion or cancellation.

type CrawlerOptions

type CrawlerOptions struct {
	// SharedSemaphore allows sharing a global semaphore across multiple crawlers
	// If nil, the crawler creates its own semaphore based on appCfg.MaxRequests
	SharedSemaphore *semaphore.Weighted
}

CrawlerOptions contains optional parameters for NewCrawler

type CrawlerProgress

type CrawlerProgress struct {
	SiteKey        string
	PagesProcessed int64
	PagesQueued    int
	IsRunning      bool
}

CrawlerProgress contains progress information for a crawler

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL