crawler

package

v1.1.1 Latest Latest Go to latest Published: Feb 7, 2026 License: Apache-2.0 Imports: 29 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/piratf/doc-scraper

Links

Open Source Insights

Documentation ¶

Overview ¶

FILE: pkg/crawler/crawler.go

Index ¶

type Crawler
- func NewCrawler(appCfg config.AppConfig, siteCfg config.SiteConfig, siteKey string, ...) (*Crawler, error)
- func NewCrawlerWithOptions(appCfg config.AppConfig, siteCfg config.SiteConfig, siteKey string, ...) (*Crawler, error)
type CrawlerOptions
type CrawlerProgress

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler orchestrates the web crawling process for a single configured site

func NewCrawler ¶

func NewCrawler(
	appCfg config.AppConfig,
	siteCfg config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Logger,
	store storage.VisitedStore,
	fetcher *fetch.Fetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
) (*Crawler, error)

NewCrawler creates and initializes a new Crawler instance and its components

func NewCrawlerWithOptions ¶

func NewCrawlerWithOptions(
	appCfg config.AppConfig,
	siteCfg config.SiteConfig,
	siteKey string,
	baseLogger *logrus.Logger,
	store storage.VisitedStore,
	fetcher *fetch.Fetcher,
	rateLimiter *fetch.RateLimiter,
	crawlCtx context.Context,
	cancelCrawl context.CancelFunc,
	resume bool,
	opts *CrawlerOptions,
) (*Crawler, error)

NewCrawlerWithOptions creates a new Crawler with optional configuration

func (*Crawler) FoundSitemap ¶

func (c *Crawler) FoundSitemap(sitemapURL string)

FoundSitemap implements fetch.SitemapDiscoverer for the RobotsHandler callback. It's called by RobotsHandler when a sitemap URL is found in robots.txt.

func (*Crawler) GetProgress ¶

func (c *Crawler) GetProgress() CrawlerProgress

GetProgress returns the current progress of the crawler

func (*Crawler) Run ¶

func (c *Crawler) Run(resume bool) error

Run starts the crawling process for the configured site and blocks until completion or cancellation.

type CrawlerOptions ¶

type CrawlerOptions struct {
	// SharedSemaphore allows sharing a global semaphore across multiple crawlers
	// If nil, the crawler creates its own semaphore based on appCfg.MaxRequests
	SharedSemaphore *semaphore.Weighted
}

CrawlerOptions contains optional parameters for NewCrawler

type CrawlerProgress ¶

type CrawlerProgress struct {
	SiteKey        string
	PagesProcessed int64
	PagesQueued    int
	IsRunning      bool
}

CrawlerProgress contains progress information for a crawler

Source Files ¶

View all Source files

crawler.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL