crawler

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 22, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package crawler provides a generic, configurable web crawler built on Colly. It supports concurrent requests, retries with exponential backoff, rate limiting, proxy rotation, and pluggable document evaluation via the Evaluator interface.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	// AllowedDomains restricts crawling to these hosts.
	AllowedDomains []string

	// Parallelism controls concurrent requests. Required, must be > 0.
	Parallelism int

	// RetryCount sets additional retry attempts per page. 0 = no retries.
	RetryCount int

	// HTTPTimeout caps each HTTP request. 0 = no timeout.
	HTTPTimeout time.Duration

	// RateLimit sets minimum delay between requests to the same domain.
	RateLimit time.Duration

	// MaxDepth limits link-following depth. 0 = no link following.
	MaxDepth int

	// Evaluator processes fetched documents. Required.
	Evaluator Evaluator

	// CookieDomains lists domains for which CookieProvider is called.
	CookieDomains []string

	// CookieProvider returns cookies per domain. Optional.
	CookieProvider CookieProvider

	// Headers customizes outbound requests. Optional.
	Headers HeaderProvider

	// Hook runs before each request. Optional.
	Hook RequestHook

	// Logger receives diagnostic messages. Optional (no-op if nil).
	Logger Logger
}

Config wires the crawler with domain settings, scraping options, and collaborators.

func (Config) Validate

func (c Config) Validate() error

Validate checks required fields.

type CookieProvider

type CookieProvider func(domain string) []*http.Cookie

CookieProvider returns cookies for a given domain. Optional.

type Evaluation

type Evaluation struct {
	Findings []Finding
}

Evaluation is the output of an Evaluator.

type Evaluator

type Evaluator interface {
	Evaluate(pageID string, document *goquery.Document) (Evaluation, error)
}

Evaluator processes a fetched HTML document and produces findings. Implementations are injected into the crawler at construction time.

type Finding

type Finding struct {
	ID          string `json:"id,omitempty"`
	Description string `json:"description"`
	Passed      bool   `json:"passed"`
	Message     string `json:"message"`
	Data        string `json:"data,omitempty"` // arbitrary payload (e.g., JSON-encoded extracted data)
}

Finding captures a single evaluation outcome from the Evaluator.

type HeaderProvider

type HeaderProvider interface {
	Apply(request *colly.Request)
}

HeaderProvider decorates outbound HTTP requests. Optional.

type Logger

type Logger interface {
	Debug(format string, args ...interface{})
	Info(format string, args ...interface{})
	Warning(format string, args ...interface{})
	Error(format string, args ...interface{})
}

Logger emits structured diagnostic messages. Safe for concurrent use.

type Page

type Page struct {
	ID       string // Unique identifier for this page
	Category string // Grouping label (e.g., platform, source)
	URL      string // Full URL to fetch
}

Page describes a single URL to crawl.

type RequestHook

type RequestHook interface {
	BeforeRequest(ctx context.Context, page Page) error
}

RequestHook runs before each outbound request. Optional.

type Result

type Result struct {
	PageID         string            `json:"pageId"`
	PageURL        string            `json:"pageUrl"`
	FinalURL       string            `json:"finalUrl,omitempty"`
	Category       string            `json:"category"`
	Title          string            `json:"title,omitempty"`
	Success        bool              `json:"success"`
	ErrorMessage   string            `json:"errorMessage,omitempty"`
	HTTPStatusCode int               `json:"httpStatusCode,omitempty"`
	Findings       []Finding         `json:"findings,omitempty"`
	Document       *goquery.Document `json:"-"` // parsed HTML, not serialized
}

Result represents the outcome of crawling a single page.

type Service

type Service struct {
	// contains filtered or unexported fields
}

Service orchestrates crawling pages and emits results.

func NewService

func NewService(cfg Config, results chan<- *Result) (*Service, error)

NewService constructs a crawler service.

func (*Service) Run

func (svc *Service) Run(ctx context.Context, pages []Page) error

Run visits each page and blocks until all are processed or ctx is cancelled.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL