crawler

package

v0.3.0 Latest Latest Go to latest Published: Mar 22, 2026 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tyemirov/utils

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler provides a generic, configurable web crawler built on Colly. It supports concurrent requests, retries with exponential backoff, rate limiting, proxy rotation, and pluggable document evaluation via the Evaluator interface.

Index ¶

type Config
- func (c Config) Validate() error
type CookieProvider
type Evaluation
type Evaluator
type Finding
type HeaderProvider
type Logger
type Page
type RequestHook
type Result
type Service
- func NewService(cfg Config, results chan<- *Result) (*Service, error)
- func (svc *Service) Run(ctx context.Context, pages []Page) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	// AllowedDomains restricts crawling to these hosts.
	AllowedDomains []string

	// Parallelism controls concurrent requests. Required, must be > 0.
	Parallelism int

	// RetryCount sets additional retry attempts per page. 0 = no retries.
	RetryCount int

	// HTTPTimeout caps each HTTP request. 0 = no timeout.
	HTTPTimeout time.Duration

	// RateLimit sets minimum delay between requests to the same domain.
	RateLimit time.Duration

	// MaxDepth limits link-following depth. 0 = no link following.
	MaxDepth int

	// Evaluator processes fetched documents. Required.
	Evaluator Evaluator

	// CookieDomains lists domains for which CookieProvider is called.
	CookieDomains []string

	// CookieProvider returns cookies per domain. Optional.
	CookieProvider CookieProvider

	// Headers customizes outbound requests. Optional.
	Headers HeaderProvider

	// Hook runs before each request. Optional.
	Hook RequestHook

	// Logger receives diagnostic messages. Optional (no-op if nil).
	Logger Logger
}

Config wires the crawler with domain settings, scraping options, and collaborators.

func (Config) Validate ¶

func (c Config) Validate() error

Validate checks required fields.

type CookieProvider ¶

type CookieProvider func(domain string) []*http.Cookie

CookieProvider returns cookies for a given domain. Optional.

type Evaluation ¶

type Evaluation struct {
	Findings []Finding
}

Evaluation is the output of an Evaluator.

type Evaluator ¶

type Evaluator interface {
	Evaluate(pageID string, document *goquery.Document) (Evaluation, error)
}

Evaluator processes a fetched HTML document and produces findings. Implementations are injected into the crawler at construction time.

type Finding ¶

type Finding struct {
	ID          string `json:"id,omitempty"`
	Description string `json:"description"`
	Passed      bool   `json:"passed"`
	Message     string `json:"message"`
	Data        string `json:"data,omitempty"` // arbitrary payload (e.g., JSON-encoded extracted data)
}

Finding captures a single evaluation outcome from the Evaluator.

type HeaderProvider ¶

type HeaderProvider interface {
	Apply(request *colly.Request)
}

HeaderProvider decorates outbound HTTP requests. Optional.

type Logger ¶

type Logger interface {
	Debug(format string, args ...interface{})
	Info(format string, args ...interface{})
	Warning(format string, args ...interface{})
	Error(format string, args ...interface{})
}

Logger emits structured diagnostic messages. Safe for concurrent use.

type Page ¶

type Page struct {
	ID       string // Unique identifier for this page
	Category string // Grouping label (e.g., platform, source)
	URL      string // Full URL to fetch
}

Page describes a single URL to crawl.

type RequestHook ¶

type RequestHook interface {
	BeforeRequest(ctx context.Context, page Page) error
}

RequestHook runs before each outbound request. Optional.

type Result ¶

type Result struct {
	PageID         string            `json:"pageId"`
	PageURL        string            `json:"pageUrl"`
	FinalURL       string            `json:"finalUrl,omitempty"`
	Category       string            `json:"category"`
	Title          string            `json:"title,omitempty"`
	Success        bool              `json:"success"`
	ErrorMessage   string            `json:"errorMessage,omitempty"`
	HTTPStatusCode int               `json:"httpStatusCode,omitempty"`
	Findings       []Finding         `json:"findings,omitempty"`
	Document       *goquery.Document `json:"-"` // parsed HTML, not serialized
}

Result represents the outcome of crawling a single page.

type Service ¶

type Service struct {
	// contains filtered or unexported fields
}

Service orchestrates crawling pages and emits results.

func NewService ¶

func NewService(cfg Config, results chan<- *Result) (*Service, error)

NewService constructs a crawler service.

func (*Service) Run ¶

func (svc *Service) Run(ctx context.Context, pages []Page) error

Run visits each page and blocks until all are processed or ctx is cancelled.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL