Documentation
¶
Overview ¶
Package crawler provides a generic, configurable web crawler built on Colly. It supports concurrent requests, retries with exponential backoff, rate limiting, proxy rotation, and pluggable document evaluation via the Evaluator interface.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct {
// AllowedDomains restricts crawling to these hosts.
AllowedDomains []string
// Parallelism controls concurrent requests. Required, must be > 0.
Parallelism int
// RetryCount sets additional retry attempts per page. 0 = no retries.
RetryCount int
// HTTPTimeout caps each HTTP request. 0 = no timeout.
HTTPTimeout time.Duration
// RateLimit sets minimum delay between requests to the same domain.
RateLimit time.Duration
// MaxDepth limits link-following depth. 0 = no link following.
MaxDepth int
// Evaluator processes fetched documents. Required.
Evaluator Evaluator
// CookieDomains lists domains for which CookieProvider is called.
CookieDomains []string
// CookieProvider returns cookies per domain. Optional.
CookieProvider CookieProvider
// Headers customizes outbound requests. Optional.
Headers HeaderProvider
// Hook runs before each request. Optional.
Hook RequestHook
// Logger receives diagnostic messages. Optional (no-op if nil).
Logger Logger
}
Config wires the crawler with domain settings, scraping options, and collaborators.
type CookieProvider ¶
CookieProvider returns cookies for a given domain. Optional.
type Evaluation ¶
type Evaluation struct {
Findings []Finding
}
Evaluation is the output of an Evaluator.
type Evaluator ¶
type Evaluator interface {
Evaluate(pageID string, document *goquery.Document) (Evaluation, error)
}
Evaluator processes a fetched HTML document and produces findings. Implementations are injected into the crawler at construction time.
type Finding ¶
type Finding struct {
ID string `json:"id,omitempty"`
Description string `json:"description"`
Passed bool `json:"passed"`
Message string `json:"message"`
Data string `json:"data,omitempty"` // arbitrary payload (e.g., JSON-encoded extracted data)
}
Finding captures a single evaluation outcome from the Evaluator.
type HeaderProvider ¶
HeaderProvider decorates outbound HTTP requests. Optional.
type Logger ¶
type Logger interface {
Debug(format string, args ...interface{})
Info(format string, args ...interface{})
Warning(format string, args ...interface{})
Error(format string, args ...interface{})
}
Logger emits structured diagnostic messages. Safe for concurrent use.
type Page ¶
type Page struct {
ID string // Unique identifier for this page
Category string // Grouping label (e.g., platform, source)
URL string // Full URL to fetch
}
Page describes a single URL to crawl.
type RequestHook ¶
RequestHook runs before each outbound request. Optional.
type Result ¶
type Result struct {
PageID string `json:"pageId"`
PageURL string `json:"pageUrl"`
FinalURL string `json:"finalUrl,omitempty"`
Category string `json:"category"`
Title string `json:"title,omitempty"`
Success bool `json:"success"`
ErrorMessage string `json:"errorMessage,omitempty"`
HTTPStatusCode int `json:"httpStatusCode,omitempty"`
Findings []Finding `json:"findings,omitempty"`
Document *goquery.Document `json:"-"` // parsed HTML, not serialized
}
Result represents the outcome of crawling a single page.
type Service ¶
type Service struct {
// contains filtered or unexported fields
}
Service orchestrates crawling pages and emits results.
func NewService ¶
NewService constructs a crawler service.