Documentation
¶
Overview ¶
Package crawler provides a generic, configurable web crawling engine built on Colly. It supports concurrent page fetching, retries with exponential backoff, rate limiting, proxy rotation with circuit-breaker health tracking, and pluggable response processing via the ResponseHandler interface.
Simple consumers use an Evaluator with the default response handler. Advanced consumers inject a custom ResponseHandler for full control over how fetched documents are processed and results are emitted.
Index ¶
- Constants
- func DescribeProxyForLog(rawProxyURL string) string
- func ExtractCanonicalURL(doc *goquery.Document) string
- func ExtractTitle(doc *goquery.Document) string
- func GetContextValue(ctx *colly.Context, key, fallback string) string
- func GetRetryAttempt(response *colly.Response) int
- func GetTargetCategoryFromContext(resp *colly.Response) string
- func GetTargetIDFromContext(resp *colly.Response) string
- func GetTargetURLFromContext(resp *colly.Response) string
- func NewContextAwareTransport(base http.RoundTripper, ctxProvider func() context.Context) http.RoundTripper
- func NewHTTPTransport(insecureSkipVerify bool, requestTimeout time.Duration) *http.Transport
- func NewPanicSafeTransport(base http.RoundTripper, logger Logger) http.RoundTripper
- func NewProxyRotator(rawProxies []string, tracker ProxyHealth, logger Logger) (colly.ProxyFunc, error)
- func ParseHTMLResponse(body []byte) (*goquery.Document, error)
- func RecordProxyFailure(tracker ProxyHealth, resp *colly.Response)
- func SanitizeProxyURL(rawProxyURL string) string
- func SetupErrorHandling(collector *colly.Collector, handler ResponseHandler, retryHandler RetryHandler, ...)
- type Config
- type CookieProvider
- type DefaultResponseHandler
- type Evaluation
- type Evaluator
- type FilePersister
- type Finding
- type HeaderProvider
- type Logger
- type Pagedeprecated
- type PlatformConfig
- type PlatformHooks
- type ProxyHealth
- type RequestConfigurator
- type RequestHook
- type ResponseHandler
- type Result
- type RetryDecision
- type RetryExhaustionBehavior
- type RetryHandler
- type RetryOptions
- type RetryPolicy
- type ScraperConfig
- type Service
- type Target
- type TargetOption
Constants ¶
const ( // Context keys for Colly request context. CtxTargetIDKey = "crawler_target_id" CtxTargetCategoryKey = "crawler_target_category" CtxTargetURLKey = "crawler_target_url" CtxRunContextKey = "crawler_run_context" CtxHTTPStatusCodeKey = "crawler_http_status" CtxErrorKey = "crawler_error" CtxEvaluationKey = "crawler_evaluation" CtxTitleKey = "crawler_title" CtxInitialURLKey = "crawler_initial_url" CtxFinalURLKey = "crawler_final_url" CtxCanonicalURLKey = "crawler_canonical_url" CtxRedirectedKey = "crawler_redirected" CtxNotFoundKey = "crawler_not_found" RetryCountKey = "crawler_retry_count" RetriedFlagKey = "crawler_retried" TitleNotFound = "Title Not Found" UnknownURL = "UnknownURL" HTMLExtension = "html" )
Variables ¶
This section is empty.
Functions ¶
func DescribeProxyForLog ¶ added in v0.4.0
DescribeProxyForLog returns a safe-to-log proxy description.
func ExtractCanonicalURL ¶ added in v0.4.0
ExtractCanonicalURL extracts the canonical URL from a <link rel="canonical"> tag.
func ExtractTitle ¶ added in v0.4.0
ExtractTitle returns the text content of the first <title> element.
func GetContextValue ¶ added in v0.4.0
GetContextValue returns a value from a Colly context, or the fallback if absent.
func GetRetryAttempt ¶ added in v0.4.0
GetRetryAttempt returns the current retry attempt from the response context.
func GetTargetCategoryFromContext ¶ added in v0.4.0
GetTargetCategoryFromContext retrieves the target category from a Colly response context.
func GetTargetIDFromContext ¶ added in v0.4.0
GetTargetIDFromContext retrieves the target ID from a Colly response context.
func GetTargetURLFromContext ¶ added in v0.4.0
GetTargetURLFromContext retrieves the target URL from a Colly response context.
func NewContextAwareTransport ¶ added in v0.4.0
func NewContextAwareTransport(base http.RoundTripper, ctxProvider func() context.Context) http.RoundTripper
NewContextAwareTransport wraps a transport with run-context cancellation.
func NewHTTPTransport ¶ added in v0.4.0
NewHTTPTransport creates an HTTP transport with idle-timeout support.
func NewPanicSafeTransport ¶ added in v0.4.0
func NewPanicSafeTransport(base http.RoundTripper, logger Logger) http.RoundTripper
NewPanicSafeTransport wraps a transport with panic recovery.
func NewProxyRotator ¶ added in v0.4.0
func NewProxyRotator(rawProxies []string, tracker ProxyHealth, logger Logger) (colly.ProxyFunc, error)
NewProxyRotator creates a round-robin proxy rotation function.
func ParseHTMLResponse ¶ added in v0.4.0
ParseHTMLResponse parses HTML from a response body.
func RecordProxyFailure ¶ added in v0.4.0
func RecordProxyFailure(tracker ProxyHealth, resp *colly.Response)
RecordProxyFailure records a proxy failure on the tracker if applicable.
func SanitizeProxyURL ¶ added in v0.4.0
SanitizeProxyURL strips credentials from a proxy URL for safe logging.
func SetupErrorHandling ¶ added in v0.4.0
func SetupErrorHandling(collector *colly.Collector, handler ResponseHandler, retryHandler RetryHandler, tracker ProxyHealth, logger Logger)
SetupErrorHandling wires OnError with retry and proxy tracking.
Types ¶
type Config ¶
type Config struct {
// Category identifies the target category (e.g., "AMZN", "camps").
Category string
// Scraper controls concurrency, retries, and network behaviour.
Scraper ScraperConfig
// Platform holds domain-specific settings.
Platform PlatformConfig
// ResponseHandler processes HTTP responses. Optional; when nil, a default
// handler is created using the Evaluator.
ResponseHandler ResponseHandler
// Evaluator produces findings for a fetched document. Required when
// ResponseHandler is nil.
Evaluator Evaluator
// PlatformHooks customise title normalisation and retry decisions. Optional.
PlatformHooks PlatformHooks
// CookieProvider returns cookies per domain. Optional.
CookieProvider CookieProvider
// CookieDomains lists domains for which CookieProvider is called.
CookieDomains []string
// FilePersister handles artifact persistence. Optional; a default
// implementation is created when OutputDirectory is set.
FilePersister FilePersister
// OutputDirectory is optional; when set and FilePersister is nil the
// crawler will persist artifacts under this path.
OutputDirectory string
// RunFolder scopes persisted artifacts for a single execution.
RunFolder string
// Headers customises outbound requests. Optional.
Headers HeaderProvider
// Hook runs before each request. Optional.
Hook RequestHook
// Logger receives diagnostic messages. Optional (no-op if nil).
Logger Logger
}
Config wires the crawler service with target metadata, scraping options, and effectful collaborators.
type CookieProvider ¶
CookieProvider returns cookies for a given domain. Optional.
type DefaultResponseHandler ¶ added in v0.4.0
type DefaultResponseHandler struct {
// contains filtered or unexported fields
}
DefaultResponseHandler processes HTTP responses using an Evaluator. It parses HTML, extracts the title, runs the evaluator, and emits Results.
func NewDefaultResponseHandler ¶ added in v0.4.0
func NewDefaultResponseHandler( cfg Config, retryHandler RetryHandler, proxyTracker ProxyHealth, filePersister FilePersister, results chan<- *Result, logger Logger, ) *DefaultResponseHandler
NewDefaultResponseHandler creates the standard response handler.
func (*DefaultResponseHandler) SendResult ¶ added in v0.4.0
func (h *DefaultResponseHandler) SendResult(resp *colly.Response, success bool, errorMessage string)
func (*DefaultResponseHandler) SetSlotReleaser ¶ added in v0.4.0
func (h *DefaultResponseHandler) SetSlotReleaser(releaser func(*colly.Response))
func (*DefaultResponseHandler) Setup ¶ added in v0.4.0
func (h *DefaultResponseHandler) Setup(collector *colly.Collector)
type Evaluation ¶
type Evaluation struct {
Findings []Finding
}
Evaluation is the output of an Evaluator.
type Evaluator ¶
type Evaluator interface {
Evaluate(targetID string, document *goquery.Document) (Evaluation, error)
}
Evaluator processes a fetched HTML document and produces findings. Used by the default ResponseHandler. Not needed when a custom ResponseHandler is provided.
type FilePersister ¶ added in v0.4.0
type FilePersister interface {
Save(targetID, fileName string, content []byte) error
Close() error
}
FilePersister persists binary artifacts associated with a target.
func NewBackgroundFilePersister ¶ added in v0.4.0
func NewBackgroundFilePersister(delegate FilePersister, workerCount, bufferSize int, logger Logger) FilePersister
NewBackgroundFilePersister wraps a FilePersister with async worker pool.
func NewDirectoryFilePersister ¶ added in v0.4.0
func NewDirectoryFilePersister(rootDirectory, category, runFolder string) FilePersister
NewDirectoryFilePersister creates a file persister that writes to disk.
type Finding ¶
type Finding struct {
ID string `json:"id,omitempty"`
Description string `json:"description"`
Passed bool `json:"passed"`
Message string `json:"message"`
Data string `json:"data,omitempty"`
}
Finding captures a single evaluation outcome from the Evaluator.
type HeaderProvider ¶
HeaderProvider decorates outbound HTTP requests. Optional.
type Logger ¶
type Logger interface {
Debug(format string, args ...interface{})
Info(format string, args ...interface{})
Warning(format string, args ...interface{})
Error(format string, args ...interface{})
}
Logger emits structured diagnostic messages. Safe for concurrent use.
type PlatformConfig ¶ added in v0.4.0
PlatformConfig restricts the crawler to known domains.
func (PlatformConfig) Validate ¶ added in v0.4.0
func (cfg PlatformConfig) Validate() error
Validate ensures the platform configuration is usable.
type PlatformHooks ¶ added in v0.4.0
type PlatformHooks interface {
NormalizeTitle(title string) string
ShouldRetry(title string, document *goquery.Document) RetryDecision
}
PlatformHooks provide platform-specific normalisation and retry logic.
type ProxyHealth ¶ added in v0.4.0
type ProxyHealth interface {
IsAvailable(proxy string) bool
RecordSuccess(proxy string)
RecordFailure(proxy string)
RecordCriticalFailure(proxy string)
}
ProxyHealth tracks proxy availability for circuit-breaker rotation.
func NewProxyHealthTracker ¶ added in v0.4.0
func NewProxyHealthTracker(values []string, logger Logger) ProxyHealth
NewProxyHealthTracker creates a circuit-breaker health tracker for proxies.
type RequestConfigurator ¶ added in v0.4.0
RequestConfigurator sets up headers and cookies on the collector.
func NewRequestConfigurator ¶ added in v0.4.0
func NewRequestConfigurator(cfg Config, logger Logger) RequestConfigurator
NewRequestConfigurator creates a configurator from the crawler config.
type RequestHook ¶
RequestHook runs before each outbound request. Optional.
type ResponseHandler ¶ added in v0.4.0
type ResponseHandler interface {
Setup(collector *colly.Collector)
SendResult(resp *colly.Response, success bool, errorMessage string)
SetSlotReleaser(releaser func(*colly.Response))
}
ResponseHandler processes HTTP responses and emits results. Implementations are injected into the Service at construction time. The default handler parses HTML, runs an Evaluator, and sends Results. Custom handlers can implement platform-specific logic (image download, discoverability probing, etc.).
type Result ¶
type Result struct {
TargetID string `json:"targetId"`
TargetURL string `json:"targetUrl"`
FinalURL string `json:"finalUrl,omitempty"`
CanonicalURL string `json:"canonicalUrl,omitempty"`
Category string `json:"category"`
Title string `json:"title,omitempty"`
Success bool `json:"success"`
ErrorMessage string `json:"errorMessage,omitempty"`
HTTPStatusCode int `json:"httpStatusCode,omitempty"`
Findings []Finding `json:"findings,omitempty"`
Document *goquery.Document `json:"-"`
ProxyURL string `json:"proxyUrl,omitempty"`
}
Result represents the outcome of crawling a single target.
type RetryDecision ¶ added in v0.4.0
type RetryDecision struct {
ShouldRetry bool
Message string
LogMessage string
Policy RetryPolicy
ExhaustionBehavior RetryExhaustionBehavior
}
RetryDecision captures the outcome of a platform retry check.
func (RetryDecision) ResolvedLogMessage ¶ added in v0.4.0
func (d RetryDecision) ResolvedLogMessage() string
ResolvedLogMessage returns the log message or falls back to the message.
type RetryExhaustionBehavior ¶ added in v0.4.0
type RetryExhaustionBehavior uint8
RetryExhaustionBehavior controls what happens when retries are exhausted.
const ( RetryExhaustionBehaviorFail RetryExhaustionBehavior = iota RetryExhaustionBehaviorContinue )
type RetryHandler ¶ added in v0.4.0
type RetryHandler interface {
Retry(response *colly.Response, options RetryOptions) bool
}
RetryHandler encapsulates retry behaviour for failed responses.
func NewRetryHandler ¶ added in v0.4.0
func NewRetryHandler(scraper ScraperConfig, logger Logger) RetryHandler
NewRetryHandler constructs a retry handler from scraper config.
type RetryOptions ¶ added in v0.4.0
RetryOptions controls retry behaviour per-request.
type RetryPolicy ¶ added in v0.4.0
type RetryPolicy uint8
RetryPolicy controls how retries are performed.
const ( RetryPolicyDefault RetryPolicy = iota RetryPolicyRotateProxy )
type ScraperConfig ¶ added in v0.4.0
type ScraperConfig struct {
MaxDepth int
Parallelism int
RetryCount int
HTTPTimeout time.Duration
InsecureSkipVerify bool
RateLimit time.Duration
ProxyList []string
ProxyCircuitBreakerEnabled bool
SaveFiles bool
}
ScraperConfig controls concurrency, retries, and network behaviour.
func (ScraperConfig) Validate ¶ added in v0.4.0
func (cfg ScraperConfig) Validate() error
Validate checks that essential numeric fields are positive.
type Service ¶
type Service struct {
// contains filtered or unexported fields
}
Service orchestrates crawling of targets and emits results.
func NewService ¶
NewService constructs a crawler service. When cfg.ResponseHandler is nil, a DefaultResponseHandler is created using cfg.Evaluator and the results channel. When cfg.ResponseHandler is set, results can be nil.
func (*Service) RetryHandler ¶ added in v0.4.0
func (svc *Service) RetryHandler() RetryHandler
RetryHandler returns the service's retry handler for use by custom ResponseHandlers.
type Target ¶ added in v0.4.0
Target describes a single URL to crawl.
func NewTarget ¶ added in v0.4.0
func NewTarget(id, category, url string, opts ...TargetOption) (Target, error)
NewTarget constructs a Target after validating mandatory fields.
func (Target) MetadataValue ¶ added in v0.4.0
MetadataValue returns a metadata value by key, or empty string if absent.
type TargetOption ¶ added in v0.4.0
type TargetOption func(*Target)
TargetOption mutates optional fields on Target construction.
func WithMetadata ¶ added in v0.4.0
func WithMetadata(key, value string) TargetOption
WithMetadata sets an extensible key-value pair on the target.