crawler

package
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 23, 2026 License: MIT Imports: 20 Imported by: 0

Documentation

Overview

Package crawler provides a generic, configurable web crawling engine built on Colly. It supports concurrent page fetching, retries with exponential backoff, rate limiting, proxy rotation with circuit-breaker health tracking, and pluggable response processing via the ResponseHandler interface.

Simple consumers use an Evaluator with the default response handler. Advanced consumers inject a custom ResponseHandler for full control over how fetched documents are processed and results are emitted.

Index

Constants

View Source
const (
	// Context keys for Colly request context.
	CtxTargetIDKey       = "crawler_target_id"
	CtxTargetCategoryKey = "crawler_target_category"
	CtxTargetURLKey      = "crawler_target_url"
	CtxRunContextKey     = "crawler_run_context"
	CtxHTTPStatusCodeKey = "crawler_http_status"
	CtxErrorKey          = "crawler_error"
	CtxEvaluationKey     = "crawler_evaluation"
	CtxTitleKey          = "crawler_title"
	CtxInitialURLKey     = "crawler_initial_url"
	CtxFinalURLKey       = "crawler_final_url"
	CtxCanonicalURLKey   = "crawler_canonical_url"
	CtxRedirectedKey     = "crawler_redirected"
	CtxNotFoundKey       = "crawler_not_found"

	RetryCountKey  = "crawler_retry_count"
	RetriedFlagKey = "crawler_retried"

	TitleNotFound = "Title Not Found"
	UnknownURL    = "UnknownURL"
	HTMLExtension = "html"
)

Variables

This section is empty.

Functions

func DescribeProxyForLog added in v0.4.0

func DescribeProxyForLog(rawProxyURL string) string

DescribeProxyForLog returns a safe-to-log proxy description.

func ExtractCanonicalURL added in v0.4.0

func ExtractCanonicalURL(doc *goquery.Document) string

ExtractCanonicalURL extracts the canonical URL from a <link rel="canonical"> tag.

func ExtractTitle added in v0.4.0

func ExtractTitle(doc *goquery.Document) string

ExtractTitle returns the text content of the first <title> element.

func GetContextValue added in v0.4.0

func GetContextValue(ctx *colly.Context, key, fallback string) string

GetContextValue returns a value from a Colly context, or the fallback if absent.

func GetRetryAttempt added in v0.4.0

func GetRetryAttempt(response *colly.Response) int

GetRetryAttempt returns the current retry attempt from the response context.

func GetTargetCategoryFromContext added in v0.4.0

func GetTargetCategoryFromContext(resp *colly.Response) string

GetTargetCategoryFromContext retrieves the target category from a Colly response context.

func GetTargetIDFromContext added in v0.4.0

func GetTargetIDFromContext(resp *colly.Response) string

GetTargetIDFromContext retrieves the target ID from a Colly response context.

func GetTargetURLFromContext added in v0.4.0

func GetTargetURLFromContext(resp *colly.Response) string

GetTargetURLFromContext retrieves the target URL from a Colly response context.

func NewContextAwareTransport added in v0.4.0

func NewContextAwareTransport(base http.RoundTripper, ctxProvider func() context.Context) http.RoundTripper

NewContextAwareTransport wraps a transport with run-context cancellation.

func NewHTTPTransport added in v0.4.0

func NewHTTPTransport(insecureSkipVerify bool, requestTimeout time.Duration) *http.Transport

NewHTTPTransport creates an HTTP transport with idle-timeout support.

func NewPanicSafeTransport added in v0.4.0

func NewPanicSafeTransport(base http.RoundTripper, logger Logger) http.RoundTripper

NewPanicSafeTransport wraps a transport with panic recovery.

func NewProxyRotator added in v0.4.0

func NewProxyRotator(rawProxies []string, tracker ProxyHealth, logger Logger) (colly.ProxyFunc, error)

NewProxyRotator creates a round-robin proxy rotation function.

func ParseHTMLResponse added in v0.4.0

func ParseHTMLResponse(body []byte) (*goquery.Document, error)

ParseHTMLResponse parses HTML from a response body.

func RecordProxyFailure added in v0.4.0

func RecordProxyFailure(tracker ProxyHealth, resp *colly.Response)

RecordProxyFailure records a proxy failure on the tracker if applicable.

func SanitizeProxyURL added in v0.4.0

func SanitizeProxyURL(rawProxyURL string) string

SanitizeProxyURL strips credentials from a proxy URL for safe logging.

func SetupErrorHandling added in v0.4.0

func SetupErrorHandling(collector *colly.Collector, handler ResponseHandler, retryHandler RetryHandler, tracker ProxyHealth, logger Logger)

SetupErrorHandling wires OnError with retry and proxy tracking.

Types

type Config

type Config struct {
	// Category identifies the target category (e.g., "AMZN", "camps").
	Category string

	// Scraper controls concurrency, retries, and network behaviour.
	Scraper ScraperConfig

	// Platform holds domain-specific settings.
	Platform PlatformConfig

	// ResponseHandler processes HTTP responses. Optional; when nil, a default
	// handler is created using the Evaluator.
	ResponseHandler ResponseHandler

	// Evaluator produces findings for a fetched document. Required when
	// ResponseHandler is nil.
	Evaluator Evaluator

	// PlatformHooks customise title normalisation and retry decisions. Optional.
	PlatformHooks PlatformHooks

	// CookieProvider returns cookies per domain. Optional.
	CookieProvider CookieProvider

	// CookieDomains lists domains for which CookieProvider is called.
	CookieDomains []string

	// FilePersister handles artifact persistence. Optional; a default
	// implementation is created when OutputDirectory is set.
	FilePersister FilePersister

	// OutputDirectory is optional; when set and FilePersister is nil the
	// crawler will persist artifacts under this path.
	OutputDirectory string

	// RunFolder scopes persisted artifacts for a single execution.
	RunFolder string

	// Headers customises outbound requests. Optional.
	Headers HeaderProvider

	// Hook runs before each request. Optional.
	Hook RequestHook

	// Logger receives diagnostic messages. Optional (no-op if nil).
	Logger Logger
}

Config wires the crawler service with target metadata, scraping options, and effectful collaborators.

func (Config) Validate

func (cfg Config) Validate() error

Validate ensures required configuration is present and self-consistent.

type CookieProvider

type CookieProvider func(domain string) []*http.Cookie

CookieProvider returns cookies for a given domain. Optional.

type DefaultResponseHandler added in v0.4.0

type DefaultResponseHandler struct {
	// contains filtered or unexported fields
}

DefaultResponseHandler processes HTTP responses using an Evaluator. It parses HTML, extracts the title, runs the evaluator, and emits Results.

func NewDefaultResponseHandler added in v0.4.0

func NewDefaultResponseHandler(
	cfg Config,
	retryHandler RetryHandler,
	proxyTracker ProxyHealth,
	filePersister FilePersister,
	results chan<- *Result,
	logger Logger,
) *DefaultResponseHandler

NewDefaultResponseHandler creates the standard response handler.

func (*DefaultResponseHandler) SendResult added in v0.4.0

func (h *DefaultResponseHandler) SendResult(resp *colly.Response, success bool, errorMessage string)

func (*DefaultResponseHandler) SetSlotReleaser added in v0.4.0

func (h *DefaultResponseHandler) SetSlotReleaser(releaser func(*colly.Response))

func (*DefaultResponseHandler) Setup added in v0.4.0

func (h *DefaultResponseHandler) Setup(collector *colly.Collector)

type Evaluation

type Evaluation struct {
	Findings []Finding
}

Evaluation is the output of an Evaluator.

type Evaluator

type Evaluator interface {
	Evaluate(targetID string, document *goquery.Document) (Evaluation, error)
}

Evaluator processes a fetched HTML document and produces findings. Used by the default ResponseHandler. Not needed when a custom ResponseHandler is provided.

type FilePersister added in v0.4.0

type FilePersister interface {
	Save(targetID, fileName string, content []byte) error
	Close() error
}

FilePersister persists binary artifacts associated with a target.

func NewBackgroundFilePersister added in v0.4.0

func NewBackgroundFilePersister(delegate FilePersister, workerCount, bufferSize int, logger Logger) FilePersister

NewBackgroundFilePersister wraps a FilePersister with async worker pool.

func NewDirectoryFilePersister added in v0.4.0

func NewDirectoryFilePersister(rootDirectory, category, runFolder string) FilePersister

NewDirectoryFilePersister creates a file persister that writes to disk.

type Finding

type Finding struct {
	ID          string `json:"id,omitempty"`
	Description string `json:"description"`
	Passed      bool   `json:"passed"`
	Message     string `json:"message"`
	Data        string `json:"data,omitempty"`
}

Finding captures a single evaluation outcome from the Evaluator.

type HeaderProvider

type HeaderProvider interface {
	Apply(category string, request *colly.Request)
}

HeaderProvider decorates outbound HTTP requests. Optional.

type Logger

type Logger interface {
	Debug(format string, args ...interface{})
	Info(format string, args ...interface{})
	Warning(format string, args ...interface{})
	Error(format string, args ...interface{})
}

Logger emits structured diagnostic messages. Safe for concurrent use.

type Page deprecated

type Page = Target

Page is a deprecated alias for Target.

Deprecated: Use Target instead.

type PlatformConfig added in v0.4.0

type PlatformConfig struct {
	AllowedDomains []string
	CookieDomains  []string
}

PlatformConfig restricts the crawler to known domains.

func (PlatformConfig) Validate added in v0.4.0

func (cfg PlatformConfig) Validate() error

Validate ensures the platform configuration is usable.

type PlatformHooks added in v0.4.0

type PlatformHooks interface {
	NormalizeTitle(title string) string
	ShouldRetry(title string, document *goquery.Document) RetryDecision
}

PlatformHooks provide platform-specific normalisation and retry logic.

type ProxyHealth added in v0.4.0

type ProxyHealth interface {
	IsAvailable(proxy string) bool
	RecordSuccess(proxy string)
	RecordFailure(proxy string)
	RecordCriticalFailure(proxy string)
}

ProxyHealth tracks proxy availability for circuit-breaker rotation.

func NewProxyHealthTracker added in v0.4.0

func NewProxyHealthTracker(values []string, logger Logger) ProxyHealth

NewProxyHealthTracker creates a circuit-breaker health tracker for proxies.

type RequestConfigurator added in v0.4.0

type RequestConfigurator interface {
	Configure(collector *colly.Collector)
}

RequestConfigurator sets up headers and cookies on the collector.

func NewRequestConfigurator added in v0.4.0

func NewRequestConfigurator(cfg Config, logger Logger) RequestConfigurator

NewRequestConfigurator creates a configurator from the crawler config.

type RequestHook

type RequestHook interface {
	BeforeRequest(ctx context.Context, target Target) error
}

RequestHook runs before each outbound request. Optional.

type ResponseHandler added in v0.4.0

type ResponseHandler interface {
	Setup(collector *colly.Collector)
	SendResult(resp *colly.Response, success bool, errorMessage string)
	SetSlotReleaser(releaser func(*colly.Response))
}

ResponseHandler processes HTTP responses and emits results. Implementations are injected into the Service at construction time. The default handler parses HTML, runs an Evaluator, and sends Results. Custom handlers can implement platform-specific logic (image download, discoverability probing, etc.).

type Result

type Result struct {
	TargetID       string            `json:"targetId"`
	TargetURL      string            `json:"targetUrl"`
	FinalURL       string            `json:"finalUrl,omitempty"`
	CanonicalURL   string            `json:"canonicalUrl,omitempty"`
	Category       string            `json:"category"`
	Title          string            `json:"title,omitempty"`
	Success        bool              `json:"success"`
	ErrorMessage   string            `json:"errorMessage,omitempty"`
	HTTPStatusCode int               `json:"httpStatusCode,omitempty"`
	Findings       []Finding         `json:"findings,omitempty"`
	Document       *goquery.Document `json:"-"`
	ProxyURL       string            `json:"proxyUrl,omitempty"`
}

Result represents the outcome of crawling a single target.

type RetryDecision added in v0.4.0

type RetryDecision struct {
	ShouldRetry        bool
	Message            string
	LogMessage         string
	Policy             RetryPolicy
	ExhaustionBehavior RetryExhaustionBehavior
}

RetryDecision captures the outcome of a platform retry check.

func (RetryDecision) ResolvedLogMessage added in v0.4.0

func (d RetryDecision) ResolvedLogMessage() string

ResolvedLogMessage returns the log message or falls back to the message.

type RetryExhaustionBehavior added in v0.4.0

type RetryExhaustionBehavior uint8

RetryExhaustionBehavior controls what happens when retries are exhausted.

const (
	RetryExhaustionBehaviorFail RetryExhaustionBehavior = iota
	RetryExhaustionBehaviorContinue
)

type RetryHandler added in v0.4.0

type RetryHandler interface {
	Retry(response *colly.Response, options RetryOptions) bool
}

RetryHandler encapsulates retry behaviour for failed responses.

func NewRetryHandler added in v0.4.0

func NewRetryHandler(scraper ScraperConfig, logger Logger) RetryHandler

NewRetryHandler constructs a retry handler from scraper config.

type RetryOptions added in v0.4.0

type RetryOptions struct {
	SkipDelay    bool
	LimitRetries bool
	MaxRetries   int
}

RetryOptions controls retry behaviour per-request.

type RetryPolicy added in v0.4.0

type RetryPolicy uint8

RetryPolicy controls how retries are performed.

const (
	RetryPolicyDefault RetryPolicy = iota
	RetryPolicyRotateProxy
)

type ScraperConfig added in v0.4.0

type ScraperConfig struct {
	MaxDepth                   int
	Parallelism                int
	RetryCount                 int
	HTTPTimeout                time.Duration
	InsecureSkipVerify         bool
	RateLimit                  time.Duration
	ProxyList                  []string
	ProxyCircuitBreakerEnabled bool
	SaveFiles                  bool
}

ScraperConfig controls concurrency, retries, and network behaviour.

func (ScraperConfig) Validate added in v0.4.0

func (cfg ScraperConfig) Validate() error

Validate checks that essential numeric fields are positive.

type Service

type Service struct {
	// contains filtered or unexported fields
}

Service orchestrates crawling of targets and emits results.

func NewService

func NewService(cfg Config, results chan<- *Result) (*Service, error)

NewService constructs a crawler service. When cfg.ResponseHandler is nil, a DefaultResponseHandler is created using cfg.Evaluator and the results channel. When cfg.ResponseHandler is set, results can be nil.

func (*Service) RetryHandler added in v0.4.0

func (svc *Service) RetryHandler() RetryHandler

RetryHandler returns the service's retry handler for use by custom ResponseHandlers.

func (*Service) Run

func (svc *Service) Run(ctx context.Context, targets []Target) error

Run visits each target and blocks until all are processed or ctx is cancelled.

type Target added in v0.4.0

type Target struct {
	ID       string
	Category string
	URL      string
	Metadata map[string]string
}

Target describes a single URL to crawl.

func NewTarget added in v0.4.0

func NewTarget(id, category, url string, opts ...TargetOption) (Target, error)

NewTarget constructs a Target after validating mandatory fields.

func (Target) MetadataValue added in v0.4.0

func (t Target) MetadataValue(key string) string

MetadataValue returns a metadata value by key, or empty string if absent.

type TargetOption added in v0.4.0

type TargetOption func(*Target)

TargetOption mutates optional fields on Target construction.

func WithMetadata added in v0.4.0

func WithMetadata(key, value string) TargetOption

WithMetadata sets an extensible key-value pair on the target.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL