crawler

package
v0.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 24, 2026 License: MIT Imports: 20 Imported by: 0

Documentation

Overview

Package crawler provides a reusable crawling service that fetches web pages, applies configurable rules, and emits normalized results. It supports proxy rotation, retry with backoff, rate limiting, platform-specific hooks, and extensible response handling through the ResponseHandler interface.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SetPackageLogger added in v0.5.0

func SetPackageLogger(logger Logger)

SetPackageLogger replaces the package-level logger used by standalone functions.

Types

type Config

type Config struct {
	// PlatformID identifies the target platform (for example "AMZN").
	PlatformID string

	// Scraper controls concurrency, retries, and network behaviour.
	Scraper ScraperConfig

	// Platform holds domain-specific settings such as allowed hosts.
	Platform PlatformConfig

	// OutputDirectory is optional; when supplied and FilePersister is nil the
	// crawler will persist downloaded artifacts under this path.
	OutputDirectory string

	// RunFolder scopes persisted artifacts for a single execution.
	RunFolder string

	// RuleEvaluator produces rule findings for a fetched document. Mandatory.
	RuleEvaluator RuleEvaluator

	// CookieGenerator returns cookies for a given domain. Optional.
	CookieGenerator CookieGenerator

	// FilePersister handles file persistence. Optional; a default implementation
	// is created when OutputDirectory is set.
	FilePersister FilePersister

	// PlatformHooks customise platform-specific behaviour. Optional.
	PlatformHooks PlatformHooks

	// RequestHeaders applies custom headers before each outbound request.
	RequestHeaders RequestHeaderProvider

	// RequestHook runs before each outbound request. Optional.
	RequestHook RequestHook

	// Logger receives debug/info/warning/error logs. Optional; a no-op logger is
	// used when nil.
	Logger Logger
}

Config wires the crawler service with platform metadata, scraping options, and effectful collaborators. All fields are mandatory unless marked as optional.

func (Config) Validate

func (cfg Config) Validate() error

Validate ensures required configuration is present and self-consistent.

type CookieGenerator added in v0.5.0

type CookieGenerator func(domain string) []*http.Cookie

CookieGenerator returns cookies for a specific domain.

type FilePersister added in v0.4.0

type FilePersister interface {
	Save(productID, fileName string, content []byte) error
	Close() error
}

FilePersister persists binary artifacts associated with a product.

type Logger

type Logger interface {
	Debug(format string, args ...interface{})
	Info(format string, args ...interface{})
	Warning(format string, args ...interface{})
	Error(format string, args ...interface{})
}

Logger emits structured diagnostic messages. Implementations should be safe for concurrent use. Methods follow fmt.Sprintf semantics.

func EnsureLogger added in v0.5.1

func EnsureLogger(logger Logger) Logger

EnsureLogger returns the provided logger if non-nil, otherwise a no-op logger.

type NoopResponseHandler added in v0.5.0

type NoopResponseHandler struct{}

NoopResponseHandler provides default no-op implementations of ResponseHandler.

func (NoopResponseHandler) AfterEvaluation added in v0.5.0

func (NoopResponseHandler) AfterEvaluation(*colly.Response, *goquery.Document, *Result)

AfterEvaluation does nothing.

func (NoopResponseHandler) BeforeEvaluation added in v0.5.0

func (NoopResponseHandler) BeforeEvaluation(*colly.Response, *goquery.Document)

BeforeEvaluation does nothing.

func (NoopResponseHandler) HandleBinaryResponse added in v0.5.0

func (NoopResponseHandler) HandleBinaryResponse(*colly.Response, string, string) bool

HandleBinaryResponse returns false, indicating the response was not handled.

type PlatformConfig added in v0.4.0

type PlatformConfig struct {
	AllowedDomains      []string
	CookieDomains       []string
	SkipRulesOnRedirect bool
}

PlatformConfig restricts the crawler to known domains and provides selectors.

func (PlatformConfig) Validate added in v0.4.0

func (cfg PlatformConfig) Validate() error

Validate ensures the platform configuration is usable.

type PlatformHooks added in v0.4.0

type PlatformHooks interface {
	NormalizeTitle(title string) string
	ShouldRetry(title string, document *goquery.Document) RetryDecision
	ExtractDOMTitle(document *goquery.Document) string
	IsContentComplete(document *goquery.Document) bool
	InferRedirect(productID, originalURL, finalURL, canonicalURL string) (redirected bool, redirectedProductID string)
}

PlatformHooks provide platform-specific normalisation, content validation, redirect detection, and retry logic. Implementations encapsulate all platform-specific behaviour so the core crawler remains generic.

type Product added in v0.5.0

type Product struct {
	ID          string
	Platform    string
	URL         string
	OriginalID  string
	OriginalURL string
}

Product describes a single page to crawl.

func NewProduct added in v0.5.0

func NewProduct(id, platform, url string, opts ...ProductOption) (Product, error)

NewProduct constructs a Product after validating mandatory fields.

type ProductOption added in v0.5.0

type ProductOption func(*Product)

ProductOption mutates optional fields on Product construction.

func WithOriginalID added in v0.5.0

func WithOriginalID(originalID string) ProductOption

WithOriginalID sets the original identifier when different from ID.

func WithOriginalURL added in v0.5.0

func WithOriginalURL(originalURL string) ProductOption

WithOriginalURL records the source URL before redirects.

type RequestConfigurator added in v0.4.0

type RequestConfigurator interface {
	Configure(collector *colly.Collector)
}

RequestConfigurator applies cookies and headers to outgoing requests.

type RequestHeaderProvider added in v0.5.0

type RequestHeaderProvider interface {
	Apply(platformID string, request *colly.Request)
}

RequestHeaderProvider decorates outbound collector requests.

type RequestHook

type RequestHook interface {
	BeforeRequest(ctx context.Context, product Product) error
}

type ResponseHandler added in v0.4.0

type ResponseHandler interface {
	// HandleBinaryResponse processes non-HTML responses (e.g. images).
	// Return true to indicate the response was handled and stop further processing.
	HandleBinaryResponse(resp *colly.Response, productID string, fileExtension string) bool

	// BeforeEvaluation is called after HTML parsing and content validation but
	// before rule evaluation. Use for tasks like image retrieval.
	BeforeEvaluation(resp *colly.Response, document *goquery.Document)

	// AfterEvaluation is called once the processor has enough context to build
	// the final result, before that result is emitted. Use for tasks like
	// discoverability probing or file persistence.
	AfterEvaluation(resp *colly.Response, document *goquery.Document, result *Result)
}

ResponseHandler extends the crawling pipeline with domain-specific behaviour. Implementations are called at specific points during response processing.

type ResponseHandlerRuntimeBinder added in v0.5.2

type ResponseHandlerRuntimeBinder interface {
	BindRuntime(collector *colly.Collector, filePersister FilePersister, retryHandler RetryHandler)
}

ResponseHandlerRuntimeBinder fills runtime-managed dependencies on handlers after the crawler service has created them.

type ResponseProcessor added in v0.5.0

type ResponseProcessor interface {
	Setup(collector *colly.Collector)
	SendFinalResult(resp *colly.Response, success bool, errorText string)
	SetResultCallback(callback func(*colly.Response))
	SetResponseHandlers(handlers []ResponseHandler)
}

ResponseProcessor handles incoming responses and emits final results.

type Result

type Result struct {
	ProductID               string       `json:"product_id" csv:"ID"`
	OriginalProductID       string       `json:"original_product_id,omitempty" csv:"OriginalID"`
	OriginalURL             string       `json:"original_url,omitempty" csv:""`
	FinalURL                string       `json:"final_url,omitempty" csv:""`
	CanonicalURL            string       `json:"canonical_url,omitempty" csv:""`
	ProxyURL                string       `json:"proxy_url,omitempty" csv:"ProxyURL"`
	ProductURL              string       `json:"product_url" csv:"URL"`
	ProductTitle            string       `json:"product_title,omitempty" csv:"Title"`
	ProductPlatform         string       `json:"product_platform"`
	Success                 bool         `json:"success"`
	ErrorMessage            string       `json:"error_message,omitempty" csv:"ErrorMessage"`
	HTTPStatusCode          int          `json:"http_status_code,omitempty" csv:"HTTPStatusCode"`
	Progress                int          `json:"progress,omitempty"`
	RuleResults             []RuleResult `json:"results,omitempty"`
	ConfiguredVerifierCount int          `json:"-" csv:"-"`
	ScoreOverride           *int         `json:"-" csv:"-"`
}

Result represents the normalized outcome of crawling a single product page.

func (Result) CalculateScore added in v0.5.0

func (result Result) CalculateScore(configuredVerifierCount int) int

CalculateScore returns the percentage of configured verifiers that passed.

func (Result) IsNotFound added in v0.5.0

func (result Result) IsNotFound() bool

IsNotFound reports whether the HTTP status code represents a missing page.

func (Result) IsNotRetryable added in v0.5.0

func (result Result) IsNotRetryable() bool

IsNotRetryable reports whether retrying would be pointless.

type RetryDecision added in v0.4.0

type RetryDecision struct {
	ShouldRetry        bool
	Message            string
	LogMessage         string
	Policy             RetryPolicy
	ExhaustionBehavior RetryExhaustionBehavior
}

RetryDecision captures the outcome of a platform retry check.

func (RetryDecision) ResolvedLogMessage added in v0.4.0

func (decision RetryDecision) ResolvedLogMessage() string

ResolvedLogMessage returns the log message or falls back to the general message.

type RetryExhaustionBehavior added in v0.4.0

type RetryExhaustionBehavior uint8

RetryExhaustionBehavior controls what happens when retries are exhausted.

const (
	RetryExhaustionBehaviorFail RetryExhaustionBehavior = iota
	RetryExhaustionBehaviorContinue
)

type RetryHandler added in v0.4.0

type RetryHandler interface {
	Retry(response *colly.Response, options RetryOptions) bool
}

RetryHandler encapsulates retry behaviour for failed responses.

type RetryOptions added in v0.4.0

type RetryOptions struct {
	SkipDelay    bool
	LimitRetries bool
	MaxRetries   int
}

type RetryPolicy added in v0.4.0

type RetryPolicy uint8

RetryPolicy controls how retries are performed.

const (
	RetryPolicyDefault RetryPolicy = iota
	RetryPolicyRotateProxy
)

type RuleEvaluation added in v0.5.0

type RuleEvaluation struct {
	Passed             bool
	ConfiguredVerifier int
	RuleResults        []RuleResult
}

RuleEvaluation aggregates evaluation output from the injected RuleEvaluator.

type RuleEvaluator added in v0.5.0

type RuleEvaluator interface {
	Evaluate(productID string, document *goquery.Document) (RuleEvaluation, error)
	ConfiguredVerifierCount() int
}

RuleEvaluator produces a RuleEvaluation for a fetched document.

type RuleResult added in v0.5.0

type RuleResult struct {
	ID                  string               `json:"id,omitempty" csv:"-"`
	Description         string               `json:"description" csv:"Description,keyValue"`
	Passed              bool                 `json:"passed" csv:"passed"`
	ReportingOrder      int                  `json:"reporting_order" csv:"-"`
	Message             string               `json:"message" csv:"message"`
	VerificationResults []VerificationResult `json:"verification_results"`
}

RuleResult represents rule-level evaluation outcome.

type ScraperConfig added in v0.4.0

type ScraperConfig struct {
	MaxDepth                   int
	Parallelism                int
	RetryCount                 int
	HTTPTimeout                time.Duration
	InsecureSkipVerify         bool
	RateLimit                  time.Duration
	ProxyList                  []string
	SaveFiles                  bool
	ProxyCircuitBreakerEnabled bool
}

ScraperConfig exposes concurrency and retry knobs for the crawler.

func (ScraperConfig) Validate added in v0.4.0

func (cfg ScraperConfig) Validate() error

Validate checks that essential numeric fields are positive.

type Service

type Service struct {
	// contains filtered or unexported fields
}

Service orchestrates crawling of product pages and emits results.

func NewService

func NewService(cfg Config, results chan<- *Result, options ...ServiceOption) (*Service, error)

NewService constructs a crawler service configured for a platform. ServiceOption values customize the service with response handlers and lifecycle hooks.

func (*Service) Run

func (service *Service) Run(ctx context.Context, products []Product) error

Run visits each product URL once and blocks until completion or context cancellation.

type ServiceHook added in v0.5.0

type ServiceHook interface {
	// AfterInit is called after the collector, transport, and response processor
	// are fully wired. Use for binding domain-specific network configuration.
	AfterInit(collector *colly.Collector, transport http.RoundTripper)

	// BeforeRun is called before the product visit loop starts.
	BeforeRun(ctx context.Context)

	// AfterRun is called after all products have been visited and the collector
	// has finished. Use for cleanup (e.g. stopping image converter workers).
	AfterRun()
}

ServiceHook provides lifecycle callbacks for the crawler service.

type ServiceOption added in v0.5.0

type ServiceOption func(*Service)

ServiceOption configures a Service during construction.

func WithResponseHandlers added in v0.5.0

func WithResponseHandlers(handlers ...ResponseHandler) ServiceOption

WithResponseHandlers registers ResponseHandlers that extend the crawling pipeline.

func WithServiceHook added in v0.5.0

func WithServiceHook(hook ServiceHook) ServiceOption

WithServiceHook registers a lifecycle hook for the crawler service.

type VerificationResult added in v0.5.0

type VerificationResult struct {
	ID             string `json:"id,omitempty" csv:"-"`
	Description    string `json:"description" csv:"Description,keyValue"`
	Passed         bool   `json:"passed" csv:"passed"`
	Message        string `json:"message" csv:"message"`
	Value          string `json:"value" csv:"value"`
	ReportingOrder int    `json:"reporting_order"`
	IncludeValue   bool   `json:"-"`
}

VerificationResult captures the outcome of an individual verifier.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL