Documentation
¶
Overview ¶
Package crawler provides a reusable crawling service that fetches web pages, applies configurable rules, and emits normalized results. It supports proxy rotation, retry with backoff, rate limiting, platform-specific hooks, and extensible response handling through the ResponseHandler interface.
Index ¶
- func SetPackageLogger(logger Logger)
- type Config
- type CookieGenerator
- type FilePersister
- type Logger
- type NoopResponseHandler
- type PlatformConfig
- type PlatformHooks
- type Product
- type ProductOption
- type RequestConfigurator
- type RequestHeaderProvider
- type RequestHook
- type ResponseHandler
- type ResponseHandlerRuntimeBinder
- type ResponseProcessor
- type Result
- type RetryDecision
- type RetryExhaustionBehavior
- type RetryHandler
- type RetryOptions
- type RetryPolicy
- type RuleEvaluation
- type RuleEvaluator
- type RuleResult
- type ScraperConfig
- type Service
- type ServiceHook
- type ServiceOption
- type VerificationResult
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func SetPackageLogger ¶ added in v0.5.0
func SetPackageLogger(logger Logger)
SetPackageLogger replaces the package-level logger used by standalone functions.
Types ¶
type Config ¶
type Config struct {
// PlatformID identifies the target platform (for example "AMZN").
PlatformID string
// Scraper controls concurrency, retries, and network behaviour.
Scraper ScraperConfig
// Platform holds domain-specific settings such as allowed hosts.
Platform PlatformConfig
// OutputDirectory is optional; when supplied and FilePersister is nil the
// crawler will persist downloaded artifacts under this path.
OutputDirectory string
// RunFolder scopes persisted artifacts for a single execution.
RunFolder string
// RuleEvaluator produces rule findings for a fetched document. Mandatory.
RuleEvaluator RuleEvaluator
// CookieGenerator returns cookies for a given domain. Optional.
CookieGenerator CookieGenerator
// FilePersister handles file persistence. Optional; a default implementation
// is created when OutputDirectory is set.
FilePersister FilePersister
// PlatformHooks customise platform-specific behaviour. Optional.
PlatformHooks PlatformHooks
// RequestHeaders applies custom headers before each outbound request.
RequestHeaders RequestHeaderProvider
// RequestHook runs before each outbound request. Optional.
RequestHook RequestHook
// Logger receives debug/info/warning/error logs. Optional; a no-op logger is
// used when nil.
Logger Logger
}
Config wires the crawler service with platform metadata, scraping options, and effectful collaborators. All fields are mandatory unless marked as optional.
type CookieGenerator ¶ added in v0.5.0
CookieGenerator returns cookies for a specific domain.
type FilePersister ¶ added in v0.4.0
type FilePersister interface {
Save(productID, fileName string, content []byte) error
Close() error
}
FilePersister persists binary artifacts associated with a product.
type Logger ¶
type Logger interface {
Debug(format string, args ...interface{})
Info(format string, args ...interface{})
Warning(format string, args ...interface{})
Error(format string, args ...interface{})
}
Logger emits structured diagnostic messages. Implementations should be safe for concurrent use. Methods follow fmt.Sprintf semantics.
func EnsureLogger ¶ added in v0.5.1
EnsureLogger returns the provided logger if non-nil, otherwise a no-op logger.
type NoopResponseHandler ¶ added in v0.5.0
type NoopResponseHandler struct{}
NoopResponseHandler provides default no-op implementations of ResponseHandler.
func (NoopResponseHandler) AfterEvaluation ¶ added in v0.5.0
AfterEvaluation does nothing.
func (NoopResponseHandler) BeforeEvaluation ¶ added in v0.5.0
func (NoopResponseHandler) BeforeEvaluation(*colly.Response, *goquery.Document)
BeforeEvaluation does nothing.
func (NoopResponseHandler) HandleBinaryResponse ¶ added in v0.5.0
HandleBinaryResponse returns false, indicating the response was not handled.
type PlatformConfig ¶ added in v0.4.0
type PlatformConfig struct {
AllowedDomains []string
CookieDomains []string
SkipRulesOnRedirect bool
}
PlatformConfig restricts the crawler to known domains and provides selectors.
func (PlatformConfig) Validate ¶ added in v0.4.0
func (cfg PlatformConfig) Validate() error
Validate ensures the platform configuration is usable.
type PlatformHooks ¶ added in v0.4.0
type PlatformHooks interface {
NormalizeTitle(title string) string
ShouldRetry(title string, document *goquery.Document) RetryDecision
ExtractDOMTitle(document *goquery.Document) string
IsContentComplete(document *goquery.Document) bool
InferRedirect(productID, originalURL, finalURL, canonicalURL string) (redirected bool, redirectedProductID string)
}
PlatformHooks provide platform-specific normalisation, content validation, redirect detection, and retry logic. Implementations encapsulate all platform-specific behaviour so the core crawler remains generic.
type Product ¶ added in v0.5.0
Product describes a single page to crawl.
func NewProduct ¶ added in v0.5.0
func NewProduct(id, platform, url string, opts ...ProductOption) (Product, error)
NewProduct constructs a Product after validating mandatory fields.
type ProductOption ¶ added in v0.5.0
type ProductOption func(*Product)
ProductOption mutates optional fields on Product construction.
func WithOriginalID ¶ added in v0.5.0
func WithOriginalID(originalID string) ProductOption
WithOriginalID sets the original identifier when different from ID.
func WithOriginalURL ¶ added in v0.5.0
func WithOriginalURL(originalURL string) ProductOption
WithOriginalURL records the source URL before redirects.
type RequestConfigurator ¶ added in v0.4.0
RequestConfigurator applies cookies and headers to outgoing requests.
type RequestHeaderProvider ¶ added in v0.5.0
RequestHeaderProvider decorates outbound collector requests.
type RequestHook ¶
type ResponseHandler ¶ added in v0.4.0
type ResponseHandler interface {
// HandleBinaryResponse processes non-HTML responses (e.g. images).
// Return true to indicate the response was handled and stop further processing.
HandleBinaryResponse(resp *colly.Response, productID string, fileExtension string) bool
// BeforeEvaluation is called after HTML parsing and content validation but
// before rule evaluation. Use for tasks like image retrieval.
BeforeEvaluation(resp *colly.Response, document *goquery.Document)
// AfterEvaluation is called once the processor has enough context to build
// the final result, before that result is emitted. Use for tasks like
// discoverability probing or file persistence.
AfterEvaluation(resp *colly.Response, document *goquery.Document, result *Result)
}
ResponseHandler extends the crawling pipeline with domain-specific behaviour. Implementations are called at specific points during response processing.
type ResponseHandlerRuntimeBinder ¶ added in v0.5.2
type ResponseHandlerRuntimeBinder interface {
BindRuntime(collector *colly.Collector, filePersister FilePersister, retryHandler RetryHandler)
}
ResponseHandlerRuntimeBinder fills runtime-managed dependencies on handlers after the crawler service has created them.
type ResponseProcessor ¶ added in v0.5.0
type ResponseProcessor interface {
Setup(collector *colly.Collector)
SendFinalResult(resp *colly.Response, success bool, errorText string)
SetResultCallback(callback func(*colly.Response))
SetResponseHandlers(handlers []ResponseHandler)
}
ResponseProcessor handles incoming responses and emits final results.
type Result ¶
type Result struct {
ProductID string `json:"product_id" csv:"ID"`
OriginalProductID string `json:"original_product_id,omitempty" csv:"OriginalID"`
OriginalURL string `json:"original_url,omitempty" csv:""`
FinalURL string `json:"final_url,omitempty" csv:""`
CanonicalURL string `json:"canonical_url,omitempty" csv:""`
ProxyURL string `json:"proxy_url,omitempty" csv:"ProxyURL"`
ProductURL string `json:"product_url" csv:"URL"`
ProductTitle string `json:"product_title,omitempty" csv:"Title"`
ProductPlatform string `json:"product_platform"`
Success bool `json:"success"`
ErrorMessage string `json:"error_message,omitempty" csv:"ErrorMessage"`
HTTPStatusCode int `json:"http_status_code,omitempty" csv:"HTTPStatusCode"`
Progress int `json:"progress,omitempty"`
RuleResults []RuleResult `json:"results,omitempty"`
ConfiguredVerifierCount int `json:"-" csv:"-"`
ScoreOverride *int `json:"-" csv:"-"`
}
Result represents the normalized outcome of crawling a single product page.
func (Result) CalculateScore ¶ added in v0.5.0
CalculateScore returns the percentage of configured verifiers that passed.
func (Result) IsNotFound ¶ added in v0.5.0
IsNotFound reports whether the HTTP status code represents a missing page.
func (Result) IsNotRetryable ¶ added in v0.5.0
IsNotRetryable reports whether retrying would be pointless.
type RetryDecision ¶ added in v0.4.0
type RetryDecision struct {
ShouldRetry bool
Message string
LogMessage string
Policy RetryPolicy
ExhaustionBehavior RetryExhaustionBehavior
}
RetryDecision captures the outcome of a platform retry check.
func (RetryDecision) ResolvedLogMessage ¶ added in v0.4.0
func (decision RetryDecision) ResolvedLogMessage() string
ResolvedLogMessage returns the log message or falls back to the general message.
type RetryExhaustionBehavior ¶ added in v0.4.0
type RetryExhaustionBehavior uint8
RetryExhaustionBehavior controls what happens when retries are exhausted.
const ( RetryExhaustionBehaviorFail RetryExhaustionBehavior = iota RetryExhaustionBehaviorContinue )
type RetryHandler ¶ added in v0.4.0
type RetryHandler interface {
Retry(response *colly.Response, options RetryOptions) bool
}
RetryHandler encapsulates retry behaviour for failed responses.
type RetryOptions ¶ added in v0.4.0
type RetryPolicy ¶ added in v0.4.0
type RetryPolicy uint8
RetryPolicy controls how retries are performed.
const ( RetryPolicyDefault RetryPolicy = iota RetryPolicyRotateProxy )
type RuleEvaluation ¶ added in v0.5.0
type RuleEvaluation struct {
Passed bool
ConfiguredVerifier int
RuleResults []RuleResult
}
RuleEvaluation aggregates evaluation output from the injected RuleEvaluator.
type RuleEvaluator ¶ added in v0.5.0
type RuleEvaluator interface {
Evaluate(productID string, document *goquery.Document) (RuleEvaluation, error)
ConfiguredVerifierCount() int
}
RuleEvaluator produces a RuleEvaluation for a fetched document.
type RuleResult ¶ added in v0.5.0
type RuleResult struct {
ID string `json:"id,omitempty" csv:"-"`
Description string `json:"description" csv:"Description,keyValue"`
Passed bool `json:"passed" csv:"passed"`
ReportingOrder int `json:"reporting_order" csv:"-"`
Message string `json:"message" csv:"message"`
VerificationResults []VerificationResult `json:"verification_results"`
}
RuleResult represents rule-level evaluation outcome.
type ScraperConfig ¶ added in v0.4.0
type ScraperConfig struct {
MaxDepth int
Parallelism int
RetryCount int
HTTPTimeout time.Duration
InsecureSkipVerify bool
RateLimit time.Duration
ProxyList []string
SaveFiles bool
ProxyCircuitBreakerEnabled bool
}
ScraperConfig exposes concurrency and retry knobs for the crawler.
func (ScraperConfig) Validate ¶ added in v0.4.0
func (cfg ScraperConfig) Validate() error
Validate checks that essential numeric fields are positive.
type Service ¶
type Service struct {
// contains filtered or unexported fields
}
Service orchestrates crawling of product pages and emits results.
func NewService ¶
func NewService(cfg Config, results chan<- *Result, options ...ServiceOption) (*Service, error)
NewService constructs a crawler service configured for a platform. ServiceOption values customize the service with response handlers and lifecycle hooks.
type ServiceHook ¶ added in v0.5.0
type ServiceHook interface {
// AfterInit is called after the collector, transport, and response processor
// are fully wired. Use for binding domain-specific network configuration.
AfterInit(collector *colly.Collector, transport http.RoundTripper)
// BeforeRun is called before the product visit loop starts.
BeforeRun(ctx context.Context)
// AfterRun is called after all products have been visited and the collector
// has finished. Use for cleanup (e.g. stopping image converter workers).
AfterRun()
}
ServiceHook provides lifecycle callbacks for the crawler service.
type ServiceOption ¶ added in v0.5.0
type ServiceOption func(*Service)
ServiceOption configures a Service during construction.
func WithResponseHandlers ¶ added in v0.5.0
func WithResponseHandlers(handlers ...ResponseHandler) ServiceOption
WithResponseHandlers registers ResponseHandlers that extend the crawling pipeline.
func WithServiceHook ¶ added in v0.5.0
func WithServiceHook(hook ServiceHook) ServiceOption
WithServiceHook registers a lifecycle hook for the crawler service.
type VerificationResult ¶ added in v0.5.0
type VerificationResult struct {
ID string `json:"id,omitempty" csv:"-"`
Description string `json:"description" csv:"Description,keyValue"`
Passed bool `json:"passed" csv:"passed"`
Message string `json:"message" csv:"message"`
Value string `json:"value" csv:"value"`
ReportingOrder int `json:"reporting_order"`
IncludeValue bool `json:"-"`
}
VerificationResult captures the outcome of an individual verifier.