crawler

package
v0.33.14 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 29, 2026 License: MIT Imports: 27 Imported by: 0

Documentation

Index

Constants

View Source
const (
	WAFVendorCloudflare = "cloudflare"
	WAFVendorImperva    = "imperva"
	WAFVendorDataDome   = "datadome"
	WAFVendorAkamai     = "akamai"
	WAFVendorGeneric    = "generic"
)

Vendor labels.

View Source
const MaxBodySampleSize = 50 * 1024

MaxBodySampleSize is the maximum size of body sample stored for tech detection (50KB)

View Source
const ProbeBodyLimit = 4 * 1024

ProbeBodyLimit caps the body read during a pre-flight probe. 4 KB is enough for the EdgeSuite block page (~373 bytes), the Imperva script preamble, and Cloudflare's interstitial.

View Source
const ProbeTimeout = 8 * time.Second

ProbeTimeout bounds how long a probe waits before giving up. Akamai SYN-drop variants will hold the connection — at this point we'd rather fall through to the normal flow and let the mid-job circuit breaker catch a real wall.

Variables

This section is empty.

Functions

func IsPathAllowed

func IsPathAllowed(rules *RobotsRules, path string) bool

Types

type CacheCheckAttempt

type CacheCheckAttempt struct {
	Attempt     int    `json:"attempt"`
	CacheStatus string `json:"cache_status"`
	Delay       int    `json:"delay_ms"`
	// Diagnostics duplicates attempt metadata for backward-compatible probe history.
	Diagnostics *ProbeDiagnostics `json:"diagnostics,omitempty"`
}

CacheCheckAttempt stores the result of a single cache status check.

type CacheMetadata

type CacheMetadata struct {
	HeaderSource     string `json:"header_source,omitempty"`
	RawValue         string `json:"raw_value,omitempty"`
	NormalisedStatus string `json:"normalised_status,omitempty"`
	Age              string `json:"age,omitempty"`
	CacheControl     string `json:"cache_control,omitempty"`
	Vary             string `json:"vary,omitempty"`
	CacheStatus      string `json:"cache_status,omitempty"`
	CFCacheStatus    string `json:"cf_cache_status,omitempty"`
	XCache           string `json:"x_cache,omitempty"`
	XCacheRemote     string `json:"x_cache_remote,omitempty"`
	XVercelCache     string `json:"x_vercel_cache,omitempty"`
	XVarnish         string `json:"x_varnish,omitempty"`
}

CacheMetadata stores cache-related headers and interpretation.

type Config

type Config struct {
	DefaultTimeout time.Duration // Default timeout for requests
	MaxConcurrency int           // Maximum number of concurrent requests
	RateLimit      int           // Determines request delay range: base=1s/RateLimit, range=base to 1s
	UserAgent      string        // User agent string for requests
	RetryAttempts  int           // Number of retry attempts for failed requests
	RetryDelay     time.Duration // Delay between retry attempts
	SkipCachedURLs bool          // Whether to skip URLs that are already cached (HIT)
	Port           string        // Server port
	Env            string        // Environment (development/production)
	LogLevel       string        // Logging level
	DatabaseURL    string        // Database connection URL
	AuthToken      string        // Database authentication token
	SentryDSN      string        // Sentry DSN for error tracking
	FindLinks      bool          // Whether to extract links (e.g. PDFs/docs) from pages
	SkipSSRFCheck  bool          // Skip SSRF protection (for tests only, never enable in production)
}

Config holds the configuration for a crawler instance

func DefaultConfig

func DefaultConfig() *Config

DefaultConfig returns a Config instance with default values

type CrawlOptions

type CrawlOptions struct {
	MaxPages    int  // Maximum pages to crawl
	Concurrency int  // Number of concurrent crawlers
	RateLimit   int  // Maximum requests per second
	Timeout     int  // Request timeout in seconds
	FollowLinks bool // Whether to follow links on crawled pages
}

CrawlOptions defines configuration options for a crawl operation

type CrawlResult

type CrawlResult struct {
	URL                 string              `json:"url"`
	ResponseTime        int64               `json:"response_time"`
	StatusCode          int                 `json:"status_code"`
	Error               string              `json:"error,omitempty"`
	Warning             string              `json:"warning,omitempty"`
	CacheStatus         string              `json:"cache_status"`
	ContentType         string              `json:"content_type"`
	ContentLength       int64               `json:"content_length"`
	Headers             http.Header         `json:"headers"`
	RedirectURL         string              `json:"redirect_url"`
	Performance         PerformanceMetrics  `json:"performance"`
	Timestamp           int64               `json:"timestamp"`
	RetryCount          int                 `json:"retry_count"`
	SkippedCrawl        bool                `json:"skipped_crawl,omitempty"`
	Links               map[string][]string `json:"links,omitempty"`
	SecondResponseTime  int64               `json:"second_response_time,omitempty"`
	SecondCacheStatus   string              `json:"second_cache_status,omitempty"`
	SecondContentLength int64               `json:"second_content_length,omitempty"`
	SecondHeaders       http.Header         `json:"second_headers,omitempty"`
	SecondPerformance   *PerformanceMetrics `json:"second_performance,omitempty"`
	CacheCheckAttempts  []CacheCheckAttempt `json:"cache_check_attempts,omitempty"`
	RequestDiagnostics  *RequestDiagnostics `json:"request_diagnostics,omitempty"`
	BodySample          []byte              `json:"-"` // Truncated body for tech detection (not serialised)
	Body                []byte              `json:"-"` // Full body for storage upload (not serialised)
	WAF                 *WAFDetection       `json:"waf,omitempty"`
}

CrawlResult represents the result of a URL crawl operation

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

func New

func New(config *Config, id ...string) *Crawler

func (*Crawler) CheckCacheStatus

func (c *Crawler) CheckCacheStatus(ctx context.Context, targetURL string) (ProbeDiagnostics, error)

func (*Crawler) Config

func (c *Crawler) Config() *Config

func (*Crawler) CreateHTTPClient

func (c *Crawler) CreateHTTPClient(timeout time.Duration) *http.Client

func (*Crawler) DiscoverSitemaps

func (c *Crawler) DiscoverSitemaps(ctx context.Context, domain string) ([]string, error)

DiscoverSitemaps is a backward-compatible wrapper that only returns sitemaps

func (*Crawler) DiscoverSitemapsAndRobots

func (c *Crawler) DiscoverSitemapsAndRobots(ctx context.Context, domain string) (*SitemapDiscoveryResult, error)

DiscoverSitemapsAndRobots attempts to find sitemaps and parse robots.txt rules for a domain

func (*Crawler) FilterURLs

func (c *Crawler) FilterURLs(urls []string, includePaths, excludePaths []string) []string

FilterURLs filters URLs based on include/exclude patterns

func (*Crawler) GetUserAgent

func (c *Crawler) GetUserAgent() string

func (*Crawler) ParseSitemap

func (c *Crawler) ParseSitemap(ctx context.Context, sitemapURL string) ([]string, error)

ParseSitemap extracts URLs from a sitemap

func (*Crawler) Probe added in v0.33.13

func (c *Crawler) Probe(ctx context.Context, domain string) (WAFDetection, error)

Probe runs a pre-flight WAF detection request against the homepage of a domain using the crawler's configured User-Agent and shared probe client. Reuses probeClient so the connection pool / DNS cache are shared with the rest of the crawl path.

func (*Crawler) WarmURL

func (c *Crawler) WarmURL(ctx context.Context, targetURL string, findLinks bool) (*CrawlResult, error)

type PerformanceMetrics

type PerformanceMetrics struct {
	DNSLookupTime       int64 `json:"dns_lookup_time"`
	TCPConnectionTime   int64 `json:"tcp_connection_time"`
	TLSHandshakeTime    int64 `json:"tls_handshake_time"`
	TTFB                int64 `json:"ttfb"`
	ContentTransferTime int64 `json:"content_transfer_time"`
}

PerformanceMetrics holds detailed timing information for a request.

type ProbeDiagnostics

type ProbeDiagnostics struct {
	Attempt  int               `json:"attempt,omitempty"`
	Request  *RequestMetadata  `json:"request,omitempty"`
	Response *ResponseMetadata `json:"response,omitempty"`
	Cache    *CacheMetadata    `json:"cache,omitempty"`
	DelayMS  int               `json:"delay_ms,omitempty"`
}

ProbeDiagnostics stores diagnostics for a cache probe attempt.

type RequestAttemptDiagnostics

type RequestAttemptDiagnostics struct {
	Request         *RequestMetadata    `json:"request,omitempty"`
	Response        *ResponseMetadata   `json:"response,omitempty"`
	RequestHeaders  http.Header         `json:"request_headers,omitempty"`
	ResponseHeaders http.Header         `json:"response_headers,omitempty"`
	Timing          *PerformanceMetrics `json:"timing,omitempty"`
	Cache           *CacheMetadata      `json:"cache,omitempty"`
}

RequestAttemptDiagnostics stores the diagnostics for a full request attempt.

type RequestDiagnostics

type RequestDiagnostics struct {
	Primary   *RequestAttemptDiagnostics `json:"primary,omitempty"`
	Probes    []ProbeDiagnostics         `json:"probes,omitempty"`
	Secondary *RequestAttemptDiagnostics `json:"secondary,omitempty"`
	Timings   *RequestStageTimings       `json:"timings,omitempty"`
}

RequestDiagnostics stores per-stage diagnostics for a crawl.

type RequestMetadata

type RequestMetadata struct {
	Method     string `json:"method,omitempty"`
	URL        string `json:"url,omitempty"`
	FinalURL   string `json:"final_url,omitempty"`
	Scheme     string `json:"scheme,omitempty"`
	Host       string `json:"host,omitempty"`
	Path       string `json:"path,omitempty"`
	Query      string `json:"query,omitempty"`
	Timestamp  int64  `json:"timestamp,omitempty"`
	Provenance string `json:"provenance,omitempty"`
}

RequestMetadata stores request details for a crawl attempt.

type RequestStageTimings added in v0.32.6

type RequestStageTimings struct {
	PrimaryRequestMS   int64 `json:"primary_request_ms,omitempty"`
	CacheValidationMS  int64 `json:"cache_validation_ms,omitempty"`
	SecondaryRequestMS int64 `json:"secondary_request_ms,omitempty"`
	TotalMS            int64 `json:"total_ms,omitempty"`
}

RequestStageTimings stores aggregate duration for each crawl phase.

type ResponseMetadata

type ResponseMetadata struct {
	StatusCode    int    `json:"status_code,omitempty"`
	ContentType   string `json:"content_type,omitempty"`
	ContentLength int64  `json:"content_length,omitempty"`
	RedirectURL   string `json:"redirect_url,omitempty"`
	Warning       string `json:"warning,omitempty"`
	Error         string `json:"error,omitempty"`
}

ResponseMetadata stores response details for a crawl attempt.

type RobotsRules

type RobotsRules struct {
	CrawlDelay       int // seconds; 0 means unspecified
	Sitemaps         []string
	DisallowPatterns []string
	AllowPatterns    []string // override DisallowPatterns
}

func ParseRobotsTxt

func ParseRobotsTxt(ctx context.Context, domain string, userAgent string, transport ...http.RoundTripper) (*RobotsRules, error)

Precedence: Hover-specific section if present, else wildcard (*). Aggressive SEO crawler sections (AhrefsBot, MJ12bot, ...) are intentionally not matched — they often carry punitive 10s delays meant for them.

type Sitemap

type Sitemap struct {
	XMLName xml.Name `xml:"sitemap"`
	Loc     string   `xml:"loc"`
}

type SitemapDiscoveryResult

type SitemapDiscoveryResult struct {
	Sitemaps    []string
	RobotsRules *RobotsRules
}

SitemapDiscoveryResult contains both sitemaps and robots.txt rules

type SitemapIndex

type SitemapIndex struct {
	XMLName  xml.Name  `xml:"sitemapindex"`
	Sitemaps []Sitemap `xml:"sitemap"`
}

Create proper sitemap structs

type URL

type URL struct {
	XMLName xml.Name `xml:"url"`
	Loc     string   `xml:"loc"`
}

type URLSet

type URLSet struct {
	XMLName xml.Name `xml:"urlset"`
	URLs    []URL    `xml:"url"`
}

type WAFDetection added in v0.33.13

type WAFDetection struct {
	Blocked bool   `json:"blocked"`
	Vendor  string `json:"vendor,omitempty"`
	Reason  string `json:"reason,omitempty"`
}

WAFDetection captures a verdict from the WAF fingerprint detector. Vendor identifies the protection layer ("cloudflare", "imperva", "datadome", "akamai", "generic", or empty when not blocked). Reason is the specific signal that fired, suitable for surfacing in jobs.error_message.

func DetectWAF added in v0.33.13

func DetectWAF(statusCode int, headers http.Header, bodySample []byte) WAFDetection

DetectWAF inspects a response and reports whether it carries a fingerprint of a known bot-protection layer. The function is pure: no I/O, safe for table-driven tests. It is intentionally conservative on 200 responses — only blocking status codes (typically 403 or 202) combined with corroborating fingerprints trigger a verdict, so a healthy site that happens to use Cloudflare for caching does not get flagged.

Fingerprints (issue #365 row 1 + comment 4334238167):

  • Cloudflare: cf-mitigated header set on a non-200 response
  • Imperva: body contains _Incapsula_Resource
  • DataDome: Server header equals DataDome
  • Akamai: Server header AkamaiGHost OR akaalb_/_abck/bm_sz cookie OR Server-Timing ak_p marker, all on a blocking status
  • Generic: tiny body (<500 bytes) on 403 or 202 with no other signal

func Probe added in v0.33.13

func Probe(ctx context.Context, domain string, userAgent string, transport http.RoundTripper) (WAFDetection, error)

Probe issues a GET against the homepage of the given domain and runs the WAF detector against the response. The probe sends the supplied User-Agent so the verdict matches what real crawl tasks will see.

On network or timeout error the probe returns WAFDetection{} with the underlying error; callers should treat a network error as "no verdict" rather than as a block.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL