crawler

package

v0.33.14 Latest Latest Go to latest Published: Apr 29, 2026 License: MIT Imports: 27 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Harvey-AU/hover

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
func IsPathAllowed(rules *RobotsRules, path string) bool
type CacheCheckAttempt
type CacheMetadata
type Config
- func DefaultConfig() *Config
type CrawlOptions
type CrawlResult
type Crawler
- func New(config *Config, id ...string) *Crawler
- func (c *Crawler) CheckCacheStatus(ctx context.Context, targetURL string) (ProbeDiagnostics, error)
- func (c *Crawler) Config() *Config
- func (c *Crawler) CreateHTTPClient(timeout time.Duration) *http.Client
- func (c *Crawler) DiscoverSitemaps(ctx context.Context, domain string) ([]string, error)
- func (c *Crawler) DiscoverSitemapsAndRobots(ctx context.Context, domain string) (*SitemapDiscoveryResult, error)
- func (c *Crawler) FilterURLs(urls []string, includePaths, excludePaths []string) []string
- func (c *Crawler) GetUserAgent() string
- func (c *Crawler) ParseSitemap(ctx context.Context, sitemapURL string) ([]string, error)
- func (c *Crawler) Probe(ctx context.Context, domain string) (WAFDetection, error)
- func (c *Crawler) WarmURL(ctx context.Context, targetURL string, findLinks bool) (*CrawlResult, error)
type PerformanceMetrics
type ProbeDiagnostics
type RequestAttemptDiagnostics
type RequestDiagnostics
type RequestMetadata
type RequestStageTimings
type ResponseMetadata
type RobotsRules
- func ParseRobotsTxt(ctx context.Context, domain string, userAgent string, ...) (*RobotsRules, error)
type Sitemap
type SitemapDiscoveryResult
type SitemapIndex
type URL
type URLSet
type WAFDetection
- func DetectWAF(statusCode int, headers http.Header, bodySample []byte) WAFDetection
- func Probe(ctx context.Context, domain string, userAgent string, ...) (WAFDetection, error)

Constants ¶

View Source

const (
	WAFVendorCloudflare = "cloudflare"
	WAFVendorImperva    = "imperva"
	WAFVendorDataDome   = "datadome"
	WAFVendorAkamai     = "akamai"
	WAFVendorGeneric    = "generic"
)

Vendor labels.

View Source

const MaxBodySampleSize = 50 * 1024

MaxBodySampleSize is the maximum size of body sample stored for tech detection (50KB)

View Source

const ProbeBodyLimit = 4 * 1024

ProbeBodyLimit caps the body read during a pre-flight probe. 4 KB is enough for the EdgeSuite block page (~373 bytes), the Imperva script preamble, and Cloudflare's interstitial.

View Source

const ProbeTimeout = 8 * time.Second

ProbeTimeout bounds how long a probe waits before giving up. Akamai SYN-drop variants will hold the connection — at this point we'd rather fall through to the normal flow and let the mid-job circuit breaker catch a real wall.

Variables ¶

This section is empty.

Functions ¶

func IsPathAllowed ¶

func IsPathAllowed(rules *RobotsRules, path string) bool

Types ¶

type CacheCheckAttempt ¶

type CacheCheckAttempt struct {
	Attempt     int    `json:"attempt"`
	CacheStatus string `json:"cache_status"`
	Delay       int    `json:"delay_ms"`
	// Diagnostics duplicates attempt metadata for backward-compatible probe history.
	Diagnostics *ProbeDiagnostics `json:"diagnostics,omitempty"`
}

CacheCheckAttempt stores the result of a single cache status check.

type CacheMetadata ¶

type CacheMetadata struct {
	HeaderSource     string `json:"header_source,omitempty"`
	RawValue         string `json:"raw_value,omitempty"`
	NormalisedStatus string `json:"normalised_status,omitempty"`
	Age              string `json:"age,omitempty"`
	CacheControl     string `json:"cache_control,omitempty"`
	Vary             string `json:"vary,omitempty"`
	CacheStatus      string `json:"cache_status,omitempty"`
	CFCacheStatus    string `json:"cf_cache_status,omitempty"`
	XCache           string `json:"x_cache,omitempty"`
	XCacheRemote     string `json:"x_cache_remote,omitempty"`
	XVercelCache     string `json:"x_vercel_cache,omitempty"`
	XVarnish         string `json:"x_varnish,omitempty"`
}

CacheMetadata stores cache-related headers and interpretation.

type Config ¶

type Config struct {
	DefaultTimeout time.Duration // Default timeout for requests
	MaxConcurrency int           // Maximum number of concurrent requests
	RateLimit      int           // Determines request delay range: base=1s/RateLimit, range=base to 1s
	UserAgent      string        // User agent string for requests
	RetryAttempts  int           // Number of retry attempts for failed requests
	RetryDelay     time.Duration // Delay between retry attempts
	SkipCachedURLs bool          // Whether to skip URLs that are already cached (HIT)
	Port           string        // Server port
	Env            string        // Environment (development/production)
	LogLevel       string        // Logging level
	DatabaseURL    string        // Database connection URL
	AuthToken      string        // Database authentication token
	SentryDSN      string        // Sentry DSN for error tracking
	FindLinks      bool          // Whether to extract links (e.g. PDFs/docs) from pages
	SkipSSRFCheck  bool          // Skip SSRF protection (for tests only, never enable in production)
}

Config holds the configuration for a crawler instance

func DefaultConfig ¶

func DefaultConfig() *Config

DefaultConfig returns a Config instance with default values

type CrawlOptions ¶

type CrawlOptions struct {
	MaxPages    int  // Maximum pages to crawl
	Concurrency int  // Number of concurrent crawlers
	RateLimit   int  // Maximum requests per second
	Timeout     int  // Request timeout in seconds
	FollowLinks bool // Whether to follow links on crawled pages
}

CrawlOptions defines configuration options for a crawl operation

type CrawlResult ¶

type CrawlResult struct {
	URL                 string              `json:"url"`
	ResponseTime        int64               `json:"response_time"`
	StatusCode          int                 `json:"status_code"`
	Error               string              `json:"error,omitempty"`
	Warning             string              `json:"warning,omitempty"`
	CacheStatus         string              `json:"cache_status"`
	ContentType         string              `json:"content_type"`
	ContentLength       int64               `json:"content_length"`
	Headers             http.Header         `json:"headers"`
	RedirectURL         string              `json:"redirect_url"`
	Performance         PerformanceMetrics  `json:"performance"`
	Timestamp           int64               `json:"timestamp"`
	RetryCount          int                 `json:"retry_count"`
	SkippedCrawl        bool                `json:"skipped_crawl,omitempty"`
	Links               map[string][]string `json:"links,omitempty"`
	SecondResponseTime  int64               `json:"second_response_time,omitempty"`
	SecondCacheStatus   string              `json:"second_cache_status,omitempty"`
	SecondContentLength int64               `json:"second_content_length,omitempty"`
	SecondHeaders       http.Header         `json:"second_headers,omitempty"`
	SecondPerformance   *PerformanceMetrics `json:"second_performance,omitempty"`
	CacheCheckAttempts  []CacheCheckAttempt `json:"cache_check_attempts,omitempty"`
	RequestDiagnostics  *RequestDiagnostics `json:"request_diagnostics,omitempty"`
	BodySample          []byte              `json:"-"` // Truncated body for tech detection (not serialised)
	Body                []byte              `json:"-"` // Full body for storage upload (not serialised)
	WAF                 *WAFDetection       `json:"waf,omitempty"`
}

CrawlResult represents the result of a URL crawl operation

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

func New ¶

func New(config *Config, id ...string) *Crawler

func (*Crawler) CheckCacheStatus ¶

func (c *Crawler) CheckCacheStatus(ctx context.Context, targetURL string) (ProbeDiagnostics, error)

func (*Crawler) Config ¶

func (c *Crawler) Config() *Config

func (*Crawler) CreateHTTPClient ¶

func (c *Crawler) CreateHTTPClient(timeout time.Duration) *http.Client

func (*Crawler) DiscoverSitemaps ¶

func (c *Crawler) DiscoverSitemaps(ctx context.Context, domain string) ([]string, error)

DiscoverSitemaps is a backward-compatible wrapper that only returns sitemaps

func (*Crawler) DiscoverSitemapsAndRobots ¶

func (c *Crawler) DiscoverSitemapsAndRobots(ctx context.Context, domain string) (*SitemapDiscoveryResult, error)

DiscoverSitemapsAndRobots attempts to find sitemaps and parse robots.txt rules for a domain

func (*Crawler) FilterURLs ¶

func (c *Crawler) FilterURLs(urls []string, includePaths, excludePaths []string) []string

FilterURLs filters URLs based on include/exclude patterns

func (*Crawler) GetUserAgent ¶

func (c *Crawler) GetUserAgent() string

func (*Crawler) ParseSitemap ¶

func (c *Crawler) ParseSitemap(ctx context.Context, sitemapURL string) ([]string, error)

ParseSitemap extracts URLs from a sitemap

func (*Crawler) Probe ¶ added in v0.33.13

func (c *Crawler) Probe(ctx context.Context, domain string) (WAFDetection, error)

Probe runs a pre-flight WAF detection request against the homepage of a domain using the crawler's configured User-Agent and shared probe client. Reuses probeClient so the connection pool / DNS cache are shared with the rest of the crawl path.

func (*Crawler) WarmURL ¶

func (c *Crawler) WarmURL(ctx context.Context, targetURL string, findLinks bool) (*CrawlResult, error)

type PerformanceMetrics ¶

type PerformanceMetrics struct {
	DNSLookupTime       int64 `json:"dns_lookup_time"`
	TCPConnectionTime   int64 `json:"tcp_connection_time"`
	TLSHandshakeTime    int64 `json:"tls_handshake_time"`
	TTFB                int64 `json:"ttfb"`
	ContentTransferTime int64 `json:"content_transfer_time"`
}

PerformanceMetrics holds detailed timing information for a request.

type ProbeDiagnostics ¶

type ProbeDiagnostics struct {
	Attempt  int               `json:"attempt,omitempty"`
	Request  *RequestMetadata  `json:"request,omitempty"`
	Response *ResponseMetadata `json:"response,omitempty"`
	Cache    *CacheMetadata    `json:"cache,omitempty"`
	DelayMS  int               `json:"delay_ms,omitempty"`
}

ProbeDiagnostics stores diagnostics for a cache probe attempt.

type RequestAttemptDiagnostics ¶

type RequestAttemptDiagnostics struct {
	Request         *RequestMetadata    `json:"request,omitempty"`
	Response        *ResponseMetadata   `json:"response,omitempty"`
	RequestHeaders  http.Header         `json:"request_headers,omitempty"`
	ResponseHeaders http.Header         `json:"response_headers,omitempty"`
	Timing          *PerformanceMetrics `json:"timing,omitempty"`
	Cache           *CacheMetadata      `json:"cache,omitempty"`
}

RequestAttemptDiagnostics stores the diagnostics for a full request attempt.

type RequestDiagnostics ¶

type RequestDiagnostics struct {
	Primary   *RequestAttemptDiagnostics `json:"primary,omitempty"`
	Probes    []ProbeDiagnostics         `json:"probes,omitempty"`
	Secondary *RequestAttemptDiagnostics `json:"secondary,omitempty"`
	Timings   *RequestStageTimings       `json:"timings,omitempty"`
}

RequestDiagnostics stores per-stage diagnostics for a crawl.

type RequestMetadata ¶

type RequestMetadata struct {
	Method     string `json:"method,omitempty"`
	URL        string `json:"url,omitempty"`
	FinalURL   string `json:"final_url,omitempty"`
	Scheme     string `json:"scheme,omitempty"`
	Host       string `json:"host,omitempty"`
	Path       string `json:"path,omitempty"`
	Query      string `json:"query,omitempty"`
	Timestamp  int64  `json:"timestamp,omitempty"`
	Provenance string `json:"provenance,omitempty"`
}

RequestMetadata stores request details for a crawl attempt.

type RequestStageTimings ¶ added in v0.32.6

type RequestStageTimings struct {
	PrimaryRequestMS   int64 `json:"primary_request_ms,omitempty"`
	CacheValidationMS  int64 `json:"cache_validation_ms,omitempty"`
	SecondaryRequestMS int64 `json:"secondary_request_ms,omitempty"`
	TotalMS            int64 `json:"total_ms,omitempty"`
}

RequestStageTimings stores aggregate duration for each crawl phase.

type ResponseMetadata ¶

type ResponseMetadata struct {
	StatusCode    int    `json:"status_code,omitempty"`
	ContentType   string `json:"content_type,omitempty"`
	ContentLength int64  `json:"content_length,omitempty"`
	RedirectURL   string `json:"redirect_url,omitempty"`
	Warning       string `json:"warning,omitempty"`
	Error         string `json:"error,omitempty"`
}

ResponseMetadata stores response details for a crawl attempt.

type RobotsRules ¶

type RobotsRules struct {
	CrawlDelay       int // seconds; 0 means unspecified
	Sitemaps         []string
	DisallowPatterns []string
	AllowPatterns    []string // override DisallowPatterns
}

func ParseRobotsTxt ¶

func ParseRobotsTxt(ctx context.Context, domain string, userAgent string, transport ...http.RoundTripper) (*RobotsRules, error)

Precedence: Hover-specific section if present, else wildcard (*). Aggressive SEO crawler sections (AhrefsBot, MJ12bot, ...) are intentionally not matched — they often carry punitive 10s delays meant for them.

type Sitemap ¶

type Sitemap struct {
	XMLName xml.Name `xml:"sitemap"`
	Loc     string   `xml:"loc"`
}

type SitemapDiscoveryResult ¶

type SitemapDiscoveryResult struct {
	Sitemaps    []string
	RobotsRules *RobotsRules
}

SitemapDiscoveryResult contains both sitemaps and robots.txt rules

type SitemapIndex ¶

type SitemapIndex struct {
	XMLName  xml.Name  `xml:"sitemapindex"`
	Sitemaps []Sitemap `xml:"sitemap"`
}

Create proper sitemap structs

type URL ¶

type URL struct {
	XMLName xml.Name `xml:"url"`
	Loc     string   `xml:"loc"`
}

type URLSet ¶

type URLSet struct {
	XMLName xml.Name `xml:"urlset"`
	URLs    []URL    `xml:"url"`
}

type WAFDetection ¶ added in v0.33.13

type WAFDetection struct {
	Blocked bool   `json:"blocked"`
	Vendor  string `json:"vendor,omitempty"`
	Reason  string `json:"reason,omitempty"`
}

WAFDetection captures a verdict from the WAF fingerprint detector. Vendor identifies the protection layer ("cloudflare", "imperva", "datadome", "akamai", "generic", or empty when not blocked). Reason is the specific signal that fired, suitable for surfacing in jobs.error_message.

func DetectWAF ¶ added in v0.33.13

func DetectWAF(statusCode int, headers http.Header, bodySample []byte) WAFDetection

DetectWAF inspects a response and reports whether it carries a fingerprint of a known bot-protection layer. The function is pure: no I/O, safe for table-driven tests. It is intentionally conservative on 200 responses — only blocking status codes (typically 403 or 202) combined with corroborating fingerprints trigger a verdict, so a healthy site that happens to use Cloudflare for caching does not get flagged.

Fingerprints (issue #365 row 1 + comment 4334238167):

Cloudflare: cf-mitigated header set on a non-200 response
Imperva: body contains _Incapsula_Resource
DataDome: Server header equals DataDome
Akamai: Server header AkamaiGHost OR akaalb_/_abck/bm_sz cookie OR Server-Timing ak_p marker, all on a blocking status
Generic: tiny body (<500 bytes) on 403 or 202 with no other signal

func Probe ¶ added in v0.33.13

func Probe(ctx context.Context, domain string, userAgent string, transport http.RoundTripper) (WAFDetection, error)

Probe issues a GET against the homepage of the given domain and runs the WAF detector against the response. The probe sends the supplied User-Agent so the verdict matches what real crawl tasks will see.

On network or timeout error the probe returns WAFDetection{} with the underlying error; callers should treat a network error as "no verdict" rather than as a block.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL