fetch

package
v0.0.0-...-eb0ce61 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 23, 2026 License: Apache-2.0 Imports: 36 Imported by: 0

Documentation

Overview

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file handles headless browser authentication and session management. It provides login form detection, automated login flows, and cookie/session persistence. Does NOT handle auth profile management (see internal/auth).

Package fetch provides HTTP and headless browser content fetching capabilities.

This file provides device emulation and screenshot capture for chromedp. It handles viewport configuration, mobile device simulation, and full-page or viewport screenshot generation with configurable formats.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file handles network request/response interception for API scraping. It provides the networkInterceptor type for capturing network traffic based on configurable URL patterns and resource types. Does NOT handle request execution or browser lifecycle management.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file provides network idle detection and response tracking for chromedp. It tracks active network requests to determine when page loading is complete and captures HTTP response status codes from document requests.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides browser/tooling availability checks and fetcher lifecycle helpers.

Purpose: - Centralize best-effort fetcher cleanup so callers do not leak repo-started browser automation.

Responsibilities: - Detect whether a fetcher exposes a Close method. - Invoke Close safely for callers that create short-lived fetchers per request or test.

Scope: - Fetcher lifecycle cleanup only; concrete fetch behavior lives in sibling files.

Usage: - Call CloseFetcher(fetcher) in scrape/crawl teardown paths after constructing a fetch.Fetcher.

Invariants/Assumptions: - Cleanup is best-effort and should be safe to call on nil or non-closable fetchers. - Close must not panic when the underlying fetcher has already been cleaned up.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides browser/tooling availability checks and fetcher construction helpers.

Purpose: - Centralize fetcher factories plus browser and Playwright prerequisite detection.

Responsibilities: - Create adaptive fetchers with optional metrics and proxy-pool wiring. - Detect Chrome/Chromium availability across supported host platforms. - Cache Playwright readiness checks while allowing explicit refresh probes.

Scope: - Shared fetcher setup and availability probing only; concrete fetching lives in sibling files.

Usage: - Called by runtime initialization, health endpoints, and diagnostic helpers.

Invariants/Assumptions: - Availability checks must never launch long-running browser sessions. - Fresh diagnostic probes may invalidate cached Playwright readiness state.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file implements the main entry points for automatic form detection for headless login flows. It analyzes HTML to detect login forms, identify input fields, and generate CSS selectors for automated login without manual configuration.

The detection uses heuristics based on:

  • Input type attributes (password, email)
  • Autocomplete attributes (username, current-password)
  • Name/id patterns (user, login, email, pass)
  • Form structure and field relationships

It does NOT execute JavaScript or handle multi-step flows (MFA/2FA).

Package fetch provides HTTP and headless browser content fetching capabilities.

This file contains form classification logic for determining the type of form (login, register, password reset, search, contact, newsletter, checkout, survey) based on its fields and structure.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file contains field finding functions for detecting username, password, and submit button fields within forms.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file contains form detection heuristics for analyzing form elements and div-based forms (common in modern SPAs).

Package fetch provides HTTP and headless browser content fetching capabilities.

This file contains CSS selector generation functions for targeting form elements and containers with reliable selectors.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file contains the type definitions for form detection, including form types, field matches, detected forms, and detection weights configuration.

The detection uses heuristics based on:

  • Input type attributes (password, email)
  • Autocomplete attributes (username, current-password)
  • Name/id patterns (user, login, email, pass)
  • Form structure and field relationships

It does NOT execute JavaScript or handle multi-step flows (MFA/2FA).

Package fetch provides HTTP and headless browser content fetching capabilities.

This file contains utility functions for form detection, including CSS escaping and sorting functions.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file implements automated form filling and submission for general forms (not just login forms). It uses chromedp for headless browser automation.

The form filler supports:

  • Automatic form detection and field mapping
  • Filling text, email, phone, textarea, select, checkbox, and radio fields
  • Form submission with success/failure detection
  • Multi-step form workflows

It does NOT handle CAPTCHAs or complex JavaScript-dependent forms.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file provides device emulation for Playwright fetcher. It handles viewport configuration, mobile device simulation, and device profile resolution from requests and render profiles.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file handles network request/response interception for Playwright-based API scraping. It provides the playwrightInterceptor type for capturing network traffic based on configurable URL patterns and resource types. Does NOT handle request execution or browser lifecycle management.

Package fetch provides HTTP and headless browser content fetching capabilities.

This file provides screenshot capture for Playwright fetcher. It handles viewport configuration, file generation, and full-page or viewport screenshot capture with configurable formats (PNG/JPEG).

Package fetch provides HTTP and headless browser content fetching capabilities.

This file provides session/cookie extraction for Playwright fetcher. It handles extracting cookies from browser contexts and saving them as sessions for later reuse. Does NOT handle session loading or authentication flows.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities.

Purpose:

  • Load and validate persisted proxy-pool configuration.

Responsibilities:

  • Read proxy-pool JSON files.
  • Distinguish optional default absence from explicit user misconfiguration.

Scope:

  • Proxy-pool persistence helpers only.

Usage:

  • LoadProxyPoolFromFile(path) for strict loading.
  • ProxyPoolFromConfig(path, explicit) for optional startup loading.

Invariants/Assumptions:

  • Startup callers may choose silent missing-file handling for non-required pool paths.
  • Explicit proxy-pool paths still surface errors.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides render profile management utilities. This file implements CRUD operations for render profiles stored in DATA_DIR/render_profiles.json.

Responsibilities: - Load and save render profiles with strict validation - CRUD operations: List, Get, Upsert, Delete - Atomic file writes to prevent corruption - Validation of profile fields (name uniqueness, host patterns, engine enum)

This file does NOT: - Handle runtime profile matching (see render_profiles_store.go) - Execute fetches or apply profiles to requests

Invariants: - Profile names must be unique (case-sensitive) - Host patterns must be non-empty and pass hostmatch.ValidateHostPatterns - Engine must be one of: http, chromedp, playwright (if set) - File writes are atomic (temp file + rename)

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.

Package fetch provides HTTP and headless browser content fetching capabilities. Authentication and proxy configuration types.

Package fetch provides HTTP and headless browser content fetching capabilities. Device emulation types for mobile/responsive content.

Package fetch provides HTTP and headless browser content fetching capabilities. Network interception types for capturing XHR/Fetch API traffic.

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrChromeNotFound     = apperrors.ErrChromeNotFound
	ErrPlaywrightNotReady = apperrors.ErrPlaywrightNotReady
)
View Source
var ErrCircuitBreakerOpen = apperrors.New(apperrors.KindInternal, "circuit breaker is open")

ErrCircuitBreakerOpen is returned when the circuit breaker is open and requests are blocked. This maps to HTTP 503 Service Unavailable.

Functions

func ApplyAuthQuery

func ApplyAuthQuery(rawURL string, query map[string]string) string

ApplyAuthQuery applies authentication query parameters to a URL. If the query map is empty, the original URL is returned unchanged.

func CSSEscape

func CSSEscape(s string) string

CSSEscape escapes a string for use in CSS selectors. This is a simplified version - handles common cases.

func CalculateBackoff

func CalculateBackoff(cfg RetryConfig, attempt int) time.Duration

CalculateBackoff returns backoff duration based on the configured strategy. This is the main entry point for computing retry delays.

func CheckBrowserAvailability

func CheckBrowserAvailability(usePlaywright bool) error

CheckBrowserAvailability checks if the required browser binaries are available.

func CheckBrowserAvailabilityFresh

func CheckBrowserAvailabilityFresh(usePlaywright bool) error

CheckBrowserAvailabilityFresh forces a new availability probe.

func CloseFetcher

func CloseFetcher(fetcher Fetcher) error

CloseFetcher closes fetchers that expose a Close method.

func DefaultRetryableCodes

func DefaultRetryableCodes() map[int]bool

DefaultRetryableCodes returns the default set of HTTP status codes that should trigger a retry.

func DeleteRenderProfile

func DeleteRenderProfile(dataDir, name string) error

DeleteRenderProfile removes a profile by name. Returns apperrors.NotFound if the profile doesn't exist.

func FindChrome

func FindChrome() (string, error)

FindChrome resolves the Chrome/Chromium binary path for diagnostics and runtime checks.

func IsJSHeavy

func IsJSHeavy(js JSHeaviness, threshold float64) bool

IsJSHeavy determines if the page is JS-heavy based on the score and a threshold. Default threshold is usually around 0.5.

func IsStatusCodeRetryable

func IsStatusCodeRetryable(status int, retryableCodes map[int]bool) bool

IsStatusCodeRetryable checks if a status code is in the retryable set.

func ListDevicePresetNames

func ListDevicePresetNames() []string

ListDevicePresetNames returns all available device preset names.

func ListRenderProfileNames

func ListRenderProfileNames(dataDir string) ([]string, error)

ListRenderProfileNames returns a sorted list of all profile names.

func ParseRateLimitPolicyHeader

func ParseRateLimitPolicyHeader(header string) (limit int, window time.Duration)

ParseRateLimitPolicyHeader parses the RateLimit-Policy header (RFC 9440). Format: RateLimit-Policy: 100;w=60 Where 100 is the limit and w=60 specifies a 60-second window. This can provide window duration even when RateLimit header is not present.

func RenderProfilesPath

func RenderProfilesPath(dataDir string) string

RenderProfilesPath returns the path to the render profiles JSON file.

func SaveRenderProfilesFile

func SaveRenderProfilesFile(dataDir string, file RenderProfilesFile) error

SaveRenderProfilesFile saves the render profiles file to disk atomically. Validates before writing. Creates parent directories if needed.

func ShouldRetryWithConfig

func ShouldRetryWithConfig(err error, status int, cfg RetryConfig) bool

ShouldRetryWithConfig checks if retry should occur using configurable rules. It first checks the configured status codes, then falls back to default logic.

func SleepWithContext

func SleepWithContext(ctx context.Context, d time.Duration) error

SleepWithContext sleeps for the given duration or until the context is cancelled. Returns ctx.Err() if cancelled, nil otherwise.

func UpsertRenderProfile

func UpsertRenderProfile(dataDir string, profile RenderProfile) error

UpsertRenderProfile creates or updates a render profile. If a profile with the same name exists, it is replaced in-place (preserving order). If not found, the profile is appended to the end.

func ValidateRenderProfile

func ValidateRenderProfile(p RenderProfile) error

ValidateRenderProfile validates a single render profile.

func ValidateRenderProfilesFile

func ValidateRenderProfilesFile(file RenderProfilesFile) error

ValidateRenderProfilesFile validates an entire render profiles file.

Types

type AdaptiveConfig

type AdaptiveConfig struct {
	Enabled                bool
	MinQPS                 rate.Limit    // floor (e.g., 0.1 = 1 req per 10s)
	MaxQPS                 rate.Limit    // ceiling (initial QPS)
	AdditiveIncrease       rate.Limit    // QPS to add on success (e.g., 0.5)
	MultiplicativeDecrease float64       // factor to multiply on 429 (e.g., 0.5 = halve)
	SuccessThreshold       int           // consecutive successes before increase
	CooldownPeriod         time.Duration // minimum time between adjustments
}

AdaptiveConfig controls the behavior of adaptive rate limiting. When enabled, the limiter dynamically adjusts QPS per host based on server responses (429 status codes, Retry-After headers) and successful request patterns using an additive increase/multiplicative decrease algorithm.

type AdaptiveFetcher

type AdaptiveFetcher struct {
	// contains filtered or unexported fields
}

func NewAdaptiveFetcher

func NewAdaptiveFetcher(dataDir string) *AdaptiveFetcher

func (*AdaptiveFetcher) Close

func (f *AdaptiveFetcher) Close() error

func (*AdaptiveFetcher) Fetch

func (f *AdaptiveFetcher) Fetch(ctx context.Context, req Request) (Result, error)

func (*AdaptiveFetcher) SetMetricsCallback

func (f *AdaptiveFetcher) SetMetricsCallback(cb MetricsCallback)

SetMetricsCallback sets the callback function for metrics collection

func (*AdaptiveFetcher) SetProxyPool

func (f *AdaptiveFetcher) SetProxyPool(pool *ProxyPool)

SetProxyPool sets the proxy pool for all underlying fetchers.

type AuthOptions

type AuthOptions struct {
	Basic               string            `json:"basic,omitempty"`
	Headers             map[string]string `json:"headers,omitempty"`
	Cookies             []string          `json:"cookies,omitempty"`
	Query               map[string]string `json:"query,omitempty"`
	LoginURL            string            `json:"loginUrl,omitempty"`
	LoginUserSelector   string            `json:"loginUserSelector,omitempty"`
	LoginPassSelector   string            `json:"loginPassSelector,omitempty"`
	LoginSubmitSelector string            `json:"loginSubmitSelector,omitempty"`
	LoginUser           string            `json:"loginUser,omitempty"`
	LoginPass           string            `json:"loginPass,omitempty"`
	LoginAutoDetect     bool              `json:"loginAutoDetect,omitempty"`
	Proxy               *ProxyConfig      `json:"proxy,omitempty"`
	// ProxyHints provides hints for proxy selection when using the loaded proxy pool.
	ProxyHints *ProxySelectionHints `json:"proxyHints,omitempty"`
	// OAuth2 contains OAuth 2.0 configuration for automatic token management.
	// When set, the fetcher will use OAuth transport with automatic token refresh.
	OAuth2 *OAuth2AuthConfig `json:"oauth2,omitempty"`
}

AuthOptions contains authentication options for fetch operations.

func (*AuthOptions) NormalizeTransport

func (a *AuthOptions) NormalizeTransport()

NormalizeTransport trims proxy-related transport overrides in place.

func (*AuthOptions) ValidateTransport

func (a *AuthOptions) ValidateTransport() error

ValidateTransport rejects ambiguous or malformed proxy overrides.

type BackoffStrategy

type BackoffStrategy int

BackoffStrategy defines the backoff calculation strategy.

const (
	// BackoffStrategyExponential uses exponential backoff: base * 2^attempt
	BackoffStrategyExponential BackoffStrategy = iota
	// BackoffStrategyExponentialJitter adds random jitter to exponential backoff
	BackoffStrategyExponentialJitter
	// BackoffStrategyLinear uses linear backoff: base * (attempt + 1)
	BackoffStrategyLinear
	// BackoffStrategyFixed uses a fixed delay regardless of attempt
	BackoffStrategyFixed
)

func ParseBackoffStrategy

func ParseBackoffStrategy(s string) BackoffStrategy

ParseBackoffStrategy parses a backoff strategy string.

func (BackoffStrategy) String

func (s BackoffStrategy) String() string

String returns the string representation of the backoff strategy.

type BlockedResourceType

type BlockedResourceType string
const (
	BlockedResourceImage      BlockedResourceType = "image"
	BlockedResourceMedia      BlockedResourceType = "media"
	BlockedResourceFont       BlockedResourceType = "font"
	BlockedResourceStylesheet BlockedResourceType = "stylesheet"
	BlockedResourceOther      BlockedResourceType = "other"
)

type ChromedpFetcher

type ChromedpFetcher struct {
	// contains filtered or unexported fields
}

func (*ChromedpFetcher) Fetch

func (f *ChromedpFetcher) Fetch(ctx context.Context, req Request, prof RenderProfile) (Result, error)

func (*ChromedpFetcher) SetProxyPool

func (f *ChromedpFetcher) SetProxyPool(pool *ProxyPool)

SetProxyPool sets the proxy pool for this fetcher.

type CircuitBreaker

type CircuitBreaker struct {
	// contains filtered or unexported fields
}

CircuitBreaker tracks failure state per host and implements the circuit breaker pattern. It is safe for concurrent use by multiple goroutines.

func NewCircuitBreaker

func NewCircuitBreaker(cfg CircuitBreakerConfig) *CircuitBreaker

NewCircuitBreaker creates a new CircuitBreaker with the given configuration.

func (*CircuitBreaker) Allow

func (cb *CircuitBreaker) Allow(host string) bool

Allow checks if a request to the given host should be allowed. Returns true if the request can proceed, false if it should be blocked.

func (*CircuitBreaker) GetConfig

func (cb *CircuitBreaker) GetConfig() CircuitBreakerConfig

GetConfig returns a copy of the circuit breaker configuration.

func (*CircuitBreaker) GetHostStatus

func (cb *CircuitBreaker) GetHostStatus() []CircuitBreakerHostStatus

GetHostStatus returns circuit breaker status for all known hosts.

func (*CircuitBreaker) GetState

func (cb *CircuitBreaker) GetState(host string) CircuitBreakerState

GetState returns the current circuit breaker state for the given host.

func (*CircuitBreaker) IsEnabled

func (cb *CircuitBreaker) IsEnabled() bool

IsEnabled returns true if the circuit breaker is enabled.

func (*CircuitBreaker) RecordFailure

func (cb *CircuitBreaker) RecordFailure(host string)

RecordFailure records a failed request to the given host. This may transition the circuit breaker from closed to open, or half-open to open.

func (*CircuitBreaker) RecordSuccess

func (cb *CircuitBreaker) RecordSuccess(host string)

RecordSuccess records a successful request to the given host. This may transition the circuit breaker from half-open to closed.

func (*CircuitBreaker) Reset

func (cb *CircuitBreaker) Reset(host string)

Reset resets the circuit breaker state for a specific host or all hosts if host is empty.

type CircuitBreakerConfig

type CircuitBreakerConfig struct {
	Enabled             bool          // Whether circuit breaker is enabled
	FailureThreshold    int           // Failures before opening circuit (default: 5)
	SuccessThreshold    int           // Successes in half-open to close (default: 3)
	ResetTimeout        time.Duration // Time before attempting half-open (default: 30s)
	HalfOpenMaxRequests int           // Max requests in half-open state (default: 3)
}

CircuitBreakerConfig configures circuit breaker behavior.

func DefaultCircuitBreakerConfig

func DefaultCircuitBreakerConfig() CircuitBreakerConfig

DefaultCircuitBreakerConfig returns a CircuitBreakerConfig with sensible defaults.

type CircuitBreakerHostStatus

type CircuitBreakerHostStatus struct {
	Host             string
	State            string
	FailureCount     int
	SuccessCount     int
	LastFailureTime  time.Time
	HalfOpenRequests int
}

CircuitBreakerHostStatus represents the current state of a circuit breaker for a host.

func (CircuitBreakerHostStatus) String

func (cbs CircuitBreakerHostStatus) String() string

String returns a human-readable description of the circuit breaker state.

type CircuitBreakerState

type CircuitBreakerState int

CircuitBreakerState represents the state of a circuit breaker.

const (
	// StateClosed is the normal operating state where requests are allowed.
	StateClosed CircuitBreakerState = iota
	// StateOpen means the failure threshold was reached; requests are blocked.
	StateOpen
	// StateHalfOpen is a testing state to check if the service has recovered.
	StateHalfOpen
)

func (CircuitBreakerState) String

func (s CircuitBreakerState) String() string

String returns the string representation of the circuit breaker state.

type DefaultHealthChecker

type DefaultHealthChecker struct {
	TestURL string
	Timeout time.Duration
}

DefaultHealthChecker makes HTTP request through proxy to test endpoint.

func (*DefaultHealthChecker) Check

func (c *DefaultHealthChecker) Check(ctx context.Context, proxy ProxyEntry) (latencyMs int64, err error)

Check performs a health check on the given proxy.

type DetectedForm

type DetectedForm struct {
	FormIndex    int          `json:"formIndex"`           // Index in document (0 = first form)
	FormSelector string       `json:"formSelector"`        // CSS selector to target this form
	Score        float64      `json:"score"`               // Overall confidence score (0.0-1.0)
	FormType     FormType     `json:"formType"`            // Classified type
	UserField    *FieldMatch  `json:"userField"`           // Detected username field (nil if not found)
	PassField    *FieldMatch  `json:"passField"`           // Detected password field (nil if not found)
	SubmitField  *FieldMatch  `json:"submitField"`         // Detected submit button (nil if not found)
	AllFields    []FieldMatch `json:"allFields,omitempty"` // All detected fields in the form
	HTML         string       `json:"html,omitempty"`      // Form HTML snippet (for debugging)
	Action       string       `json:"action,omitempty"`    // Form action URL
	Method       string       `json:"method,omitempty"`    // Form method (GET/POST)
	Name         string       `json:"name,omitempty"`      // Form name attribute
	ID           string       `json:"id,omitempty"`        // Form ID attribute
}

DetectedForm represents a form with detection metadata.

type DetectedFormFields

type DetectedFormFields struct {
	UserField   FieldMatch `json:"userField"`   // Detected username/email field
	PassField   FieldMatch `json:"passField"`   // Detected password field
	SubmitField FieldMatch `json:"submitField"` // Detected submit button
	FormType    FormType   `json:"formType"`    // Classified type of form
}

DetectedFormFields captures the fields detected within a form.

type DetectionWeights

type DetectionWeights struct {
	PasswordTypeWeight   float64 // input[type=password] - strongest signal
	EmailTypeWeight      float64 // input[type=email]
	AutocompleteUsername float64 // autocomplete="username"
	AutocompletePassword float64 // autocomplete="current-password"
	NamePatternUsername  float64 // name matches user/login/email patterns
	NamePatternPassword  float64 // name matches pass/pwd patterns
	IDPatternUsername    float64 // id matches user/login/email patterns
	SubmitButtonType     float64 // button[type=submit] or input[type=submit]
	SubmitButtonText     float64 // button text contains "login", "sign in", etc.
}

DetectionWeights configures the scoring weights for form detection heuristics. Higher weights indicate stronger signals.

func DefaultDetectionWeights

func DefaultDetectionWeights() DetectionWeights

DefaultDetectionWeights returns sensible default weights for form detection.

type DeviceCategory

type DeviceCategory string

DeviceCategory classifies devices by form factor.

const (
	DeviceCategoryMobile  DeviceCategory = "mobile"
	DeviceCategoryTablet  DeviceCategory = "tablet"
	DeviceCategoryDesktop DeviceCategory = "desktop"
)

func GetDeviceCategories

func GetDeviceCategories() []DeviceCategory

GetDeviceCategories returns all available device categories.

type DeviceEmulation

type DeviceEmulation struct {
	Name              string         `json:"name"`              // Device preset name (e.g., "iPhone 14", "Pixel 7")
	ViewportWidth     int            `json:"viewportWidth"`     // Viewport width in pixels
	ViewportHeight    int            `json:"viewportHeight"`    // Viewport height in pixels
	DeviceScaleFactor float64        `json:"deviceScaleFactor"` // Device pixel ratio (e.g., 2.0 for Retina)
	UserAgent         string         `json:"userAgent"`         // User agent string for the device
	IsMobile          bool           `json:"isMobile"`          // Whether to emulate mobile viewport
	HasTouch          bool           `json:"hasTouch"`          // Whether the device has touch capability
	Category          DeviceCategory `json:"category"`          // Device category (mobile, tablet, desktop)
	Orientation       Orientation    `json:"orientation"`       // Default orientation (portrait/landscape)
}

DeviceEmulation defines device emulation settings for mobile/responsive content. Used by headless fetchers to emulate specific devices.

func GetDevicePreset

func GetDevicePreset(name string) *DeviceEmulation

GetDevicePreset returns a device emulation preset by name. Returns nil if the preset name is not recognized.

func GetDevicePresetsByCategory

func GetDevicePresetsByCategory(cat DeviceCategory) []DeviceEmulation

GetDevicePresetsByCategory returns all device presets matching the given category.

func (*DeviceEmulation) ApplyOrientation

func (d *DeviceEmulation) ApplyOrientation(orientation Orientation) *DeviceEmulation

ApplyOrientation applies the specified orientation to a device emulation. For landscape orientation on mobile/tablet devices, it swaps width and height.

type Fetcher

type Fetcher interface {
	Fetch(ctx context.Context, req Request) (Result, error)
}

func NewFetcher

func NewFetcher(dataDir string) Fetcher

func NewFetcherWithProxyPool

func NewFetcherWithProxyPool(dataDir string, pool *ProxyPool) Fetcher

NewFetcherWithProxyPool creates a new fetcher with proxy pool support.

type FetcherWithMetrics

type FetcherWithMetrics interface {
	Fetcher
	SetMetricsCallback(cb MetricsCallback)
}

FetcherWithMetrics is a fetcher that supports metrics callbacks.

func NewFetcherWithMetrics

func NewFetcherWithMetrics(dataDir string, callback MetricsCallback) FetcherWithMetrics

NewFetcherWithMetrics creates a new fetcher with metrics callback support.

func NewFetcherWithMetricsAndProxyPool

func NewFetcherWithMetricsAndProxyPool(dataDir string, callback MetricsCallback, pool *ProxyPool) FetcherWithMetrics

NewFetcherWithMetricsAndProxyPool creates a new fetcher with both metrics and proxy pool support.

type FieldMatch

type FieldMatch struct {
	Selector     string    `json:"selector"`               // CSS selector to target this field
	Attribute    string    `json:"attribute"`              // Which attribute matched (type, name, id, etc.)
	MatchValue   string    `json:"matchValue"`             // The value that matched
	Confidence   float64   `json:"confidence"`             // Individual field confidence (0.0-1.0)
	MatchReasons []string  `json:"matchReasons,omitempty"` // Why this field was selected
	FieldType    FieldType `json:"fieldType,omitempty"`    // Semantic field type classification
	FieldName    string    `json:"fieldName,omitempty"`    // Human-readable field name (e.g., "email", "firstName")
	Required     bool      `json:"required,omitempty"`     // Whether the field is required
	Placeholder  string    `json:"placeholder,omitempty"`  // Placeholder text if available
}

FieldMatch represents a detected form field with metadata about how it was identified.

type FieldType

type FieldType string

FieldType classifies form fields by their semantic purpose.

const (
	FieldTypeText     FieldType = "text"
	FieldTypeEmail    FieldType = "email"
	FieldTypePassword FieldType = "password"
	FieldTypePhone    FieldType = "phone"
	FieldTypeAddress  FieldType = "address"
	FieldTypeSearch   FieldType = "search"
	FieldTypeURL      FieldType = "url"
	FieldTypeNumber   FieldType = "number"
	FieldTypeDate     FieldType = "date"
	FieldTypeSelect   FieldType = "select"
	FieldTypeTextarea FieldType = "textarea"
	FieldTypeCheckbox FieldType = "checkbox"
	FieldTypeRadio    FieldType = "radio"
	FieldTypeSubmit   FieldType = "submit"
	FieldTypeHidden   FieldType = "hidden"
	FieldTypeFile     FieldType = "file"
	FieldTypeUnknown  FieldType = "unknown"
)

type FormDetectRequest

type FormDetectRequest struct {
	URL      string `json:"url"`
	FormType string `json:"formType,omitempty"`
	Headless bool   `json:"headless"`
}

FormDetectRequest represents a request to detect forms on a page.

type FormDetectResponse

type FormDetectResponse struct {
	URL           string         `json:"url"`
	Forms         []DetectedForm `json:"forms"`
	FormCount     int            `json:"formCount"`
	DetectedTypes []string       `json:"detectedTypes"`
}

FormDetectResponse represents the response from form detection.

func (FormDetectResponse) MarshalJSON

func (r FormDetectResponse) MarshalJSON() ([]byte, error)

MarshalJSON implements custom JSON marshaling for FormDetectResponse.

type FormDetector

type FormDetector struct {
	Weights DetectionWeights
}

FormDetector analyzes HTML to find and classify login forms.

func NewFormDetector

func NewFormDetector() *FormDetector

NewFormDetector creates a new form detector with default weights.

func NewFormDetectorWithWeights

func NewFormDetectorWithWeights(weights DetectionWeights) *FormDetector

NewFormDetectorWithWeights creates a form detector with custom weights.

func (*FormDetector) DetectAllForms

func (d *FormDetector) DetectAllForms(html string) ([]DetectedForm, error)

DetectAllForms analyzes HTML and returns all detected forms with full field classification. This is the general-purpose form detection that supports all form types.

func (*FormDetector) DetectFormFields

func (d *FormDetector) DetectFormFields(html string, formSelector string) ([]FieldMatch, error)

DetectFormFields extracts all fields from a specific form.

func (*FormDetector) DetectForms

func (d *FormDetector) DetectForms(html string) ([]DetectedForm, error)

DetectForms analyzes HTML and returns detected forms sorted by confidence (highest first).

func (*FormDetector) DetectFormsByType

func (d *FormDetector) DetectFormsByType(html string, formType FormType) ([]DetectedForm, error)

DetectFormsByType analyzes HTML and returns forms of a specific type.

func (*FormDetector) DetectLoginForm

func (d *FormDetector) DetectLoginForm(html string) (*DetectedForm, error)

DetectLoginForm is a convenience method that returns the highest-confidence login form. Returns nil if no suitable login form is detected.

type FormFillRequest

type FormFillRequest struct {
	URL            string            `json:"url"`                      // URL of the page containing the form
	FormSelector   string            `json:"formSelector,omitempty"`   // CSS selector for the form (auto-detect if empty)
	Fields         map[string]string `json:"fields"`                   // field name/selector -> value
	Submit         bool              `json:"submit"`                   // Whether to submit the form
	WaitFor        string            `json:"waitFor,omitempty"`        // Selector to wait for after submit
	Timeout        time.Duration     `json:"timeout,omitempty"`        // Operation timeout
	Headless       bool              `json:"headless"`                 // Use headless mode
	DetectOnly     bool              `json:"detectOnly,omitempty"`     // Only detect forms, don't fill
	FormTypeFilter string            `json:"formTypeFilter,omitempty"` // Filter by form type (e.g., "contact", "search")
}

FormFillRequest represents a form fill operation.

type FormFillResult

type FormFillResult struct {
	Success       bool           `json:"success"`
	FormSelector  string         `json:"formSelector"`
	FormType      FormType       `json:"formType,omitempty"`
	FilledFields  []string       `json:"filledFields"`
	Errors        []string       `json:"errors,omitempty"`
	PageURL       string         `json:"pageUrl,omitempty"`
	PageHTML      string         `json:"pageHtml,omitempty"`
	DetectedForms []DetectedForm `json:"detectedForms,omitempty"`
}

FormFillResult represents the result of a form fill operation.

func (FormFillResult) MarshalJSON

func (r FormFillResult) MarshalJSON() ([]byte, error)

MarshalJSON implements custom JSON marshaling for FormFillResult.

type FormFiller

type FormFiller struct {
	// contains filtered or unexported fields
}

FormFiller handles automated form filling and submission.

func NewFormFiller

func NewFormFiller(fetcher *ChromedpFetcher) *FormFiller

NewFormFiller creates a new form filler using the provided chromedp fetcher.

func (*FormFiller) Detect

Detect forms on a page and return detailed information.

func (*FormFiller) DetectForms

func (f *FormFiller) DetectForms(ctx context.Context, url string, formTypeFilter string) (*FormFillResult, error)

DetectForms detects all forms on a page and returns their details.

func (*FormFiller) FillForm

func (f *FormFiller) FillForm(ctx context.Context, req FormFillRequest) (*FormFillResult, error)

FillForm fills and optionally submits a form.

type FormType

type FormType string

FormType classifies detected forms by their likely purpose.

const (
	FormTypeLogin         FormType = "login"
	FormTypeRegister      FormType = "register"
	FormTypePasswordReset FormType = "password_reset"
	FormTypeSearch        FormType = "search"
	FormTypeContact       FormType = "contact"
	FormTypeNewsletter    FormType = "newsletter"
	FormTypeCheckout      FormType = "checkout"
	FormTypeSurvey        FormType = "survey"
	FormTypeUnknown       FormType = "unknown"
)

type HTTPFetcher

type HTTPFetcher struct {
	// contains filtered or unexported fields
}

HTTPFetcher implements content fetching using the standard library http.Client. Provides retry logic, rate limiting, authentication, conditional requests, and response size limits. See fetcher.go for the Fetcher interface definition.

func (*HTTPFetcher) Fetch

func (f *HTTPFetcher) Fetch(ctx context.Context, req Request) (Result, error)

Fetch performs a standard HTTP GET request to retrieve the content of a URL. It supports retries, rate limiting, and basic/token authentication.

func (*HTTPFetcher) SetProxyPool

func (f *HTTPFetcher) SetProxyPool(pool *ProxyPool)

SetProxyPool sets the proxy pool for this fetcher.

type HealthCheckConfig

type HealthCheckConfig struct {
	Enabled              bool   `json:"enabled"`
	IntervalSeconds      int    `json:"interval_seconds"`
	TimeoutSeconds       int    `json:"timeout_seconds"`
	MaxConsecutiveFails  int    `json:"max_consecutive_fails"`
	RecoveryAfterSeconds int    `json:"recovery_after_seconds"`
	TestURL              string `json:"test_url,omitempty"`
}

HealthCheckConfig configures proxy health checking.

func DefaultHealthCheckConfig

func DefaultHealthCheckConfig() HealthCheckConfig

DefaultHealthCheckConfig returns sensible defaults for health checking.

type HealthChecker

type HealthChecker interface {
	Check(ctx context.Context, proxy ProxyEntry) (latencyMs int64, err error)
}

HealthChecker defines the interface for proxy health checking.

type HostLimiter

type HostLimiter struct {
	// contains filtered or unexported fields
}

HostLimiter manages per-host rate limiters

func NewAdaptiveHostLimiter

func NewAdaptiveHostLimiter(qps int, burst int, cfg *AdaptiveConfig) *HostLimiter

NewAdaptiveHostLimiter creates a HostLimiter with adaptive rate limiting enabled. The limiter will dynamically adjust QPS per host based on server responses.

func NewAdaptiveHostLimiterWithCircuitBreaker

func NewAdaptiveHostLimiterWithCircuitBreaker(qps int, burst int, adaptiveCfg *AdaptiveConfig, cb *CircuitBreaker) *HostLimiter

NewAdaptiveHostLimiterWithCircuitBreaker creates a HostLimiter with both adaptive rate limiting and circuit breaker enabled.

func NewHostLimiter

func NewHostLimiter(qps int, burst int) *HostLimiter

func NewHostLimiterWithCircuitBreaker

func NewHostLimiterWithCircuitBreaker(qps int, burst int, cb *CircuitBreaker) *HostLimiter

NewHostLimiterWithCircuitBreaker creates a HostLimiter with circuit breaker enabled. The circuit breaker will isolate failing hosts to prevent cascading failures.

func (*HostLimiter) GetAdaptiveConfig

func (h *HostLimiter) GetAdaptiveConfig() *AdaptiveConfig

GetAdaptiveConfig returns a copy of the adaptive configuration, or nil if not enabled.

func (*HostLimiter) GetBurst

func (h *HostLimiter) GetBurst() int

GetBurst returns the configured burst

func (*HostLimiter) GetCircuitBreaker

func (h *HostLimiter) GetCircuitBreaker() *CircuitBreaker

GetCircuitBreaker returns the circuit breaker instance, or nil if not enabled.

func (*HostLimiter) GetHostStatus

func (h *HostLimiter) GetHostStatus() []HostStatus

GetHostStatus returns rate limit status for all known hosts

func (*HostLimiter) GetLimiter

func (h *HostLimiter) GetLimiter(host string) *rate.Limiter

GetLimiter returns the rate limiter for a specific host (for metrics registration)

func (*HostLimiter) GetQPS

func (h *HostLimiter) GetQPS() float64

GetQPS returns the configured QPS

func (*HostLimiter) IsAdaptiveEnabled

func (h *HostLimiter) IsAdaptiveEnabled() bool

IsAdaptiveEnabled returns true if adaptive rate limiting is enabled.

func (*HostLimiter) IsCircuitBreakerEnabled

func (h *HostLimiter) IsCircuitBreakerEnabled() bool

IsCircuitBreakerEnabled returns true if circuit breaker is enabled.

func (*HostLimiter) RecordRateLimit

func (h *HostLimiter) RecordRateLimit(host string, retryAfter time.Duration)

RecordRateLimit reports a 429 response for the given host with an optional Retry-After duration. When adaptive rate limiting is enabled, this will decrease the QPS for the host and optionally set a cooldown period.

func (*HostLimiter) RecordResult

func (h *HostLimiter) RecordResult(host string, err error, status int)

RecordResult records the result of a request for both adaptive rate limiting and circuit breaker tracking.

func (*HostLimiter) RecordSuccess

func (h *HostLimiter) RecordSuccess(host string)

RecordSuccess reports a successful request (2xx/3xx status) for the given host. When adaptive rate limiting is enabled, this may increase the QPS for the host after a threshold of consecutive successes is reached.

func (*HostLimiter) UpdateRateLimitInfo

func (h *HostLimiter) UpdateRateLimitInfo(host string, info RateLimitInfo)

UpdateRateLimitInfo updates the limiter based on server-provided rate limit headers. This allows the limiter to respect server-provided rate limits instead of relying solely on adaptive AIMD behavior.

Behavior:

  • If info.Limit > 0, adjusts currentQPS to respect the server's limit
  • If info.Remaining is low (< 10%), enters cooldown until reset time
  • If info.Reset is in the future and remaining is low, respects that reset time

func (*HostLimiter) Wait

func (h *HostLimiter) Wait(ctx context.Context, rawURL string) error

Wait waits for the rate limiter for the given URL using global default rates. For per-host rate configuration, use WaitWithRates instead.

func (*HostLimiter) WaitWithRates

func (h *HostLimiter) WaitWithRates(ctx context.Context, rawURL string, profileQPS int, profileBurst int) error

WaitWithRates waits for the rate limiter for the given URL with optional per-host rates. If profileQPS or profileBurst are 0, the global defaults are used. This method also checks the circuit breaker before allowing the request.

type HostStatus

type HostStatus struct {
	Host        string
	QPS         float64
	Burst       int
	LastRequest time.Time
	// Adaptive rate limiting fields
	CurrentQPS           float64 // actual QPS after adaptation
	AdaptiveEnabled      bool
	ConsecutiveSuccesses int
	Consecutive429s      int
	InCooldown           bool
	CooldownUntil        time.Time
	// Circuit breaker fields
	CircuitBreakerState    string    // closed, open, half-open
	CircuitBreakerFailures int       // Current failure count
	CircuitBreakerLastFail time.Time // Last failure timestamp
}

HostStatus represents the current rate limit state for a single host

type InterceptedEntry

type InterceptedEntry struct {
	Request  InterceptedRequest   `json:"request"`
	Response *InterceptedResponse `json:"response,omitempty"` // nil if response not received
	Duration time.Duration        `json:"duration"`           // Time between request and response
}

InterceptedEntry combines a request/response pair with timing data.

type InterceptedRequest

type InterceptedRequest struct {
	RequestID    string                  `json:"requestId"`    // Unique identifier
	URL          string                  `json:"url"`          // Request URL
	Method       string                  `json:"method"`       // HTTP method
	Headers      map[string]string       `json:"headers"`      // Request headers
	Body         string                  `json:"body"`         // Request body (base64 if binary)
	BodySize     int64                   `json:"bodySize"`     // Original body size
	Timestamp    time.Time               `json:"timestamp"`    // When request was sent
	ResourceType InterceptedResourceType `json:"resourceType"` // Type of resource
}

InterceptedRequest represents a captured network request.

type InterceptedResourceType

type InterceptedResourceType string

InterceptedResourceType represents the type of network resource.

const (
	ResourceTypeXHR        InterceptedResourceType = "xhr"
	ResourceTypeFetch      InterceptedResourceType = "fetch"
	ResourceTypeDocument   InterceptedResourceType = "document"
	ResourceTypeScript     InterceptedResourceType = "script"
	ResourceTypeStylesheet InterceptedResourceType = "stylesheet"
	ResourceTypeImage      InterceptedResourceType = "image"
	ResourceTypeMedia      InterceptedResourceType = "media"
	ResourceTypeFont       InterceptedResourceType = "font"
	ResourceTypeWebSocket  InterceptedResourceType = "websocket"
	ResourceTypeOther      InterceptedResourceType = "other"
)

type InterceptedResponse

type InterceptedResponse struct {
	RequestID  string            `json:"requestId"`  // Matches request
	Status     int               `json:"status"`     // HTTP status code
	StatusText string            `json:"statusText"` // HTTP status text
	Headers    map[string]string `json:"headers"`    // Response headers
	Body       string            `json:"body"`       // Response body (base64 if binary)
	BodySize   int64             `json:"bodySize"`   // Size of response body
	Timestamp  time.Time         `json:"timestamp"`  // When response received
}

InterceptedResponse represents a captured network response.

type JSHeaviness

type JSHeaviness struct {
	Score   float64  `json:"score"`
	Reasons []string `json:"reasons"`

	ScriptTagCount   int `json:"scriptTagCount"`
	BodyTextLength   int `json:"bodyTextLength"`
	RootDivSignals   int `json:"rootDivSignals"`
	FrameworkSignals int `json:"frameworkSignals"`
}

func DetectJSHeaviness

func DetectJSHeaviness(html string) JSHeaviness

DetectJSHeaviness analyzes HTML content to determine if it requires JavaScript to render meaningful content.

type MetricsCallback

type MetricsCallback func(duration time.Duration, success bool, fetcherType, url string)

MetricsCallback is the function signature for metrics collection callbacks.

type NetworkInterceptConfig

type NetworkInterceptConfig struct {
	Enabled             bool                      `json:"enabled"`             // Toggle interception
	URLPatterns         []string                  `json:"urlPatterns"`         // Glob patterns for URLs to intercept (e.g., "**/api/**", "*.json")
	ResourceTypes       []InterceptedResourceType `json:"resourceTypes"`       // Resource types to capture
	CaptureRequestBody  bool                      `json:"captureRequestBody"`  // Whether to capture request bodies
	CaptureResponseBody bool                      `json:"captureResponseBody"` // Whether to capture response bodies
	MaxBodySize         int64                     `json:"maxBodySize"`         // Max bytes to capture per body (default 1MB)
	MaxEntries          int                       `json:"maxEntries"`          // Max number of entries to capture (default 1000)
}

NetworkInterceptConfig defines configuration for network request/response interception. Used to capture XHR/Fetch API traffic from SPAs for API scraping.

func DefaultNetworkInterceptConfig

func DefaultNetworkInterceptConfig() NetworkInterceptConfig

DefaultNetworkInterceptConfig returns a default configuration with sensible limits.

type OAuth2AuthConfig

type OAuth2AuthConfig struct {
	// ProfileName is the name of the auth profile with OAuth2 configuration
	ProfileName string `json:"profileName,omitempty"`
	// AccessToken is a static access token (optional - if not set, will be loaded from store)
	AccessToken string `json:"accessToken,omitempty"`
	// TokenType is the token type (e.g., "Bearer"). Defaults to "Bearer" if not set.
	TokenType string `json:"tokenType,omitempty"`
}

OAuth2AuthConfig defines OAuth 2.0 authentication configuration for fetch operations.

type Orientation

type Orientation string

Orientation represents the device screen orientation.

const (
	OrientationPortrait  Orientation = "portrait"
	OrientationLandscape Orientation = "landscape"
)

type PlaywrightFetcher

type PlaywrightFetcher struct {
	// contains filtered or unexported fields
}

func (*PlaywrightFetcher) Close

func (f *PlaywrightFetcher) Close() error

func (*PlaywrightFetcher) Fetch

func (f *PlaywrightFetcher) Fetch(ctx context.Context, req Request, prof RenderProfile) (Result, error)

func (*PlaywrightFetcher) SetProxyPool

func (f *PlaywrightFetcher) SetProxyPool(pool *ProxyPool)

SetProxyPool sets the proxy pool for this fetcher.

type ProxyConfig

type ProxyConfig struct {
	URL      string `json:"url,omitempty"`      // Proxy URL (http://, https://, socks5://)
	Username string `json:"username,omitempty"` // Username for proxy authentication
	Password string `json:"password,omitempty"` // Password for proxy authentication
}

ProxyConfig defines proxy settings for fetch operations.

type ProxyEntry

type ProxyEntry struct {
	ID          string   `json:"id"`
	URL         string   `json:"url"`
	Username    string   `json:"username,omitempty"`
	Password    string   `json:"password,omitempty"`
	Region      string   `json:"region,omitempty"`
	Tags        []string `json:"tags,omitempty"`
	Weight      int      `json:"weight,omitempty"`
	MaxRequests int      `json:"max_requests,omitempty"`
}

ProxyEntry represents a single proxy in the pool.

func (ProxyEntry) ToProxyConfig

func (e ProxyEntry) ToProxyConfig() ProxyConfig

ToProxyConfig converts ProxyEntry to ProxyConfig for use with existing fetchers.

type ProxyPool

type ProxyPool struct {
	// contains filtered or unexported fields
}

ProxyPool manages a collection of proxies with rotation.

func LoadProxyPoolFromFile

func LoadProxyPoolFromFile(path string) (*ProxyPool, error)

LoadProxyPoolFromFile loads a proxy pool from a JSON configuration file.

func NewProxyPool

func NewProxyPool(config ProxyPoolConfig) (*ProxyPool, error)

NewProxyPool creates a new proxy pool from configuration.

func ProxyPoolFromConfig

func ProxyPoolFromConfig(path string, explicit bool) (*ProxyPool, error)

ProxyPoolFromConfig creates a proxy pool from configured startup settings. Missing files are silent only when callers mark the path as non-required.

func (*ProxyPool) GetEntries

func (p *ProxyPool) GetEntries() []ProxyEntry

GetEntries returns a copy of all proxy entries.

func (*ProxyPool) GetHealthyProxyCount

func (p *ProxyPool) GetHealthyProxyCount() int

GetHealthyProxyCount returns the number of healthy proxies.

func (*ProxyPool) GetProxyStats

func (p *ProxyPool) GetProxyStats(proxyID string) (ProxyStats, bool)

GetProxyStats returns stats for a specific proxy.

func (*ProxyPool) GetStats

func (p *ProxyPool) GetStats() map[string]ProxyStats

GetStats returns current stats for all proxies.

func (*ProxyPool) GetStrategy

func (p *ProxyPool) GetStrategy() RotationStrategy

GetStrategy returns the current rotation strategy.

func (*ProxyPool) GetTotalProxyCount

func (p *ProxyPool) GetTotalProxyCount() int

GetTotalProxyCount returns the total number of proxies.

func (*ProxyPool) RecordFailure

func (p *ProxyPool) RecordFailure(proxyID string, err error)

RecordFailure updates stats for a failed proxy request.

func (*ProxyPool) RecordSuccess

func (p *ProxyPool) RecordSuccess(proxyID string, latencyMs int64)

RecordSuccess updates stats for a successful proxy request.

func (*ProxyPool) Select

func (p *ProxyPool) Select(hints ProxySelectionHints) (ProxyEntry, error)

Select returns a proxy based on the configured rotation strategy. Returns an error if no healthy proxies are available.

func (*ProxyPool) SetStrategy

func (p *ProxyPool) SetStrategy(strategy RotationStrategy)

SetStrategy changes the rotation strategy.

func (*ProxyPool) Stop

func (p *ProxyPool) Stop()

Stop stops the proxy pool and its background health checks.

type ProxyPoolConfig

type ProxyPoolConfig struct {
	DefaultStrategy string            `json:"default_strategy"`
	HealthCheck     HealthCheckConfig `json:"health_check,omitempty"`
	Proxies         []ProxyEntry      `json:"proxies"`
}

ProxyPoolConfig is the configuration file format for proxy pools.

type ProxySelectionHints

type ProxySelectionHints struct {
	PreferredRegion string   `json:"preferred_region,omitempty"`
	RequiredTags    []string `json:"required_tags,omitempty"`
	ExcludeProxyIDs []string `json:"exclude_proxy_ids,omitempty"`
}

ProxySelectionHints provides hints for proxy selection.

func NormalizeProxySelectionHints

func NormalizeProxySelectionHints(hints *ProxySelectionHints) *ProxySelectionHints

NormalizeProxySelectionHints trims and deduplicates proxy selection hints.

type ProxyStats

type ProxyStats struct {
	RequestCount     uint64    `json:"request_count"`
	SuccessCount     uint64    `json:"success_count"`
	FailureCount     uint64    `json:"failure_count"`
	LastUsed         time.Time `json:"last_used"`
	LastFailed       time.Time `json:"last_failed,omitempty"`
	AvgLatencyMs     int64     `json:"avg_latency_ms"`
	ConsecutiveFails int       `json:"consecutive_fails"`
	IsHealthy        bool      `json:"is_healthy"`
}

ProxyStats tracks usage and health for a proxy.

func (ProxyStats) SuccessRate

func (s ProxyStats) SuccessRate() float64

SuccessRate returns the success rate as a percentage (0-100).

type RateLimitInfo

type RateLimitInfo struct {
	Limit     int           // Maximum requests allowed in the window
	Remaining int           // Requests remaining in current window
	Reset     time.Time     // When the rate limit window resets
	Window    time.Duration // Optional: window duration if known
}

RateLimitInfo holds parsed rate limit data from various header formats. It represents the server's rate limit policy and current state.

func ExtractRateLimitInfo

func ExtractRateLimitInfo(headers http.Header) (RateLimitInfo, bool)

ExtractRateLimitInfo tries all known header formats and returns the best available data. Priority: 1. RFC 9440 RateLimit header (preferred standard) 2. X-RateLimit-* headers (common API patterns) 3. RateLimit-Policy header (for window info only)

Returns (info, true) if any rate limit headers were found, (empty, false) otherwise.

func ParseRateLimitHeader

func ParseRateLimitHeader(header string) (RateLimitInfo, error)

ParseRateLimitHeader parses the RFC 9440 RateLimit header. Format: RateLimit: limit=100, remaining=50, reset=60 The reset value can be:

  • Delta-seconds (integer): seconds until reset
  • Unix timestamp (integer >= 1e9): absolute reset time

See: https://datatracker.ietf.org/doc/html/rfc9440

func ParseXRateLimitHeaders

func ParseXRateLimitHeaders(headers http.Header) (RateLimitInfo, error)

ParseXRateLimitHeaders parses common X-RateLimit-* header variants. Supports GitHub, Twitter, and other common API patterns:

  • X-RateLimit-Limit: maximum requests allowed
  • X-RateLimit-Remaining: requests remaining in current window
  • X-RateLimit-Reset: reset time (Unix timestamp or HTTP date)

Some APIs use different prefixes (e.g., x-ratelimit-* lowercase). This function checks both canonical and lowercase forms.

func (*RateLimitInfo) IsRateLimited

func (r *RateLimitInfo) IsRateLimited() bool

IsRateLimited returns true if the rate limit has been exceeded (Remaining <= 0). Returns false if no rate limit information is available.

func (*RateLimitInfo) TimeUntilReset

func (r *RateLimitInfo) TimeUntilReset() time.Duration

TimeUntilReset returns the duration until the rate limit resets. Returns 0 if reset time is not set or has already passed.

func (*RateLimitInfo) UsagePercent

func (r *RateLimitInfo) UsagePercent() float64

UsagePercent returns the percentage of rate limit used (0-100). Returns -1 if limit information is not available.

type RenderBlockPolicy

type RenderBlockPolicy struct {
	ResourceTypes []BlockedResourceType `json:"resourceTypes,omitempty"`
	URLPatterns   []string              `json:"urlPatterns,omitempty"` // glob-style patterns
}

type RenderEngine

type RenderEngine string
const (
	RenderEngineHTTP       RenderEngine = "http"
	RenderEngineChromedp   RenderEngine = "chromedp"
	RenderEnginePlaywright RenderEngine = "playwright"
)

type RenderProfile

type RenderProfile struct {
	Name         string   `json:"name"`
	HostPatterns []string `json:"hostPatterns"` // match against URL host, glob-style ("example.com", "*.example.com")

	// If set, overrides engine selection entirely.
	ForceEngine RenderEngine `json:"forceEngine,omitempty"`

	// If true, skip HTTP probe and go straight to headless engine selection.
	PreferHeadless bool `json:"preferHeadless,omitempty"`

	// If true, treat every page on this host as JS-heavy (forces escalation if not forced to HTTP).
	AssumeJSHeavy bool `json:"assumeJsHeavy,omitempty"`

	// If true, never escalate (forces HTTP).
	NeverHeadless bool `json:"neverHeadless,omitempty"`

	// Overrides default JS-heavy threshold for this host (0..1). 0 means use global default.
	JSHeavyThreshold float64 `json:"jsHeavyThreshold,omitempty"`

	// Rate limiting configuration for this profile (0 = use global defaults).
	RateLimitQPS   int `json:"rateLimitQPS,omitempty"`
	RateLimitBurst int `json:"rateLimitBurst,omitempty"`

	Block      RenderBlockPolicy   `json:"block,omitempty"`
	Wait       RenderWaitPolicy    `json:"wait,omitempty"`
	Timeouts   RenderTimeoutPolicy `json:"timeouts,omitempty"`
	Screenshot ScreenshotConfig    `json:"screenshot,omitempty"`
	Device     *DeviceEmulation    `json:"device,omitempty"` // Device emulation for this profile

	// CaptchaConfig defines CAPTCHA handling for this profile.
	CaptchaConfig *captcha.CaptchaConfig `json:"captchaConfig,omitempty"`
}

func GetRenderProfile

func GetRenderProfile(dataDir, name string) (RenderProfile, bool, error)

GetRenderProfile retrieves a single profile by name. Returns (profile, true, nil) if found, (zero, false, nil) if not found.

type RenderProfileStore

type RenderProfileStore struct {
	// contains filtered or unexported fields
}

func NewRenderProfileStore

func NewRenderProfileStore(dataDir string) *RenderProfileStore

NewRenderProfileStore initializes a new store. It attempts to load profiles immediately.

func (*RenderProfileStore) GetRateLimitsForURL

func (s *RenderProfileStore) GetRateLimitsForURL(rawURL string) (qps int, burst int)

GetRateLimitsForURL returns the rate limit configuration for a given URL. Returns (0, 0) if no matching profile or if profile has no rate limits set.

func (*RenderProfileStore) MatchURL

func (s *RenderProfileStore) MatchURL(rawURL string) (*RenderProfile, bool, error)

MatchURL returns the highest-precedence matching profile for a given URL. Precedence: first match in file order (user-controlled), deterministic.

func (*RenderProfileStore) Profiles

func (s *RenderProfileStore) Profiles() []RenderProfile

Profiles returns a copy of all loaded profiles.

func (*RenderProfileStore) Reload

func (s *RenderProfileStore) Reload() error

Reload loads profiles from disk. If file is missing, profiles become empty. Idempotent.

func (*RenderProfileStore) ReloadIfChanged

func (s *RenderProfileStore) ReloadIfChanged() error

ReloadIfChanged checks file modification time and reloads if necessary.

type RenderProfilesFile

type RenderProfilesFile struct {
	Profiles []RenderProfile `json:"profiles"`
}

func LoadRenderProfilesFile

func LoadRenderProfilesFile(dataDir string) (RenderProfilesFile, error)

LoadRenderProfilesFile loads the render profiles file from disk. If the file doesn't exist, returns an empty RenderProfilesFile. Uses strict JSON decoding - unknown fields cause a validation error.

type RenderTimeoutPolicy

type RenderTimeoutPolicy struct {
	// Absolute cap for the entire render phase (headless only).
	MaxRenderMs int `json:"maxRenderMs,omitempty"`
	// Cap for in-page script evaluation/wait-for-function loops.
	ScriptEvalMs int `json:"scriptEvalMs,omitempty"`
	// Cap for navigation (goto) only.
	NavigationMs int `json:"navigationMs,omitempty"`
}

type RenderWaitMode

type RenderWaitMode string
const (
	RenderWaitModeDOMReady    RenderWaitMode = "dom_ready"    // DOMContentLoaded + body present
	RenderWaitModeNetworkIdle RenderWaitMode = "network_idle" // inflight==0 for quiet window
	RenderWaitModeStability   RenderWaitMode = "stability"    // body.innerText length stabilizes
	RenderWaitModeSelector    RenderWaitMode = "selector"     // selector appears (and optional stability)
)

type RenderWaitPolicy

type RenderWaitPolicy struct {
	Mode RenderWaitMode `json:"mode,omitempty"`

	// RenderWaitModeSelector
	Selector string `json:"selector,omitempty"`

	// RenderWaitModeNetworkIdle
	NetworkIdleQuietMs int `json:"networkIdleQuietMs,omitempty"`

	// RenderWaitModeStability
	MinTextLength       int `json:"minTextLength,omitempty"`
	StabilityPollMs     int `json:"stabilityPollMs,omitempty"`
	StabilityIterations int `json:"stabilityIterations,omitempty"`

	// Always applied after wait mode completes (final settle).
	ExtraSleepMs int `json:"extraSleepMs,omitempty"`
}

type Request

type Request struct {
	URL              string
	Method           string // HTTP method (GET, POST, PUT, DELETE, PATCH, etc.)
	Body             []byte // Request body for POST/PUT/PATCH
	ContentType      string // Content-Type header for request body
	Timeout          time.Duration
	UserAgent        string
	Headless         bool
	UsePlaywright    bool
	Auth             AuthOptions
	SessionID        string // Reference to persisted session for cookie reuse
	Limiter          *HostLimiter
	MaxRetries       int
	RetryBaseDelay   time.Duration
	MaxResponseBytes int64                   `json:"maxResponseBytes,omitempty"`
	IfNoneMatch      string                  `json:"-"`
	IfModifiedSince  string                  `json:"-"`
	DataDir          string                  `json:"-"`
	PreNavJS         []string                `json:"-"`
	PostNavJS        []string                `json:"-"`
	WaitSelectors    []string                `json:"-"`
	Screenshot       *ScreenshotConfig       `json:"screenshot"`
	Device           *DeviceEmulation        `json:"device,omitempty"`           // Device emulation settings
	NetworkIntercept *NetworkInterceptConfig `json:"networkIntercept,omitempty"` // Network interception config
}

type Result

type Result struct {
	URL             string             `json:"url"`
	Status          int                `json:"status"`
	HTML            string             `json:"html"`
	FetchedAt       time.Time          `json:"fetchedAt"`
	ETag            string             `json:"-"`
	LastModified    string             `json:"-"`
	Engine          RenderEngine       `json:"-"`
	ScreenshotPath  string             `json:"screenshotPath,omitempty"`  // Path to saved screenshot file
	InterceptedData []InterceptedEntry `json:"interceptedData,omitempty"` // Captured network activity
	// RateLimit contains parsed rate limit information from response headers.
	// Populated when the server returns RateLimit (RFC 9440) or X-RateLimit-* headers.
	RateLimit *RateLimitInfo `json:"rateLimit,omitempty"`
}

type RetryConfig

type RetryConfig struct {
	MaxRetries      int
	BaseDelay       time.Duration
	MaxDelay        time.Duration   // Cap on delay (default: 60s)
	Strategy        BackoffStrategy // Backoff calculation strategy
	RetryableCodes  map[int]bool    // Status codes that trigger retry (nil = use defaults)
	RetryableErrors []error         // Error types that trigger retry (empty = use defaults)
}

RetryConfig configures retry behavior with per-status-code policies and backoff strategies.

func DefaultRetryConfig

func DefaultRetryConfig() RetryConfig

DefaultRetryConfig returns a RetryConfig with sensible defaults.

type RotationStrategy

type RotationStrategy int

RotationStrategy defines proxy selection algorithm.

const (
	RotationRoundRobin RotationStrategy = iota
	RotationRandom
	RotationLeastUsed
	RotationWeighted
	RotationLeastLatency
)

func ParseRotationStrategy

func ParseRotationStrategy(s string) RotationStrategy

ParseRotationStrategy parses a rotation strategy from string.

func (RotationStrategy) String

func (r RotationStrategy) String() string

String returns the string representation of the rotation strategy.

type ScreenshotConfig

type ScreenshotConfig struct {
	Enabled  bool             `json:"enabled"`           // Whether to capture screenshot
	FullPage bool             `json:"fullPage"`          // Capture full page or just viewport
	Format   ScreenshotFormat `json:"format"`            // png or jpeg
	Quality  int              `json:"quality,omitempty"` // JPEG quality (1-100), ignored for PNG
	Width    int              `json:"width,omitempty"`   // Viewport width (0 = default)
	Height   int              `json:"height,omitempty"`  // Viewport height (0 = default)
	Device   *DeviceEmulation `json:"device,omitempty"`  // Device emulation settings
}

ScreenshotConfig defines screenshot capture options for headless fetchers. Screenshots are only applicable to chromedp and playwright engines, not HTTP fetcher.

type ScreenshotFormat

type ScreenshotFormat string
const (
	ScreenshotFormatPNG  ScreenshotFormat = "png"
	ScreenshotFormatJPEG ScreenshotFormat = "jpeg"
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL