Documentation
¶
Overview ¶
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file handles headless browser authentication and session management. It provides login form detection, automated login flows, and cookie/session persistence. Does NOT handle auth profile management (see internal/auth).
Package fetch provides HTTP and headless browser content fetching capabilities.
This file provides device emulation and screenshot capture for chromedp. It handles viewport configuration, mobile device simulation, and full-page or viewport screenshot generation with configurable formats.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file handles network request/response interception for API scraping. It provides the networkInterceptor type for capturing network traffic based on configurable URL patterns and resource types. Does NOT handle request execution or browser lifecycle management.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file provides network idle detection and response tracking for chromedp. It tracks active network requests to determine when page loading is complete and captures HTTP response status codes from document requests.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides browser/tooling availability checks and fetcher lifecycle helpers.
Purpose: - Centralize best-effort fetcher cleanup so callers do not leak repo-started browser automation.
Responsibilities: - Detect whether a fetcher exposes a Close method. - Invoke Close safely for callers that create short-lived fetchers per request or test.
Scope: - Fetcher lifecycle cleanup only; concrete fetch behavior lives in sibling files.
Usage: - Call CloseFetcher(fetcher) in scrape/crawl teardown paths after constructing a fetch.Fetcher.
Invariants/Assumptions: - Cleanup is best-effort and should be safe to call on nil or non-closable fetchers. - Close must not panic when the underlying fetcher has already been cleaned up.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides browser/tooling availability checks and fetcher construction helpers.
Purpose: - Centralize fetcher factories plus browser and Playwright prerequisite detection.
Responsibilities: - Create adaptive fetchers with optional metrics and proxy-pool wiring. - Detect Chrome/Chromium availability across supported host platforms. - Cache Playwright readiness checks while allowing explicit refresh probes.
Scope: - Shared fetcher setup and availability probing only; concrete fetching lives in sibling files.
Usage: - Called by runtime initialization, health endpoints, and diagnostic helpers.
Invariants/Assumptions: - Availability checks must never launch long-running browser sessions. - Fresh diagnostic probes may invalidate cached Playwright readiness state.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file implements the main entry points for automatic form detection for headless login flows. It analyzes HTML to detect login forms, identify input fields, and generate CSS selectors for automated login without manual configuration.
The detection uses heuristics based on:
- Input type attributes (password, email)
- Autocomplete attributes (username, current-password)
- Name/id patterns (user, login, email, pass)
- Form structure and field relationships
It does NOT execute JavaScript or handle multi-step flows (MFA/2FA).
Package fetch provides HTTP and headless browser content fetching capabilities.
This file contains form classification logic for determining the type of form (login, register, password reset, search, contact, newsletter, checkout, survey) based on its fields and structure.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file contains field finding functions for detecting username, password, and submit button fields within forms.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file contains form detection heuristics for analyzing form elements and div-based forms (common in modern SPAs).
Package fetch provides HTTP and headless browser content fetching capabilities.
This file contains CSS selector generation functions for targeting form elements and containers with reliable selectors.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file contains the type definitions for form detection, including form types, field matches, detected forms, and detection weights configuration.
The detection uses heuristics based on:
- Input type attributes (password, email)
- Autocomplete attributes (username, current-password)
- Name/id patterns (user, login, email, pass)
- Form structure and field relationships
It does NOT execute JavaScript or handle multi-step flows (MFA/2FA).
Package fetch provides HTTP and headless browser content fetching capabilities.
This file contains utility functions for form detection, including CSS escaping and sorting functions.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file implements automated form filling and submission for general forms (not just login forms). It uses chromedp for headless browser automation.
The form filler supports:
- Automatic form detection and field mapping
- Filling text, email, phone, textarea, select, checkbox, and radio fields
- Form submission with success/failure detection
- Multi-step form workflows
It does NOT handle CAPTCHAs or complex JavaScript-dependent forms.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file provides device emulation for Playwright fetcher. It handles viewport configuration, mobile device simulation, and device profile resolution from requests and render profiles.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file handles network request/response interception for Playwright-based API scraping. It provides the playwrightInterceptor type for capturing network traffic based on configurable URL patterns and resource types. Does NOT handle request execution or browser lifecycle management.
Package fetch provides HTTP and headless browser content fetching capabilities.
This file provides screenshot capture for Playwright fetcher. It handles viewport configuration, file generation, and full-page or viewport screenshot capture with configurable formats (PNG/JPEG).
Package fetch provides HTTP and headless browser content fetching capabilities.
This file provides session/cookie extraction for Playwright fetcher. It handles extracting cookies from browser contexts and saving them as sessions for later reuse. Does NOT handle session loading or authentication flows.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities.
Purpose:
- Load and validate persisted proxy-pool configuration.
Responsibilities:
- Read proxy-pool JSON files.
- Distinguish optional default absence from explicit user misconfiguration.
Scope:
- Proxy-pool persistence helpers only.
Usage:
- LoadProxyPoolFromFile(path) for strict loading.
- ProxyPoolFromConfig(path, explicit) for optional startup loading.
Invariants/Assumptions:
- Startup callers may choose silent missing-file handling for non-required pool paths.
- Explicit proxy-pool paths still surface errors.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides render profile management utilities. This file implements CRUD operations for render profiles stored in DATA_DIR/render_profiles.json.
Responsibilities: - Load and save render profiles with strict validation - CRUD operations: List, Get, Upsert, Delete - Atomic file writes to prevent corruption - Validation of profile fields (name uniqueness, host patterns, engine enum)
This file does NOT: - Handle runtime profile matching (see render_profiles_store.go) - Execute fetches or apply profiles to requests
Invariants: - Profile names must be unique (case-sensitive) - Host patterns must be non-empty and pass hostmatch.ValidateHostPatterns - Engine must be one of: http, chromedp, playwright (if set) - File writes are atomic (temp file + rename)
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. It handles request routing, rate limiting, retry logic, and render profiles. It does NOT handle content extraction or parsing.
Package fetch provides HTTP and headless browser content fetching capabilities. Authentication and proxy configuration types.
Package fetch provides HTTP and headless browser content fetching capabilities. Device emulation types for mobile/responsive content.
Package fetch provides HTTP and headless browser content fetching capabilities. Network interception types for capturing XHR/Fetch API traffic.
Index ¶
- Variables
- func ApplyAuthQuery(rawURL string, query map[string]string) string
- func CSSEscape(s string) string
- func CalculateBackoff(cfg RetryConfig, attempt int) time.Duration
- func CheckBrowserAvailability(usePlaywright bool) error
- func CheckBrowserAvailabilityFresh(usePlaywright bool) error
- func CloseFetcher(fetcher Fetcher) error
- func DefaultRetryableCodes() map[int]bool
- func DeleteRenderProfile(dataDir, name string) error
- func FindChrome() (string, error)
- func IsJSHeavy(js JSHeaviness, threshold float64) bool
- func IsStatusCodeRetryable(status int, retryableCodes map[int]bool) bool
- func ListDevicePresetNames() []string
- func ListRenderProfileNames(dataDir string) ([]string, error)
- func ParseRateLimitPolicyHeader(header string) (limit int, window time.Duration)
- func RenderProfilesPath(dataDir string) string
- func SaveRenderProfilesFile(dataDir string, file RenderProfilesFile) error
- func ShouldRetryWithConfig(err error, status int, cfg RetryConfig) bool
- func SleepWithContext(ctx context.Context, d time.Duration) error
- func UpsertRenderProfile(dataDir string, profile RenderProfile) error
- func ValidateRenderProfile(p RenderProfile) error
- func ValidateRenderProfilesFile(file RenderProfilesFile) error
- type AdaptiveConfig
- type AdaptiveFetcher
- type AuthOptions
- type BackoffStrategy
- type BlockedResourceType
- type ChromedpFetcher
- type CircuitBreaker
- func (cb *CircuitBreaker) Allow(host string) bool
- func (cb *CircuitBreaker) GetConfig() CircuitBreakerConfig
- func (cb *CircuitBreaker) GetHostStatus() []CircuitBreakerHostStatus
- func (cb *CircuitBreaker) GetState(host string) CircuitBreakerState
- func (cb *CircuitBreaker) IsEnabled() bool
- func (cb *CircuitBreaker) RecordFailure(host string)
- func (cb *CircuitBreaker) RecordSuccess(host string)
- func (cb *CircuitBreaker) Reset(host string)
- type CircuitBreakerConfig
- type CircuitBreakerHostStatus
- type CircuitBreakerState
- type DefaultHealthChecker
- type DetectedForm
- type DetectedFormFields
- type DetectionWeights
- type DeviceCategory
- type DeviceEmulation
- type Fetcher
- type FetcherWithMetrics
- type FieldMatch
- type FieldType
- type FormDetectRequest
- type FormDetectResponse
- type FormDetector
- func (d *FormDetector) DetectAllForms(html string) ([]DetectedForm, error)
- func (d *FormDetector) DetectFormFields(html string, formSelector string) ([]FieldMatch, error)
- func (d *FormDetector) DetectForms(html string) ([]DetectedForm, error)
- func (d *FormDetector) DetectFormsByType(html string, formType FormType) ([]DetectedForm, error)
- func (d *FormDetector) DetectLoginForm(html string) (*DetectedForm, error)
- type FormFillRequest
- type FormFillResult
- type FormFiller
- func (f *FormFiller) Detect(ctx context.Context, req FormDetectRequest) (*FormDetectResponse, error)
- func (f *FormFiller) DetectForms(ctx context.Context, url string, formTypeFilter string) (*FormFillResult, error)
- func (f *FormFiller) FillForm(ctx context.Context, req FormFillRequest) (*FormFillResult, error)
- type FormType
- type HTTPFetcher
- type HealthCheckConfig
- type HealthChecker
- type HostLimiter
- func NewAdaptiveHostLimiter(qps int, burst int, cfg *AdaptiveConfig) *HostLimiter
- func NewAdaptiveHostLimiterWithCircuitBreaker(qps int, burst int, adaptiveCfg *AdaptiveConfig, cb *CircuitBreaker) *HostLimiter
- func NewHostLimiter(qps int, burst int) *HostLimiter
- func NewHostLimiterWithCircuitBreaker(qps int, burst int, cb *CircuitBreaker) *HostLimiter
- func (h *HostLimiter) GetAdaptiveConfig() *AdaptiveConfig
- func (h *HostLimiter) GetBurst() int
- func (h *HostLimiter) GetCircuitBreaker() *CircuitBreaker
- func (h *HostLimiter) GetHostStatus() []HostStatus
- func (h *HostLimiter) GetLimiter(host string) *rate.Limiter
- func (h *HostLimiter) GetQPS() float64
- func (h *HostLimiter) IsAdaptiveEnabled() bool
- func (h *HostLimiter) IsCircuitBreakerEnabled() bool
- func (h *HostLimiter) RecordRateLimit(host string, retryAfter time.Duration)
- func (h *HostLimiter) RecordResult(host string, err error, status int)
- func (h *HostLimiter) RecordSuccess(host string)
- func (h *HostLimiter) UpdateRateLimitInfo(host string, info RateLimitInfo)
- func (h *HostLimiter) Wait(ctx context.Context, rawURL string) error
- func (h *HostLimiter) WaitWithRates(ctx context.Context, rawURL string, profileQPS int, profileBurst int) error
- type HostStatus
- type InterceptedEntry
- type InterceptedRequest
- type InterceptedResourceType
- type InterceptedResponse
- type JSHeaviness
- type MetricsCallback
- type NetworkInterceptConfig
- type OAuth2AuthConfig
- type Orientation
- type PlaywrightFetcher
- type ProxyConfig
- type ProxyEntry
- type ProxyPool
- func (p *ProxyPool) GetEntries() []ProxyEntry
- func (p *ProxyPool) GetHealthyProxyCount() int
- func (p *ProxyPool) GetProxyStats(proxyID string) (ProxyStats, bool)
- func (p *ProxyPool) GetStats() map[string]ProxyStats
- func (p *ProxyPool) GetStrategy() RotationStrategy
- func (p *ProxyPool) GetTotalProxyCount() int
- func (p *ProxyPool) RecordFailure(proxyID string, err error)
- func (p *ProxyPool) RecordSuccess(proxyID string, latencyMs int64)
- func (p *ProxyPool) Select(hints ProxySelectionHints) (ProxyEntry, error)
- func (p *ProxyPool) SetStrategy(strategy RotationStrategy)
- func (p *ProxyPool) Stop()
- type ProxyPoolConfig
- type ProxySelectionHints
- type ProxyStats
- type RateLimitInfo
- type RenderBlockPolicy
- type RenderEngine
- type RenderProfile
- type RenderProfileStore
- func (s *RenderProfileStore) GetRateLimitsForURL(rawURL string) (qps int, burst int)
- func (s *RenderProfileStore) MatchURL(rawURL string) (*RenderProfile, bool, error)
- func (s *RenderProfileStore) Profiles() []RenderProfile
- func (s *RenderProfileStore) Reload() error
- func (s *RenderProfileStore) ReloadIfChanged() error
- type RenderProfilesFile
- type RenderTimeoutPolicy
- type RenderWaitMode
- type RenderWaitPolicy
- type Request
- type Result
- type RetryConfig
- type RotationStrategy
- type ScreenshotConfig
- type ScreenshotFormat
Constants ¶
This section is empty.
Variables ¶
var ( ErrChromeNotFound = apperrors.ErrChromeNotFound ErrPlaywrightNotReady = apperrors.ErrPlaywrightNotReady )
var ErrCircuitBreakerOpen = apperrors.New(apperrors.KindInternal, "circuit breaker is open")
ErrCircuitBreakerOpen is returned when the circuit breaker is open and requests are blocked. This maps to HTTP 503 Service Unavailable.
Functions ¶
func ApplyAuthQuery ¶
ApplyAuthQuery applies authentication query parameters to a URL. If the query map is empty, the original URL is returned unchanged.
func CSSEscape ¶
CSSEscape escapes a string for use in CSS selectors. This is a simplified version - handles common cases.
func CalculateBackoff ¶
func CalculateBackoff(cfg RetryConfig, attempt int) time.Duration
CalculateBackoff returns backoff duration based on the configured strategy. This is the main entry point for computing retry delays.
func CheckBrowserAvailability ¶
CheckBrowserAvailability checks if the required browser binaries are available.
func CheckBrowserAvailabilityFresh ¶
CheckBrowserAvailabilityFresh forces a new availability probe.
func CloseFetcher ¶
CloseFetcher closes fetchers that expose a Close method.
func DefaultRetryableCodes ¶
DefaultRetryableCodes returns the default set of HTTP status codes that should trigger a retry.
func DeleteRenderProfile ¶
DeleteRenderProfile removes a profile by name. Returns apperrors.NotFound if the profile doesn't exist.
func FindChrome ¶
FindChrome resolves the Chrome/Chromium binary path for diagnostics and runtime checks.
func IsJSHeavy ¶
func IsJSHeavy(js JSHeaviness, threshold float64) bool
IsJSHeavy determines if the page is JS-heavy based on the score and a threshold. Default threshold is usually around 0.5.
func IsStatusCodeRetryable ¶
IsStatusCodeRetryable checks if a status code is in the retryable set.
func ListDevicePresetNames ¶
func ListDevicePresetNames() []string
ListDevicePresetNames returns all available device preset names.
func ListRenderProfileNames ¶
ListRenderProfileNames returns a sorted list of all profile names.
func ParseRateLimitPolicyHeader ¶
ParseRateLimitPolicyHeader parses the RateLimit-Policy header (RFC 9440). Format: RateLimit-Policy: 100;w=60 Where 100 is the limit and w=60 specifies a 60-second window. This can provide window duration even when RateLimit header is not present.
func RenderProfilesPath ¶
RenderProfilesPath returns the path to the render profiles JSON file.
func SaveRenderProfilesFile ¶
func SaveRenderProfilesFile(dataDir string, file RenderProfilesFile) error
SaveRenderProfilesFile saves the render profiles file to disk atomically. Validates before writing. Creates parent directories if needed.
func ShouldRetryWithConfig ¶
func ShouldRetryWithConfig(err error, status int, cfg RetryConfig) bool
ShouldRetryWithConfig checks if retry should occur using configurable rules. It first checks the configured status codes, then falls back to default logic.
func SleepWithContext ¶
SleepWithContext sleeps for the given duration or until the context is cancelled. Returns ctx.Err() if cancelled, nil otherwise.
func UpsertRenderProfile ¶
func UpsertRenderProfile(dataDir string, profile RenderProfile) error
UpsertRenderProfile creates or updates a render profile. If a profile with the same name exists, it is replaced in-place (preserving order). If not found, the profile is appended to the end.
func ValidateRenderProfile ¶
func ValidateRenderProfile(p RenderProfile) error
ValidateRenderProfile validates a single render profile.
func ValidateRenderProfilesFile ¶
func ValidateRenderProfilesFile(file RenderProfilesFile) error
ValidateRenderProfilesFile validates an entire render profiles file.
Types ¶
type AdaptiveConfig ¶
type AdaptiveConfig struct {
Enabled bool
MinQPS rate.Limit // floor (e.g., 0.1 = 1 req per 10s)
MaxQPS rate.Limit // ceiling (initial QPS)
AdditiveIncrease rate.Limit // QPS to add on success (e.g., 0.5)
MultiplicativeDecrease float64 // factor to multiply on 429 (e.g., 0.5 = halve)
SuccessThreshold int // consecutive successes before increase
CooldownPeriod time.Duration // minimum time between adjustments
}
AdaptiveConfig controls the behavior of adaptive rate limiting. When enabled, the limiter dynamically adjusts QPS per host based on server responses (429 status codes, Retry-After headers) and successful request patterns using an additive increase/multiplicative decrease algorithm.
type AdaptiveFetcher ¶
type AdaptiveFetcher struct {
// contains filtered or unexported fields
}
func NewAdaptiveFetcher ¶
func NewAdaptiveFetcher(dataDir string) *AdaptiveFetcher
func (*AdaptiveFetcher) Close ¶
func (f *AdaptiveFetcher) Close() error
func (*AdaptiveFetcher) SetMetricsCallback ¶
func (f *AdaptiveFetcher) SetMetricsCallback(cb MetricsCallback)
SetMetricsCallback sets the callback function for metrics collection
func (*AdaptiveFetcher) SetProxyPool ¶
func (f *AdaptiveFetcher) SetProxyPool(pool *ProxyPool)
SetProxyPool sets the proxy pool for all underlying fetchers.
type AuthOptions ¶
type AuthOptions struct {
Basic string `json:"basic,omitempty"`
Headers map[string]string `json:"headers,omitempty"`
Cookies []string `json:"cookies,omitempty"`
Query map[string]string `json:"query,omitempty"`
LoginURL string `json:"loginUrl,omitempty"`
LoginUserSelector string `json:"loginUserSelector,omitempty"`
LoginPassSelector string `json:"loginPassSelector,omitempty"`
LoginSubmitSelector string `json:"loginSubmitSelector,omitempty"`
LoginUser string `json:"loginUser,omitempty"`
LoginPass string `json:"loginPass,omitempty"`
LoginAutoDetect bool `json:"loginAutoDetect,omitempty"`
Proxy *ProxyConfig `json:"proxy,omitempty"`
// ProxyHints provides hints for proxy selection when using the loaded proxy pool.
ProxyHints *ProxySelectionHints `json:"proxyHints,omitempty"`
// OAuth2 contains OAuth 2.0 configuration for automatic token management.
// When set, the fetcher will use OAuth transport with automatic token refresh.
OAuth2 *OAuth2AuthConfig `json:"oauth2,omitempty"`
}
AuthOptions contains authentication options for fetch operations.
func (*AuthOptions) NormalizeTransport ¶
func (a *AuthOptions) NormalizeTransport()
NormalizeTransport trims proxy-related transport overrides in place.
func (*AuthOptions) ValidateTransport ¶
func (a *AuthOptions) ValidateTransport() error
ValidateTransport rejects ambiguous or malformed proxy overrides.
type BackoffStrategy ¶
type BackoffStrategy int
BackoffStrategy defines the backoff calculation strategy.
const ( // BackoffStrategyExponential uses exponential backoff: base * 2^attempt BackoffStrategyExponential BackoffStrategy = iota // BackoffStrategyExponentialJitter adds random jitter to exponential backoff BackoffStrategyExponentialJitter // BackoffStrategyLinear uses linear backoff: base * (attempt + 1) BackoffStrategyLinear // BackoffStrategyFixed uses a fixed delay regardless of attempt BackoffStrategyFixed )
func ParseBackoffStrategy ¶
func ParseBackoffStrategy(s string) BackoffStrategy
ParseBackoffStrategy parses a backoff strategy string.
func (BackoffStrategy) String ¶
func (s BackoffStrategy) String() string
String returns the string representation of the backoff strategy.
type BlockedResourceType ¶
type BlockedResourceType string
const ( BlockedResourceImage BlockedResourceType = "image" BlockedResourceMedia BlockedResourceType = "media" BlockedResourceFont BlockedResourceType = "font" BlockedResourceStylesheet BlockedResourceType = "stylesheet" BlockedResourceOther BlockedResourceType = "other" )
type ChromedpFetcher ¶
type ChromedpFetcher struct {
// contains filtered or unexported fields
}
func (*ChromedpFetcher) Fetch ¶
func (f *ChromedpFetcher) Fetch(ctx context.Context, req Request, prof RenderProfile) (Result, error)
func (*ChromedpFetcher) SetProxyPool ¶
func (f *ChromedpFetcher) SetProxyPool(pool *ProxyPool)
SetProxyPool sets the proxy pool for this fetcher.
type CircuitBreaker ¶
type CircuitBreaker struct {
// contains filtered or unexported fields
}
CircuitBreaker tracks failure state per host and implements the circuit breaker pattern. It is safe for concurrent use by multiple goroutines.
func NewCircuitBreaker ¶
func NewCircuitBreaker(cfg CircuitBreakerConfig) *CircuitBreaker
NewCircuitBreaker creates a new CircuitBreaker with the given configuration.
func (*CircuitBreaker) Allow ¶
func (cb *CircuitBreaker) Allow(host string) bool
Allow checks if a request to the given host should be allowed. Returns true if the request can proceed, false if it should be blocked.
func (*CircuitBreaker) GetConfig ¶
func (cb *CircuitBreaker) GetConfig() CircuitBreakerConfig
GetConfig returns a copy of the circuit breaker configuration.
func (*CircuitBreaker) GetHostStatus ¶
func (cb *CircuitBreaker) GetHostStatus() []CircuitBreakerHostStatus
GetHostStatus returns circuit breaker status for all known hosts.
func (*CircuitBreaker) GetState ¶
func (cb *CircuitBreaker) GetState(host string) CircuitBreakerState
GetState returns the current circuit breaker state for the given host.
func (*CircuitBreaker) IsEnabled ¶
func (cb *CircuitBreaker) IsEnabled() bool
IsEnabled returns true if the circuit breaker is enabled.
func (*CircuitBreaker) RecordFailure ¶
func (cb *CircuitBreaker) RecordFailure(host string)
RecordFailure records a failed request to the given host. This may transition the circuit breaker from closed to open, or half-open to open.
func (*CircuitBreaker) RecordSuccess ¶
func (cb *CircuitBreaker) RecordSuccess(host string)
RecordSuccess records a successful request to the given host. This may transition the circuit breaker from half-open to closed.
func (*CircuitBreaker) Reset ¶
func (cb *CircuitBreaker) Reset(host string)
Reset resets the circuit breaker state for a specific host or all hosts if host is empty.
type CircuitBreakerConfig ¶
type CircuitBreakerConfig struct {
Enabled bool // Whether circuit breaker is enabled
FailureThreshold int // Failures before opening circuit (default: 5)
SuccessThreshold int // Successes in half-open to close (default: 3)
ResetTimeout time.Duration // Time before attempting half-open (default: 30s)
HalfOpenMaxRequests int // Max requests in half-open state (default: 3)
}
CircuitBreakerConfig configures circuit breaker behavior.
func DefaultCircuitBreakerConfig ¶
func DefaultCircuitBreakerConfig() CircuitBreakerConfig
DefaultCircuitBreakerConfig returns a CircuitBreakerConfig with sensible defaults.
type CircuitBreakerHostStatus ¶
type CircuitBreakerHostStatus struct {
Host string
State string
FailureCount int
SuccessCount int
LastFailureTime time.Time
HalfOpenRequests int
}
CircuitBreakerHostStatus represents the current state of a circuit breaker for a host.
func (CircuitBreakerHostStatus) String ¶
func (cbs CircuitBreakerHostStatus) String() string
String returns a human-readable description of the circuit breaker state.
type CircuitBreakerState ¶
type CircuitBreakerState int
CircuitBreakerState represents the state of a circuit breaker.
const ( // StateClosed is the normal operating state where requests are allowed. StateClosed CircuitBreakerState = iota // StateOpen means the failure threshold was reached; requests are blocked. StateOpen // StateHalfOpen is a testing state to check if the service has recovered. StateHalfOpen )
func (CircuitBreakerState) String ¶
func (s CircuitBreakerState) String() string
String returns the string representation of the circuit breaker state.
type DefaultHealthChecker ¶
DefaultHealthChecker makes HTTP request through proxy to test endpoint.
func (*DefaultHealthChecker) Check ¶
func (c *DefaultHealthChecker) Check(ctx context.Context, proxy ProxyEntry) (latencyMs int64, err error)
Check performs a health check on the given proxy.
type DetectedForm ¶
type DetectedForm struct {
FormIndex int `json:"formIndex"` // Index in document (0 = first form)
FormSelector string `json:"formSelector"` // CSS selector to target this form
Score float64 `json:"score"` // Overall confidence score (0.0-1.0)
FormType FormType `json:"formType"` // Classified type
UserField *FieldMatch `json:"userField"` // Detected username field (nil if not found)
PassField *FieldMatch `json:"passField"` // Detected password field (nil if not found)
SubmitField *FieldMatch `json:"submitField"` // Detected submit button (nil if not found)
AllFields []FieldMatch `json:"allFields,omitempty"` // All detected fields in the form
HTML string `json:"html,omitempty"` // Form HTML snippet (for debugging)
Action string `json:"action,omitempty"` // Form action URL
Method string `json:"method,omitempty"` // Form method (GET/POST)
Name string `json:"name,omitempty"` // Form name attribute
ID string `json:"id,omitempty"` // Form ID attribute
}
DetectedForm represents a form with detection metadata.
type DetectedFormFields ¶
type DetectedFormFields struct {
UserField FieldMatch `json:"userField"` // Detected username/email field
PassField FieldMatch `json:"passField"` // Detected password field
SubmitField FieldMatch `json:"submitField"` // Detected submit button
FormType FormType `json:"formType"` // Classified type of form
}
DetectedFormFields captures the fields detected within a form.
type DetectionWeights ¶
type DetectionWeights struct {
PasswordTypeWeight float64 // input[type=password] - strongest signal
EmailTypeWeight float64 // input[type=email]
AutocompleteUsername float64 // autocomplete="username"
AutocompletePassword float64 // autocomplete="current-password"
NamePatternUsername float64 // name matches user/login/email patterns
NamePatternPassword float64 // name matches pass/pwd patterns
IDPatternUsername float64 // id matches user/login/email patterns
SubmitButtonType float64 // button[type=submit] or input[type=submit]
SubmitButtonText float64 // button text contains "login", "sign in", etc.
}
DetectionWeights configures the scoring weights for form detection heuristics. Higher weights indicate stronger signals.
func DefaultDetectionWeights ¶
func DefaultDetectionWeights() DetectionWeights
DefaultDetectionWeights returns sensible default weights for form detection.
type DeviceCategory ¶
type DeviceCategory string
DeviceCategory classifies devices by form factor.
const ( DeviceCategoryMobile DeviceCategory = "mobile" DeviceCategoryTablet DeviceCategory = "tablet" DeviceCategoryDesktop DeviceCategory = "desktop" )
func GetDeviceCategories ¶
func GetDeviceCategories() []DeviceCategory
GetDeviceCategories returns all available device categories.
type DeviceEmulation ¶
type DeviceEmulation struct {
Name string `json:"name"` // Device preset name (e.g., "iPhone 14", "Pixel 7")
ViewportWidth int `json:"viewportWidth"` // Viewport width in pixels
ViewportHeight int `json:"viewportHeight"` // Viewport height in pixels
DeviceScaleFactor float64 `json:"deviceScaleFactor"` // Device pixel ratio (e.g., 2.0 for Retina)
UserAgent string `json:"userAgent"` // User agent string for the device
IsMobile bool `json:"isMobile"` // Whether to emulate mobile viewport
HasTouch bool `json:"hasTouch"` // Whether the device has touch capability
Category DeviceCategory `json:"category"` // Device category (mobile, tablet, desktop)
Orientation Orientation `json:"orientation"` // Default orientation (portrait/landscape)
}
DeviceEmulation defines device emulation settings for mobile/responsive content. Used by headless fetchers to emulate specific devices.
func GetDevicePreset ¶
func GetDevicePreset(name string) *DeviceEmulation
GetDevicePreset returns a device emulation preset by name. Returns nil if the preset name is not recognized.
func GetDevicePresetsByCategory ¶
func GetDevicePresetsByCategory(cat DeviceCategory) []DeviceEmulation
GetDevicePresetsByCategory returns all device presets matching the given category.
func (*DeviceEmulation) ApplyOrientation ¶
func (d *DeviceEmulation) ApplyOrientation(orientation Orientation) *DeviceEmulation
ApplyOrientation applies the specified orientation to a device emulation. For landscape orientation on mobile/tablet devices, it swaps width and height.
type Fetcher ¶
func NewFetcher ¶
func NewFetcherWithProxyPool ¶
NewFetcherWithProxyPool creates a new fetcher with proxy pool support.
type FetcherWithMetrics ¶
type FetcherWithMetrics interface {
Fetcher
SetMetricsCallback(cb MetricsCallback)
}
FetcherWithMetrics is a fetcher that supports metrics callbacks.
func NewFetcherWithMetrics ¶
func NewFetcherWithMetrics(dataDir string, callback MetricsCallback) FetcherWithMetrics
NewFetcherWithMetrics creates a new fetcher with metrics callback support.
func NewFetcherWithMetricsAndProxyPool ¶
func NewFetcherWithMetricsAndProxyPool(dataDir string, callback MetricsCallback, pool *ProxyPool) FetcherWithMetrics
NewFetcherWithMetricsAndProxyPool creates a new fetcher with both metrics and proxy pool support.
type FieldMatch ¶
type FieldMatch struct {
Selector string `json:"selector"` // CSS selector to target this field
Attribute string `json:"attribute"` // Which attribute matched (type, name, id, etc.)
MatchValue string `json:"matchValue"` // The value that matched
Confidence float64 `json:"confidence"` // Individual field confidence (0.0-1.0)
MatchReasons []string `json:"matchReasons,omitempty"` // Why this field was selected
FieldType FieldType `json:"fieldType,omitempty"` // Semantic field type classification
FieldName string `json:"fieldName,omitempty"` // Human-readable field name (e.g., "email", "firstName")
Required bool `json:"required,omitempty"` // Whether the field is required
Placeholder string `json:"placeholder,omitempty"` // Placeholder text if available
}
FieldMatch represents a detected form field with metadata about how it was identified.
type FieldType ¶
type FieldType string
FieldType classifies form fields by their semantic purpose.
const ( FieldTypeText FieldType = "text" FieldTypeEmail FieldType = "email" FieldTypePassword FieldType = "password" FieldTypePhone FieldType = "phone" FieldTypeAddress FieldType = "address" FieldTypeSearch FieldType = "search" FieldTypeURL FieldType = "url" FieldTypeNumber FieldType = "number" FieldTypeDate FieldType = "date" FieldTypeSelect FieldType = "select" FieldTypeTextarea FieldType = "textarea" FieldTypeCheckbox FieldType = "checkbox" FieldTypeRadio FieldType = "radio" FieldTypeSubmit FieldType = "submit" FieldTypeHidden FieldType = "hidden" FieldTypeFile FieldType = "file" FieldTypeUnknown FieldType = "unknown" )
type FormDetectRequest ¶
type FormDetectRequest struct {
URL string `json:"url"`
FormType string `json:"formType,omitempty"`
Headless bool `json:"headless"`
}
FormDetectRequest represents a request to detect forms on a page.
type FormDetectResponse ¶
type FormDetectResponse struct {
URL string `json:"url"`
Forms []DetectedForm `json:"forms"`
FormCount int `json:"formCount"`
DetectedTypes []string `json:"detectedTypes"`
}
FormDetectResponse represents the response from form detection.
func (FormDetectResponse) MarshalJSON ¶
func (r FormDetectResponse) MarshalJSON() ([]byte, error)
MarshalJSON implements custom JSON marshaling for FormDetectResponse.
type FormDetector ¶
type FormDetector struct {
Weights DetectionWeights
}
FormDetector analyzes HTML to find and classify login forms.
func NewFormDetector ¶
func NewFormDetector() *FormDetector
NewFormDetector creates a new form detector with default weights.
func NewFormDetectorWithWeights ¶
func NewFormDetectorWithWeights(weights DetectionWeights) *FormDetector
NewFormDetectorWithWeights creates a form detector with custom weights.
func (*FormDetector) DetectAllForms ¶
func (d *FormDetector) DetectAllForms(html string) ([]DetectedForm, error)
DetectAllForms analyzes HTML and returns all detected forms with full field classification. This is the general-purpose form detection that supports all form types.
func (*FormDetector) DetectFormFields ¶
func (d *FormDetector) DetectFormFields(html string, formSelector string) ([]FieldMatch, error)
DetectFormFields extracts all fields from a specific form.
func (*FormDetector) DetectForms ¶
func (d *FormDetector) DetectForms(html string) ([]DetectedForm, error)
DetectForms analyzes HTML and returns detected forms sorted by confidence (highest first).
func (*FormDetector) DetectFormsByType ¶
func (d *FormDetector) DetectFormsByType(html string, formType FormType) ([]DetectedForm, error)
DetectFormsByType analyzes HTML and returns forms of a specific type.
func (*FormDetector) DetectLoginForm ¶
func (d *FormDetector) DetectLoginForm(html string) (*DetectedForm, error)
DetectLoginForm is a convenience method that returns the highest-confidence login form. Returns nil if no suitable login form is detected.
type FormFillRequest ¶
type FormFillRequest struct {
URL string `json:"url"` // URL of the page containing the form
FormSelector string `json:"formSelector,omitempty"` // CSS selector for the form (auto-detect if empty)
Fields map[string]string `json:"fields"` // field name/selector -> value
Submit bool `json:"submit"` // Whether to submit the form
WaitFor string `json:"waitFor,omitempty"` // Selector to wait for after submit
Timeout time.Duration `json:"timeout,omitempty"` // Operation timeout
Headless bool `json:"headless"` // Use headless mode
DetectOnly bool `json:"detectOnly,omitempty"` // Only detect forms, don't fill
FormTypeFilter string `json:"formTypeFilter,omitempty"` // Filter by form type (e.g., "contact", "search")
}
FormFillRequest represents a form fill operation.
type FormFillResult ¶
type FormFillResult struct {
Success bool `json:"success"`
FormSelector string `json:"formSelector"`
FormType FormType `json:"formType,omitempty"`
FilledFields []string `json:"filledFields"`
Errors []string `json:"errors,omitempty"`
PageURL string `json:"pageUrl,omitempty"`
PageHTML string `json:"pageHtml,omitempty"`
DetectedForms []DetectedForm `json:"detectedForms,omitempty"`
}
FormFillResult represents the result of a form fill operation.
func (FormFillResult) MarshalJSON ¶
func (r FormFillResult) MarshalJSON() ([]byte, error)
MarshalJSON implements custom JSON marshaling for FormFillResult.
type FormFiller ¶
type FormFiller struct {
// contains filtered or unexported fields
}
FormFiller handles automated form filling and submission.
func NewFormFiller ¶
func NewFormFiller(fetcher *ChromedpFetcher) *FormFiller
NewFormFiller creates a new form filler using the provided chromedp fetcher.
func (*FormFiller) Detect ¶
func (f *FormFiller) Detect(ctx context.Context, req FormDetectRequest) (*FormDetectResponse, error)
Detect forms on a page and return detailed information.
func (*FormFiller) DetectForms ¶
func (f *FormFiller) DetectForms(ctx context.Context, url string, formTypeFilter string) (*FormFillResult, error)
DetectForms detects all forms on a page and returns their details.
func (*FormFiller) FillForm ¶
func (f *FormFiller) FillForm(ctx context.Context, req FormFillRequest) (*FormFillResult, error)
FillForm fills and optionally submits a form.
type FormType ¶
type FormType string
FormType classifies detected forms by their likely purpose.
const ( FormTypeLogin FormType = "login" FormTypeRegister FormType = "register" FormTypePasswordReset FormType = "password_reset" FormTypeSearch FormType = "search" FormTypeContact FormType = "contact" FormTypeNewsletter FormType = "newsletter" FormTypeCheckout FormType = "checkout" FormTypeSurvey FormType = "survey" FormTypeUnknown FormType = "unknown" )
type HTTPFetcher ¶
type HTTPFetcher struct {
// contains filtered or unexported fields
}
HTTPFetcher implements content fetching using the standard library http.Client. Provides retry logic, rate limiting, authentication, conditional requests, and response size limits. See fetcher.go for the Fetcher interface definition.
func (*HTTPFetcher) Fetch ¶
Fetch performs a standard HTTP GET request to retrieve the content of a URL. It supports retries, rate limiting, and basic/token authentication.
func (*HTTPFetcher) SetProxyPool ¶
func (f *HTTPFetcher) SetProxyPool(pool *ProxyPool)
SetProxyPool sets the proxy pool for this fetcher.
type HealthCheckConfig ¶
type HealthCheckConfig struct {
Enabled bool `json:"enabled"`
IntervalSeconds int `json:"interval_seconds"`
TimeoutSeconds int `json:"timeout_seconds"`
MaxConsecutiveFails int `json:"max_consecutive_fails"`
RecoveryAfterSeconds int `json:"recovery_after_seconds"`
TestURL string `json:"test_url,omitempty"`
}
HealthCheckConfig configures proxy health checking.
func DefaultHealthCheckConfig ¶
func DefaultHealthCheckConfig() HealthCheckConfig
DefaultHealthCheckConfig returns sensible defaults for health checking.
type HealthChecker ¶
type HealthChecker interface {
Check(ctx context.Context, proxy ProxyEntry) (latencyMs int64, err error)
}
HealthChecker defines the interface for proxy health checking.
type HostLimiter ¶
type HostLimiter struct {
// contains filtered or unexported fields
}
HostLimiter manages per-host rate limiters
func NewAdaptiveHostLimiter ¶
func NewAdaptiveHostLimiter(qps int, burst int, cfg *AdaptiveConfig) *HostLimiter
NewAdaptiveHostLimiter creates a HostLimiter with adaptive rate limiting enabled. The limiter will dynamically adjust QPS per host based on server responses.
func NewAdaptiveHostLimiterWithCircuitBreaker ¶
func NewAdaptiveHostLimiterWithCircuitBreaker(qps int, burst int, adaptiveCfg *AdaptiveConfig, cb *CircuitBreaker) *HostLimiter
NewAdaptiveHostLimiterWithCircuitBreaker creates a HostLimiter with both adaptive rate limiting and circuit breaker enabled.
func NewHostLimiter ¶
func NewHostLimiter(qps int, burst int) *HostLimiter
func NewHostLimiterWithCircuitBreaker ¶
func NewHostLimiterWithCircuitBreaker(qps int, burst int, cb *CircuitBreaker) *HostLimiter
NewHostLimiterWithCircuitBreaker creates a HostLimiter with circuit breaker enabled. The circuit breaker will isolate failing hosts to prevent cascading failures.
func (*HostLimiter) GetAdaptiveConfig ¶
func (h *HostLimiter) GetAdaptiveConfig() *AdaptiveConfig
GetAdaptiveConfig returns a copy of the adaptive configuration, or nil if not enabled.
func (*HostLimiter) GetBurst ¶
func (h *HostLimiter) GetBurst() int
GetBurst returns the configured burst
func (*HostLimiter) GetCircuitBreaker ¶
func (h *HostLimiter) GetCircuitBreaker() *CircuitBreaker
GetCircuitBreaker returns the circuit breaker instance, or nil if not enabled.
func (*HostLimiter) GetHostStatus ¶
func (h *HostLimiter) GetHostStatus() []HostStatus
GetHostStatus returns rate limit status for all known hosts
func (*HostLimiter) GetLimiter ¶
func (h *HostLimiter) GetLimiter(host string) *rate.Limiter
GetLimiter returns the rate limiter for a specific host (for metrics registration)
func (*HostLimiter) GetQPS ¶
func (h *HostLimiter) GetQPS() float64
GetQPS returns the configured QPS
func (*HostLimiter) IsAdaptiveEnabled ¶
func (h *HostLimiter) IsAdaptiveEnabled() bool
IsAdaptiveEnabled returns true if adaptive rate limiting is enabled.
func (*HostLimiter) IsCircuitBreakerEnabled ¶
func (h *HostLimiter) IsCircuitBreakerEnabled() bool
IsCircuitBreakerEnabled returns true if circuit breaker is enabled.
func (*HostLimiter) RecordRateLimit ¶
func (h *HostLimiter) RecordRateLimit(host string, retryAfter time.Duration)
RecordRateLimit reports a 429 response for the given host with an optional Retry-After duration. When adaptive rate limiting is enabled, this will decrease the QPS for the host and optionally set a cooldown period.
func (*HostLimiter) RecordResult ¶
func (h *HostLimiter) RecordResult(host string, err error, status int)
RecordResult records the result of a request for both adaptive rate limiting and circuit breaker tracking.
func (*HostLimiter) RecordSuccess ¶
func (h *HostLimiter) RecordSuccess(host string)
RecordSuccess reports a successful request (2xx/3xx status) for the given host. When adaptive rate limiting is enabled, this may increase the QPS for the host after a threshold of consecutive successes is reached.
func (*HostLimiter) UpdateRateLimitInfo ¶
func (h *HostLimiter) UpdateRateLimitInfo(host string, info RateLimitInfo)
UpdateRateLimitInfo updates the limiter based on server-provided rate limit headers. This allows the limiter to respect server-provided rate limits instead of relying solely on adaptive AIMD behavior.
Behavior:
- If info.Limit > 0, adjusts currentQPS to respect the server's limit
- If info.Remaining is low (< 10%), enters cooldown until reset time
- If info.Reset is in the future and remaining is low, respects that reset time
func (*HostLimiter) Wait ¶
func (h *HostLimiter) Wait(ctx context.Context, rawURL string) error
Wait waits for the rate limiter for the given URL using global default rates. For per-host rate configuration, use WaitWithRates instead.
func (*HostLimiter) WaitWithRates ¶
func (h *HostLimiter) WaitWithRates(ctx context.Context, rawURL string, profileQPS int, profileBurst int) error
WaitWithRates waits for the rate limiter for the given URL with optional per-host rates. If profileQPS or profileBurst are 0, the global defaults are used. This method also checks the circuit breaker before allowing the request.
type HostStatus ¶
type HostStatus struct {
Host string
QPS float64
Burst int
LastRequest time.Time
// Adaptive rate limiting fields
CurrentQPS float64 // actual QPS after adaptation
AdaptiveEnabled bool
ConsecutiveSuccesses int
Consecutive429s int
InCooldown bool
CooldownUntil time.Time
// Circuit breaker fields
CircuitBreakerState string // closed, open, half-open
CircuitBreakerFailures int // Current failure count
CircuitBreakerLastFail time.Time // Last failure timestamp
}
HostStatus represents the current rate limit state for a single host
type InterceptedEntry ¶
type InterceptedEntry struct {
Request InterceptedRequest `json:"request"`
Response *InterceptedResponse `json:"response,omitempty"` // nil if response not received
Duration time.Duration `json:"duration"` // Time between request and response
}
InterceptedEntry combines a request/response pair with timing data.
type InterceptedRequest ¶
type InterceptedRequest struct {
RequestID string `json:"requestId"` // Unique identifier
URL string `json:"url"` // Request URL
Method string `json:"method"` // HTTP method
Headers map[string]string `json:"headers"` // Request headers
Body string `json:"body"` // Request body (base64 if binary)
BodySize int64 `json:"bodySize"` // Original body size
Timestamp time.Time `json:"timestamp"` // When request was sent
ResourceType InterceptedResourceType `json:"resourceType"` // Type of resource
}
InterceptedRequest represents a captured network request.
type InterceptedResourceType ¶
type InterceptedResourceType string
InterceptedResourceType represents the type of network resource.
const ( ResourceTypeXHR InterceptedResourceType = "xhr" ResourceTypeFetch InterceptedResourceType = "fetch" ResourceTypeDocument InterceptedResourceType = "document" ResourceTypeScript InterceptedResourceType = "script" ResourceTypeStylesheet InterceptedResourceType = "stylesheet" ResourceTypeImage InterceptedResourceType = "image" ResourceTypeMedia InterceptedResourceType = "media" ResourceTypeFont InterceptedResourceType = "font" ResourceTypeWebSocket InterceptedResourceType = "websocket" ResourceTypeOther InterceptedResourceType = "other" )
type InterceptedResponse ¶
type InterceptedResponse struct {
RequestID string `json:"requestId"` // Matches request
Status int `json:"status"` // HTTP status code
StatusText string `json:"statusText"` // HTTP status text
Headers map[string]string `json:"headers"` // Response headers
Body string `json:"body"` // Response body (base64 if binary)
BodySize int64 `json:"bodySize"` // Size of response body
Timestamp time.Time `json:"timestamp"` // When response received
}
InterceptedResponse represents a captured network response.
type JSHeaviness ¶
type JSHeaviness struct {
Score float64 `json:"score"`
Reasons []string `json:"reasons"`
ScriptTagCount int `json:"scriptTagCount"`
BodyTextLength int `json:"bodyTextLength"`
RootDivSignals int `json:"rootDivSignals"`
FrameworkSignals int `json:"frameworkSignals"`
}
func DetectJSHeaviness ¶
func DetectJSHeaviness(html string) JSHeaviness
DetectJSHeaviness analyzes HTML content to determine if it requires JavaScript to render meaningful content.
type MetricsCallback ¶
MetricsCallback is the function signature for metrics collection callbacks.
type NetworkInterceptConfig ¶
type NetworkInterceptConfig struct {
Enabled bool `json:"enabled"` // Toggle interception
URLPatterns []string `json:"urlPatterns"` // Glob patterns for URLs to intercept (e.g., "**/api/**", "*.json")
ResourceTypes []InterceptedResourceType `json:"resourceTypes"` // Resource types to capture
CaptureRequestBody bool `json:"captureRequestBody"` // Whether to capture request bodies
CaptureResponseBody bool `json:"captureResponseBody"` // Whether to capture response bodies
MaxBodySize int64 `json:"maxBodySize"` // Max bytes to capture per body (default 1MB)
MaxEntries int `json:"maxEntries"` // Max number of entries to capture (default 1000)
}
NetworkInterceptConfig defines configuration for network request/response interception. Used to capture XHR/Fetch API traffic from SPAs for API scraping.
func DefaultNetworkInterceptConfig ¶
func DefaultNetworkInterceptConfig() NetworkInterceptConfig
DefaultNetworkInterceptConfig returns a default configuration with sensible limits.
type OAuth2AuthConfig ¶
type OAuth2AuthConfig struct {
// ProfileName is the name of the auth profile with OAuth2 configuration
ProfileName string `json:"profileName,omitempty"`
// AccessToken is a static access token (optional - if not set, will be loaded from store)
AccessToken string `json:"accessToken,omitempty"`
// TokenType is the token type (e.g., "Bearer"). Defaults to "Bearer" if not set.
TokenType string `json:"tokenType,omitempty"`
}
OAuth2AuthConfig defines OAuth 2.0 authentication configuration for fetch operations.
type Orientation ¶
type Orientation string
Orientation represents the device screen orientation.
const ( OrientationPortrait Orientation = "portrait" OrientationLandscape Orientation = "landscape" )
type PlaywrightFetcher ¶
type PlaywrightFetcher struct {
// contains filtered or unexported fields
}
func (*PlaywrightFetcher) Close ¶
func (f *PlaywrightFetcher) Close() error
func (*PlaywrightFetcher) Fetch ¶
func (f *PlaywrightFetcher) Fetch(ctx context.Context, req Request, prof RenderProfile) (Result, error)
func (*PlaywrightFetcher) SetProxyPool ¶
func (f *PlaywrightFetcher) SetProxyPool(pool *ProxyPool)
SetProxyPool sets the proxy pool for this fetcher.
type ProxyConfig ¶
type ProxyConfig struct {
URL string `json:"url,omitempty"` // Proxy URL (http://, https://, socks5://)
Username string `json:"username,omitempty"` // Username for proxy authentication
Password string `json:"password,omitempty"` // Password for proxy authentication
}
ProxyConfig defines proxy settings for fetch operations.
type ProxyEntry ¶
type ProxyEntry struct {
ID string `json:"id"`
URL string `json:"url"`
Username string `json:"username,omitempty"`
Password string `json:"password,omitempty"`
Region string `json:"region,omitempty"`
Tags []string `json:"tags,omitempty"`
Weight int `json:"weight,omitempty"`
MaxRequests int `json:"max_requests,omitempty"`
}
ProxyEntry represents a single proxy in the pool.
func (ProxyEntry) ToProxyConfig ¶
func (e ProxyEntry) ToProxyConfig() ProxyConfig
ToProxyConfig converts ProxyEntry to ProxyConfig for use with existing fetchers.
type ProxyPool ¶
type ProxyPool struct {
// contains filtered or unexported fields
}
ProxyPool manages a collection of proxies with rotation.
func LoadProxyPoolFromFile ¶
LoadProxyPoolFromFile loads a proxy pool from a JSON configuration file.
func NewProxyPool ¶
func NewProxyPool(config ProxyPoolConfig) (*ProxyPool, error)
NewProxyPool creates a new proxy pool from configuration.
func ProxyPoolFromConfig ¶
ProxyPoolFromConfig creates a proxy pool from configured startup settings. Missing files are silent only when callers mark the path as non-required.
func (*ProxyPool) GetEntries ¶
func (p *ProxyPool) GetEntries() []ProxyEntry
GetEntries returns a copy of all proxy entries.
func (*ProxyPool) GetHealthyProxyCount ¶
GetHealthyProxyCount returns the number of healthy proxies.
func (*ProxyPool) GetProxyStats ¶
func (p *ProxyPool) GetProxyStats(proxyID string) (ProxyStats, bool)
GetProxyStats returns stats for a specific proxy.
func (*ProxyPool) GetStats ¶
func (p *ProxyPool) GetStats() map[string]ProxyStats
GetStats returns current stats for all proxies.
func (*ProxyPool) GetStrategy ¶
func (p *ProxyPool) GetStrategy() RotationStrategy
GetStrategy returns the current rotation strategy.
func (*ProxyPool) GetTotalProxyCount ¶
GetTotalProxyCount returns the total number of proxies.
func (*ProxyPool) RecordFailure ¶
RecordFailure updates stats for a failed proxy request.
func (*ProxyPool) RecordSuccess ¶
RecordSuccess updates stats for a successful proxy request.
func (*ProxyPool) Select ¶
func (p *ProxyPool) Select(hints ProxySelectionHints) (ProxyEntry, error)
Select returns a proxy based on the configured rotation strategy. Returns an error if no healthy proxies are available.
func (*ProxyPool) SetStrategy ¶
func (p *ProxyPool) SetStrategy(strategy RotationStrategy)
SetStrategy changes the rotation strategy.
type ProxyPoolConfig ¶
type ProxyPoolConfig struct {
DefaultStrategy string `json:"default_strategy"`
HealthCheck HealthCheckConfig `json:"health_check,omitempty"`
Proxies []ProxyEntry `json:"proxies"`
}
ProxyPoolConfig is the configuration file format for proxy pools.
type ProxySelectionHints ¶
type ProxySelectionHints struct {
PreferredRegion string `json:"preferred_region,omitempty"`
RequiredTags []string `json:"required_tags,omitempty"`
ExcludeProxyIDs []string `json:"exclude_proxy_ids,omitempty"`
}
ProxySelectionHints provides hints for proxy selection.
func NormalizeProxySelectionHints ¶
func NormalizeProxySelectionHints(hints *ProxySelectionHints) *ProxySelectionHints
NormalizeProxySelectionHints trims and deduplicates proxy selection hints.
type ProxyStats ¶
type ProxyStats struct {
RequestCount uint64 `json:"request_count"`
SuccessCount uint64 `json:"success_count"`
FailureCount uint64 `json:"failure_count"`
LastUsed time.Time `json:"last_used"`
LastFailed time.Time `json:"last_failed,omitempty"`
AvgLatencyMs int64 `json:"avg_latency_ms"`
ConsecutiveFails int `json:"consecutive_fails"`
IsHealthy bool `json:"is_healthy"`
}
ProxyStats tracks usage and health for a proxy.
func (ProxyStats) SuccessRate ¶
func (s ProxyStats) SuccessRate() float64
SuccessRate returns the success rate as a percentage (0-100).
type RateLimitInfo ¶
type RateLimitInfo struct {
Limit int // Maximum requests allowed in the window
Remaining int // Requests remaining in current window
Reset time.Time // When the rate limit window resets
Window time.Duration // Optional: window duration if known
}
RateLimitInfo holds parsed rate limit data from various header formats. It represents the server's rate limit policy and current state.
func ExtractRateLimitInfo ¶
func ExtractRateLimitInfo(headers http.Header) (RateLimitInfo, bool)
ExtractRateLimitInfo tries all known header formats and returns the best available data. Priority: 1. RFC 9440 RateLimit header (preferred standard) 2. X-RateLimit-* headers (common API patterns) 3. RateLimit-Policy header (for window info only)
Returns (info, true) if any rate limit headers were found, (empty, false) otherwise.
func ParseRateLimitHeader ¶
func ParseRateLimitHeader(header string) (RateLimitInfo, error)
ParseRateLimitHeader parses the RFC 9440 RateLimit header. Format: RateLimit: limit=100, remaining=50, reset=60 The reset value can be:
- Delta-seconds (integer): seconds until reset
- Unix timestamp (integer >= 1e9): absolute reset time
func ParseXRateLimitHeaders ¶
func ParseXRateLimitHeaders(headers http.Header) (RateLimitInfo, error)
ParseXRateLimitHeaders parses common X-RateLimit-* header variants. Supports GitHub, Twitter, and other common API patterns:
- X-RateLimit-Limit: maximum requests allowed
- X-RateLimit-Remaining: requests remaining in current window
- X-RateLimit-Reset: reset time (Unix timestamp or HTTP date)
Some APIs use different prefixes (e.g., x-ratelimit-* lowercase). This function checks both canonical and lowercase forms.
func (*RateLimitInfo) IsRateLimited ¶
func (r *RateLimitInfo) IsRateLimited() bool
IsRateLimited returns true if the rate limit has been exceeded (Remaining <= 0). Returns false if no rate limit information is available.
func (*RateLimitInfo) TimeUntilReset ¶
func (r *RateLimitInfo) TimeUntilReset() time.Duration
TimeUntilReset returns the duration until the rate limit resets. Returns 0 if reset time is not set or has already passed.
func (*RateLimitInfo) UsagePercent ¶
func (r *RateLimitInfo) UsagePercent() float64
UsagePercent returns the percentage of rate limit used (0-100). Returns -1 if limit information is not available.
type RenderBlockPolicy ¶
type RenderBlockPolicy struct {
ResourceTypes []BlockedResourceType `json:"resourceTypes,omitempty"`
URLPatterns []string `json:"urlPatterns,omitempty"` // glob-style patterns
}
type RenderEngine ¶
type RenderEngine string
const ( RenderEngineHTTP RenderEngine = "http" RenderEngineChromedp RenderEngine = "chromedp" RenderEnginePlaywright RenderEngine = "playwright" )
type RenderProfile ¶
type RenderProfile struct {
Name string `json:"name"`
HostPatterns []string `json:"hostPatterns"` // match against URL host, glob-style ("example.com", "*.example.com")
// If set, overrides engine selection entirely.
ForceEngine RenderEngine `json:"forceEngine,omitempty"`
// If true, skip HTTP probe and go straight to headless engine selection.
PreferHeadless bool `json:"preferHeadless,omitempty"`
// If true, treat every page on this host as JS-heavy (forces escalation if not forced to HTTP).
AssumeJSHeavy bool `json:"assumeJsHeavy,omitempty"`
// If true, never escalate (forces HTTP).
NeverHeadless bool `json:"neverHeadless,omitempty"`
// Overrides default JS-heavy threshold for this host (0..1). 0 means use global default.
JSHeavyThreshold float64 `json:"jsHeavyThreshold,omitempty"`
// Rate limiting configuration for this profile (0 = use global defaults).
RateLimitQPS int `json:"rateLimitQPS,omitempty"`
RateLimitBurst int `json:"rateLimitBurst,omitempty"`
Block RenderBlockPolicy `json:"block,omitempty"`
Wait RenderWaitPolicy `json:"wait,omitempty"`
Timeouts RenderTimeoutPolicy `json:"timeouts,omitempty"`
Screenshot ScreenshotConfig `json:"screenshot,omitempty"`
Device *DeviceEmulation `json:"device,omitempty"` // Device emulation for this profile
// CaptchaConfig defines CAPTCHA handling for this profile.
CaptchaConfig *captcha.CaptchaConfig `json:"captchaConfig,omitempty"`
}
func GetRenderProfile ¶
func GetRenderProfile(dataDir, name string) (RenderProfile, bool, error)
GetRenderProfile retrieves a single profile by name. Returns (profile, true, nil) if found, (zero, false, nil) if not found.
type RenderProfileStore ¶
type RenderProfileStore struct {
// contains filtered or unexported fields
}
func NewRenderProfileStore ¶
func NewRenderProfileStore(dataDir string) *RenderProfileStore
NewRenderProfileStore initializes a new store. It attempts to load profiles immediately.
func (*RenderProfileStore) GetRateLimitsForURL ¶
func (s *RenderProfileStore) GetRateLimitsForURL(rawURL string) (qps int, burst int)
GetRateLimitsForURL returns the rate limit configuration for a given URL. Returns (0, 0) if no matching profile or if profile has no rate limits set.
func (*RenderProfileStore) MatchURL ¶
func (s *RenderProfileStore) MatchURL(rawURL string) (*RenderProfile, bool, error)
MatchURL returns the highest-precedence matching profile for a given URL. Precedence: first match in file order (user-controlled), deterministic.
func (*RenderProfileStore) Profiles ¶
func (s *RenderProfileStore) Profiles() []RenderProfile
Profiles returns a copy of all loaded profiles.
func (*RenderProfileStore) Reload ¶
func (s *RenderProfileStore) Reload() error
Reload loads profiles from disk. If file is missing, profiles become empty. Idempotent.
func (*RenderProfileStore) ReloadIfChanged ¶
func (s *RenderProfileStore) ReloadIfChanged() error
ReloadIfChanged checks file modification time and reloads if necessary.
type RenderProfilesFile ¶
type RenderProfilesFile struct {
Profiles []RenderProfile `json:"profiles"`
}
func LoadRenderProfilesFile ¶
func LoadRenderProfilesFile(dataDir string) (RenderProfilesFile, error)
LoadRenderProfilesFile loads the render profiles file from disk. If the file doesn't exist, returns an empty RenderProfilesFile. Uses strict JSON decoding - unknown fields cause a validation error.
type RenderTimeoutPolicy ¶
type RenderTimeoutPolicy struct {
// Absolute cap for the entire render phase (headless only).
MaxRenderMs int `json:"maxRenderMs,omitempty"`
// Cap for in-page script evaluation/wait-for-function loops.
ScriptEvalMs int `json:"scriptEvalMs,omitempty"`
NavigationMs int `json:"navigationMs,omitempty"`
}
type RenderWaitMode ¶
type RenderWaitMode string
const ( RenderWaitModeDOMReady RenderWaitMode = "dom_ready" // DOMContentLoaded + body present RenderWaitModeNetworkIdle RenderWaitMode = "network_idle" // inflight==0 for quiet window RenderWaitModeStability RenderWaitMode = "stability" // body.innerText length stabilizes RenderWaitModeSelector RenderWaitMode = "selector" // selector appears (and optional stability) )
type RenderWaitPolicy ¶
type RenderWaitPolicy struct {
Mode RenderWaitMode `json:"mode,omitempty"`
// RenderWaitModeSelector
Selector string `json:"selector,omitempty"`
// RenderWaitModeNetworkIdle
NetworkIdleQuietMs int `json:"networkIdleQuietMs,omitempty"`
// RenderWaitModeStability
MinTextLength int `json:"minTextLength,omitempty"`
StabilityPollMs int `json:"stabilityPollMs,omitempty"`
StabilityIterations int `json:"stabilityIterations,omitempty"`
// Always applied after wait mode completes (final settle).
ExtraSleepMs int `json:"extraSleepMs,omitempty"`
}
type Request ¶
type Request struct {
URL string
Method string // HTTP method (GET, POST, PUT, DELETE, PATCH, etc.)
Body []byte // Request body for POST/PUT/PATCH
ContentType string // Content-Type header for request body
Timeout time.Duration
UserAgent string
Headless bool
UsePlaywright bool
Auth AuthOptions
SessionID string // Reference to persisted session for cookie reuse
Limiter *HostLimiter
MaxRetries int
RetryBaseDelay time.Duration
MaxResponseBytes int64 `json:"maxResponseBytes,omitempty"`
IfNoneMatch string `json:"-"`
IfModifiedSince string `json:"-"`
DataDir string `json:"-"`
WaitSelectors []string `json:"-"`
Screenshot *ScreenshotConfig `json:"screenshot"`
Device *DeviceEmulation `json:"device,omitempty"` // Device emulation settings
NetworkIntercept *NetworkInterceptConfig `json:"networkIntercept,omitempty"` // Network interception config
}
type Result ¶
type Result struct {
URL string `json:"url"`
Status int `json:"status"`
HTML string `json:"html"`
FetchedAt time.Time `json:"fetchedAt"`
ETag string `json:"-"`
LastModified string `json:"-"`
Engine RenderEngine `json:"-"`
ScreenshotPath string `json:"screenshotPath,omitempty"` // Path to saved screenshot file
InterceptedData []InterceptedEntry `json:"interceptedData,omitempty"` // Captured network activity
// RateLimit contains parsed rate limit information from response headers.
// Populated when the server returns RateLimit (RFC 9440) or X-RateLimit-* headers.
RateLimit *RateLimitInfo `json:"rateLimit,omitempty"`
}
type RetryConfig ¶
type RetryConfig struct {
MaxRetries int
BaseDelay time.Duration
MaxDelay time.Duration // Cap on delay (default: 60s)
Strategy BackoffStrategy // Backoff calculation strategy
RetryableCodes map[int]bool // Status codes that trigger retry (nil = use defaults)
RetryableErrors []error // Error types that trigger retry (empty = use defaults)
}
RetryConfig configures retry behavior with per-status-code policies and backoff strategies.
func DefaultRetryConfig ¶
func DefaultRetryConfig() RetryConfig
DefaultRetryConfig returns a RetryConfig with sensible defaults.
type RotationStrategy ¶
type RotationStrategy int
RotationStrategy defines proxy selection algorithm.
const ( RotationRoundRobin RotationStrategy = iota RotationRandom RotationLeastUsed RotationWeighted RotationLeastLatency )
func ParseRotationStrategy ¶
func ParseRotationStrategy(s string) RotationStrategy
ParseRotationStrategy parses a rotation strategy from string.
func (RotationStrategy) String ¶
func (r RotationStrategy) String() string
String returns the string representation of the rotation strategy.
type ScreenshotConfig ¶
type ScreenshotConfig struct {
Enabled bool `json:"enabled"` // Whether to capture screenshot
FullPage bool `json:"fullPage"` // Capture full page or just viewport
Format ScreenshotFormat `json:"format"` // png or jpeg
Quality int `json:"quality,omitempty"` // JPEG quality (1-100), ignored for PNG
Width int `json:"width,omitempty"` // Viewport width (0 = default)
Height int `json:"height,omitempty"` // Viewport height (0 = default)
Device *DeviceEmulation `json:"device,omitempty"` // Device emulation settings
}
ScreenshotConfig defines screenshot capture options for headless fetchers. Screenshots are only applicable to chromedp and playwright engines, not HTTP fetcher.
type ScreenshotFormat ¶
type ScreenshotFormat string
const ( ScreenshotFormatPNG ScreenshotFormat = "png" ScreenshotFormatJPEG ScreenshotFormat = "jpeg" )
Source Files
¶
- adaptive_fetcher.go
- chromedp_auth.go
- chromedp_device.go
- chromedp_fetcher.go
- chromedp_intercept.go
- chromedp_network.go
- circuit_breaker.go
- close.go
- detect.go
- fetcher.go
- form_detect.go
- form_detect_classify.go
- form_detect_fields.go
- form_detect_heuristics.go
- form_detect_selectors.go
- form_detect_types.go
- form_detect_utils.go
- form_filler.go
- http_fetcher.go
- limiter.go
- playwright_device.go
- playwright_fetcher.go
- playwright_intercept.go
- playwright_screenshot.go
- playwright_session.go
- proxy_pool.go
- proxy_pool_health.go
- proxy_pool_persist.go
- proxy_pool_select.go
- proxy_pool_types.go
- ratelimit.go
- render_profiles.go
- render_profiles_mgmt.go
- render_profiles_store.go
- retry.go
- types.go
- types_auth.go
- types_device.go
- types_intercept.go