Documentation
¶
Index ¶
- Variables
- func BuildHttpClient(dialer *fastdialer.Dialer, options *types.Options, ...) (*retryablehttp.Client, *fastdialer.Dialer, error)
- type CrawlSession
- type DoRequestFunc
- type RedirectCallback
- type Shared
- func (s *Shared) Do(crawlSession *CrawlSession, doRequest DoRequestFunc) error
- func (s *Shared) Enqueue(queue *queue.Queue, navigationRequests ...*navigation.Request)
- func (s *Shared) NewCrawlSessionWithURL(URL string) (*CrawlSession, error)
- func (s *Shared) Output(navigationRequest *navigation.Request, navigationResponse *navigation.Response, ...)
- func (s *Shared) ValidateScope(URL string, root string) bool
Constants ¶
This section is empty.
Variables ¶
var ErrOutOfScope = errors.New("out of scope")
Functions ¶
func BuildHttpClient ¶ added in v1.0.0
func BuildHttpClient(dialer *fastdialer.Dialer, options *types.Options, redirectCallback RedirectCallback) (*retryablehttp.Client, *fastdialer.Dialer, error)
BuildHttpClient builds a http client based on a profile
Types ¶
type CrawlSession ¶ added in v1.0.1
type CrawlSession struct {
Ctx context.Context
CancelFunc context.CancelFunc
URL *url.URL
Hostname string
Queue *queue.Queue
HttpClient *retryablehttp.Client
Browser *rod.Browser
}
CrawlSession represents an active crawling session for a specific target URL. It maintains the session context, cancellation function, parsed URL information, the request queue, and HTTP/browser clients needed for the crawl operation.
type DoRequestFunc ¶ added in v1.0.1
type DoRequestFunc func(crawlSession *CrawlSession, req *navigation.Request) (*navigation.Response, error)
DoRequestFunc is a function type for executing navigation requests. Implementations should perform the actual HTTP request or browser navigation and return the response or an error. This allows different crawling strategies (standard HTTP vs. headless browser) to provide their own request logic.
type RedirectCallback ¶ added in v1.0.0
type Shared ¶ added in v1.0.1
type Shared struct {
}
Shared represents the shared state and configuration used across all crawl sessions. It maintains common resources like HTTP headers, cookie jars, known files database, and crawler options that are reused for efficiency across multiple crawl operations.
func NewShared ¶ added in v1.0.1
func NewShared(options *types.CrawlerOptions) (*Shared, error)
NewShared creates a new Shared instance with the provided crawler options. It initializes the HTTP headers, known files database (if configured), and an empty cookie jar. Returns an error if the HTTP client or cookie jar creation fails.
func (*Shared) Do ¶ added in v1.0.1
func (s *Shared) Do(crawlSession *CrawlSession, doRequest DoRequestFunc) error
Do executes the main crawling loop for the given crawl session. It processes items from the queue concurrently (respecting the Concurrency limit), validates each request (URL format, path filters, scope), applies rate limiting and delays, executes the request using the provided doRequest function, writes results to output, and enqueues any newly discovered URLs from responses.
The method returns when the queue is empty or the session context is cancelled (due to timeout or manual cancellation). Returns an error if the context is cancelled.
func (*Shared) Enqueue ¶ added in v1.0.1
func (s *Shared) Enqueue(queue *queue.Queue, navigationRequests ...*navigation.Request)
Enqueue adds one or more navigation requests to the crawl queue after applying validation checks. The method performs the following checks in order:
- URL format validation
- Query parameter handling (if IgnoreQueryParams is enabled)
- Depth filtering - skips URLs exceeding MaxDepth before uniqueness check to prevent caching URLs that would be rejected, allowing them to be processed if discovered later at valid depths via different paths
- Uniqueness filtering - prevents duplicate URL crawling
- Cycle detection - identifies URLs stuck in redirect loops
- Scope validation - ensures URLs belong to the allowed crawl scope
For in-scope URLs, the method also handles path climbing when enabled, extracting and enqueuing parent directory paths. Out-of-scope URLs are sent to output if DisplayOutScope is enabled.
func (*Shared) NewCrawlSessionWithURL ¶ added in v1.0.1
func (s *Shared) NewCrawlSessionWithURL(URL string) (*CrawlSession, error)
NewCrawlSessionWithURL creates and initializes a new crawl session for the specified URL. It performs the following initialization steps:
- Creates a context with optional timeout based on CrawlDuration setting
- Parses the target URL and extracts the hostname
- Initializes the request queue with the configured strategy
- Enqueues the initial URL and any known files for the target
- Sets up the HTTP client with response parsing callbacks
Returns the initialized CrawlSession or an error if initialization fails.
func (*Shared) Output ¶ added in v1.0.1
func (s *Shared) Output(navigationRequest *navigation.Request, navigationResponse *navigation.Response, err error)
Output writes a crawl result to the configured output writer. It creates a Result object containing the navigation request, response (if any), and error information (if any), then writes it to the output writer. If an OnResult callback is configured and output writing succeeds, the callback is invoked.
func (*Shared) ValidateScope ¶ added in v1.0.1
ValidateScope checks whether a given URL is within the allowed crawling scope based on the configured scope rules and the root hostname. Returns true if the URL passes scope validation, false otherwise.