common

package
v1.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 1, 2025 License: MIT Imports: 27 Imported by: 3

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrOutOfScope = errors.New("out of scope")

Functions

func BuildHttpClient added in v1.0.0

func BuildHttpClient(dialer *fastdialer.Dialer, options *types.Options, redirectCallback RedirectCallback) (*retryablehttp.Client, *fastdialer.Dialer, error)

BuildHttpClient builds a http client based on a profile

Types

type CrawlSession added in v1.0.1

type CrawlSession struct {
	Ctx        context.Context
	CancelFunc context.CancelFunc
	URL        *url.URL
	Hostname   string
	Queue      *queue.Queue
	HttpClient *retryablehttp.Client
	Browser    *rod.Browser
}

CrawlSession represents an active crawling session for a specific target URL. It maintains the session context, cancellation function, parsed URL information, the request queue, and HTTP/browser clients needed for the crawl operation.

type DoRequestFunc added in v1.0.1

type DoRequestFunc func(crawlSession *CrawlSession, req *navigation.Request) (*navigation.Response, error)

DoRequestFunc is a function type for executing navigation requests. Implementations should perform the actual HTTP request or browser navigation and return the response or an error. This allows different crawling strategies (standard HTTP vs. headless browser) to provide their own request logic.

type RedirectCallback added in v1.0.0

type RedirectCallback func(resp *http.Response, depth int)

type Shared added in v1.0.1

type Shared struct {
	Headers    map[string]string
	KnownFiles *files.KnownFiles
	Options    *types.CrawlerOptions
	Jar        *httputil.CookieJar
}

Shared represents the shared state and configuration used across all crawl sessions. It maintains common resources like HTTP headers, cookie jars, known files database, and crawler options that are reused for efficiency across multiple crawl operations.

func NewShared added in v1.0.1

func NewShared(options *types.CrawlerOptions) (*Shared, error)

NewShared creates a new Shared instance with the provided crawler options. It initializes the HTTP headers, known files database (if configured), and an empty cookie jar. Returns an error if the HTTP client or cookie jar creation fails.

func (*Shared) Do added in v1.0.1

func (s *Shared) Do(crawlSession *CrawlSession, doRequest DoRequestFunc) error

Do executes the main crawling loop for the given crawl session. It processes items from the queue concurrently (respecting the Concurrency limit), validates each request (URL format, path filters, scope), applies rate limiting and delays, executes the request using the provided doRequest function, writes results to output, and enqueues any newly discovered URLs from responses.

The method returns when the queue is empty or the session context is cancelled (due to timeout or manual cancellation). Returns an error if the context is cancelled.

func (*Shared) Enqueue added in v1.0.1

func (s *Shared) Enqueue(queue *queue.Queue, navigationRequests ...*navigation.Request)

Enqueue adds one or more navigation requests to the crawl queue after applying validation checks. The method performs the following checks in order:

  1. URL format validation
  2. Query parameter handling (if IgnoreQueryParams is enabled)
  3. Depth filtering - skips URLs exceeding MaxDepth before uniqueness check to prevent caching URLs that would be rejected, allowing them to be processed if discovered later at valid depths via different paths
  4. Uniqueness filtering - prevents duplicate URL crawling
  5. Cycle detection - identifies URLs stuck in redirect loops
  6. Scope validation - ensures URLs belong to the allowed crawl scope

For in-scope URLs, the method also handles path climbing when enabled, extracting and enqueuing parent directory paths. Out-of-scope URLs are sent to output if DisplayOutScope is enabled.

func (*Shared) NewCrawlSessionWithURL added in v1.0.1

func (s *Shared) NewCrawlSessionWithURL(URL string) (*CrawlSession, error)

NewCrawlSessionWithURL creates and initializes a new crawl session for the specified URL. It performs the following initialization steps:

  1. Creates a context with optional timeout based on CrawlDuration setting
  2. Parses the target URL and extracts the hostname
  3. Initializes the request queue with the configured strategy
  4. Enqueues the initial URL and any known files for the target
  5. Sets up the HTTP client with response parsing callbacks

Returns the initialized CrawlSession or an error if initialization fails.

func (*Shared) Output added in v1.0.1

func (s *Shared) Output(navigationRequest *navigation.Request, navigationResponse *navigation.Response, err error)

Output writes a crawl result to the configured output writer. It creates a Result object containing the navigation request, response (if any), and error information (if any), then writes it to the output writer. If an OnResult callback is configured and output writing succeeds, the callback is invoked.

func (*Shared) ValidateScope added in v1.0.1

func (s *Shared) ValidateScope(URL string, root string) bool

ValidateScope checks whether a given URL is within the allowed crawling scope based on the configured scope rules and the root hostname. Returns true if the URL passes scope validation, false otherwise.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL