crawl

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 26, 2025 License: MIT Imports: 7 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler is the internal crawl engine.

It is constructed by the aether.Client and not exposed directly to end users. Public APIs will wrap this engine via aether/crawl.go.

func NewCrawler

func NewCrawler(fetcher *httpclient.Client, opts Options) (*Crawler, error)

NewCrawler constructs a new Crawler using the provided internal HTTP client and options.

func (*Crawler) Run

func (c *Crawler) Run(ctx context.Context, startURL string) error

Run executes the crawl starting from startURL.

The crawl stops when:

  • the frontier is empty, or
  • MaxPages (if > 0) is reached, or
  • the context is canceled, or
  • a fatal error is returned by the Visitor or fetcher.

The current implementation is single-threaded (one worker), but all underlying components are safe for future multi-worker expansion.

type DepthLimit

type DepthLimit struct {
	MaxDepth int // maximum allowed depth (0-based). If MaxDepth < 0, unlimited.
}

DepthLimit is a simple struct used for validating depth transitions.

func NewDepthLimit

func NewDepthLimit(maxDepth int) DepthLimit

NewDepthLimit constructs a new DepthLimit.

If maxDepth < 0, the crawler treats depth as unlimited. Depth 0 = root URL. Depth 1 = root's outgoing links. Depth 2 = links from depth 1 pages, etc.

func (DepthLimit) Allowed

func (d DepthLimit) Allowed(depth int) bool

Allowed reports whether a page at `depth` is allowed to be visited according to the configured max depth.

If MaxDepth < 0, depth is unlimited.

func (DepthLimit) Exceeded

func (d DepthLimit) Exceeded(parentDepth int) bool

Exceeded reports whether this (parentDepth + 1) exceeds MaxDepth.

Useful when deciding whether to enqueue outgoing links.

func (DepthLimit) Next

func (d DepthLimit) Next(parentDepth int) int

Next returns the next depth for child links.

Parents at depth N produce children at depth N+1.

type FrontierItem

type FrontierItem struct {
	URL   string
	Depth int
}

FrontierItem represents a single entry in the crawl frontier. Depth is measured from the starting URL (depth 0).

type FrontierQueue

type FrontierQueue struct {
	// contains filtered or unexported fields
}

FrontierQueue is a thread-safe FIFO queue of FrontierItem values.

The queue is used by the crawler to schedule which URLs to visit next. It does not perform any URL normalization or filtering; those concerns are handled by higher-level components (rules, visit map, etc.).

func NewFrontierQueue

func NewFrontierQueue() *FrontierQueue

NewFrontierQueue constructs an empty frontier queue.

func (*FrontierQueue) Dequeue

func (q *FrontierQueue) Dequeue() (FrontierItem, bool)

Dequeue removes and returns the oldest item from the queue. The boolean return value is false if the queue is empty.

func (*FrontierQueue) Empty

func (q *FrontierQueue) Empty() bool

Empty reports whether the queue is currently empty.

func (*FrontierQueue) Enqueue

func (q *FrontierQueue) Enqueue(item FrontierItem)

Enqueue adds a new item to the end of the queue. It is safe to call from multiple goroutines.

func (*FrontierQueue) Len

func (q *FrontierQueue) Len() int

Len returns the current number of items in the queue.

This is primarily useful for monitoring and unit tests; the crawler uses Dequeue's boolean return to detect emptiness.

type Options

type Options struct {
	// MaxDepth is the maximum depth to crawl, starting at 0 for the root URL.
	// If MaxDepth < 0, depth is unlimited.
	MaxDepth int

	// MaxPages limits how many pages will be fetched. If MaxPages <= 0,
	// there is no explicit page limit and the crawl stops only when the
	// frontier becomes empty or the context is canceled.
	MaxPages int

	// SameHostOnly restricts all crawled URLs to the same host as the
	// starting URL.
	SameHostOnly bool

	// AllowedDomains, if non-empty, restricts crawling to these hostnames.
	// Hostnames are matched in their lowercase form.
	AllowedDomains []string

	// DisallowedDomains, if non-empty, blocks crawling for these hostnames.
	DisallowedDomains []string

	// FetchDelay is a soft politeness delay enforced between successive
	// requests to the same host. A value of zero disables per-host delay.
	FetchDelay time.Duration

	// Concurrency is reserved for a future multi-worker version of the
	// crawler. The current implementation uses a single worker, but keeps
	// this field for API stability.
	Concurrency int

	// Visitor is invoked for each fetched page. It must not be nil.
	Visitor Visitor
}

Options configures the behavior of the crawler.

type Page

type Page struct {
	URL        string
	Depth      int
	StatusCode int

	// Content is the raw response body interpreted as text. For non-text
	// content types, this will still be a string representation of bytes.
	Content string

	// Links contains the child URLs discovered on the page that were
	// accepted by host/domain/visited rules and enqueued for crawling.
	Links []string

	// Metadata holds additional simple metadata such as content type.
	Metadata map[string]string
}

Page represents a single crawled page, as seen by the visitor callback.

type PerHostThrottle

type PerHostThrottle struct {
	// contains filtered or unexported fields
}

PerHostThrottle enforces a minimum delay between requests to the same host.

The crawler uses this to avoid overwhelming web servers even when multiple worker goroutines are active. This complements robots.txt compliance and the HTTP client's concurrency controls.

func NewPerHostThrottle

func NewPerHostThrottle(minDelay time.Duration) *PerHostThrottle

NewPerHostThrottle constructs a new throttle enforcer.

minDelay = 0 means "no throttling".

func (*PerHostThrottle) Wait

func (p *PerHostThrottle) Wait(rawURL string)

Wait respects the per-host delay before allowing another request to proceed.

The caller should invoke Wait() *immediately before* performing a network fetch. This method blocks only the worker hitting this specific host. Workers hitting other hosts proceed unhindered.

type VisitMap

type VisitMap struct {
	// contains filtered or unexported fields
}

VisitMap is a concurrency-safe visited URL registry.

URLs stored here must already be normalized by the crawler subsystem. Typically, this includes:

  • scheme normalization
  • host lowercasing
  • path cleaning
  • removal of URL fragments (#section)
  • resolution of relative URLs

VisitMap does not perform normalization on its own, by design.

func NewVisitMap

func NewVisitMap() *VisitMap

NewVisitMap constructs an empty VisitMap.

func (*VisitMap) Count

func (v *VisitMap) Count() int

Count returns the number of visited URLs so far.

func (*VisitMap) IsVisited

func (v *VisitMap) IsVisited(url string) bool

IsVisited reports whether the URL has already been seen.

func (*VisitMap) MarkVisited

func (v *VisitMap) MarkVisited(url string) bool

MarkVisited records a URL as visited, regardless of prior existence.

It returns true if the URL was newly added, false if it was already present.

type Visitor

type Visitor interface {
	VisitPage(ctx context.Context, page *Page) error
}

Visitor is invoked for each successfully fetched page.

type VisitorFunc

type VisitorFunc func(ctx context.Context, page *Page) error

VisitorFunc is a functional adapter to allow the use of ordinary functions as Visitors.

func (VisitorFunc) VisitPage

func (f VisitorFunc) VisitPage(ctx context.Context, page *Page) error

VisitPage calls f(ctx, page).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL