crawl

package

v1.0.0 Latest Latest Go to latest Published: Nov 26, 2025 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Nibir1/Aether

Links

Open Source Insights

Documentation ¶

Index ¶

type Crawler
- func NewCrawler(fetcher *httpclient.Client, opts Options) (*Crawler, error)
- func (c *Crawler) Run(ctx context.Context, startURL string) error
type DepthLimit
- func NewDepthLimit(maxDepth int) DepthLimit
- func (d DepthLimit) Allowed(depth int) bool
- func (d DepthLimit) Exceeded(parentDepth int) bool
- func (d DepthLimit) Next(parentDepth int) int
type FrontierItem
type FrontierQueue
- func NewFrontierQueue() *FrontierQueue
- func (q *FrontierQueue) Dequeue() (FrontierItem, bool)
- func (q *FrontierQueue) Empty() bool
- func (q *FrontierQueue) Enqueue(item FrontierItem)
- func (q *FrontierQueue) Len() int
type Options
type Page
type PerHostThrottle
- func NewPerHostThrottle(minDelay time.Duration) *PerHostThrottle
- func (p *PerHostThrottle) Wait(rawURL string)
type VisitMap
- func NewVisitMap() *VisitMap
- func (v *VisitMap) Count() int
- func (v *VisitMap) IsVisited(url string) bool
- func (v *VisitMap) MarkVisited(url string) bool
type Visitor
type VisitorFunc
- func (f VisitorFunc) VisitPage(ctx context.Context, page *Page) error

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler is the internal crawl engine.

It is constructed by the aether.Client and not exposed directly to end users. Public APIs will wrap this engine via aether/crawl.go.

func NewCrawler ¶

func NewCrawler(fetcher *httpclient.Client, opts Options) (*Crawler, error)

NewCrawler constructs a new Crawler using the provided internal HTTP client and options.

func (*Crawler) Run ¶

func (c *Crawler) Run(ctx context.Context, startURL string) error

Run executes the crawl starting from startURL.

The crawl stops when:

the frontier is empty, or
MaxPages (if > 0) is reached, or
the context is canceled, or
a fatal error is returned by the Visitor or fetcher.

The current implementation is single-threaded (one worker), but all underlying components are safe for future multi-worker expansion.

type DepthLimit ¶

type DepthLimit struct {
	MaxDepth int // maximum allowed depth (0-based). If MaxDepth < 0, unlimited.
}

DepthLimit is a simple struct used for validating depth transitions.

func NewDepthLimit ¶

func NewDepthLimit(maxDepth int) DepthLimit

NewDepthLimit constructs a new DepthLimit.

If maxDepth < 0, the crawler treats depth as unlimited. Depth 0 = root URL. Depth 1 = root's outgoing links. Depth 2 = links from depth 1 pages, etc.

func (DepthLimit) Allowed ¶

func (d DepthLimit) Allowed(depth int) bool

Allowed reports whether a page at `depth` is allowed to be visited according to the configured max depth.

If MaxDepth < 0, depth is unlimited.

func (DepthLimit) Exceeded ¶

func (d DepthLimit) Exceeded(parentDepth int) bool

Exceeded reports whether this (parentDepth + 1) exceeds MaxDepth.

Useful when deciding whether to enqueue outgoing links.

func (DepthLimit) Next ¶

func (d DepthLimit) Next(parentDepth int) int

Next returns the next depth for child links.

Parents at depth N produce children at depth N+1.

type FrontierItem ¶

type FrontierItem struct {
	URL   string
	Depth int
}

FrontierItem represents a single entry in the crawl frontier. Depth is measured from the starting URL (depth 0).

type FrontierQueue ¶

type FrontierQueue struct {
	// contains filtered or unexported fields
}

FrontierQueue is a thread-safe FIFO queue of FrontierItem values.

The queue is used by the crawler to schedule which URLs to visit next. It does not perform any URL normalization or filtering; those concerns are handled by higher-level components (rules, visit map, etc.).

func NewFrontierQueue ¶

func NewFrontierQueue() *FrontierQueue

NewFrontierQueue constructs an empty frontier queue.

func (*FrontierQueue) Dequeue ¶

func (q *FrontierQueue) Dequeue() (FrontierItem, bool)

Dequeue removes and returns the oldest item from the queue. The boolean return value is false if the queue is empty.

func (*FrontierQueue) Empty ¶

func (q *FrontierQueue) Empty() bool

Empty reports whether the queue is currently empty.

func (*FrontierQueue) Enqueue ¶

func (q *FrontierQueue) Enqueue(item FrontierItem)

Enqueue adds a new item to the end of the queue. It is safe to call from multiple goroutines.

func (*FrontierQueue) Len ¶

func (q *FrontierQueue) Len() int

Len returns the current number of items in the queue.

This is primarily useful for monitoring and unit tests; the crawler uses Dequeue's boolean return to detect emptiness.

type Options ¶

type Options struct {
	// MaxDepth is the maximum depth to crawl, starting at 0 for the root URL.
	// If MaxDepth < 0, depth is unlimited.
	MaxDepth int

	// MaxPages limits how many pages will be fetched. If MaxPages <= 0,
	// there is no explicit page limit and the crawl stops only when the
	// frontier becomes empty or the context is canceled.
	MaxPages int

	// SameHostOnly restricts all crawled URLs to the same host as the
	// starting URL.
	SameHostOnly bool

	// AllowedDomains, if non-empty, restricts crawling to these hostnames.
	// Hostnames are matched in their lowercase form.
	AllowedDomains []string

	// DisallowedDomains, if non-empty, blocks crawling for these hostnames.
	DisallowedDomains []string

	// FetchDelay is a soft politeness delay enforced between successive
	// requests to the same host. A value of zero disables per-host delay.
	FetchDelay time.Duration

	// Concurrency is reserved for a future multi-worker version of the
	// crawler. The current implementation uses a single worker, but keeps
	// this field for API stability.
	Concurrency int

	// Visitor is invoked for each fetched page. It must not be nil.
	Visitor Visitor
}

Options configures the behavior of the crawler.

type Page ¶

type Page struct {
	URL        string
	Depth      int
	StatusCode int

	// Content is the raw response body interpreted as text. For non-text
	// content types, this will still be a string representation of bytes.
	Content string

	// Links contains the child URLs discovered on the page that were
	// accepted by host/domain/visited rules and enqueued for crawling.
	Links []string

	// Metadata holds additional simple metadata such as content type.
	Metadata map[string]string
}

Page represents a single crawled page, as seen by the visitor callback.

type PerHostThrottle ¶

type PerHostThrottle struct {
	// contains filtered or unexported fields
}

PerHostThrottle enforces a minimum delay between requests to the same host.

The crawler uses this to avoid overwhelming web servers even when multiple worker goroutines are active. This complements robots.txt compliance and the HTTP client's concurrency controls.

func NewPerHostThrottle ¶

func NewPerHostThrottle(minDelay time.Duration) *PerHostThrottle

NewPerHostThrottle constructs a new throttle enforcer.

minDelay = 0 means "no throttling".

func (*PerHostThrottle) Wait ¶

func (p *PerHostThrottle) Wait(rawURL string)

Wait respects the per-host delay before allowing another request to proceed.

The caller should invoke Wait() *immediately before* performing a network fetch. This method blocks only the worker hitting this specific host. Workers hitting other hosts proceed unhindered.

type VisitMap ¶

type VisitMap struct {
	// contains filtered or unexported fields
}

VisitMap is a concurrency-safe visited URL registry.

URLs stored here must already be normalized by the crawler subsystem. Typically, this includes:

scheme normalization
host lowercasing
path cleaning
removal of URL fragments (#section)
resolution of relative URLs

VisitMap does not perform normalization on its own, by design.

func NewVisitMap ¶

func NewVisitMap() *VisitMap

NewVisitMap constructs an empty VisitMap.

func (*VisitMap) Count ¶

func (v *VisitMap) Count() int

Count returns the number of visited URLs so far.

func (*VisitMap) IsVisited ¶

func (v *VisitMap) IsVisited(url string) bool

IsVisited reports whether the URL has already been seen.

func (*VisitMap) MarkVisited ¶

func (v *VisitMap) MarkVisited(url string) bool

MarkVisited records a URL as visited, regardless of prior existence.

It returns true if the URL was newly added, false if it was already present.

type Visitor ¶

type Visitor interface {
	VisitPage(ctx context.Context, page *Page) error
}

Visitor is invoked for each successfully fetched page.

type VisitorFunc ¶

type VisitorFunc func(ctx context.Context, page *Page) error

VisitorFunc is a functional adapter to allow the use of ordinary functions as Visitors.

func (VisitorFunc) VisitPage ¶

func (f VisitorFunc) VisitPage(ctx context.Context, page *Page) error

VisitPage calls f(ctx, page).

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL