Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler is the internal crawl engine.
It is constructed by the aether.Client and not exposed directly to end users. Public APIs will wrap this engine via aether/crawl.go.
func NewCrawler ¶
func NewCrawler(fetcher *httpclient.Client, opts Options) (*Crawler, error)
NewCrawler constructs a new Crawler using the provided internal HTTP client and options.
func (*Crawler) Run ¶
Run executes the crawl starting from startURL.
The crawl stops when:
- the frontier is empty, or
- MaxPages (if > 0) is reached, or
- the context is canceled, or
- a fatal error is returned by the Visitor or fetcher.
The current implementation is single-threaded (one worker), but all underlying components are safe for future multi-worker expansion.
type DepthLimit ¶
type DepthLimit struct {
MaxDepth int // maximum allowed depth (0-based). If MaxDepth < 0, unlimited.
}
DepthLimit is a simple struct used for validating depth transitions.
func NewDepthLimit ¶
func NewDepthLimit(maxDepth int) DepthLimit
NewDepthLimit constructs a new DepthLimit.
If maxDepth < 0, the crawler treats depth as unlimited. Depth 0 = root URL. Depth 1 = root's outgoing links. Depth 2 = links from depth 1 pages, etc.
func (DepthLimit) Allowed ¶
func (d DepthLimit) Allowed(depth int) bool
Allowed reports whether a page at `depth` is allowed to be visited according to the configured max depth.
If MaxDepth < 0, depth is unlimited.
func (DepthLimit) Exceeded ¶
func (d DepthLimit) Exceeded(parentDepth int) bool
Exceeded reports whether this (parentDepth + 1) exceeds MaxDepth.
Useful when deciding whether to enqueue outgoing links.
func (DepthLimit) Next ¶
func (d DepthLimit) Next(parentDepth int) int
Next returns the next depth for child links.
Parents at depth N produce children at depth N+1.
type FrontierItem ¶
FrontierItem represents a single entry in the crawl frontier. Depth is measured from the starting URL (depth 0).
type FrontierQueue ¶
type FrontierQueue struct {
// contains filtered or unexported fields
}
FrontierQueue is a thread-safe FIFO queue of FrontierItem values.
The queue is used by the crawler to schedule which URLs to visit next. It does not perform any URL normalization or filtering; those concerns are handled by higher-level components (rules, visit map, etc.).
func NewFrontierQueue ¶
func NewFrontierQueue() *FrontierQueue
NewFrontierQueue constructs an empty frontier queue.
func (*FrontierQueue) Dequeue ¶
func (q *FrontierQueue) Dequeue() (FrontierItem, bool)
Dequeue removes and returns the oldest item from the queue. The boolean return value is false if the queue is empty.
func (*FrontierQueue) Empty ¶
func (q *FrontierQueue) Empty() bool
Empty reports whether the queue is currently empty.
func (*FrontierQueue) Enqueue ¶
func (q *FrontierQueue) Enqueue(item FrontierItem)
Enqueue adds a new item to the end of the queue. It is safe to call from multiple goroutines.
func (*FrontierQueue) Len ¶
func (q *FrontierQueue) Len() int
Len returns the current number of items in the queue.
This is primarily useful for monitoring and unit tests; the crawler uses Dequeue's boolean return to detect emptiness.
type Options ¶
type Options struct {
// MaxDepth is the maximum depth to crawl, starting at 0 for the root URL.
// If MaxDepth < 0, depth is unlimited.
MaxDepth int
// MaxPages limits how many pages will be fetched. If MaxPages <= 0,
// there is no explicit page limit and the crawl stops only when the
// frontier becomes empty or the context is canceled.
MaxPages int
// SameHostOnly restricts all crawled URLs to the same host as the
// starting URL.
SameHostOnly bool
// AllowedDomains, if non-empty, restricts crawling to these hostnames.
// Hostnames are matched in their lowercase form.
AllowedDomains []string
// DisallowedDomains, if non-empty, blocks crawling for these hostnames.
DisallowedDomains []string
// FetchDelay is a soft politeness delay enforced between successive
// requests to the same host. A value of zero disables per-host delay.
FetchDelay time.Duration
// Concurrency is reserved for a future multi-worker version of the
// crawler. The current implementation uses a single worker, but keeps
// this field for API stability.
Concurrency int
// Visitor is invoked for each fetched page. It must not be nil.
Visitor Visitor
}
Options configures the behavior of the crawler.
type Page ¶
type Page struct {
URL string
Depth int
StatusCode int
// Content is the raw response body interpreted as text. For non-text
// content types, this will still be a string representation of bytes.
Content string
// Links contains the child URLs discovered on the page that were
// accepted by host/domain/visited rules and enqueued for crawling.
Links []string
// Metadata holds additional simple metadata such as content type.
Metadata map[string]string
}
Page represents a single crawled page, as seen by the visitor callback.
type PerHostThrottle ¶
type PerHostThrottle struct {
// contains filtered or unexported fields
}
PerHostThrottle enforces a minimum delay between requests to the same host.
The crawler uses this to avoid overwhelming web servers even when multiple worker goroutines are active. This complements robots.txt compliance and the HTTP client's concurrency controls.
func NewPerHostThrottle ¶
func NewPerHostThrottle(minDelay time.Duration) *PerHostThrottle
NewPerHostThrottle constructs a new throttle enforcer.
minDelay = 0 means "no throttling".
func (*PerHostThrottle) Wait ¶
func (p *PerHostThrottle) Wait(rawURL string)
Wait respects the per-host delay before allowing another request to proceed.
The caller should invoke Wait() *immediately before* performing a network fetch. This method blocks only the worker hitting this specific host. Workers hitting other hosts proceed unhindered.
type VisitMap ¶
type VisitMap struct {
// contains filtered or unexported fields
}
VisitMap is a concurrency-safe visited URL registry.
URLs stored here must already be normalized by the crawler subsystem. Typically, this includes:
- scheme normalization
- host lowercasing
- path cleaning
- removal of URL fragments (#section)
- resolution of relative URLs
VisitMap does not perform normalization on its own, by design.
func (*VisitMap) MarkVisited ¶
MarkVisited records a URL as visited, regardless of prior existence.
It returns true if the URL was newly added, false if it was already present.