frontier

package
v0.10.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 7, 2026 License: AGPL-3.0 Imports: 5 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CrawlURL

type CrawlURL struct {
	URL      string
	Priority int // lower = higher priority
	Depth    int
	FoundOn  string
	Attempt  int // retry attempt number (0 = first try)
	// contains filtered or unexported fields
}

CrawlURL represents a URL to be crawled with priority and metadata.

type Frontier

type Frontier struct {
	// contains filtered or unexported fields
}

Frontier manages the URL queue with priority, dedup, and per-host politeness.

func New

func New(delay time.Duration, maxSize int) *Frontier

New creates a new Frontier. maxSize limits the priority queue size (0 = unlimited).

func (*Frontier) Add

func (f *Frontier) Add(crawlURL CrawlURL) bool

Add adds a URL to the frontier if it hasn't been seen before. Returns true if the URL was added. Even if already seen, updates the minimum depth tracking so that dequeued URLs get their true shortest-path depth.

func (*Frontier) Close

func (f *Frontier) Close()

Close closes the frontier, preventing new URLs from being added.

func (*Frontier) Delay added in v0.9.0

func (f *Frontier) Delay() time.Duration

Delay returns the current per-host politeness delay.

func (*Frontier) Len

func (f *Frontier) Len() int

Len returns the number of URLs in the queue.

func (*Frontier) MarkSeen

func (f *Frontier) MarkSeen(url string)

MarkSeen adds a URL to the dedup database without adding it to the queue.

func (*Frontier) Next

func (f *Frontier) Next() *CrawlURL

Next returns the next URL that is ready to be fetched (respecting per-host delay). Returns nil if no URL is ready or the frontier is empty.

func (*Frontier) SeenCount

func (f *Frontier) SeenCount() int

SeenCount returns the total number of unique URLs seen.

func (*Frontier) SetDelay added in v0.9.0

func (f *Frontier) SetDelay(delay time.Duration)

SetDelay updates the per-host politeness delay.

type HostQueue

type HostQueue struct {
	// contains filtered or unexported fields
}

HostQueue manages per-host politeness delays.

func NewHostQueue

func NewHostQueue(delay time.Duration) *HostQueue

NewHostQueue creates a new HostQueue with the given default delay.

func (*HostQueue) CanFetch

func (hq *HostQueue) CanFetch(host string) bool

CanFetch returns true if enough time has passed since the last fetch to this host.

func (*HostQueue) RecordFetch

func (hq *HostQueue) RecordFetch(host string)

RecordFetch records that a fetch was made to this host.

func (*HostQueue) SetDelay

func (hq *HostQueue) SetDelay(delay time.Duration)

SetDelay updates the delay for a specific host (e.g., from robots.txt crawl-delay).

func (*HostQueue) TimeUntilReady

func (hq *HostQueue) TimeUntilReady(host string) time.Duration

TimeUntilReady returns how long to wait before the host can be fetched again.

type URLDb

type URLDb struct {
	// contains filtered or unexported fields
}

URLDb tracks seen URLs using FNV-1a hashes for memory efficiency.

func NewURLDb

func NewURLDb() *URLDb

NewURLDb creates a new URL deduplication database.

func (*URLDb) Add

func (db *URLDb) Add(url string) bool

Add marks a URL as seen. Returns true if the URL was new.

func (*URLDb) Has

func (db *URLDb) Has(url string) bool

Has checks if a URL has been seen.

func (*URLDb) Len

func (db *URLDb) Len() int

Len returns the number of seen URLs.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL