Documentation
¶
Overview ¶
Package crawler implements a concurrent website crawler with rate limiting, depth control, URL deduplication, and robots.txt compliance.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func FetchSitemapURLs ¶
FetchSitemapURLs fetches and parses sitemap(s) from the given URLs. Supports both sitemap index files and direct URL sets.
Types ¶
type Config ¶
type Config struct {
MaxDepth int
Concurrency int
Timeout time.Duration
PageTimeout time.Duration
RateLimit int
RetryAttempts int
RetryDelay time.Duration
UserAgent string
FollowRedirects int
RespectRobots bool
Exclude []string
AuthHeader string
AuthValue string
CookieJar http.CookieJar
}
Config controls crawler behavior.
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler performs concurrent crawling with rate limiting.
type Link ¶
type Link struct {
Href string
Text string
Rel string
External bool
Anchor bool
Resource bool // true for non-anchor resource URLs (img, script, iframe, etc.)
Tag string // source element tag (e.g., "img", "script", "iframe")
}
Link represents a hyperlink found on a page.
type Page ¶
type Page struct {
URL string
StatusCode int
Headers http.Header
Body []byte
Links []Link
Forms []Form
Depth int
ParentURL string
Duration time.Duration
Error error
}
Page represents a single crawled page with its metadata.
type RobotsCache ¶
type RobotsCache struct {
// contains filtered or unexported fields
}
RobotsCache caches parsed robots.txt rules per host.
func NewRobotsCache ¶
func NewRobotsCache() *RobotsCache
NewRobotsCache creates an empty robots.txt cache.
func (*RobotsCache) Allowed ¶
func (rc *RobotsCache) Allowed(rawURL, userAgent string) bool
Allowed checks if a URL is permitted by robots.txt rules.
func (*RobotsCache) Sitemaps ¶
func (rc *RobotsCache) Sitemaps(origin string) []string
Sitemaps returns sitemap URLs declared in robots.txt.