Documentation
¶
Overview ¶
Package crawler implements a concurrent website crawler with rate limiting, depth control, URL deduplication, and robots.txt compliance.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func FetchSitemapURLs ¶
FetchSitemapURLs fetches and parses sitemap(s) from the given URLs. Supports both sitemap index files and direct URL sets.
Types ¶
type Config ¶
type Config struct {
MaxDepth int
Concurrency int
Timeout time.Duration
PageTimeout time.Duration
RateLimit int
RetryAttempts int
RetryDelay time.Duration
UserAgent string
FollowRedirects int
RespectRobots bool
Exclude []string
AuthHeader string
AuthValue string
CookieJar http.CookieJar
AllowPrivateIPs bool // When true, skip SSRF protection for private IPs
}
Config controls crawler behavior.
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler performs concurrent crawling with rate limiting.
type Link ¶
type Link struct {
Href string
Text string
Rel string
External bool
Anchor bool
Resource bool // true for non-anchor resource URLs (img, script, iframe, etc.)
Tag string // source element tag (e.g., "img", "script", "iframe")
}
Link represents a hyperlink found on a page.
type Page ¶
type Page struct {
URL string
StatusCode int
Headers http.Header
Body []byte
Links []Link
Forms []Form
Depth int
ParentURL string
Duration time.Duration
Error error
AuthRequired bool // true when server returned 401/403
}
Page represents a single crawled page with its metadata.
type ParseResult ¶ added in v0.5.0
type ParseResult struct {
Links []Link
Forms []Form
ParseErr error // non-nil if HTML parsing encountered an error (partial results still returned)
}
ParseResult holds links/forms extraction results along with any parse error.
func ParseHTML ¶ added in v0.5.0
func ParseHTML(pageURL string, body []byte) ParseResult
ParseHTML extracts links and forms, returning partial results even on parse error.
type RobotsCache ¶
type RobotsCache struct {
// contains filtered or unexported fields
}
RobotsCache caches parsed robots.txt rules per host.
func NewRobotsCache ¶
func NewRobotsCache() *RobotsCache
NewRobotsCache creates an empty robots.txt cache.
func (*RobotsCache) Allowed ¶
func (rc *RobotsCache) Allowed(rawURL, userAgent string) bool
Allowed checks if a URL is permitted by robots.txt rules. Per the standard, if both Allow and Disallow match a path, the longest matching rule wins. If they are the same length, Allow takes precedence.
func (*RobotsCache) CrawlDelay ¶ added in v0.5.0
func (rc *RobotsCache) CrawlDelay(origin string) time.Duration
CrawlDelay returns the crawl-delay directive for the given origin, or 0 if not set.
func (*RobotsCache) Sitemaps ¶
func (rc *RobotsCache) Sitemaps(origin string) []string
Sitemaps returns sitemap URLs declared in robots.txt.