Documentation
¶
Overview ¶
Package crawler implements a concurrent web crawler with BFS traversal, robots.txt compliance, sitemap parsing, and HTML link extraction.
Index ¶
- Variables
- func DeleteCheckpoint(path string) error
- func ParseHTML(base *url.URL, body []byte) (links []string, assets []string)
- func ParseSitemap(ctx context.Context, sitemapURL string) []string
- func SaveCheckpoint(path string, seedURL string, frontier *Frontier, pageURLs []string) error
- type Checkpoint
- type CrawlCache
- type CrawlCacheEntry
- type Crawler
- type Fetcher
- type Frontier
- type FrontierTask
- type HTTPFetcher
- type RobotsChecker
Constants ¶
This section is empty.
Variables ¶
var ErrRedirectLoop = errors.New("redirect loop detected")
ErrRedirectLoop is returned when a redirect cycle is detected.
Functions ¶
func DeleteCheckpoint ¶
DeleteCheckpoint removes a checkpoint file.
func ParseSitemap ¶
ParseSitemap fetches and parses a sitemap.xml, returning discovered URLs. Gzip-compressed sitemaps (.xml.gz) are transparently decompressed.
Types ¶
type Checkpoint ¶
type Checkpoint struct {
SeedURL string `json:"seed_url"`
Seen map[string]bool `json:"seen"`
Queue []FrontierTask `json:"queue"`
PageURLs []string `json:"page_urls"`
}
Checkpoint represents a saved crawl state that can be resumed.
func LoadCheckpoint ¶
func LoadCheckpoint(path string) (*Checkpoint, error)
LoadCheckpoint reads a checkpoint file and returns the saved state.
type CrawlCache ¶
type CrawlCache struct {
Entries map[string]CrawlCacheEntry `json:"entries"`
}
CrawlCache maps URLs to their cached metadata for incremental crawling.
func LoadCrawlCache ¶
func LoadCrawlCache(path string) (*CrawlCache, error)
LoadCrawlCache reads a cache file from disk.
func (*CrawlCache) Get ¶
func (cc *CrawlCache) Get(url string) (CrawlCacheEntry, bool)
Get returns the cache entry for a URL, or empty if not cached.
func (*CrawlCache) Save ¶
func (cc *CrawlCache) Save(path string) error
Save writes the cache to disk.
func (*CrawlCache) Set ¶
func (cc *CrawlCache) Set(url string, entry CrawlCacheEntry)
Set stores a cache entry for a URL.
type CrawlCacheEntry ¶
type CrawlCacheEntry struct {
LastModified string `json:"last_modified,omitempty"`
ETag string `json:"etag,omitempty"`
ContentHash string `json:"content_hash,omitempty"`
}
CrawlCacheEntry stores metadata for a previously crawled URL.
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler orchestrates concurrent web crawling.
func NewCrawler ¶
NewCrawler creates a new Crawler with the given configuration and fetcher.
type Frontier ¶
type Frontier struct {
// contains filtered or unexported fields
}
Frontier is a thread-safe BFS queue with deduplication and scope enforcement.
func NewFrontier ¶
func NewFrontier(seedHost string, maxPages int, includePatterns, excludePatterns []string) *Frontier
NewFrontier creates a new Frontier scoped to the given seed host.
func RestoreFrontier ¶
func RestoreFrontier(cp *Checkpoint, seedHost string, maxPages int, include, exclude []string) *Frontier
RestoreFrontier creates a Frontier pre-populated from a Checkpoint.
func (*Frontier) Add ¶
Add enqueues a URL at the given depth if it is in scope, not a duplicate, and under the cap.
func (*Frontier) Dequeue ¶
func (f *Frontier) Dequeue() (FrontierTask, bool)
Dequeue removes and returns the next task from the front of the queue.
type FrontierTask ¶
FrontierTask represents a URL to be crawled along with its depth from the seed.
type HTTPFetcher ¶
type HTTPFetcher struct {
// contains filtered or unexported fields
}
HTTPFetcher implements Fetcher using net/http.
func NewHTTPFetcher ¶
NewHTTPFetcher creates a new HTTPFetcher with the given user agent and per-request timeout.
type RobotsChecker ¶
type RobotsChecker struct {
// contains filtered or unexported fields
}
RobotsChecker fetches, parses, and caches robots.txt files per host.
func NewRobotsChecker ¶
func NewRobotsChecker(userAgent string, l logger.Logger) *RobotsChecker
NewRobotsChecker creates a new RobotsChecker that matches against the given user agent.