Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
Name string // Name of crawler for easy identification
*CrawlerConfig
}
Crawler crawls the URL fetched from Queue and saves the contents to Models.
Crawler will quit after IdleTimeout when queue is empty
func NNewCrawlers ¶
func NNewCrawlers(n int, namePrefix string, cfg *CrawlerConfig) ([]*Crawler, error)
NNewCrawlers returns N new Crawlers configured with cfg. Crawlers will be named with namePrefix.
func NewCrawler ¶
func NewCrawler(name string, cfg *CrawlerConfig) (*Crawler, error)
NewCrawler return pointer to a new Crawler
type CrawlerConfig ¶
type CrawlerConfig struct {
Queue *queue.UniqueQueue // global queue
Models *models.Models // models to use
BaseURL *url.URL // base URL to crawl
UserAgent string // user-agent to use while crawling
MarkedURLs []string // marked URL to save to model
IgnorePatterns []string // URL pattern to ignore
RequestDelay time.Duration // delay between subsequent requests
IdleTimeout time.Duration // timeout after which crawler quits when queue is empty
Log *log.Logger // logger to use
RetryTimes int // no. of times to retry failed request
FailedRequests map[string]int // map to store failed requests stats
KnownInvalidURLs *InvalidURLCache // known map of invalid URLs
Ctx context.Context // context to quit on SIGINT/SIGTERM
// contains filtered or unexported fields
}
CrawlerConfig to configure a crawler
type InvalidURLCache ¶
type InvalidURLCache struct {
// contains filtered or unexported fields
}
InvalidURLCache is the cache for invalid URLs
Click to show internal directories.
Click to hide internal directories.