Documentation
¶
Overview ¶
Package goatscrape is a web crawling and scraping framework. Its aim is to create a robust, powerful, crawling framework out of the box that packages a lot of default behaviour into plugins. It has the following advantages:
- It is easy to use with the default plugins, but can be extended by those who need extra power or control;
- It is performant, natively using concurrency and allowing spider tasks to be compiled to a single binary;
- It can be used in multiple different use cases. It can be tailored to fit a range of tasks from basic screen scraping to a bespoke tool that pulls tasks of a work queue and publishes its findings to a database.
It was originally written by Steven Holdway and is released under the MIT License for ease of static linking.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type LinkStore ¶
type LinkStore interface {
// GetLinks should return a string slice of links to crawl, in the
// amount defined in the amount paramter.
GetLinks(amount int) []string
// AddToCrawl should add the link parameter to the to crawl list.
AddToCrawl(link string)
// MoveToCrawled should delete the link in the to crawl list and
// place it in the crawled list.
MoveToCrawled(link string)
// MoreToCrawl returns a boolean value if there are still links
// in the to crawl list.
MoreToCrawl() bool
}
LinkStore defines the interface that any object that looks to store, and manage, the toCrawl and the crawled lists should implement. All the methods in this interface are expected to be thread safe.
type ParseFunc ¶
ParseFunc defines a function that takes a HTTP response, and returns a string slice of further URLs to crawl.
type PreRequestFunc ¶
PreRequestFunc is a function that modifies an existing http.Request object before it is made to a web server. It can be used for, as an example, modifying the user agent header before each request.
type RequestFunc ¶
RequestFunc should take a pre-constructed http.Request object and return a http.Response object or an error. This is used for custom getters within the spider object.
type Spider ¶
type Spider struct {
// Name is the unique scan name. It is currently use for logging purposes.
Name string
// StartingURLs is a string slice of all the URLs that will be loaded into
// the spider first. These should be used to seed the scanner.
StartingURLs []string
// AllowedDomains is a string slice with all the allowed domains. An empty
// slice will cause the spider to assume that there are no domains that are not allowed.
AllowedDomains []string
// DisallowedPages is a slice of regular expressions. Each expression is evaluated on all links
// returned from the Parse() function. If the expression matches then the link is not added to the
// to crawl list.
DisallowedPages []regexp.Regexp
// MaxPages is the maximum amount of pages to crawl before the scanner returns. A setting of zero or less
// causes the spider to assume there are no maximum pages.
MaxPages int
// MaxConcurrentRequests is the maximum amount of requests to run in parallel.
MaxConcurrentRequests int
// The Parse function should emit a list of urls that should be added to the crawl.
Parse ParseFunc
// PreRequestMiddleware is a slice of functions that implement PreRequestFunc. Each of these functions
// is called on the http.Request object before it is execute by the http.Client.
PreRequestMiddleware []PreRequestFunc
// The function that gets a web page. Should take a http.Request and return a http.Response
Getter RequestFunc
// Verbose will cause more diagnostic information to be outputted if it's set to true.
Verbose bool
// Quiet will repress all output to stdout or stderr
Quiet bool
// Links is the LinkStore object that is used by this spider. LinkStores are responsible for storing,
// and managing the crawled and the to crawl lists used by the spider during its operation.
Links LinkStore
// contains filtered or unexported fields
}
Spider defines a single scrape job. Clients should create a new Spider instance and customise it before running the Start() method.
func (*Spider) AddPreRequestMiddleware ¶
func (s *Spider) AddPreRequestMiddleware(funcs ...PreRequestFunc)
AddPreRequestMiddleware takes a veradic amount of PreRequestFunc arguments, and make sure each of the functions added are called on the http.Request object before a request is made.
Directories
¶
| Path | Synopsis |
|---|---|
|
examples
|
|
|
simple
command
This example starts with a few seed urls, and includes a parse function that always returns the same URL.
|
This example starts with a few seed urls, and includes a parse function that always returns the same URL. |
|
spider
command
This example loads five pages from XKCD, one page at a time.
|
This example loads five pages from XKCD, one page at a time. |
|
Package plugins is a set of default pieces of functionality to use with the stevie-holdway/goatscrape package.
|
Package plugins is a set of default pieces of functionality to use with the stevie-holdway/goatscrape package. |