goatscrape

package module

v0.0.0-...-2ee099b Latest Latest Go to latest Published: Oct 6, 2014 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sewh/goatscrape

Links

Open Source Insights

README ¶

goatscrape

goatscrape is a web crawling/scraping framework for the Go language loosly inspired by Scrapy for Python. It favours composibility, and has the majority of its functionality seperated into plugins; making it easy to compose behaviour from default plugins or write your own. goatscrape was written for a few reasons;

To automate tasks where large amounts of HTTP content needs to be downloaded and processed;
To allow developers to produce a single statically linked Go binary for crawling tasks;
To define a spidering tasks in terms of configuring a struct, rather than writing code for every single task involved in crawling;
... but mainly because I was bored ;)

See the examples directory for some runnable code examples.

goatscrape was originally called 'goscrape' but it was altered when there were a few other projects with that name. Despite popular belief, it only scrapes goats if there is some kind of goat oriented website to crawl.

Current Status

goatscrape is currently very alpha. There's a lot that I still want to do with it, but for my very small scale tests it seems to work as expected. That said, be warned that it's probably very buggy still.

Any contributions greatfully received :)

TODO List

A walkthrough guide and tutorial with some examples. Will do this when the API looks like it is pretty stable.
Unit tests (at some point)
More sophisticated code examples
~~Add a cookie store~~
- ~~Changed the http.Client that's used by the spider to public so theoretically you could just use the http.Client's API to do this.~~
  - Added getter plugins. Will add a getter plugin that takes a user supplied http.Client in the future.
Add some more ready-to-go middleware functions
Add some ready-to-go parse functions (such as get all hrefs from a page, for example)
- Already added one of these, maybe a few others will be added when I think of them
Add a clear redirect policy. Currently 301 redirects are automatically allowed, but this may not suit all scenarios. This will require syncing up the http.Client with the getAndVerifyHead function.
Sprinkle a little more awesome to take it from a toy to deployable tool

Documentation ¶

Overview ¶

Package goatscrape is a web crawling and scraping framework. Its aim is to create a robust, powerful, crawling framework out of the box that packages a lot of default behaviour into plugins. It has the following advantages:

It is easy to use with the default plugins, but can be extended by those who need extra power or control;
It is performant, natively using concurrency and allowing spider tasks to be compiled to a single binary;
It can be used in multiple different use cases. It can be tailored to fit a range of tasks from basic screen scraping to a bespoke tool that pulls tasks of a work queue and publishes its findings to a database.

It was originally written by Steven Holdway and is released under the MIT License for ease of static linking.

Index ¶

type LinkStore
type ParseFunc
type PreRequestFunc
type RequestFunc
type Spider
- func (s *Spider) AddPreRequestMiddleware(funcs ...PreRequestFunc)
- func (s *Spider) Start() (err error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type LinkStore ¶

type LinkStore interface {
	// GetLinks should return a string slice of links to crawl, in the
	// amount defined in the amount paramter.
	GetLinks(amount int) []string
	// AddToCrawl should add the link parameter to the to crawl list.
	AddToCrawl(link string)
	// MoveToCrawled should delete the link in the to crawl list and
	// place it in the crawled list.
	MoveToCrawled(link string)
	// MoreToCrawl returns a boolean value if there are still links
	// in the to crawl list.
	MoreToCrawl() bool
}

LinkStore defines the interface that any object that looks to store, and manage, the toCrawl and the crawled lists should implement. All the methods in this interface are expected to be thread safe.

type ParseFunc ¶

type ParseFunc func(*http.Response) []string

ParseFunc defines a function that takes a HTTP response, and returns a string slice of further URLs to crawl.

type PreRequestFunc ¶

type PreRequestFunc func(*http.Request)

PreRequestFunc is a function that modifies an existing http.Request object before it is made to a web server. It can be used for, as an example, modifying the user agent header before each request.

type RequestFunc ¶

type RequestFunc func(*http.Request) (*http.Response, error)

RequestFunc should take a pre-constructed http.Request object and return a http.Response object or an error. This is used for custom getters within the spider object.

type Spider ¶

type Spider struct {
	// Name is the unique scan name. It is currently use for logging purposes.
	Name string
	// StartingURLs is a string slice of all the URLs that will be loaded into
	// the spider first. These should be used to seed the scanner.
	StartingURLs []string
	// AllowedDomains is a string slice with all the allowed domains. An empty
	// slice will cause the spider to assume that there are no domains that are not allowed.
	AllowedDomains []string
	// DisallowedPages is a slice of regular expressions. Each expression is evaluated on all links
	// returned from the Parse() function. If the expression matches then the link is not added to the
	// to crawl list.
	DisallowedPages []regexp.Regexp
	// MaxPages is the maximum amount of pages to crawl before the scanner returns. A setting of zero or less
	// causes the spider to assume there are no maximum pages.
	MaxPages int
	// MaxConcurrentRequests is the maximum amount of requests to run in parallel.
	MaxConcurrentRequests int

	// The Parse function should emit a list of urls that should be added to the crawl.
	Parse ParseFunc
	// PreRequestMiddleware is a slice of functions that implement PreRequestFunc. Each of these functions
	// is called on the http.Request object before it is execute by the http.Client.
	PreRequestMiddleware []PreRequestFunc
	// The function that gets a web page. Should take a http.Request and return a http.Response
	Getter RequestFunc

	// Verbose will cause more diagnostic information to be outputted if it's set to true.
	Verbose bool
	// Quiet will repress all output to stdout or stderr
	Quiet bool

	// Links is the LinkStore object that is used by this spider. LinkStores are responsible for storing,
	// and managing the crawled and the to crawl lists used by the spider during its operation.
	Links LinkStore
	// contains filtered or unexported fields
}

Spider defines a single scrape job. Clients should create a new Spider instance and customise it before running the Start() method.

func (*Spider) AddPreRequestMiddleware ¶

func (s *Spider) AddPreRequestMiddleware(funcs ...PreRequestFunc)

AddPreRequestMiddleware takes a veradic amount of PreRequestFunc arguments, and make sure each of the functions added are called on the http.Request object before a request is made.

func (*Spider) Start ¶

func (s *Spider) Start() (err error)

Start begins the job with the settings defined in the spider structure's configuration.

Source Files ¶

View all Source files

crawl.go

Directories ¶

Path	Synopsis
examples
simple command This example starts with a few seed urls, and includes a parse function that always returns the same URL.	This example starts with a few seed urls, and includes a parse function that always returns the same URL.
spider command This example loads five pages from XKCD, one page at a time.	This example loads five pages from XKCD, one page at a time.
plugins Package plugins is a set of default pieces of functionality to use with the stevie-holdway/goatscrape package.	Package plugins is a set of default pieces of functionality to use with the stevie-holdway/goatscrape package.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL