spider

package module

v0.1.0 Latest Latest Go to latest Published: Oct 11, 2015 License: MIT Imports: 11 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/celrenheit/spider

Links

Open Source Insights

README ¶

Spider

This package provides a simple way, yet extensible, to scrape HTML and JSON pages. It uses spiders around the web scheduled at certain configurable intervals to fetch data. It is written in Golang and is MIT licensed.

Installation

$ go get -u github.com/celrenheit/spider

Documentation

The documentation is hosted on GoDoc.

Usage

In order, to create your own spiders you have to implement the spider.Spider interface. It has two functions, Setup and Spin.

Setup gets a Context and returns a new Context with an error if something wrong happened. Usually, it is in this function that you create a new http client and http request.

Spin gets a Context do its work and returns an error if necessarry. It is in this function that you do your work (do a request, handle response, parse HTML or JSON, etc...). It should return an error if something didn't happened correctly.

package main

import (
	"fmt"
	"log"
	"time"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/schedulers"
	"github.com/celrenheit/spider/spiderutils"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{"Albert Einstein"}

	// Create a new scheduler
	scheduler := schedulers.NewBasicScheduler()

	// Register the spider to be scheduled every 45 seconds
	scheduler.Handle(wikiSpider).Every(45 * time.Second)

	// Start the scheduler
	log.Fatal(scheduler.Start())
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	// Define the url of the wikipedia page
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	// Create a context with an http.Client and http.Request
	return spiderutils.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	// Execute the request
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	// Get goquery's html parser
	htmlparser, err := ctx.HTMLParser()
	if err != nil {
		return err
	}
	// Get the first paragraph of the wikipedia page
	summary := htmlparser.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Examples

$ cd $GOPATH/src/github.com/celrenheit/spider/examples
$ go run wiki.go

Contributing

Contributions are welcome ! Feel free to submit a pull request. You can improve documentation and examples to start. You can also provides spiders and better schedulers.

If you have developed your own spiders or schedulers, I will be pleased to review your code and eventually merge it into the project.

License

MIT License

Documentation ¶

Overview ¶

Installation:

go get -u github.com/celrenheit/spider

Usage of this package is around the usage of spiders and passing contexts.

ctx, err := spider.Setup(nil)
err := spider.Spin(ctx)

If you have many spider you can make use of a scheduler. This package provides a basic scheduler.

scheduler := schedulers.NewBasicScheduler()

scheduler.Handle(spider1).Every(20 * time.Second)

scheduler.Handle(spider2).Every(10 * time.Second).Duplicate(3).After(500*time.Millisecond)

scheduler.Start()

This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second. The second will also be duplicated 3 times in three separate goroutines. Each goroutines will have a delay between them of 500 milliseconds.

You can create you own spider by implementing the Spider interface

package main

import (
	"fmt"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/spiderutils"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{
		Title: "Albert Einstein",
	}
	ctx, _ := wikiSpider.Setup(nil)
	wikiSpider.Spin(ctx)
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	return spiderutils.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	html, _ := ctx.HTMLParser()
	summary := html.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Index ¶

Variables
func NewKVStore() *store
type BackoffCondition
- func ErrorIfStatusCodeIsNot(status int) BackoffCondition
type BaseSpiderScheduler
type Context
- func NewContext() *Context
type EveryFunc
type Scheduler
type Spider
type SpiderScheduler
type SpinnerFunc
- func (s SpinnerFunc) Spin(ctx *Context) error

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ErrNoClient  = errors.New("No request has been set")
	ErrNoRequest = errors.New("No request has been set")
)

Functions ¶

func NewKVStore ¶

func NewKVStore() *store

NewKVStore returns a new store.

Types ¶

type BackoffCondition ¶

type BackoffCondition func(*http.Response) error

func ErrorIfStatusCodeIsNot ¶

func ErrorIfStatusCodeIsNot(status int) BackoffCondition

type BaseSpiderScheduler ¶

type BaseSpiderScheduler interface {
	NextSpin() (time.Duration, bool)
	NextSpinChan() (<-chan struct{}, <-chan struct{})
}

BaseSpiderScheduler is an interface that represents the core methods need for a spider scheduler.

type Context ¶

type Context struct {
	Client *http.Client

	Parent   *Context
	Children []*Context
	// contains filtered or unexported fields
}

Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.

func NewContext ¶

func NewContext() *Context

NewContext returns a new Context.

func (*Context) Cookies ¶

func (c *Context) Cookies() []*http.Cookie

Cookies return a list of cookies for the given request URL

func (*Context) DoRequest ¶

func (c *Context) DoRequest() (*http.Response, error)

DoRequest makes an http request using the http.Client and http.Request associated with this context.

This will store the response in this context. To access the response you should do:

ctx.Response() // to get the http.Response

func (*Context) DoRequestWithExponentialBackOff ¶

func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)

DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.

func (*Context) ExtendWithRequest ¶

func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context

ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.

func (*Context) Get ¶

func (c *Context) Get(key string) interface{}

Get a value from this context

func (*Context) HTMLParser ¶

func (c *Context) HTMLParser() (*goquery.Document, error)

HTMLParser returns an HTML parser.

It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.

func (*Context) JSONParser ¶

func (c *Context) JSONParser() (*simplejson.Json, error)

JSONParser returns a JSON parser.

It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson

func (*Context) NewClient ¶

func (c *Context) NewClient() (*http.Client, error)

NewClient create a new http.Client

func (*Context) NewCookieJar ¶

func (c *Context) NewCookieJar() (*cookiejar.Jar, error)

NewCookieJar create a new *cookiejar.Jar

func (*Context) RAWContent ¶

func (c *Context) RAWContent() ([]byte, error)

RAWContent returns the raw data of the reponse's body

func (*Context) Request ¶

func (c *Context) Request() *http.Request

Request returns an http.Response

func (*Context) ResetClient ¶

func (c *Context) ResetClient() (*http.Client, error)

ResetClient create a new http.Client and replace the existing one if there is one.

func (*Context) ResetCookies ¶

func (c *Context) ResetCookies() error

ResetCookies create a new cookie jar.

Note: All the cookies previously will be deleted.

func (*Context) Response ¶

func (c *Context) Response() *http.Response

Response returns an http.Response

func (*Context) Set ¶

func (c *Context) Set(key string, value interface{})

Set a value to this context

func (*Context) SetParent ¶

func (c *Context) SetParent(parent *Context)

Set a parent context to the current context. It will also add the current context to the list of children of the parent context.

func (*Context) SetRequest ¶

func (c *Context) SetRequest(req *http.Request)

SetRequest set an http.Request

func (*Context) SetResponse ¶

func (c *Context) SetResponse(res *http.Response)

SetResponse set an http.Response

type EveryFunc ¶

type EveryFunc func() time.Duration

type Scheduler ¶

type Scheduler interface {
	Handle(Spider) SpiderScheduler
	Start() error
}

Scheduler is an interface defining an interface that scheduler. To define your own scheduler you should implement this interface

type Spider ¶

type Spider interface {
	Setup(*Context) (*Context, error)
	Spin(*Context) error
}

Spider is an interface with two methods. It is the primary element of the package

type SpiderScheduler ¶

type SpiderScheduler interface {
	// Definition
	Every(time.Duration) SpiderScheduler
	EveryFunc(EveryFunc) SpiderScheduler
	EveryRandom(time.Duration, time.Duration, time.Duration) SpiderScheduler
	From(time.Time) SpiderScheduler
	To(time.Time) SpiderScheduler
	After(time.Duration) SpiderScheduler
	Delay() time.Duration
	Duplicate(int64) SpiderScheduler
	NumGoroutine() int64

	// Base
	BaseSpiderScheduler
}

SpiderScheduler is an interface that allow to specify a schedule for the current spider added to the scheduler

type SpinnerFunc ¶

type SpinnerFunc func(ctx *Context) error

func (SpinnerFunc) Spin ¶

func (s SpinnerFunc) Spin(ctx *Context) error

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples
schedulers
spiders
spiderutils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL