Documentation
¶
Overview ¶
Installation:
go get -u github.com/celrenheit/spider
Usage of this package is around the usage of spiders and passing contexts.
ctx, err := spider.Setup(nil) err := spider.Spin(ctx)
If you have many spider you can make use of a scheduler. This package provides a basic scheduler.
scheduler := schedulers.NewBasicScheduler() scheduler.Handle(spider1).Every(20 * time.Second) scheduler.Handle(spider2).Every(10 * time.Second).Duplicate(3).After(500*time.Millisecond) scheduler.Start()
This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second. The second will also be duplicated 3 times in three separate goroutines. Each goroutines will have a delay between them of 500 milliseconds.
You can create you own spider by implementing the Spider interface
package main
import (
"fmt"
"github.com/celrenheit/spider"
"github.com/celrenheit/spider/spiderutils"
)
func main() {
wikiSpider := &WikipediaHTMLSpider{
Title: "Albert Einstein",
}
ctx, _ := wikiSpider.Setup(nil)
wikiSpider.Spin(ctx)
}
type WikipediaHTMLSpider struct {
Title string
}
func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
return spiderutils.NewHTTPContext("GET", url, nil)
}
func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
if _, err := ctx.DoRequest(); err != nil {
return err
}
html, _ := ctx.HTMLParser()
summary := html.Find("#mw-content-text p").First().Text()
fmt.Println(summary)
return nil
}
Index ¶
- Variables
- func NewKVStore() *store
- type BackoffCondition
- type BaseSpiderScheduler
- type Context
- func (c *Context) Cookies() []*http.Cookie
- func (c *Context) DoRequest() (*http.Response, error)
- func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)
- func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context
- func (c *Context) Get(key string) interface{}
- func (c *Context) HTMLParser() (*goquery.Document, error)
- func (c *Context) JSONParser() (*simplejson.Json, error)
- func (c *Context) NewClient() (*http.Client, error)
- func (c *Context) NewCookieJar() (*cookiejar.Jar, error)
- func (c *Context) RAWContent() ([]byte, error)
- func (c *Context) Request() *http.Request
- func (c *Context) ResetClient() (*http.Client, error)
- func (c *Context) ResetCookies() error
- func (c *Context) Response() *http.Response
- func (c *Context) Set(key string, value interface{})
- func (c *Context) SetParent(parent *Context)
- func (c *Context) SetRequest(req *http.Request)
- func (c *Context) SetResponse(res *http.Response)
- type EveryFunc
- type Scheduler
- type Spider
- type SpiderScheduler
- type SpinnerFunc
Constants ¶
This section is empty.
Variables ¶
var ( ErrNoClient = errors.New("No request has been set") ErrNoRequest = errors.New("No request has been set") )
Functions ¶
Types ¶
type BackoffCondition ¶
func ErrorIfStatusCodeIsNot ¶
func ErrorIfStatusCodeIsNot(status int) BackoffCondition
type BaseSpiderScheduler ¶
type BaseSpiderScheduler interface {
NextSpin() (time.Duration, bool)
NextSpinChan() (<-chan struct{}, <-chan struct{})
}
BaseSpiderScheduler is an interface that represents the core methods need for a spider scheduler.
type Context ¶
type Context struct {
Client *http.Client
Parent *Context
Children []*Context
// contains filtered or unexported fields
}
Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.
func (*Context) DoRequest ¶
DoRequest makes an http request using the http.Client and http.Request associated with this context.
This will store the response in this context. To access the response you should do:
ctx.Response() // to get the http.Response
func (*Context) DoRequestWithExponentialBackOff ¶
func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)
DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.
func (*Context) ExtendWithRequest ¶
ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.
func (*Context) HTMLParser ¶
HTMLParser returns an HTML parser.
It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.
func (*Context) JSONParser ¶
func (c *Context) JSONParser() (*simplejson.Json, error)
JSONParser returns a JSON parser.
It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson
func (*Context) NewCookieJar ¶
NewCookieJar create a new *cookiejar.Jar
func (*Context) RAWContent ¶
RAWContent returns the raw data of the reponse's body
func (*Context) ResetClient ¶
ResetClient create a new http.Client and replace the existing one if there is one.
func (*Context) ResetCookies ¶
ResetCookies create a new cookie jar.
Note: All the cookies previously will be deleted.
func (*Context) SetParent ¶
Set a parent context to the current context. It will also add the current context to the list of children of the parent context.
func (*Context) SetRequest ¶
SetRequest set an http.Request
func (*Context) SetResponse ¶
SetResponse set an http.Response
type Scheduler ¶
type Scheduler interface {
Handle(Spider) SpiderScheduler
Start() error
}
Scheduler is an interface defining an interface that scheduler. To define your own scheduler you should implement this interface
type SpiderScheduler ¶
type SpiderScheduler interface {
// Definition
Every(time.Duration) SpiderScheduler
EveryFunc(EveryFunc) SpiderScheduler
EveryRandom(time.Duration, time.Duration, time.Duration) SpiderScheduler
From(time.Time) SpiderScheduler
To(time.Time) SpiderScheduler
After(time.Duration) SpiderScheduler
Delay() time.Duration
Duplicate(int64) SpiderScheduler
NumGoroutine() int64
// Base
BaseSpiderScheduler
}
SpiderScheduler is an interface that allow to specify a schedule for the current spider added to the scheduler
type SpinnerFunc ¶
func (SpinnerFunc) Spin ¶
func (s SpinnerFunc) Spin(ctx *Context) error