spider

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 11, 2015 License: MIT Imports: 11 Imported by: 2

README

Spider Build Status GoDoc License

This package provides a simple way, yet extensible, to scrape HTML and JSON pages. It uses spiders around the web scheduled at certain configurable intervals to fetch data. It is written in Golang and is MIT licensed.

Installation

$ go get -u github.com/celrenheit/spider

Documentation

The documentation is hosted on GoDoc.

Usage

In order, to create your own spiders you have to implement the spider.Spider interface. It has two functions, Setup and Spin.

Setup gets a Context and returns a new Context with an error if something wrong happened. Usually, it is in this function that you create a new http client and http request.

Spin gets a Context do its work and returns an error if necessarry. It is in this function that you do your work (do a request, handle response, parse HTML or JSON, etc...). It should return an error if something didn't happened correctly.

package main

import (
	"fmt"
	"log"
	"time"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/schedulers"
	"github.com/celrenheit/spider/spiderutils"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{"Albert Einstein"}

	// Create a new scheduler
	scheduler := schedulers.NewBasicScheduler()

	// Register the spider to be scheduled every 45 seconds
	scheduler.Handle(wikiSpider).Every(45 * time.Second)

	// Start the scheduler
	log.Fatal(scheduler.Start())
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	// Define the url of the wikipedia page
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	// Create a context with an http.Client and http.Request
	return spiderutils.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	// Execute the request
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	// Get goquery's html parser
	htmlparser, err := ctx.HTMLParser()
	if err != nil {
		return err
	}
	// Get the first paragraph of the wikipedia page
	summary := htmlparser.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Examples

$ cd $GOPATH/src/github.com/celrenheit/spider/examples
$ go run wiki.go

Contributing

Contributions are welcome ! Feel free to submit a pull request. You can improve documentation and examples to start. You can also provides spiders and better schedulers.

If you have developed your own spiders or schedulers, I will be pleased to review your code and eventually merge it into the project.

License

MIT License

Documentation

Overview

Installation:

go get -u github.com/celrenheit/spider

Usage of this package is around the usage of spiders and passing contexts.

ctx, err := spider.Setup(nil)
err := spider.Spin(ctx)

If you have many spider you can make use of a scheduler. This package provides a basic scheduler.

scheduler := schedulers.NewBasicScheduler()

scheduler.Handle(spider1).Every(20 * time.Second)

scheduler.Handle(spider2).Every(10 * time.Second).Duplicate(3).After(500*time.Millisecond)

scheduler.Start()

This will launch 2 spiders every 20 seconds for the first and every 10 seconds for the second. The second will also be duplicated 3 times in three separate goroutines. Each goroutines will have a delay between them of 500 milliseconds.

You can create you own spider by implementing the Spider interface

package main

import (
	"fmt"

	"github.com/celrenheit/spider"
	"github.com/celrenheit/spider/spiderutils"
)

func main() {
	wikiSpider := &WikipediaHTMLSpider{
		Title: "Albert Einstein",
	}
	ctx, _ := wikiSpider.Setup(nil)
	wikiSpider.Spin(ctx)
}

type WikipediaHTMLSpider struct {
	Title string
}

func (w *WikipediaHTMLSpider) Setup(ctx *spider.Context) (*spider.Context, error) {
	url := fmt.Sprintf("https://en.wikipedia.org/wiki/%s", w.Title)
	return spiderutils.NewHTTPContext("GET", url, nil)
}

func (w *WikipediaHTMLSpider) Spin(ctx *spider.Context) error {
	if _, err := ctx.DoRequest(); err != nil {
		return err
	}

	html, _ := ctx.HTMLParser()
	summary := html.Find("#mw-content-text p").First().Text()

	fmt.Println(summary)
	return nil
}

Index

Constants

This section is empty.

Variables

View Source
var (
	ErrNoClient  = errors.New("No request has been set")
	ErrNoRequest = errors.New("No request has been set")
)

Functions

func NewKVStore

func NewKVStore() *store

NewKVStore returns a new store.

Types

type BackoffCondition

type BackoffCondition func(*http.Response) error

func ErrorIfStatusCodeIsNot

func ErrorIfStatusCodeIsNot(status int) BackoffCondition

type BaseSpiderScheduler

type BaseSpiderScheduler interface {
	NextSpin() (time.Duration, bool)
	NextSpinChan() (<-chan struct{}, <-chan struct{})
}

BaseSpiderScheduler is an interface that represents the core methods need for a spider scheduler.

type Context

type Context struct {
	Client *http.Client

	Parent   *Context
	Children []*Context
	// contains filtered or unexported fields
}

Context is the element that can be shared accross different spiders. It contains an HTTP Client and an HTTP Request. Context can execute an HTTP Request.

func NewContext

func NewContext() *Context

NewContext returns a new Context.

func (*Context) Cookies

func (c *Context) Cookies() []*http.Cookie

Cookies return a list of cookies for the given request URL

func (*Context) DoRequest

func (c *Context) DoRequest() (*http.Response, error)

DoRequest makes an http request using the http.Client and http.Request associated with this context.

This will store the response in this context. To access the response you should do:

ctx.Response() // to get the http.Response

func (*Context) DoRequestWithExponentialBackOff

func (c *Context) DoRequestWithExponentialBackOff(condition BackoffCondition, b backoff.BackOff) (*http.Response, error)

DoRequestWithExponentialBackOff makes an http request using the http.Client and http.Request associated with this context. You can pass a condition and a BackOff configuration. See https://github.com/cenkalti/backoff to know more about backoff. If no BackOff is provided it will use the default exponential BackOff configuration. See also ErrorIfStatusCodeIsNot function that provides a basic condition based on status code.

func (*Context) ExtendWithRequest

func (c *Context) ExtendWithRequest(ctx Context, r *http.Request) *Context

ExtendWithRequest return a new Context child to the provided context associated with the provided http.Request.

func (*Context) Get

func (c *Context) Get(key string) interface{}

Get a value from this context

func (*Context) HTMLParser

func (c *Context) HTMLParser() (*goquery.Document, error)

HTMLParser returns an HTML parser.

It uses PuerkitoBio's awesome goquery package. It can be found an this url: https://github.com/PuerkitoBio/goquery.

func (*Context) JSONParser

func (c *Context) JSONParser() (*simplejson.Json, error)

JSONParser returns a JSON parser.

It uses Bitly's go-simplejson package which can be found in: https://github.com/bitly/go-simplejson

func (*Context) NewClient

func (c *Context) NewClient() (*http.Client, error)

NewClient create a new http.Client

func (*Context) NewCookieJar

func (c *Context) NewCookieJar() (*cookiejar.Jar, error)

NewCookieJar create a new *cookiejar.Jar

func (*Context) RAWContent

func (c *Context) RAWContent() ([]byte, error)

RAWContent returns the raw data of the reponse's body

func (*Context) Request

func (c *Context) Request() *http.Request

Request returns an http.Response

func (*Context) ResetClient

func (c *Context) ResetClient() (*http.Client, error)

ResetClient create a new http.Client and replace the existing one if there is one.

func (*Context) ResetCookies

func (c *Context) ResetCookies() error

ResetCookies create a new cookie jar.

Note: All the cookies previously will be deleted.

func (*Context) Response

func (c *Context) Response() *http.Response

Response returns an http.Response

func (*Context) Set

func (c *Context) Set(key string, value interface{})

Set a value to this context

func (*Context) SetParent

func (c *Context) SetParent(parent *Context)

Set a parent context to the current context. It will also add the current context to the list of children of the parent context.

func (*Context) SetRequest

func (c *Context) SetRequest(req *http.Request)

SetRequest set an http.Request

func (*Context) SetResponse

func (c *Context) SetResponse(res *http.Response)

SetResponse set an http.Response

type EveryFunc

type EveryFunc func() time.Duration

type Scheduler

type Scheduler interface {
	Handle(Spider) SpiderScheduler
	Start() error
}

Scheduler is an interface defining an interface that scheduler. To define your own scheduler you should implement this interface

type Spider

type Spider interface {
	Setup(*Context) (*Context, error)
	Spin(*Context) error
}

Spider is an interface with two methods. It is the primary element of the package

type SpiderScheduler

type SpiderScheduler interface {
	// Definition
	Every(time.Duration) SpiderScheduler
	EveryFunc(EveryFunc) SpiderScheduler
	EveryRandom(time.Duration, time.Duration, time.Duration) SpiderScheduler
	From(time.Time) SpiderScheduler
	To(time.Time) SpiderScheduler
	After(time.Duration) SpiderScheduler
	Delay() time.Duration
	Duplicate(int64) SpiderScheduler
	NumGoroutine() int64

	// Base
	BaseSpiderScheduler
}

SpiderScheduler is an interface that allow to specify a schedule for the current spider added to the scheduler

type SpinnerFunc

type SpinnerFunc func(ctx *Context) error

func (SpinnerFunc) Spin

func (s SpinnerFunc) Spin(ctx *Context) error

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL