scrape

package
v0.0.0-...-7d74a43 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2018 License: BSD-3-Clause Imports: 27 Imported by: 0

Documentation

Overview

Package scrape of the Dataflow kit is for structured data extraction from webpages starting from JSON payload processing to encoding scraped data to one of output formats like JSON, CSV, XML

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func EncodeToFile

func EncodeToFile(e *encoder, ext string, payloadMD5 string, blockMap ...*map[int][]int) ([]byte, error)

EncodeToFile save parsed data to specified file.

Types

type CSVEncoder

type CSVEncoder struct {
	// contains filtered or unexported fields
}

CSVEncoder transforms parsed data to CSV format.

type DividePageFunc

type DividePageFunc func(*goquery.Selection) []*goquery.Selection

The DividePageFunc type is used to extract a page's blocks during a scrape. For more information, please see the documentation on the ScrapeConfig type.

func DividePageByIntersection

func DividePageByIntersection(selectors []string) DividePageFunc

DividePageByIntersection returns DividePageFunc function which determines common ancestor of specified selectors.

type Extractor

type Extractor struct {
	Types []string `json:"types"`
	// Params are unique for each type
	Params  map[string]interface{} `json:"params"`
	Filters []string               `json:"filters"`
}

Extractor type represents Extractor types available for scraping. Here is the list of Extractor types are currently supported: text, html, outerHtml, attr, link, image, regex, const, count Find more actual information in docs/extractors.md

type Field

type Field struct {
	//Name is a name of fields. It is required, and will be used to aggregate results.
	Name string `json:"name"`
	//Selector is a CSS selector within the given block to process.  Pass in "." to use the root block's selector.
	Selector string `json:"selector"`
	//Extractor contains the logic on how to extract some results from the selector that is provided to this Field.
	Extractor Extractor `json:"extractor"`
	//Details is an optional field strictly for Link extractor type. It guides scraper to parse additional pages following the links according to the set of fields specified inside "details"
	Details *details `json:"details"`
}

A Field corresponds to a given chunk of data to be extracted from every block in each page of a scrape.

type JSONEncoder

type JSONEncoder struct {
}

JSONEncoder transforms parsed data to JSON format.

type Part

type Part struct {
	// The name of this part.  Required, and will be used to aggregate results.
	Name string

	// A sub-selector within the given block to process.  Pass in "." to use
	// the root block's selector with no modification.
	Selector string

	// Extractor contains the logic on how to extract some results from the
	// selector that is provided to this Piece.
	Extractor extract.Extractor
	//Details is an optional field strictly for Link extractor type. It guides scraper to parse additional pages following the links according to the set of fields specified inside "details"
	Details Scraper
}

A Part represents a given chunk of data that is to be extracted from every block in each page of a scrape.

type Payload

type Payload struct {
	// Name - Collection name.
	Name string `json:"name"`
	//Request struct represents HTTP request to be sent to a server. It combines parameters for passing for downloading html pages by Fetch Endpoint.
	//Request.URL field is required. All other fields including Params, Cookies, Func are optional.
	Request fetch.Request `json:"request"`
	//Fields is a set of fields used to extract data from a web page.
	Fields []Field `json:"fields"`
	//PayloadMD5 encodes payload content to MD5. It is used for generating file name to be stored.
	PayloadMD5 string
	//FetcherType represent fetcher which is used for document download.
	//Set up it to either `base` or `chrome` values
	//If FetcherType is omitted the value of FETCHER_TYPE of parse.d service is used by default.
	//FetcherType string `json:"fetcherType"`
	//Format represents output format (CSV, JSON, XML)
	Format string `json:"format"`
	//Paginator is used to scrape multiple pages.
	//If Paginator is nil, then no pagination is performed and it is assumed that the initial URL is the only page.
	Paginator *paginator `json:"paginator"`
	//Paginated results are returned if true.
	//Default value is false
	// Single list of combined results from every block on all pages is returned by default.
	//
	// Paginated results are applicable for JSON and XML output formats.
	//
	// Combined list of results is always returned for CSV format.
	PaginateResults *bool `json:"paginateResults"`
	//FetchDelay should be used for a scraper to throttle the crawling speed to avoid hitting the web servers too frequently.
	//FetchDelay specifies sleep time for multiple requests for the same domain. It is equal to FetchDelay * random value between 500 and 1500 msec
	FetchDelay *time.Duration
	//Some web sites track  statistically significant similarities in the time between requests to them. RandomizeCrawlDelay setting decreases the chance of a crawler being blocked by such sites. This way a random delay ranging from 0.5  CrawlDelay to 1.5  CrawlDelay seconds is used between consecutive requests to the same domain. If CrawlDelay is zero (default) this option has no effect.
	RandomizeFetchDelay *bool
	//Maximum number of times to retry, in addition to the first download.
	//RETRY_HTTP_CODES
	//Default: [500, 502, 503, 504, 408]
	//Failed pages should be rescheduled for download at the end. once the spider has finished crawling all other (non failed) pages.
	RetryTimes int `json:"retryTimes"`
	// ContainPath means that one of the field just a path and we have to ignore all other fields (if present)
	// that are not a path
	IsPath bool `json:"path"`
}

Payload structure contain information and rules to be passed to a scraper Find the most actual information in docs/payload.md

type Results

type Results struct {

	// Output represents combined results after parsing from each Part of each page.  Essentially, the top-level array
	// is for each page, the second-level array is for each block in a page, and
	// the final map[string]interface{} is the mapping of Part.Name to results.
	Output [][]map[string]interface{}
}

Results describes the results of a scrape. It contains a list of all pages (URLs) visited during the process, along with all results generated from each Part in each page.

func (*Results) AllBlocks

func (r *Results) AllBlocks() []map[string]interface{}

AllBlocks returns a single list of results from every block on all pages. This function will always return a list, even if no blocks were found.

func (*Results) First

func (r *Results) First() map[string]interface{}

First returns the first set of results - i.e. the results from the first block on the first page. This function can return nil if there were no blocks found on the first page of the scrape.

type Scraper

type Scraper struct {
	Request fetch.Request
	// Paginator is the Paginator to use for this current scrape.
	//
	// If Paginator is nil, then no pagination is performed and it is assumed that
	// the initial URL is the only page.
	Paginator paginate.Paginator

	// DividePage splits a page into individual 'blocks'.  When scraping, we treat
	// each page as if it contains some number of 'blocks', each of which can be
	// further subdivided into what actually needs to be extracted.
	//
	// If the DividePage function is nil, then no division is performed and the
	// page is assumed to contain a single block containing the entire <body>
	// tag.
	DividePage DividePageFunc

	// Parts contains the list of data that is extracted for each block.  For
	// every block that is the result of the DividePage function (above), all of
	// the Parts entries receives the selector representing the block, and can
	// return a result.  If the returned result is nil, then the Part is
	// considered not to exist in this block, and is not included.
	//
	// Note: if a Part's Extractor returns an error, it results in the scrape
	// being aborted - this can be useful if you need to ensure that a given Part
	// is required, for example.
	Parts []Part
	//Opts contains options that are used during the progress of a
	// scrape.
	//Opts ScrapeOptions
	IsPath bool
}

Scraper struct consolidates settings for scraping task.

type Task

type Task struct {
	ID      string
	Payload Payload
	//Scrapers []*Scraper
	// Visited contain a map[url]error during this scrape.
	// Always contains at least one element - the initial URL.
	//Failed pages should be rescheduled for download at the end if during a scrape one of the following statuses returned [500, 502, 503, 504, 408]
	//once the spider has finished crawling all other (non failed) pages.
	Errors []error
	//TaskQueue chan *Scraper
	Robots map[string]*robotstxt.RobotsData
	//Results
	Parsed bool
	// Block counter
	BlockCounter []int
	// contains filtered or unexported fields
}

Task keeps Results of Task generated from Payload along with other auxiliary information

func NewTask

func NewTask(p Payload) *Task

NewTask creates new task to parse fetched page following the rules from Payload.

func (*Task) Parse

func (task *Task) Parse() (io.ReadCloser, error)

Parse processes specified task which parses fetched page.

type XMLEncoder

type XMLEncoder struct {
}

XMLEncoder transforms parsed data to XML format.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL