extract

package
v0.0.0-...-7d74a43 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 6, 2018 License: BSD-3-Clause Imports: 10 Imported by: 0

Documentation

Overview

Package extract of the Dataflow kit describes available extractors to retrieve a structured data from html web pages.

Extractor types

- Text Extractor returns the combined text contents of the given selection.

- HTML Extractor returns the HTML from inside each part of the given selection, as a string.

Note that this results in what is effectively the innerHTML of the element - i.e. if our selection consists of

["<p><b>ONE</b></p>", "<p><i>TWO</i></p>"]

then the output will be:

"<b>ONE</b><i>TWO</i>".

The return type is a string of all the inner HTML joined together.

- OuterHTML Extractor returns the HTML of each part of the given selection, as a string.

if our selection consists of

["<div><b>ONE</b></div>", "<p><i>TWO</i></p>"]

then the output will be:

"<div><b>ONE</b></div><p><i>TWO</i></p>".

The return type is a string of all the outer HTML joined together.

- Attr extracts the value of a given HTML attribute from each part in the selection, and returns them as a list.

The return type of the extractor is a list of attribute values (i.e. []string).

- Regex runs the given regex over the contents of each part in the given selection, and, for each match, extracts the given subexpression.

The return type of the extractor is a list of string matches (i.e. []string).

Filters

Filters are used to manipulate text data when extracting.

The following filters are available:

- upperCase makes all of the letters in the Extractor's text/ Attr uppercase.

- lowerCase makes all of the letters in the Extractor's text/ Attr lowercase.

- capitalize capitalizes the first letter of each word in the Extractor's text/ Attr

- trim returns a copy of the Extractor's text/ Attr, with all leading and trailing white space removed

Filters are available for Text, Link and Image extractor types.

Image alt attribute, Link Text and Text are influenced by specified filters.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Attr

type Attr struct {
	// The HTML attribute to extract from each part.
	Attr string
	//BaseURL specifies the base URL to use for all relative URLs contained within a document.
	BaseURL string
	// By default, if there is only a single attribute extracted, AttrExtractor
	// will return the match itself (as opposed to an array containing the single
	// match). Set AlwaysReturnList to true to disable this behaviour, ensuring
	// that the Extract function always returns an array.
	AlwaysReturnList bool

	// If no parts with this attribute are found, then return the empty list from
	// Extract, instead of  'nil'.  This signals that the result of this
	// Part should be included to the results, as opposed to omitting
	// the empty list.
	IncludeIfEmpty bool
	//Filters are used to manipulate HTML attribute when extracting.
	//Currently the following filters are available:
	//upperCase makes all of the letters in the Attr uppercase.
	//lowerCase makes all of the letters in the Attr lowercase.
	//capitalizes the first letter of each word in the Attr
	//trim returns a copy of the Attr, with all leading and trailing white space removed
	Filters []string
}

Attr extracts the value of a given HTML attribute from each part in the selection, and returns them as a list. The return type of the extractor is a list of attribute values (i.e. []string).

func (Attr) Extract

func (e Attr) Extract(sel *goquery.Selection) (interface{}, error)

Extract returns Attr value from specified selection. Absolute URL will be returned for href and src attributes if relative URLs provided

type Const

type Const struct {
	// The value to return when the Extract() function is called.
	Val interface{}
}

Const is an Extractor that returns a constant value.

func (Const) Extract

func (e Const) Extract(sel *goquery.Selection) (interface{}, error)

Extract returns Const value.

type Count

type Count struct {
	// If no parts with this attribute are found, then return a number from
	// Extract, instead of 'nil'.  This signals that the result of this
	// Part should be included to the results, as opposed to omitting
	// the empty list.
	IncludeIfEmpty bool
}

Count extracts the count of parts that are matched and returns it.

func (Count) Extract

func (e Count) Extract(sel *goquery.Selection) (interface{}, error)

Extract returns length of elements in selection.

type Extractor

type Extractor interface {
	// Extract some data from the given Selection and return it.  The returned
	// data should be encodable - i.e. passing it to json.Marshal should succeed.
	// If the returned data is nil, then the output from this part will not be
	// included.
	//
	// If this function returns an error, then the scrape is aborted.
	Extract(*goquery.Selection) (interface{}, error)
}

The Extractor interface represents something that can extract data from a selection.

type OuterHtml

type OuterHtml struct{}

OuterHtml extracts and returns the HTML of each part of the given selection, as a string.

To illustrate, if our selection consists of ["<div><b>ONE</b></div>", "<p><i>TWO</i></p>"] then the output will be: "<div><b>ONE</b></div><p><i>TWO</i></p>".

The return type is a string of all the outer HTML joined together.

func (OuterHtml) Extract

func (e OuterHtml) Extract(sel *goquery.Selection) (interface{}, error)

Extract returns OuterHtml from specified selection.

type Regex

type Regex struct {
	// The regular expression to match.  This regular expression must define
	// exactly one parenthesized subexpression (sometimes known as a "capturing
	// group"), which will be extracted.
	Regex *regexp.Regexp
	// The subexpression of the regex to match.  If this value is not set, and if
	// the given regex has more than one subexpression, an error will be thrown.
	Subexpression int

	// When OnlyText is true, only run the given regex over the text contents of
	// each part in the selection, as opposed to the HTML contents.
	OnlyText bool

	// By default, if there is only a single match, Regex will return
	// the match itself (as opposed to an array containing the single match).
	// Set AlwaysReturnList to true to disable this behaviour, ensuring that the
	// Extract function always returns an array.
	AlwaysReturnList bool

	// If no matches of the provided regex could be extracted, then return the empty list
	// from Extract, instead of 'nil'.  This signals that the result of
	// this Part should be included to the results, as opposed to
	// omitting the empty list.
	IncludeIfEmpty bool
}

Regex runs the given regex over the contents of each part in the given selection, and, for each match, extracts the given subexpression. The return type of the extractor is a list of string matches (i.e. []string).

func (Regex) Extract

func (e Regex) Extract(sel *goquery.Selection) (interface{}, error)

Extract returns Regex'ed value from specified selection.

type Text

type Text struct {
	// If text is empty in the selection, then return the empty string from Extract,
	// instead of 'nil'.  This signals that the result of this Part
	// should be included to the results, as opposed to omitting the
	// empty string.
	IncludeIfEmpty bool
	//Filters are used to manipulate Text data when extracting.
	//Currently the following filters are available:
	//upperCase makes all of the letters in the selected text  uppercase.
	//lowerCase makes all of the letters in the selected text lowercase.
	//capitalize capitalizes the first letter of each word in the selected text
	//trim returns a copy of the text, with all leading and trailing white space removed
	Filters []string
}

Text is an Extractor that returns the combined text contents of the given selection.

func (Text) Extract

func (e Text) Extract(sel *goquery.Selection) (interface{}, error)

Extract returns Text value from specified selection.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL