Documentation
¶
Overview ¶
Package extract of the Dataflow kit describes available extractors to retrieve a structured data from html web pages.
Extractor types ¶
- Text Extractor returns the combined text contents of the given selection.
- HTML Extractor returns the HTML from inside each part of the given selection, as a string.
Note that this results in what is effectively the innerHTML of the element - i.e. if our selection consists of
["<p><b>ONE</b></p>", "<p><i>TWO</i></p>"]
then the output will be:
"<b>ONE</b><i>TWO</i>".
The return type is a string of all the inner HTML joined together.
- OuterHTML Extractor returns the HTML of each part of the given selection, as a string.
if our selection consists of
["<div><b>ONE</b></div>", "<p><i>TWO</i></p>"]
then the output will be:
"<div><b>ONE</b></div><p><i>TWO</i></p>".
The return type is a string of all the outer HTML joined together.
- Attr extracts the value of a given HTML attribute from each part in the selection, and returns them as a list.
The return type of the extractor is a list of attribute values (i.e. []string).
- Regex runs the given regex over the contents of each part in the given selection, and, for each match, extracts the given subexpression.
The return type of the extractor is a list of string matches (i.e. []string).
Filters ¶
Filters are used to manipulate text data when extracting.
The following filters are available:
- upperCase makes all of the letters in the Extractor's text/ Attr uppercase.
- lowerCase makes all of the letters in the Extractor's text/ Attr lowercase.
- capitalize capitalizes the first letter of each word in the Extractor's text/ Attr
- trim returns a copy of the Extractor's text/ Attr, with all leading and trailing white space removed
Filters are available for Text, Link and Image extractor types.
Image alt attribute, Link Text and Text are influenced by specified filters.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Attr ¶
type Attr struct {
// The HTML attribute to extract from each part.
Attr string
//BaseURL specifies the base URL to use for all relative URLs contained within a document.
BaseURL string
// By default, if there is only a single attribute extracted, AttrExtractor
// will return the match itself (as opposed to an array containing the single
// match). Set AlwaysReturnList to true to disable this behaviour, ensuring
// that the Extract function always returns an array.
AlwaysReturnList bool
// If no parts with this attribute are found, then return the empty list from
// Extract, instead of 'nil'. This signals that the result of this
// Part should be included to the results, as opposed to omitting
// the empty list.
IncludeIfEmpty bool
//Filters are used to manipulate HTML attribute when extracting.
//Currently the following filters are available:
//upperCase makes all of the letters in the Attr uppercase.
//lowerCase makes all of the letters in the Attr lowercase.
//capitalizes the first letter of each word in the Attr
//trim returns a copy of the Attr, with all leading and trailing white space removed
Filters []string
}
Attr extracts the value of a given HTML attribute from each part in the selection, and returns them as a list. The return type of the extractor is a list of attribute values (i.e. []string).
type Const ¶
type Const struct {
// The value to return when the Extract() function is called.
Val interface{}
}
Const is an Extractor that returns a constant value.
type Count ¶
type Count struct {
// If no parts with this attribute are found, then return a number from
// Extract, instead of 'nil'. This signals that the result of this
// Part should be included to the results, as opposed to omitting
// the empty list.
IncludeIfEmpty bool
}
Count extracts the count of parts that are matched and returns it.
type Extractor ¶
type Extractor interface {
// Extract some data from the given Selection and return it. The returned
// data should be encodable - i.e. passing it to json.Marshal should succeed.
// If the returned data is nil, then the output from this part will not be
// included.
//
// If this function returns an error, then the scrape is aborted.
Extract(*goquery.Selection) (interface{}, error)
}
The Extractor interface represents something that can extract data from a selection.
type OuterHtml ¶
type OuterHtml struct{}
OuterHtml extracts and returns the HTML of each part of the given selection, as a string.
To illustrate, if our selection consists of ["<div><b>ONE</b></div>", "<p><i>TWO</i></p>"] then the output will be: "<div><b>ONE</b></div><p><i>TWO</i></p>".
The return type is a string of all the outer HTML joined together.
type Regex ¶
type Regex struct {
// The regular expression to match. This regular expression must define
// exactly one parenthesized subexpression (sometimes known as a "capturing
// group"), which will be extracted.
Regex *regexp.Regexp
// The subexpression of the regex to match. If this value is not set, and if
// the given regex has more than one subexpression, an error will be thrown.
Subexpression int
// When OnlyText is true, only run the given regex over the text contents of
// each part in the selection, as opposed to the HTML contents.
OnlyText bool
// By default, if there is only a single match, Regex will return
// the match itself (as opposed to an array containing the single match).
// Set AlwaysReturnList to true to disable this behaviour, ensuring that the
// Extract function always returns an array.
AlwaysReturnList bool
// If no matches of the provided regex could be extracted, then return the empty list
// from Extract, instead of 'nil'. This signals that the result of
// this Part should be included to the results, as opposed to
// omitting the empty list.
IncludeIfEmpty bool
}
Regex runs the given regex over the contents of each part in the given selection, and, for each match, extracts the given subexpression. The return type of the extractor is a list of string matches (i.e. []string).
type Text ¶
type Text struct {
// If text is empty in the selection, then return the empty string from Extract,
// instead of 'nil'. This signals that the result of this Part
// should be included to the results, as opposed to omitting the
// empty string.
IncludeIfEmpty bool
//Filters are used to manipulate Text data when extracting.
//Currently the following filters are available:
//upperCase makes all of the letters in the selected text uppercase.
//lowerCase makes all of the letters in the selected text lowercase.
//capitalize capitalizes the first letter of each word in the selected text
//trim returns a copy of the text, with all leading and trailing white space removed
Filters []string
}
Text is an Extractor that returns the combined text contents of the given selection.