scrape

package

v0.3.0 Latest Latest Go to latest Published: Sep 19, 2024 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/branow/htmlscraper

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.

Here is a simple example, scraping a Product struct from a html document.

 htmlData := `<div class="product">
 	<img src="https://via.placeholder.com/200" alt="Product 1">
 	<h2>Product 1</h2>
 	<p>Great product for your needs.</p>
 	<p class="price">$29.99</p>
 </div>`
 r := bytes.NewBufferString(htmlData)
 doc, _ := goquery.NewDocumentFromReader(r)

 scraper := scrape.Scraper{}

 // scraping
 type Product struct {
	Name        string `select:"h2" extract:"text"`
 	Description string `select:"p" extract:"text"`
 	Price       string `select:".price" extract:"text"`
 	Image       string `select:"img" extract:"@src"`
 }
 var product Product
 err := scraper.Scrape(doc, &product, ".product", "")

 // get output
 fmt.Println("Got Error:", err)
 fmt.Println("Got Output:")
 fmt.Println(product)

Index ¶

Constants
func ExtractAttribute(node *html.Node, attr string) (string, error)
func ExtractDeepText(node *html.Node) string
func ExtractText(node *html.Node) string
func GetAttributeNotFoundErr(attr string) error
func GetExtractErr(extract string) error
func GetExtractorMap() map[*Match]Extractor
func GetKindErr(typeName, expKind, actKind any) error
func GetMultiKindErr(typeName any, expKinds []any, actKind any) error
func GetNilErr(name string) error
func GetNotFoundErr(selector string) error
func WrapExtractErr(selector string, err error) error
func WrapScrapeErr(err error) error
type Extractor
type Match
- func GetEqualMatch(expected string) Match
- func GetPrefixMatch(prefix string) Match
type Scraper
- func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Constants ¶

View Source

const (
	TextExtractTag     = "text"     // get a text of children's text nodes
	DeepTextExtractTag = "deeptext" // get a text of descendants' text nodes
	AttrExtractTag     = "@"        // get a value of an attribute ("@href", "@src")
)

Extractor tags to specify extract operations.

View Source

const (
	SelectorTag  = "select"  // jQuery-like selector to find the node
	ExtractorTag = "extract" // extract operation to get useful data from the node
)

The tags that let you to specify where the valuable data is and how to get it from the html.Node.

Variables ¶

This section is empty.

Functions ¶

func ExtractAttribute ¶ added in v0.2.0

func ExtractAttribute(node *html.Node, attr string) (string, error)

ExtractAttribute returns the value of the given attribute. If the attribute is absent it returns an error.

func ExtractDeepText ¶ added in v0.2.0

func ExtractDeepText(node *html.Node) string

ExtractDeepText returns the text of all descendants' text nodes.

func ExtractText ¶ added in v0.2.0

func ExtractText(node *html.Node) string

ExtractDeepText returns the text of all children's text nodes.

func GetAttributeNotFoundErr ¶

func GetAttributeNotFoundErr(attr string) error

func GetExtractErr ¶

func GetExtractErr(extract string) error

func GetExtractorMap ¶ added in v0.2.0

func GetExtractorMap() map[*Match]Extractor

GetExtractorMap returns the default map to match extracting tags and extracting functions (or extractors).

func GetKindErr ¶

func GetKindErr(typeName, expKind, actKind any) error

func GetMultiKindErr ¶

func GetMultiKindErr(typeName any, expKinds []any, actKind any) error

func GetNilErr ¶

func GetNilErr(name string) error

func GetNotFoundErr ¶

func GetNotFoundErr(selector string) error

func WrapExtractErr ¶

func WrapExtractErr(selector string, err error) error

func WrapScrapeErr ¶

func WrapScrapeErr(err error) error

Types ¶

type Extractor ¶

type Extractor func(node *html.Node, extract string) (string, error)

Extractor is a function that processes the given node and returns the valuable data in string format.

type Match ¶ added in v0.2.0

type Match func(extract string) (string, bool)

Match wraps boolean logic of matching values of extracting tags with extracting function (or extractors). Match returns already processed value of extracting tag. (an example "@href" -> "href").

func GetEqualMatch ¶ added in v0.2.0

func GetEqualMatch(expected string) Match

GetEqualMatch creates a Match function that compares the given value with the value of the extracting tag.

func GetPrefixMatch ¶ added in v0.2.0

func GetPrefixMatch(prefix string) Match

GetPrefixMatch creates a Match function that checks whether the extracting tag value has the given prefix and returns a boolean result with the extracting tag value. In true case, it cuts the matched prefix from the extracted value (an example "@href" -> "href")

type Scraper ¶

type Scraper struct {

	// If Strict flag is true, the [Scraper.Scrape] method
	// returns an error if the seeking HTML node is not found otherwise
	// it returns a zero value according to the type. The exception is an
	// slice type for which the flag does not work and even if there
	// are not found notes it returns an empty slice of the specified type
	// with a capacity of 10.
	Strict bool

	// Extractors is a map that matches custom user extractors to extract tags.
	// Do not use reserved extractor tag names and patterns ([TextExtractTag],
	// [AttrExtractTag], and others), otherwise, the default implementation is executed.
	Extractors map[*Match]Extractor
}

Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).

func (Scraper) Scrape ¶

func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Scrape scrapes the given doc and writes the useful information into o.

o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain pointers, strings, slices, and structs but the end value must be a string.

selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.

extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL