scrape

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 19, 2024 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.

Here is a simple example, scraping a Product struct from a html document.

 htmlData := `<div class="product">
 	<img src="https://via.placeholder.com/200" alt="Product 1">
 	<h2>Product 1</h2>
 	<p>Great product for your needs.</p>
 	<p class="price">$29.99</p>
 </div>`
 r := bytes.NewBufferString(htmlData)
 doc, _ := goquery.NewDocumentFromReader(r)

 scraper := scrape.Scraper{}

 // scraping
 type Product struct {
	Name        string `select:"h2" extract:"text"`
 	Description string `select:"p" extract:"text"`
 	Price       string `select:".price" extract:"text"`
 	Image       string `select:"img" extract:"@src"`
 }
 var product Product
 err := scraper.Scrape(doc, &product, ".product", "")

 // get output
 fmt.Println("Got Error:", err)
 fmt.Println("Got Output:")
 fmt.Println(product)

Index

Constants

View Source
const (
	TextExtractTag     = "text"     // get a text of children's text nodes
	DeepTextExtractTag = "deeptext" // get a text of descendants' text nodes
	AttrExtractTag     = "@"        // get a value of an attribute ("@href", "@src")
)

Extractor tags to specify extract operations.

View Source
const (
	SelectorTag  = "select"  // jQuery-like selector to find the node
	ExtractorTag = "extract" // extract operation to get useful data from the node
)

The tags that let you to specify where the valuable data is and how to get it from the html.Node.

Variables

This section is empty.

Functions

func ExtractAttribute added in v0.2.0

func ExtractAttribute(node *html.Node, attr string) (string, error)

ExtractAttribute returns the value of the given attribute. If the attribute is absent it returns an error.

func ExtractDeepText added in v0.2.0

func ExtractDeepText(node *html.Node) string

ExtractDeepText returns the text of all descendants' text nodes.

func ExtractText added in v0.2.0

func ExtractText(node *html.Node) string

ExtractDeepText returns the text of all children's text nodes.

func GetAttributeNotFoundErr

func GetAttributeNotFoundErr(attr string) error

func GetExtractErr

func GetExtractErr(extract string) error

func GetExtractorMap added in v0.2.0

func GetExtractorMap() map[*Match]Extractor

GetExtractorMap returns the default map to match extracting tags and extracting functions (or extractors).

func GetKindErr

func GetKindErr(typeName, expKind, actKind any) error

func GetMultiKindErr

func GetMultiKindErr(typeName any, expKinds []any, actKind any) error

func GetNilErr

func GetNilErr(name string) error

func GetNotFoundErr

func GetNotFoundErr(selector string) error

func WrapExtractErr

func WrapExtractErr(selector string, err error) error

func WrapScrapeErr

func WrapScrapeErr(err error) error

Types

type Extractor

type Extractor func(node *html.Node, extract string) (string, error)

Extractor is a function that processes the given node and returns the valuable data in string format.

type Match added in v0.2.0

type Match func(extract string) (string, bool)

Match wraps boolean logic of matching values of extracting tags with extracting function (or extractors). Match returns already processed value of extracting tag. (an example "@href" -> "href").

func GetEqualMatch added in v0.2.0

func GetEqualMatch(expected string) Match

GetEqualMatch creates a Match function that compares the given value with the value of the extracting tag.

func GetPrefixMatch added in v0.2.0

func GetPrefixMatch(prefix string) Match

GetPrefixMatch creates a Match function that checks whether the extracting tag value has the given prefix and returns a boolean result with the extracting tag value. In true case, it cuts the matched prefix from the extracted value (an example "@href" -> "href")

type Scraper

type Scraper struct {

	// If Strict flag is true, the [Scraper.Scrape] method
	// returns an error if the seeking HTML node is not found otherwise
	// it returns a zero value according to the type. The exception is an
	// slice type for which the flag does not work and even if there
	// are not found notes it returns an empty slice of the specified type
	// with a capacity of 10.
	Strict bool

	// Extractors is a map that matches custom user extractors to extract tags.
	// Do not use reserved extractor tag names and patterns ([TextExtractTag],
	// [AttrExtractTag], and others), otherwise, the default implementation is executed.
	Extractors map[*Match]Extractor
}

Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).

func (Scraper) Scrape

func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Scrape scrapes the given doc and writes the useful information into o.

o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain pointers, strings, slices, and structs but the end value must be a string.

selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.

extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL