scrape

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 15, 2024 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.

Here is a simple example, scraping a Product struct from a html document.

 htmlData := `<div class="product">
 	<img src="https://via.placeholder.com/200" alt="Product 1">
 	<h2>Product 1</h2>
 	<p>Great product for your needs.</p>
 	<p class="price">$29.99</p>
 </div>`
 r := bytes.NewBufferString(htmlData)
 doc, _ := goquery.NewDocumentFromReader(r)

 scraper := scrape.Scraper{}

 // scraping
 type Product struct {
	Name        string `select:"h2" extract:"text"`
 	Description string `select:"p" extract:"text"`
 	Price       string `select:".price" extract:"text"`
 	Image       string `select:"img" extract:"@src"`
 }
 var product Product
 err := scraper.Scrape(doc, &product, ".product", "")

 // get output
 fmt.Println("Got Error:", err)
 fmt.Println("Got Output:")
 fmt.Println(product)

Index

Constants

View Source
const (
	SelectorTag  = "select"  // jQuery-like selector to find the node
	ExtractorTag = "extract" // extract operation to get useful data from the node
)

The tags that let you to specify where the valuable data is and how to get it from the html.Node.

View Source
const (
	TextExtractTag = "text" // get an inner text of the node
	AttrExtractTag = "@"    // get a value of an attribute ("@href", "@src")
)

Extractor tags to specify extract operations.

Variables

This section is empty.

Functions

func GetAttributeNotFoundErr

func GetAttributeNotFoundErr(attr string) error

func GetExtractErr

func GetExtractErr(extract string) error

func GetKindErr

func GetKindErr(typeName, expKind, actKind any) error

func GetMultiKindErr

func GetMultiKindErr(typeName any, expKinds []any, actKind any) error

func GetNilErr

func GetNilErr(name string) error

func GetNotFoundErr

func GetNotFoundErr(selector string) error

func WrapExtractErr

func WrapExtractErr(selector string, err error) error

func WrapScrapeErr

func WrapScrapeErr(err error) error

Types

type Extractor

type Extractor func(node *html.Node) (string, error)

Extractor is a function that processes the given node and returns the valuable data in string format.

type Scraper

type Scraper struct {

	// If Strict flag is true, the [Scraper.Scrape] method
	// returns an error if the seeking HTML node is not found otherwise
	// it returns a zero value according to the type. The exception is an
	// slice type for which the flag does not work and even if there
	// are not found notes it returns an empty slice of the specified type
	// with a capacity of 10.
	Strict bool

	// Extractors is a map that matches custom user
	// extractors to extract tags. Do not use reserved Extractors tag names
	// and patterns ([TextExtractTag], [AttrExtractTag]), otherwise the
	// default implementation is executed.
	Extractors map[string]Extractor
}

Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).

func (Scraper) Scrape

func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Scrape scrapes the given doc and writes the useful information into o.

o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain strings, slices, and structs but the end value must be a string.

  • valid - struct{Name string, Nicknames []string}
  • valid - []struct{{Name string, Nicknames []string}}
  • invalid - struct{Name Stringer, Nicknames []string}
  • invalid - []struct{{Name string, Nicknames []int}}

selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.

extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL