scrape

package

v0.1.0 Latest Latest Go to latest Published: Sep 15, 2024 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/branow/htmlscraper

Links

Open Source Insights

Documentation ¶

Overview ¶

Package scrape implements scraping functionality for extracting useful data from HTTP text using jQuery-like selectors. It contains struct Scraper with method Scraper.Scrape built on the github.com/PuerkitoBio.goquery library.

Here is a simple example, scraping a Product struct from a html document.

 htmlData := `<div class="product">
 	<img src="https://via.placeholder.com/200" alt="Product 1">
 	<h2>Product 1</h2>
 	<p>Great product for your needs.</p>
 	<p class="price">$29.99</p>
 </div>`
 r := bytes.NewBufferString(htmlData)
 doc, _ := goquery.NewDocumentFromReader(r)

 scraper := scrape.Scraper{}

 // scraping
 type Product struct {
	Name        string `select:"h2" extract:"text"`
 	Description string `select:"p" extract:"text"`
 	Price       string `select:".price" extract:"text"`
 	Image       string `select:"img" extract:"@src"`
 }
 var product Product
 err := scraper.Scrape(doc, &product, ".product", "")

 // get output
 fmt.Println("Got Error:", err)
 fmt.Println("Got Output:")
 fmt.Println(product)

Index ¶

Constants
func GetAttributeNotFoundErr(attr string) error
func GetExtractErr(extract string) error
func GetKindErr(typeName, expKind, actKind any) error
func GetMultiKindErr(typeName any, expKinds []any, actKind any) error
func GetNilErr(name string) error
func GetNotFoundErr(selector string) error
func WrapExtractErr(selector string, err error) error
func WrapScrapeErr(err error) error
type Extractor
type Scraper
- func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Constants ¶

View Source

const (
	SelectorTag  = "select"  // jQuery-like selector to find the node
	ExtractorTag = "extract" // extract operation to get useful data from the node
)

The tags that let you to specify where the valuable data is and how to get it from the html.Node.

View Source

const (
	TextExtractTag = "text" // get an inner text of the node
	AttrExtractTag = "@"    // get a value of an attribute ("@href", "@src")
)

Extractor tags to specify extract operations.

Variables ¶

This section is empty.

Functions ¶

func GetAttributeNotFoundErr ¶

func GetAttributeNotFoundErr(attr string) error

func GetExtractErr ¶

func GetExtractErr(extract string) error

func GetKindErr ¶

func GetKindErr(typeName, expKind, actKind any) error

func GetMultiKindErr ¶

func GetMultiKindErr(typeName any, expKinds []any, actKind any) error

func GetNilErr ¶

func GetNilErr(name string) error

func GetNotFoundErr ¶

func GetNotFoundErr(selector string) error

func WrapExtractErr ¶

func WrapExtractErr(selector string, err error) error

func WrapScrapeErr ¶

func WrapScrapeErr(err error) error

Types ¶

type Extractor ¶

type Extractor func(node *html.Node) (string, error)

Extractor is a function that processes the given node and returns the valuable data in string format.

type Scraper ¶

type Scraper struct {

	// If Strict flag is true, the [Scraper.Scrape] method
	// returns an error if the seeking HTML node is not found otherwise
	// it returns a zero value according to the type. The exception is an
	// slice type for which the flag does not work and even if there
	// are not found notes it returns an empty slice of the specified type
	// with a capacity of 10.
	Strict bool

	// Extractors is a map that matches custom user
	// extractors to extract tags. Do not use reserved Extractors tag names
	// and patterns ([TextExtractTag], [AttrExtractTag]), otherwise the
	// default implementation is executed.
	Extractors map[string]Extractor
}

Scraper is a struct that contains a method to scrape data from an HTML document (goquery.Document).

func (Scraper) Scrape ¶

func (scraper Scraper) Scrape(doc *goquery.Document, o any, selector string, extract string) error

Scrape scrapes the given doc and writes the useful information into o.

o must be a pointer to a string, slice, or struct, otherwise it causes an error. Slices and structs both can contain strings, slices, and structs but the end value must be a string.

valid - struct{Name string, Nicknames []string}
valid - []struct{{Name string, Nicknames []string}}
invalid - struct{Name Stringer, Nicknames []string}
invalid - []struct{{Name string, Nicknames []int}}

selector is a jQuery-like selector that specifies a path to nodes (is used in goquery.Selection.Find). If selector is empty the doc selection (it uses goquery.Document.Selection) is considered as default.

extract is a value that specifies how to get useful data from the node. extract is required only if o is a pointer to a string or slice, in all other cases you can leave it empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL