crawler

package
v0.0.0-...-7179273 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 3, 2025 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Crawler

type Crawler struct {
	Charset string
	// contains filtered or unexported fields
}

Crawler can fetch the target HTML page

func NewCrawler

func NewCrawler(config goose.Configuration) Crawler

NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body

func (Crawler) Crawl

func (c Crawler) Crawl(RawHTML string, url string) (*goose.Article, error)

Crawl fetches the HTML body and returns an Article

func (Crawler) GetCharset

func (c Crawler) GetCharset(document *goquery.Document) string

GetCharset returns a normalised charset string extracted from the meta tags

func (Crawler) GetContentType

func (c Crawler) GetContentType(document *goquery.Document) string

GetContentType returns the Content-Type string extracted from the meta tags

func (*Crawler) Preprocess

func (c *Crawler) Preprocess(RawHTML string) (*goquery.Document, error)

Preprocess fetches the HTML page if needed, converts it to UTF-8 and applies some text normalisation to guarantee better results when extracting the content

func (*Crawler) SetCharset

func (c *Crawler) SetCharset(cs string)

SetCharset can be used to force a charset (e.g. when read from the HTTP headers) rather than relying on the detection from the HTML meta tags

type CrawlerShort

type CrawlerShort struct {
	Charset string
	// contains filtered or unexported fields
}

Crawler can fetch the target HTML page

func NewCrawlerShort

func NewCrawlerShort(config goose.Configuration) CrawlerShort

NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body

func (CrawlerShort) Crawl

func (c CrawlerShort) Crawl(RawHTML, url string) (*goose.Article, error)

Crawl fetches the HTML body and returns an Article

func (CrawlerShort) GetCharset

func (c CrawlerShort) GetCharset(document *goquery.Document) string

GetCharset returns a normalised charset string extracted from the meta tags

func (CrawlerShort) GetContentType

func (c CrawlerShort) GetContentType(document *goquery.Document) string

GetContentType returns the Content-Type string extracted from the meta tags

func (*CrawlerShort) Preprocess

func (c *CrawlerShort) Preprocess(RawHTML string) (*goquery.Document, error)

Preprocess fetches the HTML page if needed, converts it to UTF-8 and applies some text normalisation to guarantee better results when extracting the content

func (*CrawlerShort) SetCharset

func (c *CrawlerShort) SetCharset(cs string)

SetCharset can be used to force a charset (e.g. when read from the HTTP headers) rather than relying on the detection from the HTML meta tags

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL