Documentation
¶
Index ¶
- type Crawler
- func (c Crawler) Crawl(RawHTML string, url string) (*goose.Article, error)
- func (c Crawler) GetCharset(document *goquery.Document) string
- func (c Crawler) GetContentType(document *goquery.Document) string
- func (c *Crawler) Preprocess(RawHTML string) (*goquery.Document, error)
- func (c *Crawler) SetCharset(cs string)
- type CrawlerShort
- func (c CrawlerShort) Crawl(RawHTML, url string) (*goose.Article, error)
- func (c CrawlerShort) GetCharset(document *goquery.Document) string
- func (c CrawlerShort) GetContentType(document *goquery.Document) string
- func (c *CrawlerShort) Preprocess(RawHTML string) (*goquery.Document, error)
- func (c *CrawlerShort) SetCharset(cs string)
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
Charset string
// contains filtered or unexported fields
}
Crawler can fetch the target HTML page
func NewCrawler ¶
func NewCrawler(config goose.Configuration) Crawler
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body
func (Crawler) GetCharset ¶
GetCharset returns a normalised charset string extracted from the meta tags
func (Crawler) GetContentType ¶
GetContentType returns the Content-Type string extracted from the meta tags
func (*Crawler) Preprocess ¶
Preprocess fetches the HTML page if needed, converts it to UTF-8 and applies some text normalisation to guarantee better results when extracting the content
func (*Crawler) SetCharset ¶
SetCharset can be used to force a charset (e.g. when read from the HTTP headers) rather than relying on the detection from the HTML meta tags
type CrawlerShort ¶
type CrawlerShort struct {
Charset string
// contains filtered or unexported fields
}
Crawler can fetch the target HTML page
func NewCrawlerShort ¶
func NewCrawlerShort(config goose.Configuration) CrawlerShort
NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body
func (CrawlerShort) Crawl ¶
func (c CrawlerShort) Crawl(RawHTML, url string) (*goose.Article, error)
Crawl fetches the HTML body and returns an Article
func (CrawlerShort) GetCharset ¶
func (c CrawlerShort) GetCharset(document *goquery.Document) string
GetCharset returns a normalised charset string extracted from the meta tags
func (CrawlerShort) GetContentType ¶
func (c CrawlerShort) GetContentType(document *goquery.Document) string
GetContentType returns the Content-Type string extracted from the meta tags
func (*CrawlerShort) Preprocess ¶
func (c *CrawlerShort) Preprocess(RawHTML string) (*goquery.Document, error)
Preprocess fetches the HTML page if needed, converts it to UTF-8 and applies some text normalisation to guarantee better results when extracting the content
func (*CrawlerShort) SetCharset ¶
func (c *CrawlerShort) SetCharset(cs string)
SetCharset can be used to force a charset (e.g. when read from the HTTP headers) rather than relying on the detection from the HTML meta tags