Documentation
¶
Overview ¶
Package pagser is a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags, It's parser library from scrago.
The project source code: https://github.com/foolin/pagser
Features ¶
* Simple - Use golang struct tag syntax.
* Easy - Easy use for your spider/crawler/colly application.
* Extensible - Support for extension functions.
* Struct tag grammar - Grammar is simple, like \`pagser:"a->attr(href)"\`.
* Nested Structure - Support Nested Structure for node.
* Configurable - Support configuration.
* GoQuery/Colly - Support all goquery project, such as go-colly.
More info: https://github.com/foolin/pagser
Index ¶
- type CallFunc
- type Config
- type Pagser
- func (p *Pagser) Parse(v interface{}, document string) (err error)
- func (p *Pagser) ParseDocument(v interface{}, document *goquery.Document) (err error)
- func (p *Pagser) ParseReader(v interface{}, reader io.Reader) (err error)
- func (p *Pagser) ParseSelection(v interface{}, selection *goquery.Selection) (err error)
- func (p *Pagser) RegisterFunc(name string, fn CallFunc) error
- type Tager
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CallFunc ¶
CallFunc write function interface
Builtin Functions ¶
- text() get element text, return string, this is default function, if not define function in struct tag.
- eachText() get each element text, return []string.
- html() get element inner html, return string.
- eachHtml() get each element inner html, return []string.
- outerHtml() get element outer html, return string.
- eachOutHtml() get each element outer html, return []string.
- attr(name) get element attribute value, return string.
- eachAttr() get each element attribute value, return []string.
- attrInt(name, defaultValue) get element attribute value and to int, return int.
- attrSplit(name, sep) get attribute value and split by separator to array string.
- value() get element attribute value by name is `value`, return string, eg: <input value='xxxx' /> will return "xxx".
- split(sep) get element text and split by separator to array string, return []string.
- eachJoin(sep) get each element text and join to string, return string.
Define Global Function ¶
func MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
//Todo
return "Hello", nil
}
//Register function
pagser.RegisterFunc("MyFunc", MyFunc)
//Use function
type PageData struct{
Text string `pagser:"h1->MyFunc()"`
}
Define Struct Function ¶
//Use function
type PageData struct{
Text string `pagser:"h1->MyFunc()"`
}
func (pd PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
//Todo
return "Hello", nil
}
Define your own function interface
type Config ¶
type Config struct {
TagerName string //struct tag name, default is `pagser`
FuncSymbol string //Function symbol, default is `->`
IgnoreSymbol string //Ignore symbol, default is `-`
Debug bool //Debug mode, debug will print some log, default is `false`
}
Config configuration
type Pagser ¶
type Pagser struct {
// contains filtered or unexported fields
}
Pagser the page parser
func NewWithConfig ¶
NewWithConfig create client with config and error
Example ¶
cfg := Config{
TagerName: "pagser",
FuncSymbol: "->",
IgnoreSymbol: "-",
Debug: false,
}
p, err := NewWithConfig(cfg)
if err != nil {
log.Fatal(err)
}
//data parser model
var page ExampPage
//parse html data
err = p.Parse(&page, rawPageHtml)
//check error
if err != nil {
log.Fatal(err)
}
func (*Pagser) Parse ¶
Parse parse html to struct
Example ¶
//New default config
p := New()
//data parser model
var page ExampPage
//parse html data
err := p.Parse(&page, rawPageHtml)
//check error
if err != nil {
log.Fatal(err)
}
log.Printf("%v", page)
func (*Pagser) ParseDocument ¶
ParseDocument parse document to struct
func (*Pagser) ParseReader ¶ added in v0.0.3
Parse parse html to struct
Example ¶
resp, err := http.Get("https://raw.githubusercontent.com/foolin/pagser/master/_examples/pages/demo.html")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
//New default config
p := New()
//data parser model
var page ExampPage
//parse html data
err = p.ParseReader(&page, resp.Body)
//check error
if err != nil {
panic(err)
}
log.Printf("%v", page)
func (*Pagser) ParseSelection ¶
ParseSelection parse selection to struct
