pagser

package module
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 22, 2020 License: MIT Imports: 11 Imported by: 12

README

Pagser

go-doc-img travis-img go-report-card-img Coverage Status

Pagser inspired by page parser

Pagser is a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags, It's parser library from scrago.

Features

  • Simple - Use golang struct tag syntax.
  • Easy - Easy use for your spider/crawler/colly application.
  • Extensible - Support for extension functions.
  • Struct tag grammar - Grammar is simple, like `pagser:"a->attr(href)"`.
  • Nested Structure - Support Nested Structure for node.
  • Configurable - Support configuration.
  • GoQuery/Colly - Support all goquery project, such as go-colly.

Install

go get -u github.com/foolin/pagser

Docs

See Pagser

Usage


package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
)

const rawPageHtml = `
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Pagser Title</title>
</head>

<body>
	<h1>H1 Pagser Example</h1>
	<div class="navlink">
		<div class="container">
			<ul class="clearfix">
				<li id=''><a href="/">Index</a></li>
				<li id='2'><a href="/list/web" title="web site">Web page</a></li>
				<li id='3'><a href="/list/pc" title="pc page">Pc Page</a></li>
				<li id='4'><a href="/list/mobile" title="mobile page">Mobile Page</a></li>
			</ul>
		</div>
	</div>
</body>
</html>
`

type PageData struct {
	Title string `pagser:"title"`
	H1    string `pagser:"h1"`
	Navs  []struct {
		ID   int    `pagser:"->attrInt(id, -1)"`
		Name string `pagser:"a"`
		Url  string `pagser:"a->attr(href)"`
	} `pagser:".navlink li"`
}

func main() {
	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err := p.Parse(&data, rawPageHtml)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}


Run output:


Page data json: 
-------------
{
	"Title": "Pagser Title",
	"H1": "H1 Pagser Example",
	"Navs": [
		{
			"ID": -1,
			"Name": "Index",
			"Url": "/"
		},
		{
			"ID": 2,
			"Name": "Web page",
			"Url": "/list/web"
		},
		{
			"ID": 3,
			"Name": "Pc Page",
			"Url": "/list/pc"
		},
		{
			"ID": 4,
			"Name": "Mobile Page",
			"Url": "/list/mobile"
		}
	]
}
-------------

Configuration


type Config struct {
	TagerName    string //struct tag name, default is `pagser`
	FuncSymbol   string //Function symbol, default is `->`
	IgnoreSymbol string //Ignore symbol, default is `-`
	Debug        bool   //Debug mode, debug will print some log, default is `false`
}

Struct tag grammar

[goquery selector]->[function]

Example:


type ExamData struct {
	Herf string `pagser:".navLink li a->attr(href)"`
}

1.Struct tag name: pagser
2.goquery selector: .navLink li a
3.Function symbol: ->
4.Function name: attr
5.Function arguments: href

grammar

Functions

Builtin functions
  • text() get element text, return string, this is default function, if not define function in struct tag.
  • eachText() get each element text, return []string.
  • html() get element inner html, return string.
  • eachHtml() get each element inner html, return []string.
  • outerHtml() get element outer html, return string.
  • eachOutHtml() get each element outer html, return []string.
  • attr(name) get element attribute value, return string.
  • eachAttr() get each element attribute value, return []string.
  • attrInt(name, defaultValue) get element attribute value and to int, return int.
  • attrSplit(name, sep) get attribute value and split by separator to array string.
  • value() get element attribute value by name is value, return string, eg: will return "xxx".
  • split(sep) get element text and split by separator to array string, return []string.
  • eachJoin(sep) get each element text and join to string, return string.
  • ...

More builtin functions see docs: https://pkg.go.dev/github.com/foolin/pagser

Extensions functions
  • Markdown() //convert html to markdown format.
  • UgcHtml() //sanitize html

Extensions function need register, like:

import "github.com/foolin/pagser/extensions/markdown"

p := pagser.New()

//Register Markdown
markdown.Register(p)

Write my function

Function interface


type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

1. Write global function:


//global function need call pagser.RegisterFunc("MyGlob", MyGlobalFunc) before use it.
// this global method must call pagser.RegisterFunc("MyGlob", MyGlobalFunc).
func MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Global-" + node.Text(), nil
}

type PageData struct{
  MyGlobalValue string    `pagser:"->MyGlob()"`
}

func main(){

    p := pagser.New()

    //Register global function `MyGlob`
    p.RegisterFunc("MyGlob", MyGlobalFunc)

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

2. Write struct function:


type PageData struct{
  MyFuncValue int    `pagser:"->MyFunc()"`
}

// this method will auto call, not need register.
func (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Struct-" + node.Text(), nil
}


func main(){

    p := pagser.New()

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Lookup function priority order

struct method -> parent method -> ... -> global

See advance example: https://github.com/foolin/pagser/tree/master/_examples/advance

Colly Example

Work with colly:


p := pagser.New()


// On every a element which has href attribute call callback
collector.OnHTML("body", func(e *colly.HTMLElement) {
	//data parser model
	var data PageData
	//parse html data
	err := p.ParseSelection(&data, e.Dom)

})

Examples

Dependences

  • github.com/PuerkitoBio/goquery
  • github.com/spf13/cast

Extentions:

  • github.com/mattn/godown
  • github.com/microcosm-cc/bluemonday

Documentation

Overview

Package pagser is a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags, It's parser library from scrago.

The project source code: https://github.com/foolin/pagser

Features

* Simple - Use golang struct tag syntax.

* Easy - Easy use for your spider/crawler/colly application.

* Extensible - Support for extension functions.

* Struct tag grammar - Grammar is simple, like \`pagser:"a->attr(href)"\`.

* Nested Structure - Support Nested Structure for node.

* Configurable - Support configuration.

* GoQuery/Colly - Support all goquery project, such as go-colly.

More info: https://github.com/foolin/pagser

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BuiltinFunctions added in v0.0.7

type BuiltinFunctions struct {
}

Builtin functions are registered with a lowercase initial, eg: Text -> text()

func (BuiltinFunctions) Attr added in v0.0.7

func (builtin BuiltinFunctions) Attr(node *goquery.Selection, args ...string) (out interface{}, err error)

attr(name) get element attribute value, return string.

func (BuiltinFunctions) AttrInt added in v0.0.7

func (builtin BuiltinFunctions) AttrInt(node *goquery.Selection, args ...string) (out interface{}, err error)

attrInt(name, defaultValue) get element attribute value and to int, return int.

func (BuiltinFunctions) AttrSplit added in v0.0.7

func (builtin BuiltinFunctions) AttrSplit(node *goquery.Selection, args ...string) (out interface{}, err error)

attrSplit(name, sep) get attribute value and split by separator to array string.

func (BuiltinFunctions) EachAttr added in v0.0.7

func (builtin BuiltinFunctions) EachAttr(node *goquery.Selection, args ...string) (out interface{}, err error)

eachAttr() get each element attribute value, return []string.

func (BuiltinFunctions) EachHtml added in v0.0.7

func (builtin BuiltinFunctions) EachHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eachHtml() get each element inner html, return []string.

func (BuiltinFunctions) EachJoin added in v0.0.7

func (builtin BuiltinFunctions) EachJoin(node *goquery.Selection, args ...string) (out interface{}, err error)

eachJoin(sep) get each element text and join to string, return string.

func (BuiltinFunctions) EachOutHtml added in v0.0.7

func (builtin BuiltinFunctions) EachOutHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eachOutHtml() get each element outer html, return []string.

func (BuiltinFunctions) EachText added in v0.0.7

func (builtin BuiltinFunctions) EachText(node *goquery.Selection, args ...string) (out interface{}, err error)

eachText() get each element text, return []string.

func (BuiltinFunctions) Eq added in v0.0.7

func (builtin BuiltinFunctions) Eq(node *goquery.Selection, args ...string) (out interface{}, err error)

eq(index) reduces the set of matched elements to the one at the specified index, return string.

func (BuiltinFunctions) EqAndAttr added in v0.0.7

func (builtin BuiltinFunctions) EqAndAttr(node *goquery.Selection, args ...string) (out interface{}, err error)

eqAndAttr(index, name) reduces the set of matched elements to the one at the specified index, and attr() return string.

func (BuiltinFunctions) EqAndHtml added in v0.0.7

func (builtin BuiltinFunctions) EqAndHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eqAndHtml(index) reduces the set of matched elements to the one at the specified index, and html() return string.

func (BuiltinFunctions) EqAndOutHtml added in v0.0.7

func (builtin BuiltinFunctions) EqAndOutHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eqAndOutHtml(index) reduces the set of matched elements to the one at the specified index, and outHtml() return string.

func (BuiltinFunctions) Html added in v0.0.7

func (builtin BuiltinFunctions) Html(node *goquery.Selection, args ...string) (out interface{}, err error)

html() get element inner html, return string.

func (BuiltinFunctions) OutHtml added in v0.0.7

func (builtin BuiltinFunctions) OutHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

outerHtml() get element outer html, return string.

func (BuiltinFunctions) Split added in v0.0.7

func (builtin BuiltinFunctions) Split(node *goquery.Selection, args ...string) (out interface{}, err error)

split(sep) get element text and split by separator to array string, return []string.

func (BuiltinFunctions) Text added in v0.0.7

func (builtin BuiltinFunctions) Text(node *goquery.Selection, args ...string) (out interface{}, err error)

text() get element text, return string, this is default function, if not define function in struct tag.

func (BuiltinFunctions) Value added in v0.0.7

func (builtin BuiltinFunctions) Value(node *goquery.Selection, args ...string) (out interface{}, err error)

value() get element attribute value by name is `value`, return string

type CallFunc

type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

Define Global Function

func MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	//Todo
	return "Hello", nil
}

//Register function
pagser.RegisterFunc("MyFunc", MyFunc)

//Use function
type PageData struct{
     Text string `pagser:"h1->MyFunc()"`
}

Define Struct Function

//Use function
type PageData struct{
     Text string `pagser:"h1->MyFunc()"`
}

func (pd PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	//Todo
	return "Hello", nil
}

Define your own function interface

type Config

type Config struct {
	TagerName    string //struct tag name, default is `pagser`
	FuncSymbol   string //Function symbol, default is `->`
	IgnoreSymbol string //Ignore symbol, default is `-`
	Debug        bool   //Debug mode, debug will print some log, default is `false`
}

Config configuration

func DefaultConfig

func DefaultConfig() Config

DefaultConfig the default config

type Pagser

type Pagser struct {
	// contains filtered or unexported fields
}

Pagser the page parser

func New

func New() *Pagser

New create client

func NewWithConfig

func NewWithConfig(cfg Config) (*Pagser, error)

NewWithConfig create client with config and error

Example
cfg := Config{
	TagerName:    "pagser",
	FuncSymbol:   "->",
	IgnoreSymbol: "-",
	Debug:        false,
}
p, err := NewWithConfig(cfg)
if err != nil {
	log.Fatal(err)
}

//data parser model
var page ExampPage
//parse html data
err = p.Parse(&page, rawPageHtml)
//check error
if err != nil {
	log.Fatal(err)
}

func (*Pagser) Parse

func (p *Pagser) Parse(v interface{}, document string) (err error)

Parse parse html to struct

Example
//New default config
p := New()
//data parser model
var page ExampPage
//parse html data
err := p.Parse(&page, rawPageHtml)
//check error
if err != nil {
	log.Fatal(err)
}

log.Printf("%v", page)

func (*Pagser) ParseDocument

func (p *Pagser) ParseDocument(v interface{}, document *goquery.Document) (err error)

ParseDocument parse document to struct

func (*Pagser) ParseReader added in v0.0.3

func (p *Pagser) ParseReader(v interface{}, reader io.Reader) (err error)

Parse parse html to struct

Example
resp, err := http.Get("https://raw.githubusercontent.com/foolin/pagser/master/_examples/pages/demo.html")
if err != nil {
	log.Fatal(err)
}
defer resp.Body.Close()

//New default config
p := New()
//data parser model
var page ExampPage
//parse html data
err = p.ParseReader(&page, resp.Body)
//check error
if err != nil {
	panic(err)
}

log.Printf("%v", page)

func (*Pagser) ParseSelection

func (p *Pagser) ParseSelection(v interface{}, selection *goquery.Selection) (err error)

func (*Pagser) RegisterFunc

func (p *Pagser) RegisterFunc(name string, fn CallFunc) error

RegisterFunc register function for parse

Example
p := New()
p.RegisterFunc("MyFunc", MyFunc)

type Tager

type Tager struct {
	Selector   string
	FuncName   string
	FuncParams []string
}

Tager struct tag info

Directories

Path Synopsis
_examples
advance command
basic command
http command
extensions

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL