pagser

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 25, 2020 License: MIT Imports: 10 Imported by: 12

README

Pagser

go-doc-img travis-img go-report-card-img Coverage Status

Pagser inspired by page parser

Pagser is a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags, It's parser library from scrago.

Contents

Install

go get -u github.com/foolin/pagser

Features

  • Simple - Use golang struct tag syntax.
  • Easy - Easy use for your spider/crawler/colly application.
  • Extensible - Support for extension functions.
  • Struct tag grammar - Grammar is simple, like `pagser:"a->attr(href)"`.
  • Nested Structure - Support Nested Structure for node.
  • Configurable - Support configuration.
  • Implicit type conversion - Automatic implicit type conversion, Output result string convert to int, int64, float64...
  • GoQuery/Colly - Support all goquery project, such as go-colly.

Docs

See Pagser

Usage


package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
)

const rawPageHtml = `
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Pagser Title</title>
	<meta name="keywords" content="golang,pagser,goquery,html,page,parser,colly">
</head>

<body>
	<h1>H1 Pagser Example</h1>
	<div class="navlink">
		<div class="container">
			<ul class="clearfix">
				<li id=''><a href="/">Index</a></li>
				<li id='2'><a href="/list/web" title="web site">Web page</a></li>
				<li id='3'><a href="/list/pc" title="pc page">Pc Page</a></li>
				<li id='4'><a href="/list/mobile" title="mobile page">Mobile Page</a></li>
			</ul>
		</div>
	</div>
</body>
</html>
`

type PageData struct {
	Title    string   `pagser:"title"`
	Keywords []string `pagser:"meta[name='keywords']->attrSplit(content)"`
	H1       string   `pagser:"h1"`
	Navs     []struct {
		ID   int    `pagser:"->attrEmpty(id, -1)"`
		Name string `pagser:"a->text()"`
		Url  string `pagser:"a->attr(href)"`
	} `pagser:".navlink li"`
}

func main() {
	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err := p.Parse(&data, rawPageHtml)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


Page data json: 
-------------
{
	"Title": "Pagser Title",
	"Keywords": [
		"golang",
		"pagser",
		"goquery",
		"html",
		"page",
		"parser",
		"colly"
	],
	"H1": "H1 Pagser Example",
	"Navs": [
		{
			"ID": -1,
			"Name": "Index",
			"Url": "/"
		},
		{
			"ID": 2,
			"Name": "Web page",
			"Url": "/list/web"
		},
		{
			"ID": 3,
			"Name": "Pc Page",
			"Url": "/list/pc"
		},
		{
			"ID": 4,
			"Name": "Mobile Page",
			"Url": "/list/mobile"
		}
	]
}
-------------

Configuration


type Config struct {
	TagName    string //struct tag name, default is `pagser`
	FuncSymbol   string //Function symbol, default is `->`
	Debug        bool   //Debug mode, debug will print some log, default is `false`
}

Struct Tag Grammar

[goquery selector]->[function]

Example:


type ExamData struct {
	Herf string `pagser:".navLink li a->attr(href)"`
}

1.Struct tag name: pagser
2.goquery selector: .navLink li a
3.Function symbol: ->
4.Function name: attr
5.Function arguments: href

grammar

Functions

Builtin functions
  • text() get element text, return string, this is default function, if not define function in struct tag.
  • eachText() get each element text, return []string.
  • html() get element inner html, return string.
  • eachHtml() get each element inner html, return []string.
  • outerHtml() get element outer html, return string.
  • eachOutHtml() get each element outer html, return []string.
  • attr(name) get element attribute value, return string.
  • eachAttr() get each element attribute value, return []string.
  • attrSplit(name, sep) get attribute value and split by separator to array string.
  • value() get element attribute value by name is value, return string, eg: will return "xxx".
  • split(sep) get element text and split by separator to array string, return []string.
  • eachJoin(sep) get each element text and join to string, return string.
  • ...

More builtin functions see docs: https://pkg.go.dev/github.com/foolin/pagser

Extension functions
  • Markdown() //convert html to markdown format.
  • UgcHtml() //sanitize html

Extensions function need register, like:

import "github.com/foolin/pagser/extensions/markdown"

p := pagser.New()

//Register Markdown
markdown.Register(p)

Custom function
Function interface

type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

Define global function

//global function need call pagser.RegisterFunc("MyGlob", MyGlobalFunc) before use it.
// this global method must call pagser.RegisterFunc("MyGlob", MyGlobalFunc).
func MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Global-" + node.Text(), nil
}

type PageData struct{
  MyGlobalValue string    `pagser:"->MyGlob()"`
}

func main(){

    p := pagser.New()

    //Register global function `MyGlob`
    p.RegisterFunc("MyGlob", MyGlobalFunc)

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Define struct function

type PageData struct{
  MyFuncValue int    `pagser:"->MyFunc()"`
}

// this method will auto call, not need register.
func (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Struct-" + node.Text(), nil
}


func main(){

    p := pagser.New()

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Call Syntax

Note: all function arguments are string, single quotes are optional.

  1. Function call with no arguments

->fn()

  1. Function calls with one argument, and single quotes are optional

->fn(one)

->fn('one')

  1. Function calls with many arguments

->fn(one, two, three, ...)

->fn('one', 'two', 'three', ...)

  1. Function calls with single quotes and escape character

->fn('it\'s ok', 'two,xxx', 'three', ...)

Priority Order

Lookup function priority order:

struct method -> parent method -> ... -> global

More Examples

See advance example: https://github.com/foolin/pagser/tree/master/_examples/advance

Implicit type conversion

Automatic implicit type conversion, Output result string convert to int, int64, float64...

Support type:

  • bool
  • float32
  • float64
  • int
  • int32
  • int64
  • string
  • []bool
  • []float32
  • []float64
  • []int
  • []int32
  • []int64
  • []string

Examples

Crawl page example

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
	"net/http"
)

type PageData struct {
	Title    string `pagser:"title"`
	RepoList []struct {
		Names       []string `pagser:"h1->split('/', true)"`
		Description string   `pagser:"h1 + p"`
		Stars       string   `pagser:"a.muted-link->eq(0)"`
		Repo        string   `pagser:"h1 a->concatAttr('href', 'https://github.com', $value, '?from=pagser')"`
	} `pagser:"article.Box-row"`
}

func main() {
	resp, err := http.Get("https://github.com/trending")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err = p.ParseReader(&data, resp.Body)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}


Run output:


2020/04/25 12:26:04 Page data json: 
-------------
{
	"Title": "Trending  repositories on GitHub today · GitHub",
	"RepoList": [
		{
			"Names": [
				"pcottle",
				"learnGitBranching"
			],
			"Description": "An interactive git visualization to challenge and educate!",
			"Stars": "16,010",
			"Repo": "https://github.com/pcottle/learnGitBranching?from=pagser"
		},
		{
			"Names": [
				"jackfrued",
				"Python-100-Days"
			],
			"Description": "Python - 100天从新手到大师",
			"Stars": "83,484",
			"Repo": "https://github.com/jackfrued/Python-100-Days?from=pagser"
		},
		{
			"Names": [
				"brave",
				"brave-browser"
			],
			"Description": "Next generation Brave browser for macOS, Windows, Linux, Android.",
			"Stars": "5,963",
			"Repo": "https://github.com/brave/brave-browser?from=pagser"
		},
		{
			"Names": [
				"MicrosoftDocs",
				"azure-docs"
			],
			"Description": "Open source documentation of Microsoft Azure",
			"Stars": "3,798",
			"Repo": "https://github.com/MicrosoftDocs/azure-docs?from=pagser"
		},
		{
			"Names": [
				"ahmetb",
				"kubectx"
			],
			"Description": "Faster way to switch between clusters and namespaces in kubectl",
			"Stars": "6,979",
			"Repo": "https://github.com/ahmetb/kubectx?from=pagser"
		},

        //...        

		{
			"Names": [
				"serverless",
				"serverless"
			],
			"Description": "Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions \u0026 more! –",
			"Stars": "35,502",
			"Repo": "https://github.com/serverless/serverless?from=pagser"
		},
		{
			"Names": [
				"vuejs",
				"vite"
			],
			"Description": "Experimental no-bundle dev server for Vue SFCs",
			"Stars": "1,573",
			"Repo": "https://github.com/vuejs/vite?from=pagser"
		}
	]
}
-------------
Colly Example

Work with colly:


p := pagser.New()


// On every a element which has href attribute call callback
collector.OnHTML("body", func(e *colly.HTMLElement) {
	//data parser model
	var data PageData
	//parse html data
	err := p.ParseSelection(&data, e.Dom)

})

Dependencies

  • github.com/PuerkitoBio/goquery
  • github.com/spf13/cast

Extensions:

  • github.com/mattn/godown
  • github.com/microcosm-cc/bluemonday

Documentation

Overview

Package pagser is a simple, easy, extensible, configurable HTML parser to struct based on goquery and struct tags, It's parser library from scrago.

The project source code: https://github.com/foolin/pagser

Features

* Simple - Use golang struct tag syntax.

* Easy - Easy use for your spider/crawler/colly application.

* Extensible - Support for extension functions.

* Struct tag grammar - Grammar is simple, like \`pagser:"a->attr(href)"\`.

* Nested Structure - Support Nested Structure for node.

* Configurable - Support configuration.

* GoQuery/Colly - Support all goquery project, such as go-colly.

More info: https://github.com/foolin/pagser

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type BuiltinFunctions added in v0.0.7

type BuiltinFunctions struct {
}

Builtin functions are registered with a lowercase initial, eg: Text -> text()

func (BuiltinFunctions) Attr added in v0.0.7

func (builtin BuiltinFunctions) Attr(node *goquery.Selection, args ...string) (out interface{}, err error)

attr(name, defaultValue=”) get element attribute value, return string. outerHtml() get element outer html, return string.

//<a href="https://github.com/foolin/pagser">Pagser</a>
struct {
	Example string `pagser:".selector->attr(href)"`
}

func (BuiltinFunctions) AttrEmpty added in v0.1.0

func (builtin BuiltinFunctions) AttrEmpty(node *goquery.Selection, args ...string) (out interface{}, err error)

attrEmpty(name, defaultValue) get element attribute value, return string.

//<a href="https://github.com/foolin/pagser">Pagser</a>
struct {
	Example string `pagser:".selector->AttrEmpty(href, '#')"`
}

func (BuiltinFunctions) AttrSplit added in v0.0.7

func (builtin BuiltinFunctions) AttrSplit(node *goquery.Selection, args ...string) (out interface{}, err error)

attrSplit(name, sep=',', trim='true') get attribute value and split by separator to array string, return []string.

//<a href="https://github.com/foolin/pagser">Pagser</a>
struct {
	Examples []string `pagser:".selector->attrSplit('keywords', ',')"`
}

func (BuiltinFunctions) Concat added in v0.1.0

func (builtin BuiltinFunctions) Concat(node *goquery.Selection, args ...string) (out interface{}, err error)

concat(text1, $value, [ text2, ... text_n ]) The `text1, text2, ... text_n` strings that you wish to join together, `$value` is placeholder for get element text, return string.

struct {
	Example string `pagser:".selector->concat('Result:', '<', $value, '>')"`
}

func (BuiltinFunctions) ConcatAttr added in v0.1.0

func (builtin BuiltinFunctions) ConcatAttr(node *goquery.Selection, args ...string) (out interface{}, err error)

concatAttr(name, text1, $value, [ text2, ... text_n ]) `name` get element attribute value by name, `text1, text2, ... text_n` The strings that you wish to join together, `$value` is placeholder for get element text return string.

struct {
	Example string `pagser:".selector->concatAttr('Result:', '<', $value, '>')"`
}

func (BuiltinFunctions) EachAttr added in v0.0.7

func (builtin BuiltinFunctions) EachAttr(node *goquery.Selection, args ...string) (out interface{}, err error)

eachAttr() get each element attribute value, return []string.

//<a href="https://github.com/foolin/pagser">Pagser</a>
struct {
	Examples []string `pagser:".selector->eachAttr(href)"`
}

func (BuiltinFunctions) EachAttrEmpty added in v0.1.0

func (builtin BuiltinFunctions) EachAttrEmpty(node *goquery.Selection, args ...string) (out interface{}, err error)

eachAttrEmpty(defaultValue) get each element attribute value, return []string.

//<a href="https://github.com/foolin/pagser">Pagser</a>
struct {
	Examples []string `pagser:".selector->eachAttrEmpty(href, '#')"`
}

func (BuiltinFunctions) EachHtml added in v0.0.7

func (builtin BuiltinFunctions) EachHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eachHtml() get each element inner html, return []string. eachTextEmpty(defaultValue) get each element text, return []string.

struct {
	Examples []string `pagser:".selector->eachHtml()"`
}

func (BuiltinFunctions) EachJoin added in v0.0.7

func (builtin BuiltinFunctions) EachJoin(node *goquery.Selection, args ...string) (out interface{}, err error)

eachJoin(sep) get each element text and join to string, return string.

struct {
	Example string `pagser:".selector->eachJoin(',')"`
}

func (BuiltinFunctions) EachOutHtml added in v0.0.7

func (builtin BuiltinFunctions) EachOutHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eachOutHtml() get each element outer html, return []string.

struct {
	Examples []string `pagser:".selector->eachOutHtml()"`
}

func (BuiltinFunctions) EachText added in v0.0.7

func (builtin BuiltinFunctions) EachText(node *goquery.Selection, args ...string) (out interface{}, err error)

eachText() get each element text, return []string.

struct {
	Examples []string `pagser:".selector->eachText('')"`
}

func (BuiltinFunctions) EachTextEmpty added in v0.1.0

func (builtin BuiltinFunctions) EachTextEmpty(node *goquery.Selection, args ...string) (out interface{}, err error)

eachTextEmpty(defaultValue) get each element text, return []string.

struct {
	Examples []string `pagser:".selector->eachTextEmpty('')"`
}

func (BuiltinFunctions) Eq added in v0.0.7

func (builtin BuiltinFunctions) Eq(node *goquery.Selection, args ...string) (out interface{}, err error)

eq(index) reduces the set of matched elements to the one at the specified index, return string.

struct {
	Example string `pagser:".selector->eq(0)"`
}

func (BuiltinFunctions) EqAndAttr added in v0.0.7

func (builtin BuiltinFunctions) EqAndAttr(node *goquery.Selection, args ...string) (out interface{}, err error)

eqAndAttr(index, name) reduces the set of matched elements to the one at the specified index, and attr() return string.

struct {
	Example string `pagser:".selector->eqAndAttr(0, href)"`
}

func (BuiltinFunctions) EqAndHtml added in v0.0.7

func (builtin BuiltinFunctions) EqAndHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eqAndHtml(index) reduces the set of matched elements to the one at the specified index, and html() return string.

struct {
	Example string `pagser:".selector->eqAndHtml(0)"`
}

func (BuiltinFunctions) EqAndOutHtml added in v0.0.7

func (builtin BuiltinFunctions) EqAndOutHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

eqAndOutHtml(index) reduces the set of matched elements to the one at the specified index, and outHtml() return string.

struct {
	Example string `pagser:".selector->eqAndOutHtml(0)"`
}

func (BuiltinFunctions) Html added in v0.0.7

func (builtin BuiltinFunctions) Html(node *goquery.Selection, args ...string) (out interface{}, err error)

html() get element inner html, return string.

struct {
	Example string `pagser:".selector->html()"`
}

func (BuiltinFunctions) OutHtml added in v0.0.7

func (builtin BuiltinFunctions) OutHtml(node *goquery.Selection, args ...string) (out interface{}, err error)

outerHtml() get element outer html, return string.

struct {
	Example string `pagser:".selector->outerHtml()"`
}

func (BuiltinFunctions) Split added in v0.0.7

func (builtin BuiltinFunctions) Split(node *goquery.Selection, args ...string) (out interface{}, err error)

split(sep=',', trim='true') get element text and split by separator to array string, return []string.

struct {
	Examples []string `pagser:".selector->split('|')"`
}

func (BuiltinFunctions) Text added in v0.0.7

func (builtin BuiltinFunctions) Text(node *goquery.Selection, args ...string) (out interface{}, err error)

text() get element text, return string, this is default function, if not define function in struct tag.

struct {
	Example string `pagser:".selector->text()"`
}

func (BuiltinFunctions) TextEmpty added in v0.1.0

func (builtin BuiltinFunctions) TextEmpty(node *goquery.Selection, args ...string) (out interface{}, err error)

textEmpty(defaultValue) get element text, if empty will return defaultValue, return string.

struct {
	Example string `pagser:".selector->TextEmpty('')"`
}

func (BuiltinFunctions) Value added in v0.0.7

func (builtin BuiltinFunctions) Value(node *goquery.Selection, args ...string) (out interface{}, err error)

value() get element attribute value by name is `value`, return string

//<input name="pagser" value="xxx" />
struct {
	Example string `pagser:".selector->Value()"`
}

Output: xxx

type CallFunc

type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

Define Global Function

func MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	//Todo
	return "Hello", nil
}

//Register function
pagser.RegisterFunc("MyFunc", MyFunc)

//Use function
type PageData struct{
     Text string `pagser:"h1->MyFunc()"`
}

Define Struct Function

//Use function
type PageData struct{
     Text string `pagser:"h1->MyFunc()"`
}

func (pd PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	//Todo
	return "Hello", nil
}

Lookup function priority order

struct method -> parent method -> ... -> global

Implicit convert type

Automatic type conversion, Output result string convert to int, int64, float64...

CallFunc is a define function interface

type Config

type Config struct {
	TagName    string //struct tag name, default is `pagser`
	FuncSymbol string //Function symbol, default is `->`
	CastError  bool   //Returns an error when the type cannot be converted, default is `false`
	Debug      bool   //Debug mode, debug will print some log, default is `false`
}

Config configuration

func DefaultConfig

func DefaultConfig() Config

DefaultConfig the default Config

Config{
	TagName:    "pagser",
	FuncSymbol: "->",
	CastError:  false,
	Debug:      false,
}

type Pagser

type Pagser struct {
	Config Config
	// contains filtered or unexported fields
}

Pagser the page parser

func New

func New() *Pagser

New create pagser client

func NewWithConfig

func NewWithConfig(cfg Config) (*Pagser, error)

NewWithConfig create pagser client with Config and error

Example
cfg := Config{
	TagName:    "pagser",
	FuncSymbol: "->",
	CastError:  false,
	Debug:      false,
}
p, err := NewWithConfig(cfg)
if err != nil {
	log.Fatal(err)
}

//data parser model
var page ExamplePage
//parse html data
err = p.Parse(&page, rawExampleHtml)
//check error
if err != nil {
	log.Fatal(err)
}

func (*Pagser) Parse

func (p *Pagser) Parse(v interface{}, document string) (err error)

Parse parse html to struct

Example
//New default Config
p := New()

//data parser model
var page ExamplePage
//parse html data
err := p.Parse(&page, rawExampleHtml)
//check error
if err != nil {
	log.Fatal(err)
}

//print result
log.Printf("%v", page)

func (*Pagser) ParseDocument

func (p *Pagser) ParseDocument(v interface{}, document *goquery.Document) (err error)

ParseDocument parse document to struct

Example
//New default Config
p := New()

//data parser model
var data ExamplePage
doc, err := goquery.NewDocumentFromReader(strings.NewReader(rawExampleHtml))
if err != nil {
	log.Fatal(err)
}

//parse document
err = p.ParseDocument(&data, doc)
//check error
if err != nil {
	log.Fatal(err)
}

//print result
log.Printf("%v", data)

func (*Pagser) ParseReader added in v0.0.3

func (p *Pagser) ParseReader(v interface{}, reader io.Reader) (err error)

Parse parse html to struct

Example
resp, err := http.Get("https://raw.githubusercontent.com/foolin/pagser/master/_examples/pages/demo.html")
if err != nil {
	log.Fatal(err)
}
defer resp.Body.Close()

//New default Config
p := New()
//data parser model
var page ExamplePage
//parse html data
err = p.ParseReader(&page, resp.Body)
//check error
if err != nil {
	panic(err)
}

log.Printf("%v", page)

func (*Pagser) ParseSelection

func (p *Pagser) ParseSelection(v interface{}, selection *goquery.Selection) (err error)
Example
//New default Config
p := New()

//data parser model
var data ExamplePage
doc, err := goquery.NewDocumentFromReader(strings.NewReader(rawExampleHtml))
if err != nil {
	log.Fatal(err)
}

//parse document
err = p.ParseSelection(&data, doc.Selection)
//check error
if err != nil {
	log.Fatal(err)
}

//print result
log.Printf("%v", data)

func (*Pagser) RegisterFunc

func (p *Pagser) RegisterFunc(name string, fn CallFunc)

RegisterFunc register function for parse result

pagser.RegisterFunc("MyFunc", func(node *goquery.Selection, args ...string) (out interface{}, err error) {
	//Todo
	return "Hello", nil
})
Example
p := New()

p.RegisterFunc("MyFunc", func(node *goquery.Selection, args ...string) (out interface{}, err error) {
	//Todo
	return "Hello", nil
})

Directories

Path Synopsis
_examples
advance command
basic command
config command
http command
extensions

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL