html2article

package

v1.0.2 Latest Latest Go to latest Published: Jul 8, 2019 License: GPL-3.0 Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mlogclub/mlog

Links

Open Source Insights

README ¶

基于文本密度的html2article实现[golang]

Install

go get -u -v github.com/sundy-li/html2article

Performance

Accuracy: >= 98%
Qps: 2w/s , 0.06ms/op go test -bench=. BenchmarkExtract-4 20000 66341 ns/op
说明(对比其他开源实现,可能是目前最快的html2article实现,我们测试的数据集约3kw来自于微信公众号,各大类中文科技媒体历史文章,目前能达到98%以上准确率)
除了必要dom解析以及时间解析, 为了高效率实现, 避免了过多的正则匹配

Examples

参考examples from_url.go

package main

import (
	"github.com/sundy-li/html2article"
)

func main() {
	urlStr := "https://www.leiphone.com/news/201602/DsiQtR6c1jCu7iwA.html"
	ext, err := html2article.NewFromUrl(urlStr)
	if err != nil {
		panic(err)
	}
	article, err := ext.ToArticle()
	if err != nil {
		panic(err)
	}
	println("article title is =>", article.Title)
	println("article publishtime is =>", article.Publishtime) //using UTC timezone
	println("article content is =>", article.Content)

	//parse the article to be readability
	article.Readable(urlStr)
	println("read=>", article.ReadContent)
}

Options

	ext.SetOption(&html2article.Option{
		AccurateTitle: true,  //Get the accurate title instead of from title tag
		RemoveNoise: false,  //Remove the noise node such as some footer
	})

Algorithm

Documentation ¶

Overview ¶

COPYRIGHT https://github.com/golang/tools/blob/master/cmd/html2article/conv.go

Index ¶

Variables
func Compress(str string) string
func CompressHtml(str string) string
func NewFromHtml(htmlStr string) (ext *extractor, err error)
func NewFromNode(doc *html.Node) (ext *extractor, err error)
func NewFromReader(reader io.Reader) (ext *extractor, err error)
func NewFromUrl(urlStr string) (ext *extractor, err error)
func NewFromUrlByHttplib(urlStr string, headers ...map[string]string) (ext *extractor, err error)
type Article
type Info
- func NewInfo() *Info
- func (info *Info) CalScore(sn_sum, swn_sum float64)
type Option
type Style

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ERROR_NOTFOUND = errors.New("Content not found")
	DEFAULT_OPTION = &Option{
		RemoveNoise: true,
	}
)

Functions ¶

func Compress ¶

func Compress(str string) string

压缩字符串将多个空格字符压缩为一个空格

func CompressHtml ¶

func CompressHtml(str string) string

这个暂时不用,因为code标签还不好识别

func NewFromHtml ¶

func NewFromHtml(htmlStr string) (ext *extractor, err error)

func NewFromNode ¶

func NewFromNode(doc *html.Node) (ext *extractor, err error)

func NewFromReader ¶

func NewFromReader(reader io.Reader) (ext *extractor, err error)

func NewFromUrl ¶

func NewFromUrl(urlStr string) (ext *extractor, err error)

func NewFromUrlByHttplib ¶

func NewFromUrlByHttplib(urlStr string, headers ...map[string]string) (ext *extractor, err error)

Types ¶

type Article ¶

type Article struct {
	// Basic
	Html        string `json:"content_html"`
	Content     string `json:"content"`
	Title       string `json:"title"`
	Publishtime int64  `json:"publish_time"`

	// Others
	Images      []string `json:"images"`
	ReadContent string   `json:"read_content"`
	// contains filtered or unexported fields
}

func (*Article) GetContentNode ¶

func (a *Article) GetContentNode() *html.Node

func (*Article) Paragraphs ¶

func (a *Article) Paragraphs() []string

func (*Article) ParseImage ¶

func (a *Article) ParseImage(urlStr string)

ParseImage parse the image src to the absolute path

func (*Article) ParseReadContent ¶

func (a *Article) ParseReadContent()

ParseReadContent parse the ReadContent to be readability

func (*Article) Readable ¶

func (a *Article) Readable(urlStr string)

type Info ¶

type Info struct {
	TextCount     int
	LinkTextCount int
	TagCount      int
	LinkTagCount  int
	LeafList      []int
	Density       float64
	DensitySum    float64
	Pcount        int
	InputCount    int
	ImageCount    int

	Data string
	// contains filtered or unexported fields
}

func NewInfo ¶

func NewInfo() *Info

func (*Info) CalScore ¶

func (info *Info) CalScore(sn_sum, swn_sum float64)

type Option ¶

type Option struct {
	RemoveNoise   bool // remove noise node
	AccurateTitle bool // find the accurate title node
	UserAgent     string
}

type Style ¶

type Style string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL