html2article

package
v1.0.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 8, 2019 License: GPL-3.0 Imports: 21 Imported by: 0

README

基于文本密度的html2article实现[golang]

Install

go get -u -v github.com/sundy-li/html2article

Performance

  • Accuracy: >= 98%

  • Qps: 2w/s , 0.06ms/op go test -bench=. BenchmarkExtract-4 20000 66341 ns/op

  • 说明(对比其他开源实现,可能是目前最快的html2article实现,我们测试的数据集约3kw来自于微信公众号,各大类中文科技媒体历史文章,目前能达到98%以上准确率)

  • 除了必要dom解析以及时间解析, 为了高效率实现, 避免了过多的正则匹配

Examples

参考examples from_url.go

package main

import (
	"github.com/sundy-li/html2article"
)

func main() {
	urlStr := "https://www.leiphone.com/news/201602/DsiQtR6c1jCu7iwA.html"
	ext, err := html2article.NewFromUrl(urlStr)
	if err != nil {
		panic(err)
	}
	article, err := ext.ToArticle()
	if err != nil {
		panic(err)
	}
	println("article title is =>", article.Title)
	println("article publishtime is =>", article.Publishtime) //using UTC timezone
	println("article content is =>", article.Content)

	//parse the article to be readability
	article.Readable(urlStr)
	println("read=>", article.ReadContent)
}

Options

	ext.SetOption(&html2article.Option{
		AccurateTitle: true,  //Get the accurate title instead of from title tag
		RemoveNoise: false,  //Remove the noise node such as some footer
	})

Algorithm

Documentation

Overview

COPYRIGHT https://github.com/golang/tools/blob/master/cmd/html2article/conv.go

Index

Constants

This section is empty.

Variables

View Source
var (
	ERROR_NOTFOUND = errors.New("Content not found")
	DEFAULT_OPTION = &Option{
		RemoveNoise: true,
	}
)

Functions

func Compress

func Compress(str string) string

压缩字符串 将多个空格字符压缩为一个空格

func CompressHtml

func CompressHtml(str string) string

这个暂时不用,因为code标签还不好识别

func NewFromHtml

func NewFromHtml(htmlStr string) (ext *extractor, err error)

func NewFromNode

func NewFromNode(doc *html.Node) (ext *extractor, err error)

func NewFromReader

func NewFromReader(reader io.Reader) (ext *extractor, err error)

func NewFromUrl

func NewFromUrl(urlStr string) (ext *extractor, err error)

func NewFromUrlByHttplib

func NewFromUrlByHttplib(urlStr string, headers ...map[string]string) (ext *extractor, err error)

Types

type Article

type Article struct {
	// Basic
	Html        string `json:"content_html"`
	Content     string `json:"content"`
	Title       string `json:"title"`
	Publishtime int64  `json:"publish_time"`

	// Others
	Images      []string `json:"images"`
	ReadContent string   `json:"read_content"`
	// contains filtered or unexported fields
}

func (*Article) GetContentNode

func (a *Article) GetContentNode() *html.Node

func (*Article) Paragraphs

func (a *Article) Paragraphs() []string

func (*Article) ParseImage

func (a *Article) ParseImage(urlStr string)

ParseImage parse the image src to the absolute path

func (*Article) ParseReadContent

func (a *Article) ParseReadContent()

ParseReadContent parse the ReadContent to be readability

func (*Article) Readable

func (a *Article) Readable(urlStr string)

type Info

type Info struct {
	TextCount     int
	LinkTextCount int
	TagCount      int
	LinkTagCount  int
	LeafList      []int
	Density       float64
	DensitySum    float64
	Pcount        int
	InputCount    int
	ImageCount    int

	Data string
	// contains filtered or unexported fields
}

func NewInfo

func NewInfo() *Info

func (*Info) CalScore

func (info *Info) CalScore(sn_sum, swn_sum float64)

type Option

type Option struct {
	RemoveNoise   bool // remove noise node
	AccurateTitle bool // find the accurate title node
	UserAgent     string
}

type Style

type Style string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL