extractor

package
v0.0.0-...-7179273 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 3, 2025 License: Apache-2.0 Imports: 18 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func OpenGraphResolver

func OpenGraphResolver(doc *goquery.Document) string

OpenGraphResolver return OpenGraph properties

func WebPageImageResolver

func WebPageImageResolver(doc *goquery.Document) ([]candidate, int)

WebPageImageResolver fetches all candidate images from the HTML page

func WebPageResolver

func WebPageResolver(article *goose.Article) string

WebPageResolver fetches the main image from the HTML page

Types

type Cleaner

type Cleaner struct {
	// contains filtered or unexported fields
}

Cleaner removes menus, ads, sidebars, etc. and leaves the main content

func NewCleaner

func NewCleaner(config goose.Configuration) Cleaner

NewCleaner returns a new instance of a Cleaner

func (*Cleaner) Clean

func (c *Cleaner) Clean(docToClean *goquery.Document) *goquery.Document

Clean removes HTML elements around the main content and prepares the document for parsing

type ContentExtractor

type ContentExtractor struct {
	// contains filtered or unexported fields
}

ContentExtractor can parse the HTML and fetch various properties

func NewExtractor

func NewExtractor(config goose.Configuration) ContentExtractor

NewExtractor returns a configured HTML parser

func (*ContentExtractor) CalculateBestNode

func (extr *ContentExtractor) CalculateBestNode(document *goquery.Document) *goquery.Selection

CalculateBestNode checks for the HTML node most likely to contain the main content. we're going to start looking for where the clusters of paragraphs are. We'll score a cluster based on the number of stopwords and the number of consecutive paragraphs together, which should form the cluster of text that this node is around also store on how high up the paragraphs are, comments are usually at the bottom and should get a lower score

func (extr *ContentExtractor) GetCanonicalLink(document *goquery.Document) string

GetCanonicalLink returns the meta canonical link set in the source

func (extr *ContentExtractor) GetCleanTextAndLinks(topNode *goquery.Selection, lang string) (string, []string)

GetCleanTextAndLinks parses the main HTML node for text and links

func (*ContentExtractor) GetDomain

func (extr *ContentExtractor) GetDomain(canonicalLink string) string

GetDomain extracts the domain from a link

func (*ContentExtractor) GetFavicon

func (extr *ContentExtractor) GetFavicon(document *goquery.Document) string

GetFavicon returns the favicon set in the source, if the article has one

func (*ContentExtractor) GetMetaAuthor

func (extr *ContentExtractor) GetMetaAuthor(document *goquery.Document) string

GetMetaAuthor returns the meta author set in the source, if the article has one

func (*ContentExtractor) GetMetaContent

func (extr *ContentExtractor) GetMetaContent(document *goquery.Document, metaName string) string

GetMetaContent returns the content attribute of meta tag with the given property name

func (*ContentExtractor) GetMetaContentLocation

func (extr *ContentExtractor) GetMetaContentLocation(document *goquery.Document) string

GetMetaContentLocation returns the meta content location set in the source, if the article has one

func (*ContentExtractor) GetMetaContentWithSelector

func (extr *ContentExtractor) GetMetaContentWithSelector(document *goquery.Document, selector string) string

GetMetaContentWithSelector returns the content attribute of meta tag matching the selector

func (*ContentExtractor) GetMetaContents

func (extr *ContentExtractor) GetMetaContents(document *goquery.Document, metaNames *set.Set) map[string]string

GetMetaContents returns all the meta tags as name->content pairs

func (*ContentExtractor) GetMetaDescription

func (extr *ContentExtractor) GetMetaDescription(document *goquery.Document) string

GetMetaDescription returns the meta description set in the source, if the article has one

func (*ContentExtractor) GetMetaKeywords

func (extr *ContentExtractor) GetMetaKeywords(document *goquery.Document) string

GetMetaKeywords returns the meta keywords set in the source, if the article has them

func (*ContentExtractor) GetMetaLanguage

func (extr *ContentExtractor) GetMetaLanguage(document *goquery.Document) string

GetMetaLanguage returns the meta language set in the source, if the article has one

func (*ContentExtractor) GetPublishDate

func (extr *ContentExtractor) GetPublishDate(document *goquery.Document) *time.Time

GetPublishDate returns the publication date, if one can be located.

func (*ContentExtractor) GetTags

func (extr *ContentExtractor) GetTags(document *goquery.Document) *set.Set

GetTags returns the tags set in the source, if the article has them

func (*ContentExtractor) GetTitle

func (extr *ContentExtractor) GetTitle(document *goquery.Document) string

GetTitle returns the title set in the source, if the article has one

func (*ContentExtractor) GetTitleFromUnmodifiedTitle

func (extr *ContentExtractor) GetTitleFromUnmodifiedTitle(title string) string

GetTitleFromUnmodifiedTitle returns the title from the unmodified one

func (*ContentExtractor) PostCleanup

func (extr *ContentExtractor) PostCleanup(targetNode *goquery.Selection) *goquery.Selection

PostCleanup removes any divs that looks like non-content, clusters of links, or paras with no gusto

type VideoExtractor

type VideoExtractor struct {
	// contains filtered or unexported fields
}

VideoExtractor can extract the main video from an HTML page

func NewVideoExtractor

func NewVideoExtractor() VideoExtractor

NewVideoExtractor returns a new instance of a HTML video extractor

func (*VideoExtractor) GetVideos

func (ve *VideoExtractor) GetVideos(doc *goquery.Document) *set.Set

GetVideos returns the video tags embedded in the article

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL