extractor

package

v0.0.0-...-7179273 Latest Latest Go to latest Published: Aug 3, 2025 License: Apache-2.0 Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/advancedlogic/GoOse

Links

Open Source Insights

Documentation ¶

Index ¶

func OpenGraphResolver(doc *goquery.Document) string
func WebPageImageResolver(doc *goquery.Document) ([]candidate, int)
func WebPageResolver(article *goose.Article) string
type Cleaner
- func NewCleaner(config goose.Configuration) Cleaner
- func (c *Cleaner) Clean(docToClean *goquery.Document) *goquery.Document
type ContentExtractor
- func NewExtractor(config goose.Configuration) ContentExtractor
type VideoExtractor
- func NewVideoExtractor() VideoExtractor
- func (ve *VideoExtractor) GetVideos(doc *goquery.Document) *set.Set

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func OpenGraphResolver ¶

func OpenGraphResolver(doc *goquery.Document) string

OpenGraphResolver return OpenGraph properties

func WebPageImageResolver ¶

func WebPageImageResolver(doc *goquery.Document) ([]candidate, int)

WebPageImageResolver fetches all candidate images from the HTML page

func WebPageResolver ¶

func WebPageResolver(article *goose.Article) string

WebPageResolver fetches the main image from the HTML page

Types ¶

type Cleaner ¶

type Cleaner struct {
	// contains filtered or unexported fields
}

Cleaner removes menus, ads, sidebars, etc. and leaves the main content

func NewCleaner ¶

func NewCleaner(config goose.Configuration) Cleaner

NewCleaner returns a new instance of a Cleaner

func (*Cleaner) Clean ¶

func (c *Cleaner) Clean(docToClean *goquery.Document) *goquery.Document

Clean removes HTML elements around the main content and prepares the document for parsing

type ContentExtractor ¶

type ContentExtractor struct {
	// contains filtered or unexported fields
}

ContentExtractor can parse the HTML and fetch various properties

func NewExtractor ¶

func NewExtractor(config goose.Configuration) ContentExtractor

NewExtractor returns a configured HTML parser

func (*ContentExtractor) CalculateBestNode ¶

func (extr *ContentExtractor) CalculateBestNode(document *goquery.Document) *goquery.Selection

CalculateBestNode checks for the HTML node most likely to contain the main content. we're going to start looking for where the clusters of paragraphs are. We'll score a cluster based on the number of stopwords and the number of consecutive paragraphs together, which should form the cluster of text that this node is around also store on how high up the paragraphs are, comments are usually at the bottom and should get a lower score

func (*ContentExtractor) GetCanonicalLink ¶

func (extr *ContentExtractor) GetCanonicalLink(document *goquery.Document) string

GetCanonicalLink returns the meta canonical link set in the source

func (*ContentExtractor) GetCleanTextAndLinks ¶

func (extr *ContentExtractor) GetCleanTextAndLinks(topNode *goquery.Selection, lang string) (string, []string)

GetCleanTextAndLinks parses the main HTML node for text and links

func (*ContentExtractor) GetDomain ¶

func (extr *ContentExtractor) GetDomain(canonicalLink string) string

GetDomain extracts the domain from a link

func (*ContentExtractor) GetFavicon ¶

func (extr *ContentExtractor) GetFavicon(document *goquery.Document) string

GetFavicon returns the favicon set in the source, if the article has one

func (*ContentExtractor) GetMetaAuthor ¶

func (extr *ContentExtractor) GetMetaAuthor(document *goquery.Document) string

GetMetaAuthor returns the meta author set in the source, if the article has one

func (*ContentExtractor) GetMetaContent ¶

func (extr *ContentExtractor) GetMetaContent(document *goquery.Document, metaName string) string

GetMetaContent returns the content attribute of meta tag with the given property name

func (*ContentExtractor) GetMetaContentLocation ¶

func (extr *ContentExtractor) GetMetaContentLocation(document *goquery.Document) string

GetMetaContentLocation returns the meta content location set in the source, if the article has one

func (*ContentExtractor) GetMetaContentWithSelector ¶

func (extr *ContentExtractor) GetMetaContentWithSelector(document *goquery.Document, selector string) string

GetMetaContentWithSelector returns the content attribute of meta tag matching the selector

func (*ContentExtractor) GetMetaContents ¶

func (extr *ContentExtractor) GetMetaContents(document *goquery.Document, metaNames *set.Set) map[string]string

GetMetaContents returns all the meta tags as name->content pairs

func (*ContentExtractor) GetMetaDescription ¶

func (extr *ContentExtractor) GetMetaDescription(document *goquery.Document) string

GetMetaDescription returns the meta description set in the source, if the article has one

func (*ContentExtractor) GetMetaKeywords ¶

func (extr *ContentExtractor) GetMetaKeywords(document *goquery.Document) string

GetMetaKeywords returns the meta keywords set in the source, if the article has them

func (*ContentExtractor) GetMetaLanguage ¶

func (extr *ContentExtractor) GetMetaLanguage(document *goquery.Document) string

GetMetaLanguage returns the meta language set in the source, if the article has one

func (*ContentExtractor) GetPublishDate ¶

func (extr *ContentExtractor) GetPublishDate(document *goquery.Document) *time.Time

GetPublishDate returns the publication date, if one can be located.

func (*ContentExtractor) GetTags ¶

func (extr *ContentExtractor) GetTags(document *goquery.Document) *set.Set

GetTags returns the tags set in the source, if the article has them

func (*ContentExtractor) GetTitle ¶

func (extr *ContentExtractor) GetTitle(document *goquery.Document) string

GetTitle returns the title set in the source, if the article has one

func (*ContentExtractor) GetTitleFromUnmodifiedTitle ¶

func (extr *ContentExtractor) GetTitleFromUnmodifiedTitle(title string) string

GetTitleFromUnmodifiedTitle returns the title from the unmodified one

func (*ContentExtractor) PostCleanup ¶

func (extr *ContentExtractor) PostCleanup(targetNode *goquery.Selection) *goquery.Selection

PostCleanup removes any divs that looks like non-content, clusters of links, or paras with no gusto

type VideoExtractor ¶

type VideoExtractor struct {
	// contains filtered or unexported fields
}

VideoExtractor can extract the main video from an HTML page

func NewVideoExtractor ¶

func NewVideoExtractor() VideoExtractor

NewVideoExtractor returns a new instance of a HTML video extractor

func (*VideoExtractor) GetVideos ¶

func (ve *VideoExtractor) GetVideos(doc *goquery.Document) *set.Set

GetVideos returns the video tags embedded in the article

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL