Documentation
¶
Overview ¶
Example ¶
// URL to extract contents (title, description, images, ...)
url := "https://en.wikipedia.org/wiki/Lego"
// Default option
opt := readability.NewOption()
// You can modify some option values if needed.
opt.ImageRequestTimeout = 3000 // ms
content, err := readability.Extract(url, opt)
if err != nil {
log.Fatal(err)
}
log.Println(content.Title)
log.Println(content.Description)
log.Println(content.Images)
Index ¶
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Content ¶
Content contains primary readable content of a webpage.
func ExtractFromDocument ¶
ExtractFromDocument returns Content when extraction succeeds, otherwise error. reqURL is required for converting relative image paths to absolute.
If you already have *goquery.Document after requesting HTTP, use this function, otherwise use Extract(reqURL, opt).
type Option ¶
type Option struct {
// RetryLength is minimum length for a page description.
// It will retry to extract page description with more liberal rule
// if extracted description length is less than this value.
RetryLength int
// MinTextLength is minimum length of an inner text for a tag.
// If a tag has short inner text (length is less than MinTextLength),
// the text will be discarded from the page description candidates.
MinTextLength int
// RemoveUnlikelyCandidates is a flag whether to remove some tags
// if they are considered relatively unimportant.
RemoveUnlikelyCandidates bool
// WeightClasses is a flag whether to give more/less weight to some tags
// if they contain some positive/negative words in id/class value.
WeightClasses bool
// CleanConditionally is a flag whether to remove some tags
// using various rules in conditionalCleanReason().
CleanConditionally bool
// RemoveEmptyNodes is a flag whether to remove some tags which have empty inner text.
RemoveEmptyNodes bool
// MinImageWidth is the minimum width (pixel) for choosing images.
MinImageWidth uint32
// MinImageHeight is the minimum height (pixel) for choosing images.
MinImageHeight uint32
// MaxImageCount is the maximum number of images for a web page.
MaxImageCount int
// CheckImageSize is the flag for check image's size or not
CheckImageSize bool
// CheckImageLoopCount is the number of images for parallel requests to fetch the image size.
// For example, if this value is set to 10,
// the first 10 image sources in img tag will be requested.
CheckImageLoopCount uint
// ImageRequestTimeout is timeout(ms) for a single image request.
ImageRequestTimeout uint
// IgnoreImageFormat is an array of strings for ignoring some images.
// If an image URL contains at least one of strings in this array, the image will be ignored.
IgnoreImageFormat []string
// DescriptionAsPlainText is a flag whether to strip all tags in a description value.
DescriptionAsPlainText bool
// DescriptionExtractionTimeout is timeout(ms) for extracting description for a page.
DescriptionExtractionTimeout uint
}
Option contains variety of options for extracting page content and images.
Click to show internal directories.
Click to hide internal directories.
