Documentation
¶
Overview ¶
Package defuddle extracts main content from web pages as clean HTML or Markdown.
It runs the Defuddle (https://github.com/kepano/defuddle) JavaScript library inside a sandboxed QuickJS runtime (via WebAssembly), with Markdown conversion handled natively in Go via html-to-markdown.
Basic usage:
parser, err := defuddle.NewParser()
if err != nil {
log.Fatal(err)
}
defer parser.Close()
result, err := parser.Parse(html, "https://example.com/page", nil)
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type MetaTag ¶
type MetaTag struct {
Name *string `json:"name"`
Property *string `json:"property"`
Content string `json:"content"`
}
MetaTag represents a single HTML meta tag.
type Options ¶
type Options struct {
// Markdown converts the extracted HTML content to Markdown (Go-side).
Markdown bool `json:"-"`
// RemoveSmallImages toggles removal of small/tracking images.
RemoveSmallImages *bool `json:"removeSmallImages,omitempty"`
// RemoveHiddenElements toggles removal of hidden DOM elements.
RemoveHiddenElements *bool `json:"removeHiddenElements,omitempty"`
// RemoveLowScoring toggles removal of low-scoring content blocks.
RemoveLowScoring *bool `json:"removeLowScoring,omitempty"`
// RemoveExactSelectors toggles removal via exact CSS selectors.
RemoveExactSelectors *bool `json:"removeExactSelectors,omitempty"`
// RemovePartialSelectors toggles removal via partial class/id matching.
RemovePartialSelectors *bool `json:"removePartialSelectors,omitempty"`
// RemoveContentPatterns toggles content-pattern-based removal.
RemoveContentPatterns *bool `json:"removeContentPatterns,omitempty"`
// Standardize toggles HTML normalization (headings, code blocks, etc.).
Standardize *bool `json:"standardize,omitempty"`
// Debug enables debug output from the defuddle pipeline.
Debug bool `json:"debug,omitempty"`
}
Options controls parsing behavior.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser wraps a QuickJS runtime with the defuddle bundle pre-loaded.
A Parser is safe for sequential use but NOT for concurrent use from multiple goroutines. For concurrent workloads, create one Parser per goroutine or use a sync.Pool.
func NewParser ¶
NewParser creates a new Parser instance. This loads the QuickJS WebAssembly runtime and evaluates the defuddle JS bundle (~450ms cold start). Reuse the parser across multiple Parse calls to amortize this cost.
type Result ¶
type Result struct {
// Content is the extracted main content as clean HTML.
Content string `json:"content"`
// Title is the page title.
Title string `json:"title"`
// Description is the meta description.
Description string `json:"description"`
// Domain is the hostname (e.g. "example.com").
Domain string `json:"domain"`
// Favicon is the favicon URL.
Favicon string `json:"favicon"`
// Image is the Open Graph or lead image URL.
Image string `json:"image"`
// Language is the content language (e.g. "en").
Language string `json:"language"`
// Published is the publish date (ISO 8601 when available).
Published string `json:"published"`
// Author is the author name.
Author string `json:"author"`
// Site is the site name.
Site string `json:"site"`
// WordCount is the word count of extracted content.
WordCount int `json:"wordCount"`
// ParseTime is the JS-side parse time in milliseconds.
ParseTime int `json:"parseTime"`
// MetaTags contains all meta tags from <head>.
MetaTags []MetaTag `json:"metaTags,omitempty"`
// SchemaOrgData contains parsed JSON-LD schema.org data.
SchemaOrgData json.RawMessage `json:"schemaOrgData,omitempty"`
// Markdown is the content converted to Markdown.
// Only populated when Options.Markdown is true.
Markdown string `json:"markdown,omitempty"`
}
Result holds the parsed output from defuddle.