defuddle

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 12, 2026 License: MIT Imports: 7 Imported by: 0

README

go-defuddle

Go port of Defuddle — extract main content from web pages as clean HTML or Markdown.

Runs the real Defuddle JavaScript library inside a sandboxed QuickJS (WebAssembly) runtime. Zero CGO. Pure Go. Single binary.

Install

As a library
go get github.com/vaayne/go-defuddle
As a CLI
go install github.com/vaayne/go-defuddle/cmd/defuddle@latest

CLI usage

# Extract as markdown
defuddle -m https://example.com/article

# Output as JSON with metadata
defuddle -j https://example.com/article

# Extract a specific property
defuddle -p title https://example.com/article

# Parse a local HTML file
defuddle -m page.html

# Save to file
defuddle -m -o output.md https://example.com/article
Flags
-m, -markdown     Convert content to markdown format
-j, -json         Output as JSON with metadata and content
-p, -property     Extract a specific property (title, author, domain, etc.)
-o, -output       Output file path (default: stdout)
    -debug        Enable debug mode
-v, -version      Print version

Library usage

package main

import (
	"fmt"
	"log"

	defuddle "github.com/vaayne/go-defuddle"
)

func main() {
	parser, err := defuddle.NewParser()
	if err != nil {
		log.Fatal(err)
	}
	defer parser.Close()

	result, err := parser.Parse(
		`<html>
		<head><title>My Article</title></head>
		<body>
			<article>
				<h1>My Article</h1>
				<p>This is the main content.</p>
			</article>
			<footer>Copyright 2025</footer>
		</body>
		</html>`,
		"https://example.com/my-article",
		&defuddle.Options{Markdown: true},
	)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("Title:", result.Title)
	fmt.Println(result.Markdown)
}

API

NewParser() (*Parser, error)

Creates a parser instance. Loads the QuickJS WASM runtime and evaluates the JS bundle (~450ms cold start). Reuse across calls.

Parser.Parse(html, url string, opts *Options) (*Result, error)

Extracts main content from raw HTML.

  • html — full HTML source
  • url — page URL for resolving relative links and site-specific extractors
  • opts — parsing options (pass nil for defaults)
Parser.Close()

Releases the QuickJS runtime.

Types
type Result struct {
	Content       string          // Clean HTML
	Title         string          // Page title
	Description   string          // Meta description
	Domain        string          // Hostname
	Favicon       string          // Favicon URL
	Image         string          // Lead image URL
	Language      string          // Content language
	Published     string          // Publish date
	Author        string          // Author name
	Site          string          // Site name
	WordCount     int             // Word count
	ParseTime     int             // JS parse time (ms)
	MetaTags      []MetaTag       // Meta tags from <head>
	SchemaOrgData json.RawMessage // JSON-LD schema.org data
	Markdown      string          // Markdown (when Options.Markdown is true)
}

type Options struct {
	Markdown               bool  // Convert to Markdown (Go-side)
	RemoveSmallImages      *bool // Toggle small image removal
	RemoveHiddenElements   *bool // Toggle hidden element removal
	RemoveLowScoring       *bool // Toggle low-scoring block removal
	RemoveExactSelectors   *bool // Toggle exact CSS selector removal
	RemovePartialSelectors *bool // Toggle partial class/id removal
	RemoveContentPatterns  *bool // Toggle content-pattern removal
	Standardize            *bool // Toggle HTML normalization
	Debug                  bool  // Enable debug output
}
Concurrency

A Parser is not safe for concurrent use. For concurrent workloads, create one per goroutine:

pool := make(chan *defuddle.Parser, numWorkers)
for range numWorkers {
	p, _ := defuddle.NewParser()
	pool <- p
}

// Per goroutine:
p := <-pool
defer func() { pool <- p }()
result, _ := p.Parse(html, url, nil)

How it works

┌──────────────┐       ┌──────────────────────────┐       ┌────────────────────┐
│   Go app     │──────▶│   QuickJS (Wazero WASM)  │──────▶│  html-to-markdown  │
│  .Parse()    │ HTML  │   defuddle + linkedom     │ JSON  │  HTML → Markdown   │
└──────────────┘       └──────────────────────────┘       └────────────────────┘
  1. Content extraction runs in JavaScript. Defuddle and linkedom are bundled into a single ~430KB JS file executed in QuickJS via Wazero (WebAssembly). No Node.js, no browser, no CGO.
  2. Markdown conversion runs in Go via html-to-markdown, which uses goldmark internally.
Performance
Metric Time
Init (cold start) ~450ms
Parse + Markdown ~95ms

Init is one-time per Parser instance.

Syncing with upstream Defuddle

Defuddle is included as a git submodule. The JS bundle is a custom webpack build — not taken from Defuddle's dist/ — because Defuddle's shipped bundles expect either a browser DOM or Node.js require(), neither of which exist in QuickJS.

Our custom bundle (internal/js/bundle-entry.js):

  • Inlines linkedom directly (no runtime require())
  • Imports Defuddle from source (defuddle/src/defuddle.ts)
  • Patches the DOM (styleSheets, getComputedStyle)
  • Skips Turndown (Go handles Markdown)
  • Uses math.core.ts (no temml/mathml-to-latex, saves ~450KB)
To sync
# With mise (recommended)
mise run sync

# Or manually
cd defuddle && git pull origin main && cd ..
npm install && npx webpack
go test ./...
What can break
Upstream change Fix
New browser/Node API used Add polyfill to internal/js/polyfills.js
Defuddle constructor or parse() signature changes Update internal/js/bundle-entry.js
parse() return type changes Update Result struct in defuddle.go
New npm dep with native bindings Check for pure-JS alternative
math.core.ts path changes Update webpack alias in webpack.config.js

QuickJS polyfills

QuickJS is ES2023 compliant but has no Web/Node APIs. internal/js/polyfills.js provides:

Polyfill Reason
self UMD bundle expects self on globalThis
Buffer.from() htmlparser2 entity decoder uses Buffer for base64
URL Defuddle uses new URL() for domain extraction, link resolution
atob() Base64 fallback for htmlparser2
performance.now() Defuddle profiling; shimmed to Date.now()

Development

This project uses mise for tool versions and task running.

# Setup
mise install              # Install node + go
mise run install          # Install npm deps

# Common tasks
mise run bundle           # Rebuild JS bundle
mise run bundle-check     # Verify committed bundle is up to date
mise run build-cli        # Build CLI to bin/defuddle
mise run test             # Run Go tests
mise run lint             # Run go vet
mise run ci               # Full CI pipeline (bundle-check + lint + test)
mise run sync             # Update defuddle submodule + rebuild
mise run clean            # Remove build artifacts
CI
  • CI workflow (.github/workflows/ci.yml): runs on push/PR to main — verifies the committed bundle is up to date, runs go vet and go test.
  • Release workflow (.github/workflows/release.yml): triggered by v* tags — cross-compiles for linux/darwin/windows (amd64/arm64) and creates a GitHub Release with binaries.

To release:

git tag v0.1.0
git push origin v0.1.0

Project structure

go-defuddle/
├── defuddle.go              # Go library (Parser, Result, Options)
├── defuddle_test.go         # Go tests
├── defuddle/                # git submodule → github.com/kepano/defuddle
├── cmd/defuddle/main.go     # CLI
├── internal/js/
│   ├── bundle-entry.js      # Webpack entry (wires linkedom + defuddle)
│   ├── polyfills.js         # QuickJS polyfills (Buffer, URL, atob, etc.)
│   └── defuddle-bundle.js   # Built bundle (~430KB, embedded via go:embed)
├── .github/workflows/
│   ├── ci.yml               # CI: bundle-check + vet + test
│   └── release.yml          # Release: cross-compile + GitHub Release
├── mise.toml                # Tool versions + task definitions
├── webpack.config.js        # Webpack config
├── tsconfig.json            # TypeScript config for webpack
├── package.json             # npm deps (linkedom, webpack, ts-loader)
└── go.mod

Dependencies

Go
Package Purpose
fastschema/qjs QuickJS via Wazero (WASM, no CGO)
html-to-markdown HTML → Markdown (uses goldmark)
JavaScript (bundled into defuddle-bundle.js)
Package Purpose
defuddle Content extraction pipeline
linkedom DOM implementation
htmlparser2 HTML parser
cssom CSS parsing

Limitations

  • No getComputedStyle: linkedom doesn't compute CSS. Hidden-element removal uses inline styles and class heuristics.
  • No canvas: Image dimensions use HTML attributes only.
  • URL polyfill is minimal: Covers common cases. Edge cases with IPv6 or exotic schemes may not parse.
  • Single-threaded per Parser: Create multiple instances for concurrency.
  • ~450ms cold start: First NewParser() loads WASM + JS. Subsequent Parse calls are ~95ms.

Credits

License

MIT

Documentation

Overview

Package defuddle extracts main content from web pages as clean HTML or Markdown.

It runs the Defuddle (https://github.com/kepano/defuddle) JavaScript library inside a sandboxed QuickJS runtime (via WebAssembly), with Markdown conversion handled natively in Go via html-to-markdown.

Basic usage:

parser, err := defuddle.NewParser()
if err != nil {
    log.Fatal(err)
}
defer parser.Close()

result, err := parser.Parse(html, "https://example.com/page", nil)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type MetaTag

type MetaTag struct {
	Name     *string `json:"name"`
	Property *string `json:"property"`
	Content  string  `json:"content"`
}

MetaTag represents a single HTML meta tag.

type Options

type Options struct {
	// Markdown converts the extracted HTML content to Markdown (Go-side).
	Markdown bool `json:"-"`

	// RemoveSmallImages toggles removal of small/tracking images.
	RemoveSmallImages *bool `json:"removeSmallImages,omitempty"`
	// RemoveHiddenElements toggles removal of hidden DOM elements.
	RemoveHiddenElements *bool `json:"removeHiddenElements,omitempty"`
	// RemoveLowScoring toggles removal of low-scoring content blocks.
	RemoveLowScoring *bool `json:"removeLowScoring,omitempty"`
	// RemoveExactSelectors toggles removal via exact CSS selectors.
	RemoveExactSelectors *bool `json:"removeExactSelectors,omitempty"`
	// RemovePartialSelectors toggles removal via partial class/id matching.
	RemovePartialSelectors *bool `json:"removePartialSelectors,omitempty"`
	// RemoveContentPatterns toggles content-pattern-based removal.
	RemoveContentPatterns *bool `json:"removeContentPatterns,omitempty"`
	// Standardize toggles HTML normalization (headings, code blocks, etc.).
	Standardize *bool `json:"standardize,omitempty"`
	// Debug enables debug output from the defuddle pipeline.
	Debug bool `json:"debug,omitempty"`
}

Options controls parsing behavior.

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser wraps a QuickJS runtime with the defuddle bundle pre-loaded.

A Parser is safe for sequential use but NOT for concurrent use from multiple goroutines. For concurrent workloads, create one Parser per goroutine or use a sync.Pool.

func NewParser

func NewParser() (*Parser, error)

NewParser creates a new Parser instance. This loads the QuickJS WebAssembly runtime and evaluates the defuddle JS bundle (~450ms cold start). Reuse the parser across multiple Parse calls to amortize this cost.

func (*Parser) Close

func (p *Parser) Close()

Close releases the underlying QuickJS runtime. Always defer this after NewParser.

func (*Parser) Parse

func (p *Parser) Parse(html, url string, opts *Options) (*Result, error)

Parse extracts main content from a raw HTML string.

The url parameter is used for resolving relative links and matching site-specific extractors. Pass an empty string if unknown.

type Result

type Result struct {
	// Content is the extracted main content as clean HTML.
	Content string `json:"content"`
	// Title is the page title.
	Title string `json:"title"`
	// Description is the meta description.
	Description string `json:"description"`
	// Domain is the hostname (e.g. "example.com").
	Domain string `json:"domain"`
	// Favicon is the favicon URL.
	Favicon string `json:"favicon"`
	// Image is the Open Graph or lead image URL.
	Image string `json:"image"`
	// Language is the content language (e.g. "en").
	Language string `json:"language"`
	// Published is the publish date (ISO 8601 when available).
	Published string `json:"published"`
	// Author is the author name.
	Author string `json:"author"`
	// Site is the site name.
	Site string `json:"site"`
	// WordCount is the word count of extracted content.
	WordCount int `json:"wordCount"`
	// ParseTime is the JS-side parse time in milliseconds.
	ParseTime int `json:"parseTime"`
	// MetaTags contains all meta tags from <head>.
	MetaTags []MetaTag `json:"metaTags,omitempty"`
	// SchemaOrgData contains parsed JSON-LD schema.org data.
	SchemaOrgData json.RawMessage `json:"schemaOrgData,omitempty"`
	// Markdown is the content converted to Markdown.
	// Only populated when Options.Markdown is true.
	Markdown string `json:"markdown,omitempty"`
}

Result holds the parsed output from defuddle.

Directories

Path Synopsis
cmd
defuddle command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL