defuddle

package module

v1.2.0 Latest Latest Go to latest Published: Sep 11, 2025 License: MIT Imports: 19 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/kaptinlin/defuddle-go

Links

Open Source Insights

README ¶

Defuddle Go

A Go implementation of the Defuddle TypeScript library for intelligent web content extraction. Defuddle Go extracts clean, readable content from HTML documents using advanced algorithms to remove clutter while preserving meaningful content.

Available as both a Go library and a command-line tool compatible with the original Defuddle CLI.

Features

🧠 Intelligent Content Extraction: Advanced algorithms to identify and extract main content
🎯 Site-Specific Extractors: Built-in support for popular platforms (ChatGPT, Grok, Hacker News, Reddit, etc.)
🧹 Clutter Removal: Automatically removes ads, navigation, sidebars, and other non-content elements
📱 Mobile-First: Applies mobile styles for better content detection
🔍 Metadata Extraction: Extracts titles, descriptions, authors, images, and more
🏷️ Schema.org Support: Parses structured data using JSON-LD processing
📝 Markdown Conversion: High-quality HTML to Markdown conversion
🔧 Element Processing: Advanced processing for code blocks, images, math formulas, and more
🐛 Debug Mode: Detailed processing information for troubleshooting
⚡ High Performance: Optimized for Go with efficient DOM processing
🖥️ CLI Tool: Powerful command-line interface for extracting content

Installation

CLI Tool

Download Pre-built Binaries

Download the latest binary for your platform from the releases page.

Install with Go

go install github.com/kaptinlin/defuddle-go/cmd/defuddle@latest

Install from Source

git clone https://github.com/kaptinlin/defuddle-go.git
cd defuddle-go
make build-cli
sudo make install-cli

Go Library

go get github.com/kaptinlin/defuddle-go

CLI Usage

The defuddle command-line tool provides a simple interface for extracting content from web pages and HTML files, with full compatibility with the original TypeScript CLI.

Basic Usage

# Extract content from a URL
defuddle parse https://example.com/article

# Extract from local HTML file
defuddle parse article.html

# Convert to Markdown
defuddle parse https://example.com/article --markdown

# Get JSON output with metadata
defuddle parse https://example.com/article --json

# Extract specific properties
defuddle parse https://example.com/article --property title
defuddle parse https://example.com/article --property author
defuddle parse https://example.com/article --property description

# Save output to file
defuddle parse https://example.com/article --markdown --output article.md

# Add custom headers
defuddle parse https://example.com/article --header "Authorization: Bearer token123"

# Use proxy for requests
defuddle parse https://example.com/article --proxy http://localhost:8080

# Use custom timeout and user agent
defuddle parse https://example.com/article --timeout 60s --user-agent "MyBot/1.0"

CLI Options

Option	Short	Description
`--output`	`-o`	Output file path (default: stdout)
`--markdown`	`-m`	Convert content to markdown format
`--md`		Alias for --markdown
`--json`	`-j`	Output as JSON with metadata and content
`--property`	`-p`	Extract a specific property
`--debug`		Enable debug mode
`--proxy`		Proxy URL (e.g., http://localhost:8080, socks5://localhost:1080)
`--user-agent`		Custom user agent string
`--timeout`		Request timeout (default: 30s)
`--header`	`-H`	Custom headers in format 'Key: Value' (can be used multiple times)
`--help`	`-h`	Show help message
`--version`	`-v`	Show version information

CLI Examples

# Extract Reddit post title
defuddle parse https://www.reddit.com/r/golang/comments/xyz/... --property title

# Get full JSON metadata
defuddle parse https://news.ycombinator.com/item?id=123456 --json

# Convert article to Markdown and save
defuddle parse https://blog.example.com/post --markdown --output post.md

# Debug parsing process
defuddle parse https://example.com/article --debug

# Access site behind authentication
defuddle parse https://secured.example.com/article --header "Authorization: Bearer your-token"

# Handle slow connections
defuddle parse https://slow-site.example.com/article --timeout 120s

Library Usage

Basic Content Extraction

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/kaptinlin/defuddle-go"
)

func main() {
    html := `
    <!DOCTYPE html>
    <html>
    <head>
        <title>Sample Article</title>
        <meta name="description" content="This is a sample article">
        <meta name="author" content="John Doe">
    </head>
    <body>
        <header>Navigation</header>
        <main>
            <article>
                <h1>Sample Article</h1>
                <p>This is the main content of the article.</p>
                <p>It contains multiple paragraphs of text.</p>
            </article>
        </main>
        <aside>Sidebar content</aside>
        <footer>Footer content</footer>
    </body>
    </html>
    `

    // Create Defuddle instance
    defuddleInstance, err := defuddle.NewDefuddle(html, nil)
    if err != nil {
        log.Fatal(err)
    }

    // Parse the content
    result, err := defuddleInstance.Parse(context.Background())
    if err != nil {
        log.Fatal(err)
    }

    // Output results
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Author: %s\n", result.Author)
    fmt.Printf("Description: %s\n", result.Description)
    fmt.Printf("Word Count: %d\n", result.WordCount)
    fmt.Printf("Parse Time: %dms\n", result.ParseTime)
    fmt.Printf("Content: %s\n", result.Content)
}

Parsing from URL

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/kaptinlin/defuddle-go"
)

func main() {
    options := &defuddle.Options{
        Debug: true,
        URL:   "https://example.com/article",
    }

    result, err := defuddle.ParseFromURL(context.Background(), "https://example.com/article", options)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Content length: %d\n", len(result.Content))
}

Advanced Usage with All Options

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/kaptinlin/defuddle-go"
)

func main() {
    html := `<html>...</html>` // Your HTML content here

    options := &defuddle.Options{
        Debug:                  true,  // Enable debug mode
        Markdown:               true,  // Convert content to Markdown
        SeparateMarkdown:       true,  // Keep both HTML and Markdown
        URL:                    "https://example.com/article",
        ProcessCode:            true,  // Process code blocks
        ProcessImages:          true,  // Filter and optimize images
        ProcessHeadings:        true,  // Standardize headings
        ProcessMath:            true,  // Handle mathematical formulas
        ProcessFootnotes:       true,  // Extract footnotes
        ProcessRoles:           true,  // Convert ARIA roles to semantic HTML
        RemoveExactSelectors:   true,  // Remove exact clutter selectors
        RemovePartialSelectors: true,  // Remove partial clutter selectors
    }

    defuddleInstance, err := defuddle.NewDefuddle(html, options)
    if err != nil {
        log.Fatal(err)
    }

    result, err := defuddleInstance.Parse(context.Background())
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Title: %s\n", result.Title)
    if result.ContentMarkdown != nil {
        fmt.Printf("Markdown content: %s\n", *result.ContentMarkdown)
    }

    if result.DebugInfo != nil {
        fmt.Printf("Processing steps: %d\n", len(result.DebugInfo.ProcessingSteps))
        fmt.Printf("Original elements: %d\n", result.DebugInfo.Statistics.OriginalElementCount)
        fmt.Printf("Final elements: %d\n", result.DebugInfo.Statistics.FinalElementCount)
    }
}

API Reference

Result Structure

The Result object contains the following fields:

Field	Type	Description
`Title`	string	Article title
`Author`	string	Article author
`Description`	string	Article description or summary
`Domain`	string	Website domain
`Favicon`	string	Website favicon URL
`Image`	string	Main image URL
`Published`	string	Publication date
`Site`	string	Website name
`Content`	string	Cleaned HTML content
`ContentMarkdown`	*string	Markdown version (if enabled)
`WordCount`	int	Word count in extracted content
`ParseTime`	int64	Parse time in milliseconds
`SchemaOrgData`	interface{}	Schema.org structured data
`MetaTags`	[]MetaTag	Document meta tags
`ExtractorType`	*string	Extractor type used
`DebugInfo`	*DebugInfo	Debug information (if enabled)

Configuration Options

Option	Type	Default	Description
`Debug`	bool	false	Enable debug logging
`URL`	string	""	Source URL for the content
`Markdown`	bool	false	Convert content to Markdown
`SeparateMarkdown`	bool	false	Keep both HTML and Markdown
`RemoveExactSelectors`	bool	true	Remove exact clutter matches
`RemovePartialSelectors`	bool	true	Remove partial clutter matches
`RemoveImages`	bool	false	Remove all images from extracted content
`ProcessCode`	bool	false	Process code blocks
`ProcessImages`	bool	false	Process and optimize images
`ProcessHeadings`	bool	false	Standardize heading structure
`ProcessMath`	bool	false	Process mathematical formulas
`ProcessFootnotes`	bool	false	Extract and format footnotes
`ProcessRoles`	bool	false	Convert ARIA roles to semantic HTML

Core Functions

`NewDefuddle(html string, options Options) (Defuddle, error)`

Creates a new Defuddle instance from HTML content.

`ParseFromURL(ctx context.Context, url string, options Options) (Result, error)`

Fetches content from a URL and parses it directly.

`Parse(ctx context.Context) (*Result, error)`

Parses the HTML content and returns extracted results.

Content Processing

Processing Pipeline

Defuddle Go processes content through these stages:

Schema.org Extraction - Extracts structured data using JSON-LD
Site-Specific Detection - Uses specialized extractors when available
Main Content Detection - Identifies primary content areas
Clutter Removal - Removes navigation, ads, and decorative elements
Content Standardization - Normalizes HTML structure
Element Processing - Processes code, math, images, and footnotes
Markdown Conversion - Converts to Markdown if requested

HTML Standardization

Headings

Duplicate H1/H2 headings matching the title are removed
Heading hierarchy is normalized
Navigation links within headings are removed

Code Blocks

Code blocks are standardized with preserved language information:

<pre><code data-lang="javascript" class="language-javascript">
console.log("Hello, World!");
</code></pre>

Footnotes

Footnotes are converted to a standard format with proper linking:

Text with footnote<sup id="fnref:1"><a href="#fn:1">1</a></sup>.

<div id="footnotes">
  <ol>
    <li class="footnote" id="fn:1">
      <p>Footnote content <a href="#fnref:1" class="footnote-backref">↩</a></p>
    </li>
  </ol>
</div>

Site-Specific Extractors

Built-in extractors automatically activate for supported platforms:

ChatGPT - Extracts conversation content and metadata
Grok - Extracts AI conversation content
Hacker News - Extracts posts and comments with proper threading

Custom extractors can be implemented using the BaseExtractor interface.

Examples

The examples/ directory contains ready-to-run examples:

Basic - Simple content extraction
Advanced - Full feature demonstration
Markdown - HTML to Markdown conversion
Extractors - Site-specific extraction
Custom Extractor - Building custom extractors

Run examples with:

cd examples/basic && go run main.go
cd examples/advanced && go run main.go
cd examples/markdown && go run main.go
cd examples/extractors && go run main.go
cd examples/custom_extractor && go run custom_extractor.go

Performance

Typical performance characteristics:

Processing Speed: 5-15ms for standard web pages
Memory Usage: Optimized with object pooling and efficient DOM processing
Concurrent Safe: Can process multiple documents simultaneously

Dependencies

goquery - DOM manipulation and traversal
requests - HTTP client for URL fetching
html-to-markdown - HTML to Markdown conversion
json-gold - JSON-LD processing

Contributing

Contributions are welcome. Please open an issue or submit a pull request.

License

MIT License - see LICENSE file for details.

Acknowledgments

Original Defuddle TypeScript library by Steph Ango (@kepano)
Original Defuddle CLI by Steph Ango (@kepano)
Inspired by Mozilla's Readability algorithm

Documentation ¶

Overview ¶

Package defuddle provides web content extraction and demuddling capabilities.

Index ¶

type Defuddle
- func NewDefuddle(html string, options *Options) (*Defuddle, error)
- func (d *Defuddle) Parse(ctx context.Context) (*Result, error)
type ExtractedContent
type ExtractorVariables
type MetaTag
type Metadata
type Options
type Result
- func ParseFromString(ctx context.Context, html string, options *Options) (*Result, error)
- func ParseFromURL(ctx context.Context, url string, options *Options) (*Result, error)
type StyleChange

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Defuddle ¶

type Defuddle struct {
	// contains filtered or unexported fields
}

Defuddle represents a document parser instance

func NewDefuddle ¶

func NewDefuddle(html string, options *Options) (*Defuddle, error)

NewDefuddle creates a new Defuddle instance from HTML content JavaScript original code:

constructor(document: Document, options: DefuddleOptions = {}) {
  this.doc = document;
  this.options = options;
}

func (*Defuddle) Parse ¶

func (d *Defuddle) Parse(ctx context.Context) (*Result, error)

Parse extracts the main content from the document JavaScript original code:

parse(): DefuddleResponse {
  // Try first with default settings
  const result = this.parseInternal();

  // If result has very little content, try again without clutter removal
  if (result.wordCount < 200) {
    console.log('Initial parse returned very little content, trying again');
    const retryResult = this.parseInternal({
      removePartialSelectors: false
    });

    // Return the result with more content
    if (retryResult.wordCount > result.wordCount) {
      this._log('Retry produced more content');
      return retryResult;
    }
  }

  return result;
}

type ExtractedContent ¶

type ExtractedContent struct {
	Title       *string             `json:"title,omitempty"`
	Author      *string             `json:"author,omitempty"`
	Published   *string             `json:"published,omitempty"`
	Content     *string             `json:"content,omitempty"`
	ContentHTML *string             `json:"contentHtml,omitempty"`
	Variables   *ExtractorVariables `json:"variables,omitempty"`
}

ExtractedContent represents content extracted by site-specific extractors JavaScript original code:

export interface ExtractedContent {
  title?: string;
  author?: string;
  published?: string;
  content?: string;
  contentHtml?: string;
  variables?: ExtractorVariables;
}

type ExtractorVariables ¶

type ExtractorVariables map[string]string

ExtractorVariables represents variables extracted by site-specific extractors JavaScript original code:

export interface ExtractorVariables {
  [key: string]: string;
}

type MetaTag ¶

type MetaTag = metadata.MetaTag

MetaTag represents a meta tag item from HTML This is an alias to the internal metadata.MetaTag type

type Metadata ¶

type Metadata = metadata.Metadata

Metadata represents extracted metadata from a document This is an alias to the internal metadata.Metadata type

type Options ¶

type Options struct {
	// Enable debug logging
	Debug bool `json:"debug,omitempty"`

	// URL of the page being parsed
	URL string `json:"url,omitempty"`

	// Convert output to Markdown
	Markdown bool `json:"markdown,omitempty"`

	// Include Markdown in the response
	SeparateMarkdown bool `json:"separateMarkdown,omitempty"`

	// Whether to remove elements matching exact selectors like ads, social buttons, etc.
	// Defaults to true.
	RemoveExactSelectors bool `json:"removeExactSelectors,omitempty"`

	// Whether to remove elements matching partial selectors like ads, social buttons, etc.
	// Defaults to true.
	RemovePartialSelectors bool `json:"removePartialSelectors,omitempty"`

	// Remove images from the extracted content
	// Defaults to false.
	RemoveImages bool `json:"removeImages,omitempty"`

	// Element processing options
	ProcessCode      bool                                 `json:"processCode,omitempty"`
	ProcessImages    bool                                 `json:"processImages,omitempty"`
	ProcessHeadings  bool                                 `json:"processHeadings,omitempty"`
	ProcessMath      bool                                 `json:"processMath,omitempty"`
	ProcessFootnotes bool                                 `json:"processFootnotes,omitempty"`
	ProcessRoles     bool                                 `json:"processRoles,omitempty"`
	CodeOptions      *elements.CodeBlockProcessingOptions `json:"codeOptions,omitempty"`
	ImageOptions     *elements.ImageProcessingOptions     `json:"imageOptions,omitempty"`
	HeadingOptions   *elements.HeadingProcessingOptions   `json:"headingOptions,omitempty"`
	MathOptions      *elements.MathProcessingOptions      `json:"mathOptions,omitempty"`
	FootnoteOptions  *elements.FootnoteProcessingOptions  `json:"footnoteOptions,omitempty"`
	RoleOptions      *elements.RoleProcessingOptions      `json:"roleOptions,omitempty"`
}

Options represents configuration options for Defuddle parsing JavaScript original code:

export interface DefuddleOptions {
  debug?: boolean;
  url?: string;
  markdown?: boolean;
  separateMarkdown?: boolean;
  removeExactSelectors?: boolean;
  removePartialSelectors?: boolean;
}

type Result ¶

type Result struct {
	Metadata
	Content         string      `json:"content"`
	ContentMarkdown *string     `json:"contentMarkdown,omitempty"`
	ExtractorType   *string     `json:"extractorType,omitempty"`
	MetaTags        []MetaTag   `json:"metaTags,omitempty"`
	DebugInfo       *debug.Info `json:"debugInfo,omitempty"`
}

Result represents the complete response from Defuddle parsing JavaScript original code:

export interface DefuddleResponse extends DefuddleMetadata {
  content: string;
  contentMarkdown?: string;
  extractorType?: string;
  metaTags?: MetaTagItem[];
}

func ParseFromString ¶ added in v0.2.0

func ParseFromString(ctx context.Context, html string, options *Options) (*Result, error)

ParseFromString parses HTML content directly from a string This is useful when you already have the HTML content (e.g., from browser automation)

func ParseFromURL ¶

func ParseFromURL(ctx context.Context, url string, options *Options) (*Result, error)

ParseFromURL fetches content from a URL and parses it JavaScript original code: // This corresponds to Node.js usage: Defuddle(htmlOrDom, url?, options?)

type StyleChange ¶

type StyleChange struct {
	Selector string
	Styles   string
}

StyleChange represents a CSS style change for mobile

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
defuddle command Package main provides the defuddle CLI application.	Package main provides the defuddle CLI application.
examples
advanced command Package main demonstrates advanced defuddle usage.	Package main demonstrates advanced defuddle usage.
basic command Package main demonstrates basic defuddle usage.	Package main demonstrates basic defuddle usage.
custom_extractor command Package main demonstrates custom extractor usage.	Package main demonstrates custom extractor usage.
extractors command Package main demonstrates extractors usage.	Package main demonstrates extractors usage.
markdown command Package main demonstrates markdown conversion.	Package main demonstrates markdown conversion.
extractors Package extractors provides site-specific content extraction functionality.	Package extractors provides site-specific content extraction functionality.
internal
constants Package constants provides configuration constants and selectors for the defuddle content extraction system.	Package constants provides configuration constants and selectors for the defuddle content extraction system.
debug Package debug provides debugging functionality for the defuddle content extraction system.	Package debug provides debugging functionality for the defuddle content extraction system.
elements Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting	Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting
markdown Package markdown provides HTML to Markdown conversion functionality.	Package markdown provides HTML to Markdown conversion functionality.
metadata Package metadata provides functionality for extracting and processing document metadata.	Package metadata provides functionality for extracting and processing document metadata.
pool Package pool provides memory pooling utilities for the defuddle content extraction system.	Package pool provides memory pooling utilities for the defuddle content extraction system.
scoring Package scoring provides content scoring functionality for the defuddle content extraction system.	Package scoring provides content scoring functionality for the defuddle content extraction system.
standardize Package standardize provides content standardization functionality for the defuddle content extraction system.	Package standardize provides content standardization functionality for the defuddle content extraction system.