defuddle

package module
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 11, 2025 License: MIT Imports: 19 Imported by: 0

README

Defuddle Go

Release Test Go Report Card GoDoc

A Go implementation of the Defuddle TypeScript library for intelligent web content extraction. Defuddle Go extracts clean, readable content from HTML documents using advanced algorithms to remove clutter while preserving meaningful content.

Available as both a Go library and a command-line tool compatible with the original Defuddle CLI.

Features

  • 🧠 Intelligent Content Extraction: Advanced algorithms to identify and extract main content
  • 🎯 Site-Specific Extractors: Built-in support for popular platforms (ChatGPT, Grok, Hacker News, Reddit, etc.)
  • 🧹 Clutter Removal: Automatically removes ads, navigation, sidebars, and other non-content elements
  • 📱 Mobile-First: Applies mobile styles for better content detection
  • 🔍 Metadata Extraction: Extracts titles, descriptions, authors, images, and more
  • 🏷️ Schema.org Support: Parses structured data using JSON-LD processing
  • 📝 Markdown Conversion: High-quality HTML to Markdown conversion
  • 🔧 Element Processing: Advanced processing for code blocks, images, math formulas, and more
  • 🐛 Debug Mode: Detailed processing information for troubleshooting
  • High Performance: Optimized for Go with efficient DOM processing
  • 🖥️ CLI Tool: Powerful command-line interface for extracting content

Installation

CLI Tool
Download Pre-built Binaries

Download the latest binary for your platform from the releases page.

Install with Go
go install github.com/kaptinlin/defuddle-go/cmd/defuddle@latest
Install from Source
git clone https://github.com/kaptinlin/defuddle-go.git
cd defuddle-go
make build-cli
sudo make install-cli
Go Library
go get github.com/kaptinlin/defuddle-go

CLI Usage

The defuddle command-line tool provides a simple interface for extracting content from web pages and HTML files, with full compatibility with the original TypeScript CLI.

Basic Usage
# Extract content from a URL
defuddle parse https://example.com/article

# Extract from local HTML file
defuddle parse article.html

# Convert to Markdown
defuddle parse https://example.com/article --markdown

# Get JSON output with metadata
defuddle parse https://example.com/article --json

# Extract specific properties
defuddle parse https://example.com/article --property title
defuddle parse https://example.com/article --property author
defuddle parse https://example.com/article --property description

# Save output to file
defuddle parse https://example.com/article --markdown --output article.md

# Add custom headers
defuddle parse https://example.com/article --header "Authorization: Bearer token123"

# Use proxy for requests
defuddle parse https://example.com/article --proxy http://localhost:8080

# Use custom timeout and user agent
defuddle parse https://example.com/article --timeout 60s --user-agent "MyBot/1.0"
CLI Options
Option Short Description
--output -o Output file path (default: stdout)
--markdown -m Convert content to markdown format
--md Alias for --markdown
--json -j Output as JSON with metadata and content
--property -p Extract a specific property
--debug Enable debug mode
--proxy Proxy URL (e.g., http://localhost:8080, socks5://localhost:1080)
--user-agent Custom user agent string
--timeout Request timeout (default: 30s)
--header -H Custom headers in format 'Key: Value' (can be used multiple times)
--help -h Show help message
--version -v Show version information
CLI Examples
# Extract Reddit post title
defuddle parse https://www.reddit.com/r/golang/comments/xyz/... --property title

# Get full JSON metadata
defuddle parse https://news.ycombinator.com/item?id=123456 --json

# Convert article to Markdown and save
defuddle parse https://blog.example.com/post --markdown --output post.md

# Debug parsing process
defuddle parse https://example.com/article --debug

# Access site behind authentication
defuddle parse https://secured.example.com/article --header "Authorization: Bearer your-token"

# Handle slow connections
defuddle parse https://slow-site.example.com/article --timeout 120s

Library Usage

Basic Content Extraction
package main

import (
    "context"
    "fmt"
    "log"

    "github.com/kaptinlin/defuddle-go"
)

func main() {
    html := `
    <!DOCTYPE html>
    <html>
    <head>
        <title>Sample Article</title>
        <meta name="description" content="This is a sample article">
        <meta name="author" content="John Doe">
    </head>
    <body>
        <header>Navigation</header>
        <main>
            <article>
                <h1>Sample Article</h1>
                <p>This is the main content of the article.</p>
                <p>It contains multiple paragraphs of text.</p>
            </article>
        </main>
        <aside>Sidebar content</aside>
        <footer>Footer content</footer>
    </body>
    </html>
    `

    // Create Defuddle instance
    defuddleInstance, err := defuddle.NewDefuddle(html, nil)
    if err != nil {
        log.Fatal(err)
    }

    // Parse the content
    result, err := defuddleInstance.Parse(context.Background())
    if err != nil {
        log.Fatal(err)
    }

    // Output results
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Author: %s\n", result.Author)
    fmt.Printf("Description: %s\n", result.Description)
    fmt.Printf("Word Count: %d\n", result.WordCount)
    fmt.Printf("Parse Time: %dms\n", result.ParseTime)
    fmt.Printf("Content: %s\n", result.Content)
}
Parsing from URL
package main

import (
    "context"
    "fmt"
    "log"

    "github.com/kaptinlin/defuddle-go"
)

func main() {
    options := &defuddle.Options{
        Debug: true,
        URL:   "https://example.com/article",
    }

    result, err := defuddle.ParseFromURL(context.Background(), "https://example.com/article", options)
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Content length: %d\n", len(result.Content))
}
Advanced Usage with All Options
package main

import (
    "context"
    "fmt"
    "log"

    "github.com/kaptinlin/defuddle-go"
)

func main() {
    html := `<html>...</html>` // Your HTML content here

    options := &defuddle.Options{
        Debug:                  true,  // Enable debug mode
        Markdown:               true,  // Convert content to Markdown
        SeparateMarkdown:       true,  // Keep both HTML and Markdown
        URL:                    "https://example.com/article",
        ProcessCode:            true,  // Process code blocks
        ProcessImages:          true,  // Filter and optimize images
        ProcessHeadings:        true,  // Standardize headings
        ProcessMath:            true,  // Handle mathematical formulas
        ProcessFootnotes:       true,  // Extract footnotes
        ProcessRoles:           true,  // Convert ARIA roles to semantic HTML
        RemoveExactSelectors:   true,  // Remove exact clutter selectors
        RemovePartialSelectors: true,  // Remove partial clutter selectors
    }

    defuddleInstance, err := defuddle.NewDefuddle(html, options)
    if err != nil {
        log.Fatal(err)
    }

    result, err := defuddleInstance.Parse(context.Background())
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Title: %s\n", result.Title)
    if result.ContentMarkdown != nil {
        fmt.Printf("Markdown content: %s\n", *result.ContentMarkdown)
    }

    if result.DebugInfo != nil {
        fmt.Printf("Processing steps: %d\n", len(result.DebugInfo.ProcessingSteps))
        fmt.Printf("Original elements: %d\n", result.DebugInfo.Statistics.OriginalElementCount)
        fmt.Printf("Final elements: %d\n", result.DebugInfo.Statistics.FinalElementCount)
    }
}

API Reference

Result Structure

The Result object contains the following fields:

Field Type Description
Title string Article title
Author string Article author
Description string Article description or summary
Domain string Website domain
Favicon string Website favicon URL
Image string Main image URL
Published string Publication date
Site string Website name
Content string Cleaned HTML content
ContentMarkdown *string Markdown version (if enabled)
WordCount int Word count in extracted content
ParseTime int64 Parse time in milliseconds
SchemaOrgData interface{} Schema.org structured data
MetaTags []MetaTag Document meta tags
ExtractorType *string Extractor type used
DebugInfo *DebugInfo Debug information (if enabled)
Configuration Options
Option Type Default Description
Debug bool false Enable debug logging
URL string "" Source URL for the content
Markdown bool false Convert content to Markdown
SeparateMarkdown bool false Keep both HTML and Markdown
RemoveExactSelectors bool true Remove exact clutter matches
RemovePartialSelectors bool true Remove partial clutter matches
RemoveImages bool false Remove all images from extracted content
ProcessCode bool false Process code blocks
ProcessImages bool false Process and optimize images
ProcessHeadings bool false Standardize heading structure
ProcessMath bool false Process mathematical formulas
ProcessFootnotes bool false Extract and format footnotes
ProcessRoles bool false Convert ARIA roles to semantic HTML
Core Functions
NewDefuddle(html string, options *Options) (*Defuddle, error)

Creates a new Defuddle instance from HTML content.

ParseFromURL(ctx context.Context, url string, options *Options) (*Result, error)

Fetches content from a URL and parses it directly.

Parse(ctx context.Context) (*Result, error)

Parses the HTML content and returns extracted results.

Content Processing

Processing Pipeline

Defuddle Go processes content through these stages:

  1. Schema.org Extraction - Extracts structured data using JSON-LD
  2. Site-Specific Detection - Uses specialized extractors when available
  3. Main Content Detection - Identifies primary content areas
  4. Clutter Removal - Removes navigation, ads, and decorative elements
  5. Content Standardization - Normalizes HTML structure
  6. Element Processing - Processes code, math, images, and footnotes
  7. Markdown Conversion - Converts to Markdown if requested
HTML Standardization
Headings
  • Duplicate H1/H2 headings matching the title are removed
  • Heading hierarchy is normalized
  • Navigation links within headings are removed
Code Blocks

Code blocks are standardized with preserved language information:

<pre><code data-lang="javascript" class="language-javascript">
console.log("Hello, World!");
</code></pre>
Footnotes

Footnotes are converted to a standard format with proper linking:

Text with footnote<sup id="fnref:1"><a href="#fn:1">1</a></sup>.

<div id="footnotes">
  <ol>
    <li class="footnote" id="fn:1">
      <p>Footnote content <a href="#fnref:1" class="footnote-backref">↩</a></p>
    </li>
  </ol>
</div>

Site-Specific Extractors

Built-in extractors automatically activate for supported platforms:

  • ChatGPT - Extracts conversation content and metadata
  • Grok - Extracts AI conversation content
  • Hacker News - Extracts posts and comments with proper threading

Custom extractors can be implemented using the BaseExtractor interface.

Examples

The examples/ directory contains ready-to-run examples:

Run examples with:

cd examples/basic && go run main.go
cd examples/advanced && go run main.go
cd examples/markdown && go run main.go
cd examples/extractors && go run main.go
cd examples/custom_extractor && go run custom_extractor.go

Performance

Typical performance characteristics:

  • Processing Speed: 5-15ms for standard web pages
  • Memory Usage: Optimized with object pooling and efficient DOM processing
  • Concurrent Safe: Can process multiple documents simultaneously

Dependencies

Contributing

Contributions are welcome. Please open an issue or submit a pull request.

License

MIT License - see LICENSE file for details.

Acknowledgments

Documentation

Overview

Package defuddle provides web content extraction and demuddling capabilities.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Defuddle

type Defuddle struct {
	// contains filtered or unexported fields
}

Defuddle represents a document parser instance

func NewDefuddle

func NewDefuddle(html string, options *Options) (*Defuddle, error)

NewDefuddle creates a new Defuddle instance from HTML content JavaScript original code:

constructor(document: Document, options: DefuddleOptions = {}) {
  this.doc = document;
  this.options = options;
}

func (*Defuddle) Parse

func (d *Defuddle) Parse(ctx context.Context) (*Result, error)

Parse extracts the main content from the document JavaScript original code:

parse(): DefuddleResponse {
  // Try first with default settings
  const result = this.parseInternal();

  // If result has very little content, try again without clutter removal
  if (result.wordCount < 200) {
    console.log('Initial parse returned very little content, trying again');
    const retryResult = this.parseInternal({
      removePartialSelectors: false
    });

    // Return the result with more content
    if (retryResult.wordCount > result.wordCount) {
      this._log('Retry produced more content');
      return retryResult;
    }
  }

  return result;
}

type ExtractedContent

type ExtractedContent struct {
	Title       *string             `json:"title,omitempty"`
	Author      *string             `json:"author,omitempty"`
	Published   *string             `json:"published,omitempty"`
	Content     *string             `json:"content,omitempty"`
	ContentHTML *string             `json:"contentHtml,omitempty"`
	Variables   *ExtractorVariables `json:"variables,omitempty"`
}

ExtractedContent represents content extracted by site-specific extractors JavaScript original code:

export interface ExtractedContent {
  title?: string;
  author?: string;
  published?: string;
  content?: string;
  contentHtml?: string;
  variables?: ExtractorVariables;
}

type ExtractorVariables

type ExtractorVariables map[string]string

ExtractorVariables represents variables extracted by site-specific extractors JavaScript original code:

export interface ExtractorVariables {
  [key: string]: string;
}

type MetaTag

type MetaTag = metadata.MetaTag

MetaTag represents a meta tag item from HTML This is an alias to the internal metadata.MetaTag type

type Metadata

type Metadata = metadata.Metadata

Metadata represents extracted metadata from a document This is an alias to the internal metadata.Metadata type

type Options

type Options struct {
	// Enable debug logging
	Debug bool `json:"debug,omitempty"`

	// URL of the page being parsed
	URL string `json:"url,omitempty"`

	// Convert output to Markdown
	Markdown bool `json:"markdown,omitempty"`

	// Include Markdown in the response
	SeparateMarkdown bool `json:"separateMarkdown,omitempty"`

	// Whether to remove elements matching exact selectors like ads, social buttons, etc.
	// Defaults to true.
	RemoveExactSelectors bool `json:"removeExactSelectors,omitempty"`

	// Whether to remove elements matching partial selectors like ads, social buttons, etc.
	// Defaults to true.
	RemovePartialSelectors bool `json:"removePartialSelectors,omitempty"`

	// Remove images from the extracted content
	// Defaults to false.
	RemoveImages bool `json:"removeImages,omitempty"`

	// Element processing options
	ProcessCode      bool                                 `json:"processCode,omitempty"`
	ProcessImages    bool                                 `json:"processImages,omitempty"`
	ProcessHeadings  bool                                 `json:"processHeadings,omitempty"`
	ProcessMath      bool                                 `json:"processMath,omitempty"`
	ProcessFootnotes bool                                 `json:"processFootnotes,omitempty"`
	ProcessRoles     bool                                 `json:"processRoles,omitempty"`
	CodeOptions      *elements.CodeBlockProcessingOptions `json:"codeOptions,omitempty"`
	ImageOptions     *elements.ImageProcessingOptions     `json:"imageOptions,omitempty"`
	HeadingOptions   *elements.HeadingProcessingOptions   `json:"headingOptions,omitempty"`
	MathOptions      *elements.MathProcessingOptions      `json:"mathOptions,omitempty"`
	FootnoteOptions  *elements.FootnoteProcessingOptions  `json:"footnoteOptions,omitempty"`
	RoleOptions      *elements.RoleProcessingOptions      `json:"roleOptions,omitempty"`
}

Options represents configuration options for Defuddle parsing JavaScript original code:

export interface DefuddleOptions {
  debug?: boolean;
  url?: string;
  markdown?: boolean;
  separateMarkdown?: boolean;
  removeExactSelectors?: boolean;
  removePartialSelectors?: boolean;
}

type Result

type Result struct {
	Metadata
	Content         string      `json:"content"`
	ContentMarkdown *string     `json:"contentMarkdown,omitempty"`
	ExtractorType   *string     `json:"extractorType,omitempty"`
	MetaTags        []MetaTag   `json:"metaTags,omitempty"`
	DebugInfo       *debug.Info `json:"debugInfo,omitempty"`
}

Result represents the complete response from Defuddle parsing JavaScript original code:

export interface DefuddleResponse extends DefuddleMetadata {
  content: string;
  contentMarkdown?: string;
  extractorType?: string;
  metaTags?: MetaTagItem[];
}

func ParseFromString added in v0.2.0

func ParseFromString(ctx context.Context, html string, options *Options) (*Result, error)

ParseFromString parses HTML content directly from a string This is useful when you already have the HTML content (e.g., from browser automation)

func ParseFromURL

func ParseFromURL(ctx context.Context, url string, options *Options) (*Result, error)

ParseFromURL fetches content from a URL and parses it JavaScript original code: // This corresponds to Node.js usage: Defuddle(htmlOrDom, url?, options?)

type StyleChange

type StyleChange struct {
	Selector string
	Styles   string
}

StyleChange represents a CSS style change for mobile

Directories

Path Synopsis
cmd
defuddle command
Package main provides the defuddle CLI application.
Package main provides the defuddle CLI application.
examples
advanced command
Package main demonstrates advanced defuddle usage.
Package main demonstrates advanced defuddle usage.
basic command
Package main demonstrates basic defuddle usage.
Package main demonstrates basic defuddle usage.
custom_extractor command
Package main demonstrates custom extractor usage.
Package main demonstrates custom extractor usage.
extractors command
Package main demonstrates extractors usage.
Package main demonstrates extractors usage.
markdown command
Package main demonstrates markdown conversion.
Package main demonstrates markdown conversion.
Package extractors provides site-specific content extraction functionality.
Package extractors provides site-specific content extraction functionality.
internal
constants
Package constants provides configuration constants and selectors for the defuddle content extraction system.
Package constants provides configuration constants and selectors for the defuddle content extraction system.
debug
Package debug provides debugging functionality for the defuddle content extraction system.
Package debug provides debugging functionality for the defuddle content extraction system.
elements
Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting
Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting
markdown
Package markdown provides HTML to Markdown conversion functionality.
Package markdown provides HTML to Markdown conversion functionality.
metadata
Package metadata provides functionality for extracting and processing document metadata.
Package metadata provides functionality for extracting and processing document metadata.
pool
Package pool provides memory pooling utilities for the defuddle content extraction system.
Package pool provides memory pooling utilities for the defuddle content extraction system.
scoring
Package scoring provides content scoring functionality for the defuddle content extraction system.
Package scoring provides content scoring functionality for the defuddle content extraction system.
standardize
Package standardize provides content standardization functionality for the defuddle content extraction system.
Package standardize provides content standardization functionality for the defuddle content extraction system.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL