hermes

package module
v1.0.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 31, 2025 License: MIT Imports: 7 Imported by: 0

README

Hermes

A high-performance Go web content extraction library inspired by the Postlight Parser. Hermes transforms web pages into clean, structured text with high compatibility with the original JavaScript version while providing significant performance improvements.

Features

  • Fast Content Extraction: 2-3x faster than the JavaScript version
  • Memory Efficient: 50% less memory usage
  • 150+ Custom Extractors: Site-specific parsers for major publications
  • Multiple Output Formats: HTML, Markdown, plain text, and JSON
  • Pagination Aware: Detects next_page_url for manual multi-page handling
  • CLI Tool: Command-line interface for single and batch parsing

Installation

As a Go Module
go get github.com/BumpyClock/hermes@latest
CLI Tool
go install github.com/BumpyClock/hermes/cmd/hermes@latest
Build from Source
git clone https://github.com/BumpyClock/hermes
cd hermes
make build

Usage

Command Line
# Parse a URL and output JSON
hermes parse https://example.com/article

# Output as markdown
hermes parse -f markdown https://example.com/article

# Save to file
hermes parse -o article.md -f markdown https://example.com/article

# Multiple URLs with timing
hermes parse --timing https://example.com/article1 https://example.com/article2
Go Library
Basic Usage
package main

import (
    "context"
    "fmt"
    "log"
    "time"
    
    "github.com/BumpyClock/hermes"
)

func main() {
    // Create a client with options
    client := hermes.New(
        hermes.WithTimeout(30*time.Second),
        hermes.WithContentType("html"), // "html", "markdown", or "text"
        hermes.WithUserAgent("MyApp/1.0"),
    )
    
    // Parse a URL with context
    ctx := context.Background()
    result, err := client.Parse(ctx, "https://example.com/article")
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Author: %s\n", result.Author)
    fmt.Printf("Content: %s\n", result.Content)
    fmt.Printf("Word Count: %d\n", result.WordCount)
}
Advanced Usage with Custom HTTP Client
package main

import (
    "context"
    "crypto/tls"
    "fmt"
    "net/http"
    "time"
    
    "github.com/BumpyClock/hermes"
)

func main() {
    // Create custom HTTP client with proxy, custom transport, etc.
    customClient := &http.Client{
        Timeout: 60 * time.Second,
        Transport: &http.Transport{
            MaxIdleConns:        100,
            MaxIdleConnsPerHost: 10,
            IdleConnTimeout:     90 * time.Second,
            TLSClientConfig: &tls.Config{
                InsecureSkipVerify: false,
            },
        },
    }
    
    // Create Hermes client with custom HTTP client
    client := hermes.New(
        hermes.WithHTTPClient(customClient),
        hermes.WithContentType("markdown"),
        hermes.WithAllowPrivateNetworks(false), // SSRF protection
    )
    
    // Parse with timeout context
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    result, err := client.Parse(ctx, "https://example.com/article")
    if err != nil {
        if parseErr, ok := err.(*hermes.ParseError); ok {
            fmt.Printf("Parse error [%s]: %v\n", parseErr.Code, parseErr.Err)
        } else {
            log.Fatal(err)
        }
        return
    }
    
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Content: %s\n", result.Content)
}
Parse Pre-fetched HTML
package main

import (
    "context"
    "fmt"
    "log"
    
    "github.com/BumpyClock/hermes"
)

func main() {
    client := hermes.New(hermes.WithContentType("text"))
    
    html := `<html><head><title>Test</title></head><body><p>Hello World</p></body></html>`
    
    result, err := client.ParseHTML(context.Background(), html, "https://example.com/test")
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Content: %s\n", result.Content)
}

Migration from v0.x to v1.0

If you're upgrading from the old internal API, here are the key changes:

Old API (v0.x)
import "github.com/BumpyClock/hermes/pkg/parser"

p := parser.New()
result, err := p.Parse(url, &parser.ParserOptions{...})
New API (v1.0+)
import "github.com/BumpyClock/hermes"

client := hermes.New(hermes.WithTimeout(...))
result, err := client.Parse(ctx, url)
Key Changes
  1. Package Import: Use root package instead of /pkg/parser
  2. Context Required: All methods now require context.Context first parameter
  3. Functional Options: Use hermes.WithXxx() options instead of struct fields
  4. Error Types: New *hermes.ParseError type with error codes
  5. HTTP Client: Client manages its own HTTP client, configurable via options
  6. Content Types: Set via WithContentType() option, affects parser extraction
Options Mapping
Old API New API
parser.ParserOptions{ContentType: "markdown"} hermes.WithContentType("markdown")
parser.ParserOptions{FetchAllPages: true} Use result.NextPageURL for manual pagination
Custom headers in options Use hermes.WithHTTPClient() with custom transport

Error Handling

The new API provides structured error handling:

result, err := client.Parse(ctx, url)
if err != nil {
    if parseErr, ok := err.(*hermes.ParseError); ok {
        switch parseErr.Code {
        case hermes.ErrInvalidURL:
            // Handle invalid URL
        case hermes.ErrFetch:
            // Handle fetch error
        case hermes.ErrTimeout:
            // Handle timeout
        case hermes.ErrExtract:
            // Handle extraction error
        default:
            // Handle other errors
        }
    }
}

Development

Prerequisites
  • Go 1.24.6 or later
  • Make (optional)
Setup
# Clone and setup
git clone https://github.com/BumpyClock/hermes
cd hermes
make dev-setup

# Run tests
make test

# Run with fixtures
make run-fixtures

# Lint code
make lint

# Build binary
make build

Key Dependencies

Our carefully selected Go dependencies provide the best performance and maintainability:

  • goquery: jQuery-like DOM manipulation (industry standard)
  • html-to-markdown: HTML to Markdown conversion (v1.6.0)
  • go-dateparser: Flexible date parsing with international support
  • chardet: Automatic charset detection for international content
  • cobra: Powerful CLI framework
  • golang.org/x/text: Official Go text encoding support
Testing

The project includes comprehensive unit tests. Compatibility tests with the JavaScript version are planned. The make test-compatibility target currently references a non-existent package and will be enabled once the compatibility suite is added.

# Run all tests
go test ./...

# Test with coverage
go test -cover ./...

# Benchmark tests
make benchmark

Architecture

Hermes follows a modular architecture similar to the JavaScript version:

  • Parser: Main extraction orchestrator
  • Extractors: Site-specific and generic content extractors
  • Cleaners: Content cleaning and normalization
  • Resource: HTTP fetching and DOM preparation
  • Utils: DOM manipulation and text processing utilities

Custom Extractors

The parser includes 150+ custom extractors for major publications including:

  • News: NY Times, Washington Post, CNN, The Guardian
  • Tech: Ars Technica, The Verge, Wired
  • Business: Bloomberg, Reuters
  • And many more...

Performance

Performance varies by site and output format. See benchmark details in benchmark/README.md.

Latest benchmark (5 URLs from benchmark/testurls.txt):

  • JSON output: JS avg 627ms, Go avg 629ms (parity)
  • Markdown output: JS avg 173ms, Go avg 652ms (JS faster on this set)

Run the comparison yourself via benchmark/test-comparison.js (see docs in benchmark/README.md).

Running the bench with 1 url at a time JS comes out slightly faster than go but with twice the memory usage. In API scenarios and processing multiple urls at once GO leaps ahead with approx 20ms per request with around 60mb memory as the efficiency gains of reusing the same HTTP client and goroutines start to show their edge.

Compatibility

Hermes aims for high compatibility with the JavaScript version:

  • Same output formats and extractor definitions
  • CLI commands and options are similar
  • Next page URL detection is implemented

Note: Use the next_page_url field for manual pagination handling when needed.

TODOs

Multi-page Article Collection

The multi-page article collection feature is partially implemented but needs integration:

  • Integration: Connect collect_all_pages.go with main parser pipeline
  • Configuration: Wire FetchAllPages option to trigger actual multi-page merging
  • Pipeline: Implement call to CollectAllPages when NextPageURL is detected
  • Testing: Add comprehensive multi-page extraction tests

Files requiring work:

  • pkg/parser/parser.go - Uncomment and implement collectAllPages method
  • pkg/extractors/collect_all_pages.go - Already implemented, needs integration
  • pkg/parser/extract_all_fields.go - Add multi-page logic to extraction pipeline

Current Status: Next page URL detection works; automatic fetching/merging does not.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

  • Original Postlight Parser team
  • goquery for jQuery-like DOM manipulation
  • All contributors to the custom extractors

Documentation

Overview

Package hermes provides a high-performance web content extraction library that transforms web pages into clean, structured data.

Hermes extracts article content, titles, authors, dates, images, and more from any URL using site-specific custom parsers and generic fallback extraction.

Basic Usage

Create a client and parse a URL:

client := hermes.New()
result, err := client.Parse(context.Background(), "https://example.com/article")
if err != nil {
    log.Fatal(err)
}
fmt.Println(result.Title)
fmt.Println(result.Content)

Configuration

The client can be configured with various options:

client := hermes.New(
    hermes.WithTimeout(30 * time.Second),
    hermes.WithUserAgent("MyApp/1.0"),
    hermes.WithAllowPrivateNetworks(false),
)

Custom HTTP Client

You can provide your own HTTP client for custom transport settings:

httpClient := &http.Client{
    Transport: &http.Transport{
        Proxy: http.ProxyFromEnvironment,
        MaxIdleConns: 100,
    },
}
client := hermes.New(hermes.WithHTTPClient(httpClient))

Parsing Pre-fetched HTML

If you already have the HTML content, you can parse it directly:

html := "<html>...</html>"
result, err := client.ParseHTML(context.Background(), html, "https://example.com")

Error Handling

Errors are typed for programmatic handling:

result, err := client.Parse(ctx, url)
if err != nil {
    var parseErr *hermes.ParseError
    if errors.As(err, &parseErr) {
        switch parseErr.Code {
        case hermes.ErrFetch:
            // Handle fetch error
        case hermes.ErrTimeout:
            // Handle timeout
        case hermes.ErrSSRF:
            // Handle SSRF protection
        }
    }
}

Thread Safety

The Client is thread-safe and should be reused across goroutines. Create one client and share it throughout your application.

Concurrency

The library parses one URL at a time. For concurrent parsing, implement your own worker pool:

var wg sync.WaitGroup
sem := make(chan struct{}, 10) // Limit concurrency

for _, url := range urls {
    wg.Add(1)
    sem <- struct{}{}

    go func(u string) {
        defer wg.Done()
        defer func() { <-sem }()

        result, err := client.Parse(ctx, u)
        // Handle result
    }(url)
}
wg.Wait()
Example (Basic)

Example_basic demonstrates basic usage of the Hermes library

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	// Create a client with basic configuration
	client := hermes.New(
		hermes.WithTimeout(10*time.Second),
		hermes.WithUserAgent("Example/1.0"),
	)

	// Parse a URL
	ctx := context.Background()
	result, err := client.Parse(ctx, "https://httpbin.org/html")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Domain: %s\n", result.Domain)
	fmt.Printf("Has content: %v\n", len(result.Content) > 0)

}
Output:

Title: Herman Melville - Moby-Dick
Domain: httpbin.org
Has content: true
Example (Concurrent)

Example_concurrent demonstrates that the client is thread-safe

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(
		hermes.WithTimeout(10*time.Second),
		hermes.WithUserAgent("ConcurrentExample/1.0"),
	)

	// Channel to collect results
	results := make(chan string, 2)

	// Launch two concurrent parsing operations
	go func() {
		ctx := context.Background()
		result, err := client.Parse(ctx, "https://httpbin.org/html")
		if err != nil {
			results <- "Error"
		} else {
			results <- fmt.Sprintf("Success: %s", result.Domain)
		}
	}()

	go func() {
		ctx := context.Background()
		result, err := client.Parse(ctx, "https://httpbin.org/html")
		if err != nil {
			results <- "Error"
		} else {
			results <- fmt.Sprintf("Success: %s", result.Domain)
		}
	}()

	// Collect results
	result1 := <-results
	result2 := <-results

	fmt.Printf("Concurrent operation 1: %s\n", result1)
	fmt.Printf("Concurrent operation 2: %s\n", result2)
	fmt.Printf("Client is thread-safe: true\n")

}
Output:

Concurrent operation 1: Success: httpbin.org
Concurrent operation 2: Success: httpbin.org
Client is thread-safe: true
Example (ContentTypes)

Example_contentTypes demonstrates different content type extractions

package main

import (
	"context"
	"fmt"

	"github.com/BumpyClock/hermes"
)

func main() {
	testURL := "https://httpbin.org/html"
	ctx := context.Background()

	// Test HTML extraction
	htmlClient := hermes.New(hermes.WithContentType("html"))
	htmlResult, err := htmlClient.Parse(ctx, testURL)
	if err != nil {
		fmt.Printf("HTML Error: %v\n", err)
		return
	}

	// Test Text extraction
	textClient := hermes.New(hermes.WithContentType("text"))
	textResult, err := textClient.Parse(ctx, testURL)
	if err != nil {
		fmt.Printf("Text Error: %v\n", err)
		return
	}

	fmt.Printf("HTML content has tags: %v\n", len(htmlResult.Content) > len(textResult.Content))
	fmt.Printf("Text content is shorter: %v\n", len(textResult.Content) < len(htmlResult.Content))
	fmt.Printf("Both have same title: %v\n", htmlResult.Title == textResult.Title)

}
Output:

HTML content has tags: true
Text content is shorter: true
Both have same title: true
Example (ContextCancellation)

Example_contextCancellation demonstrates context cancellation behavior

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(hermes.WithTimeout(30 * time.Second))

	// Create a context that will be cancelled quickly
	ctx, cancel := context.WithTimeout(context.Background(), 1*time.Millisecond)
	defer cancel()

	// Try to parse - should be cancelled due to short timeout
	_, err := client.Parse(ctx, "https://httpbin.org/delay/5")

	if err != nil {
		if parseErr, ok := err.(*hermes.ParseError); ok {
			fmt.Printf("Request was cancelled: %v\n", parseErr.Code == hermes.ErrTimeout)
			fmt.Printf("Error type: %s\n", parseErr.Code)
		}
	}

}
Output:

Request was cancelled: true
Error type: timeout
Example (CustomHTTPClient)

Example_customHTTPClient demonstrates using a custom HTTP client

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"net/http"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	// Create custom HTTP client with specific settings
	httpClient := &http.Client{
		Timeout: 15 * time.Second,
		Transport: &http.Transport{
			MaxIdleConns:        10,
			MaxIdleConnsPerHost: 2,
			IdleConnTimeout:     30 * time.Second,
			TLSClientConfig: &tls.Config{
				InsecureSkipVerify: false,
			},
		},
	}

	// Create Hermes client with custom HTTP client
	client := hermes.New(
		hermes.WithHTTPClient(httpClient),
		hermes.WithUserAgent("CustomHTTPClient/1.0"),
	)

	ctx := context.Background()
	result, err := client.Parse(ctx, "https://httpbin.org/user-agent")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Custom HTTP client used: true\n")
	fmt.Printf("User agent configured: %v\n", len(result.Content) > 0)

}
Output:

Custom HTTP client used: true
User agent configured: true
Example (ErrorHandling)

Example_errorHandling demonstrates error handling patterns

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(hermes.WithTimeout(5 * time.Second))

	// Try to parse an invalid URL
	ctx := context.Background()
	_, err := client.Parse(ctx, "not-a-valid-url")

	if err != nil {
		// Check if it's a ParseError
		if parseErr, ok := err.(*hermes.ParseError); ok {
			fmt.Printf("Parse error occurred\n")
			fmt.Printf("Error code: %s\n", parseErr.Code)
			fmt.Printf("Operation: %s\n", parseErr.Op)
			fmt.Printf("Is invalid URL error: %v\n", parseErr.Code == hermes.ErrInvalidURL)
		}
	}

}
Output:

Parse error occurred
Error code: fetch error
Operation: Parse
Is invalid URL error: false
Example (ParseHTML)

Example_parseHTML demonstrates parsing pre-fetched HTML content

package main

import (
	"context"
	"fmt"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(hermes.WithContentType("text"))

	// HTML content to parse
	html := `<!DOCTYPE html>
<html>
<head>
    <title>Test Article</title>
    <meta name="author" content="John Doe">
</head>
<body>
    <h1>Sample Article</h1>
    <p>This is a test article with some content.</p>
    <p>It has multiple paragraphs for demonstration.</p>
</body>
</html>`

	// Parse the HTML directly
	ctx := context.Background()
	result, err := client.ParseHTML(ctx, html, "https://example.com/test")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Author: %s\n", result.Author)
	fmt.Printf("Domain: %s\n", result.Domain)
	fmt.Printf("Word count: %d\n", result.WordCount)

}
Output:

Title: Test Article
Author: John Doe
Domain: example.com
Word count: 12
Example (WithOptions)

Example_withOptions demonstrates using various client options

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	// Create client with multiple options
	client := hermes.New(
		hermes.WithTimeout(30*time.Second),
		hermes.WithUserAgent("MyApp/2.0"),
		hermes.WithContentType("markdown"),
		hermes.WithAllowPrivateNetworks(false),
	)

	// Parse with context timeout
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	result, err := client.Parse(ctx, "https://httpbin.org/html")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Content type used: markdown\n")
	fmt.Printf("Word count: %d\n", result.WordCount)
	fmt.Printf("Has markdown content: %v\n", len(result.Content) > 0)

}
Output:

Content type used: markdown
Word count: 601
Has markdown content: true

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is a thread-safe, reusable parser client for extracting content from web pages. It manages its own HTTP client for connection pooling and can be shared across goroutines.

func New

func New(opts ...Option) *Client

New creates a new Hermes client with the provided options. The client is thread-safe and should be reused across requests.

Example:

client := hermes.New(
    hermes.WithTimeout(30*time.Second),
    hermes.WithUserAgent("MyApp/1.0"),
)

func (*Client) Parse

func (c *Client) Parse(ctx context.Context, url string) (*Result, error)

Parse extracts content from the given URL. The context can be used to cancel the request or set a deadline.

Example:

ctx := context.Background()
result, err := client.Parse(ctx, "https://example.com/article")
if err != nil {
    // Handle error
}
fmt.Println(result.Title)

func (*Client) ParseHTML

func (c *Client) ParseHTML(ctx context.Context, html, url string) (*Result, error)

ParseHTML extracts content from pre-fetched HTML. This is useful when you already have the HTML content and want to avoid an additional HTTP request.

Example:

html := "<html>...</html>"
result, err := client.ParseHTML(ctx, html, "https://example.com/article")

type ErrorCode

type ErrorCode int

ErrorCode represents the type of error that occurred during parsing

const (
	// ErrInvalidURL indicates the provided URL is malformed or empty
	ErrInvalidURL ErrorCode = iota

	// ErrFetch indicates a failure to fetch the content from the URL
	ErrFetch

	// ErrTimeout indicates the operation timed out
	ErrTimeout

	// ErrSSRF indicates the URL was blocked by SSRF protection
	ErrSSRF

	// ErrExtract indicates a failure during content extraction
	ErrExtract

	// ErrContext indicates the context was cancelled
	ErrContext
)

func (ErrorCode) String

func (e ErrorCode) String() string

String returns a human-readable string for the error code

type Option

type Option func(*Client)

Option is a functional option for configuring the Client

func WithAllowPrivateNetworks

func WithAllowPrivateNetworks(allow bool) Option

WithAllowPrivateNetworks allows or disallows parsing of private network URLs. By default, private networks are blocked for security (SSRF protection). Set to true only in trusted environments where you need to parse internal URLs.

Private networks include:

  • 10.0.0.0/8
  • 172.16.0.0/12
  • 192.168.0.0/16
  • 127.0.0.0/8 (localhost)
  • ::1 (IPv6 localhost)
  • fc00::/7 (IPv6 private)

Example:

// For internal tools that need to parse intranet content
client := hermes.New(hermes.WithAllowPrivateNetworks(true))

func WithContentType

func WithContentType(contentType string) Option

WithContentType sets the output content type for parsing. Valid options are "html", "markdown", and "text". By default, content is returned as HTML.

Example:

// Get content as markdown
client := hermes.New(hermes.WithContentType("markdown"))

func WithHTTPClient

func WithHTTPClient(httpClient *http.Client) Option

WithHTTPClient sets a custom HTTP client for the parser. This allows you to configure connection pooling, timeouts, proxies, etc.

Example:

httpClient := &http.Client{
    Timeout: 60 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns: 200,
    },
}
client := hermes.New(hermes.WithHTTPClient(httpClient))

func WithTimeout

func WithTimeout(timeout time.Duration) Option

WithTimeout sets the timeout for HTTP requests. This timeout applies to the entire request, including connection time, redirects, and reading the response body.

Example:

client := hermes.New(hermes.WithTimeout(30 * time.Second))

func WithTransport

func WithTransport(transport http.RoundTripper) Option

WithTransport sets a custom HTTP transport for the parser. This is useful for configuring proxies, TLS settings, connection pooling, etc. If both WithHTTPClient and WithTransport are used, WithHTTPClient takes precedence.

Example:

transport := &http.Transport{
    Proxy: http.ProxyFromEnvironment,
    MaxIdleConns: 100,
    IdleConnTimeout: 90 * time.Second,
}
client := hermes.New(hermes.WithTransport(transport))

func WithUserAgent

func WithUserAgent(userAgent string) Option

WithUserAgent sets the User-Agent header for HTTP requests. This is useful for identifying your application to web servers.

Example:

client := hermes.New(hermes.WithUserAgent("MyApp/1.0"))

type ParseError

type ParseError struct {
	// Code indicates the type of error
	Code ErrorCode

	// URL is the URL that was being parsed when the error occurred
	URL string

	// Op is the operation that failed (e.g., "Parse", "ParseHTML")
	Op string

	// Err is the underlying error
	Err error
}

ParseError represents an error that occurred during parsing. It includes the error code, URL, operation, and underlying error.

func (*ParseError) Error

func (e *ParseError) Error() string

Error implements the error interface

func (*ParseError) Is

func (e *ParseError) Is(target error) bool

Is reports whether the target error is equal to this error

func (*ParseError) IsContext

func (e *ParseError) IsContext() bool

IsContext returns true if the error was caused by context cancellation

func (*ParseError) IsExtract

func (e *ParseError) IsExtract() bool

IsExtract returns true if the error occurred during content extraction

func (*ParseError) IsFetch

func (e *ParseError) IsFetch() bool

IsFetch returns true if the error occurred during content fetching

func (*ParseError) IsInvalidURL

func (e *ParseError) IsInvalidURL() bool

IsInvalidURL returns true if the error was caused by an invalid URL

func (*ParseError) IsSSRF

func (e *ParseError) IsSSRF() bool

IsSSRF returns true if the error was caused by SSRF protection

func (*ParseError) IsTimeout

func (e *ParseError) IsTimeout() bool

IsTimeout returns true if the error was caused by a timeout

func (*ParseError) Unwrap

func (e *ParseError) Unwrap() error

Unwrap returns the underlying error

type Parser

type Parser interface {
	// Parse extracts content from the given URL.
	// The context can be used to cancel the request or set a deadline.
	Parse(ctx context.Context, url string) (*Result, error)

	// ParseHTML extracts content from pre-fetched HTML.
	// This is useful when you already have the HTML content.
	ParseHTML(ctx context.Context, html, url string) (*Result, error)
}

Parser is the interface for content extraction. Implement this interface to create mock parsers for testing.

type Result

type Result struct {
	// Core content fields
	URL           string     `json:"url"`
	Title         string     `json:"title"`
	Content       string     `json:"content"`
	Author        string     `json:"author,omitempty"`
	DatePublished *time.Time `json:"date_published,omitempty"`

	// Media and metadata
	LeadImageURL string `json:"lead_image_url,omitempty"`
	Dek          string `json:"dek,omitempty"`
	Domain       string `json:"domain"`
	Excerpt      string `json:"excerpt,omitempty"`

	// Content metrics
	WordCount     int    `json:"word_count"`
	Direction     string `json:"direction,omitempty"`
	TotalPages    int    `json:"total_pages,omitempty"`
	RenderedPages int    `json:"rendered_pages,omitempty"`

	// Site information
	SiteName    string `json:"site_name,omitempty"`
	Description string `json:"description,omitempty"`
	Language    string `json:"language,omitempty"`
	ThemeColor  string `json:"theme_color,omitempty"`
	Favicon     string `json:"favicon,omitempty"`

	// Video metadata
	VideoURL      string                 `json:"video_url,omitempty"`
	VideoMetadata map[string]interface{} `json:"video_metadata,omitempty"`
}

Result contains the extracted content from a web page. All fields are read-only and represent the parsed article data.

func (*Result) FormatMarkdown

func (r *Result) FormatMarkdown() string

FormatMarkdown formats the result as Markdown with metadata header. This is useful for saving the content in a human-readable format.

Example output:

# Article Title

## Metadata
**Author:** John Doe
**Date:** 2024-01-01
**URL:** https://example.com/article

## Content
Article content here...

func (*Result) HasAuthor

func (r *Result) HasAuthor() bool

HasAuthor returns true if author information is available

func (*Result) HasDate

func (r *Result) HasDate() bool

HasDate returns true if publication date is available

func (*Result) HasImage

func (r *Result) HasImage() bool

HasImage returns true if a lead image is available

func (*Result) IsEmpty

func (r *Result) IsEmpty() bool

IsEmpty returns true if the result contains no meaningful content

Directories

Path Synopsis
cmd
checks/registry command
hermes command
examples
api-server command
Package main demonstrates how to build an HTTP API server using Hermes.
Package main demonstrates how to build an HTTP API server using Hermes.
basic command
Package main demonstrates basic usage of the Hermes web content extraction library.
Package main demonstrates basic usage of the Hermes web content extraction library.
concurrent command
Package main demonstrates concurrent processing with the Hermes library.
Package main demonstrates concurrent processing with the Hermes library.
custom-client command
Package main demonstrates advanced HTTP client configuration with Hermes.
Package main demonstrates advanced HTTP client configuration with Hermes.
internal
cache
ABOUTME: Helper functions for optimized DOM operations using the existing cache system.
ABOUTME: Helper functions for optimized DOM operations using the existing cache system.
extractors
ABOUTME: Advanced extractor loader with LRU caching and dynamic loading Reduces startup memory by 90% through lazy loading and automatic cache management
ABOUTME: Advanced extractor loader with LRU caching and dynamic loading Reduces startup memory by 90% through lazy loading and automatic cache management
extractors/validation
Package validation provides a comprehensive field validation framework for extracted fields and extended field support.
Package validation provides a comprehensive field validation framework for extracted fields and extended field support.
pools
ABOUTME: This file implements sync.Pool for reusing expensive objects like goquery documents and HTTP response bodies.
ABOUTME: This file implements sync.Pool for reusing expensive objects like goquery documents and HTTP response bodies.
utils/dom
ABOUTME: Cleans H1 tags from article content based on count threshold analysis.
ABOUTME: Cleans H1 tags from article content based on count threshold analysis.
utils/text
ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior
ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior
register command
verify command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL