hermes

package module

v1.0.6 Latest Latest Go to latest Published: Aug 31, 2025 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/BumpyClock/hermes

Links

Open Source Insights

README ¶

Hermes

A high-performance Go web content extraction library inspired by the Postlight Parser. Hermes transforms web pages into clean, structured text with high compatibility with the original JavaScript version while providing significant performance improvements.

Features

Fast Content Extraction: 2-3x faster than the JavaScript version
Memory Efficient: 50% less memory usage
150+ Custom Extractors: Site-specific parsers for major publications
Multiple Output Formats: HTML, Markdown, plain text, and JSON
Pagination Aware: Detects next_page_url for manual multi-page handling
CLI Tool: Command-line interface for single and batch parsing

Installation

As a Go Module

go get github.com/BumpyClock/hermes@latest

CLI Tool

go install github.com/BumpyClock/hermes/cmd/hermes@latest

Build from Source

git clone https://github.com/BumpyClock/hermes
cd hermes
make build

Usage

Command Line

# Parse a URL and output JSON
hermes parse https://example.com/article

# Output as markdown
hermes parse -f markdown https://example.com/article

# Save to file
hermes parse -o article.md -f markdown https://example.com/article

# Multiple URLs with timing
hermes parse --timing https://example.com/article1 https://example.com/article2

Go Library

Basic Usage

package main

import (
    "context"
    "fmt"
    "log"
    "time"
    
    "github.com/BumpyClock/hermes"
)

func main() {
    // Create a client with options
    client := hermes.New(
        hermes.WithTimeout(30*time.Second),
        hermes.WithContentType("html"), // "html", "markdown", or "text"
        hermes.WithUserAgent("MyApp/1.0"),
    )
    
    // Parse a URL with context
    ctx := context.Background()
    result, err := client.Parse(ctx, "https://example.com/article")
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Author: %s\n", result.Author)
    fmt.Printf("Content: %s\n", result.Content)
    fmt.Printf("Word Count: %d\n", result.WordCount)
}

Advanced Usage with Custom HTTP Client

package main

import (
    "context"
    "crypto/tls"
    "fmt"
    "net/http"
    "time"
    
    "github.com/BumpyClock/hermes"
)

func main() {
    // Create custom HTTP client with proxy, custom transport, etc.
    customClient := &http.Client{
        Timeout: 60 * time.Second,
        Transport: &http.Transport{
            MaxIdleConns:        100,
            MaxIdleConnsPerHost: 10,
            IdleConnTimeout:     90 * time.Second,
            TLSClientConfig: &tls.Config{
                InsecureSkipVerify: false,
            },
        },
    }
    
    // Create Hermes client with custom HTTP client
    client := hermes.New(
        hermes.WithHTTPClient(customClient),
        hermes.WithContentType("markdown"),
        hermes.WithAllowPrivateNetworks(false), // SSRF protection
    )
    
    // Parse with timeout context
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    
    result, err := client.Parse(ctx, "https://example.com/article")
    if err != nil {
        if parseErr, ok := err.(*hermes.ParseError); ok {
            fmt.Printf("Parse error [%s]: %v\n", parseErr.Code, parseErr.Err)
        } else {
            log.Fatal(err)
        }
        return
    }
    
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Content: %s\n", result.Content)
}

Parse Pre-fetched HTML

package main

import (
    "context"
    "fmt"
    "log"
    
    "github.com/BumpyClock/hermes"
)

func main() {
    client := hermes.New(hermes.WithContentType("text"))
    
    html := `<html><head><title>Test</title></head><body><p>Hello World</p></body></html>`
    
    result, err := client.ParseHTML(context.Background(), html, "https://example.com/test")
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Printf("Title: %s\n", result.Title)
    fmt.Printf("Content: %s\n", result.Content)
}

Migration from v0.x to v1.0

If you're upgrading from the old internal API, here are the key changes:

Old API (v0.x)

import "github.com/BumpyClock/hermes/pkg/parser"

p := parser.New()
result, err := p.Parse(url, &parser.ParserOptions{...})

New API (v1.0+)

import "github.com/BumpyClock/hermes"

client := hermes.New(hermes.WithTimeout(...))
result, err := client.Parse(ctx, url)

Key Changes

Package Import: Use root package instead of /pkg/parser
Context Required: All methods now require context.Context first parameter
Functional Options: Use hermes.WithXxx() options instead of struct fields
Error Types: New *hermes.ParseError type with error codes
HTTP Client: Client manages its own HTTP client, configurable via options
Content Types: Set via WithContentType() option, affects parser extraction

Options Mapping

Old API	New API
`parser.ParserOptions{ContentType: "markdown"}`	`hermes.WithContentType("markdown")`
`parser.ParserOptions{FetchAllPages: true}`	Use `result.NextPageURL` for manual pagination
Custom headers in options	Use `hermes.WithHTTPClient()` with custom transport

Error Handling

The new API provides structured error handling:

result, err := client.Parse(ctx, url)
if err != nil {
    if parseErr, ok := err.(*hermes.ParseError); ok {
        switch parseErr.Code {
        case hermes.ErrInvalidURL:
            // Handle invalid URL
        case hermes.ErrFetch:
            // Handle fetch error
        case hermes.ErrTimeout:
            // Handle timeout
        case hermes.ErrExtract:
            // Handle extraction error
        default:
            // Handle other errors
        }
    }
}

Development

Prerequisites

Go 1.24.6 or later
Make (optional)

Setup

# Clone and setup
git clone https://github.com/BumpyClock/hermes
cd hermes
make dev-setup

# Run tests
make test

# Run with fixtures
make run-fixtures

# Lint code
make lint

# Build binary
make build

Key Dependencies

Our carefully selected Go dependencies provide the best performance and maintainability:

goquery: jQuery-like DOM manipulation (industry standard)
html-to-markdown: HTML to Markdown conversion (v1.6.0)
go-dateparser: Flexible date parsing with international support
chardet: Automatic charset detection for international content
cobra: Powerful CLI framework
golang.org/x/text: Official Go text encoding support

Testing

The project includes comprehensive unit tests. Compatibility tests with the JavaScript version are planned. The make test-compatibility target currently references a non-existent package and will be enabled once the compatibility suite is added.

# Run all tests
go test ./...

# Test with coverage
go test -cover ./...

# Benchmark tests
make benchmark

Architecture

Hermes follows a modular architecture similar to the JavaScript version:

Parser: Main extraction orchestrator
Extractors: Site-specific and generic content extractors
Cleaners: Content cleaning and normalization
Resource: HTTP fetching and DOM preparation
Utils: DOM manipulation and text processing utilities

Custom Extractors

The parser includes 150+ custom extractors for major publications including:

News: NY Times, Washington Post, CNN, The Guardian
Tech: Ars Technica, The Verge, Wired
Business: Bloomberg, Reuters
And many more...

Performance

Performance varies by site and output format. See benchmark details in benchmark/README.md.

Latest benchmark (5 URLs from benchmark/testurls.txt):

JSON output: JS avg 627ms, Go avg 629ms (parity)
Markdown output: JS avg 173ms, Go avg 652ms (JS faster on this set)

Run the comparison yourself via benchmark/test-comparison.js (see docs in benchmark/README.md).

Running the bench with 1 url at a time JS comes out slightly faster than go but with twice the memory usage. In API scenarios and processing multiple urls at once GO leaps ahead with approx 20ms per request with around 60mb memory as the efficiency gains of reusing the same HTTP client and goroutines start to show their edge.

Compatibility

Hermes aims for high compatibility with the JavaScript version:

Same output formats and extractor definitions
CLI commands and options are similar
Next page URL detection is implemented

Note: Use the next_page_url field for manual pagination handling when needed.

TODOs

Multi-page Article Collection

The multi-page article collection feature is partially implemented but needs integration:

Integration: Connect collect_all_pages.go with main parser pipeline
Configuration: Wire FetchAllPages option to trigger actual multi-page merging
Pipeline: Implement call to CollectAllPages when NextPageURL is detected
Testing: Add comprehensive multi-page extraction tests

Files requiring work:

pkg/parser/parser.go - Uncomment and implement collectAllPages method
pkg/extractors/collect_all_pages.go - Already implemented, needs integration
pkg/parser/extract_all_fields.go - Add multi-page logic to extraction pipeline

Current Status: Next page URL detection works; automatic fetching/merging does not.

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

License

MIT License - see LICENSE file for details.

Acknowledgments

Original Postlight Parser team
goquery for jQuery-like DOM manipulation
All contributors to the custom extractors

Documentation ¶

Overview ¶

Package hermes provides a high-performance web content extraction library that transforms web pages into clean, structured data.

Hermes extracts article content, titles, authors, dates, images, and more from any URL using site-specific custom parsers and generic fallback extraction.

Basic Usage ¶

Create a client and parse a URL:

client := hermes.New()
result, err := client.Parse(context.Background(), "https://example.com/article")
if err != nil {
    log.Fatal(err)
}
fmt.Println(result.Title)
fmt.Println(result.Content)

Configuration ¶

The client can be configured with various options:

client := hermes.New(
    hermes.WithTimeout(30 * time.Second),
    hermes.WithUserAgent("MyApp/1.0"),
    hermes.WithAllowPrivateNetworks(false),
)

Custom HTTP Client ¶

You can provide your own HTTP client for custom transport settings:

httpClient := &http.Client{
    Transport: &http.Transport{
        Proxy: http.ProxyFromEnvironment,
        MaxIdleConns: 100,
    },
}
client := hermes.New(hermes.WithHTTPClient(httpClient))

Parsing Pre-fetched HTML ¶

If you already have the HTML content, you can parse it directly:

html := "<html>...</html>"
result, err := client.ParseHTML(context.Background(), html, "https://example.com")

Error Handling ¶

Errors are typed for programmatic handling:

result, err := client.Parse(ctx, url)
if err != nil {
    var parseErr *hermes.ParseError
    if errors.As(err, &parseErr) {
        switch parseErr.Code {
        case hermes.ErrFetch:
            // Handle fetch error
        case hermes.ErrTimeout:
            // Handle timeout
        case hermes.ErrSSRF:
            // Handle SSRF protection
        }
    }
}

Thread Safety ¶

The Client is thread-safe and should be reused across goroutines. Create one client and share it throughout your application.

Concurrency ¶

The library parses one URL at a time. For concurrent parsing, implement your own worker pool:

var wg sync.WaitGroup
sem := make(chan struct{}, 10) // Limit concurrency

for _, url := range urls {
    wg.Add(1)
    sem <- struct{}{}

    go func(u string) {
        defer wg.Done()
        defer func() { <-sem }()

        result, err := client.Parse(ctx, u)
        // Handle result
    }(url)
}
wg.Wait()

Example (Basic) ¶

Example_basic demonstrates basic usage of the Hermes library

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	// Create a client with basic configuration
	client := hermes.New(
		hermes.WithTimeout(10*time.Second),
		hermes.WithUserAgent("Example/1.0"),
	)

	// Parse a URL
	ctx := context.Background()
	result, err := client.Parse(ctx, "https://httpbin.org/html")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Domain: %s\n", result.Domain)
	fmt.Printf("Has content: %v\n", len(result.Content) > 0)

}

Output:

Title: Herman Melville - Moby-Dick
Domain: httpbin.org
Has content: true

Example (Concurrent) ¶

Example_concurrent demonstrates that the client is thread-safe

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(
		hermes.WithTimeout(10*time.Second),
		hermes.WithUserAgent("ConcurrentExample/1.0"),
	)

	// Channel to collect results
	results := make(chan string, 2)

	// Launch two concurrent parsing operations
	go func() {
		ctx := context.Background()
		result, err := client.Parse(ctx, "https://httpbin.org/html")
		if err != nil {
			results <- "Error"
		} else {
			results <- fmt.Sprintf("Success: %s", result.Domain)
		}
	}()

	go func() {
		ctx := context.Background()
		result, err := client.Parse(ctx, "https://httpbin.org/html")
		if err != nil {
			results <- "Error"
		} else {
			results <- fmt.Sprintf("Success: %s", result.Domain)
		}
	}()

	// Collect results
	result1 := <-results
	result2 := <-results

	fmt.Printf("Concurrent operation 1: %s\n", result1)
	fmt.Printf("Concurrent operation 2: %s\n", result2)
	fmt.Printf("Client is thread-safe: true\n")

}

Output:

Concurrent operation 1: Success: httpbin.org
Concurrent operation 2: Success: httpbin.org
Client is thread-safe: true

Example (ContentTypes) ¶

Example_contentTypes demonstrates different content type extractions

package main

import (
	"context"
	"fmt"

	"github.com/BumpyClock/hermes"
)

func main() {
	testURL := "https://httpbin.org/html"
	ctx := context.Background()

	// Test HTML extraction
	htmlClient := hermes.New(hermes.WithContentType("html"))
	htmlResult, err := htmlClient.Parse(ctx, testURL)
	if err != nil {
		fmt.Printf("HTML Error: %v\n", err)
		return
	}

	// Test Text extraction
	textClient := hermes.New(hermes.WithContentType("text"))
	textResult, err := textClient.Parse(ctx, testURL)
	if err != nil {
		fmt.Printf("Text Error: %v\n", err)
		return
	}

	fmt.Printf("HTML content has tags: %v\n", len(htmlResult.Content) > len(textResult.Content))
	fmt.Printf("Text content is shorter: %v\n", len(textResult.Content) < len(htmlResult.Content))
	fmt.Printf("Both have same title: %v\n", htmlResult.Title == textResult.Title)

}

Output:

HTML content has tags: true
Text content is shorter: true
Both have same title: true

Example (ContextCancellation) ¶

Example_contextCancellation demonstrates context cancellation behavior

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(hermes.WithTimeout(30 * time.Second))

	// Create a context that will be cancelled quickly
	ctx, cancel := context.WithTimeout(context.Background(), 1*time.Millisecond)
	defer cancel()

	// Try to parse - should be cancelled due to short timeout
	_, err := client.Parse(ctx, "https://httpbin.org/delay/5")

	if err != nil {
		if parseErr, ok := err.(*hermes.ParseError); ok {
			fmt.Printf("Request was cancelled: %v\n", parseErr.Code == hermes.ErrTimeout)
			fmt.Printf("Error type: %s\n", parseErr.Code)
		}
	}

}

Output:

Request was cancelled: true
Error type: timeout

Example (CustomHTTPClient) ¶

Example_customHTTPClient demonstrates using a custom HTTP client

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"net/http"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	// Create custom HTTP client with specific settings
	httpClient := &http.Client{
		Timeout: 15 * time.Second,
		Transport: &http.Transport{
			MaxIdleConns:        10,
			MaxIdleConnsPerHost: 2,
			IdleConnTimeout:     30 * time.Second,
			TLSClientConfig: &tls.Config{
				InsecureSkipVerify: false,
			},
		},
	}

	// Create Hermes client with custom HTTP client
	client := hermes.New(
		hermes.WithHTTPClient(httpClient),
		hermes.WithUserAgent("CustomHTTPClient/1.0"),
	)

	ctx := context.Background()
	result, err := client.Parse(ctx, "https://httpbin.org/user-agent")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Custom HTTP client used: true\n")
	fmt.Printf("User agent configured: %v\n", len(result.Content) > 0)

}

Output:

Custom HTTP client used: true
User agent configured: true

Example (ErrorHandling) ¶

Example_errorHandling demonstrates error handling patterns

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(hermes.WithTimeout(5 * time.Second))

	// Try to parse an invalid URL
	ctx := context.Background()
	_, err := client.Parse(ctx, "not-a-valid-url")

	if err != nil {
		// Check if it's a ParseError
		if parseErr, ok := err.(*hermes.ParseError); ok {
			fmt.Printf("Parse error occurred\n")
			fmt.Printf("Error code: %s\n", parseErr.Code)
			fmt.Printf("Operation: %s\n", parseErr.Op)
			fmt.Printf("Is invalid URL error: %v\n", parseErr.Code == hermes.ErrInvalidURL)
		}
	}

}

Output:

Parse error occurred
Error code: fetch error
Operation: Parse
Is invalid URL error: false

Example (ParseHTML) ¶

Example_parseHTML demonstrates parsing pre-fetched HTML content

package main

import (
	"context"
	"fmt"

	"github.com/BumpyClock/hermes"
)

func main() {
	client := hermes.New(hermes.WithContentType("text"))

	// HTML content to parse
	html := `<!DOCTYPE html>
<html>
<head>
    <title>Test Article</title>
    <meta name="author" content="John Doe">
</head>
<body>
    <h1>Sample Article</h1>
    <p>This is a test article with some content.</p>
    <p>It has multiple paragraphs for demonstration.</p>
</body>
</html>`

	// Parse the HTML directly
	ctx := context.Background()
	result, err := client.ParseHTML(ctx, html, "https://example.com/test")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Title: %s\n", result.Title)
	fmt.Printf("Author: %s\n", result.Author)
	fmt.Printf("Domain: %s\n", result.Domain)
	fmt.Printf("Word count: %d\n", result.WordCount)

}

Output:

Title: Test Article
Author: John Doe
Domain: example.com
Word count: 12

Example (WithOptions) ¶

Example_withOptions demonstrates using various client options

package main

import (
	"context"
	"fmt"
	"time"

	"github.com/BumpyClock/hermes"
)

func main() {
	// Create client with multiple options
	client := hermes.New(
		hermes.WithTimeout(30*time.Second),
		hermes.WithUserAgent("MyApp/2.0"),
		hermes.WithContentType("markdown"),
		hermes.WithAllowPrivateNetworks(false),
	)

	// Parse with context timeout
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	result, err := client.Parse(ctx, "https://httpbin.org/html")
	if err != nil {
		fmt.Printf("Error: %v\n", err)
		return
	}

	fmt.Printf("Content type used: markdown\n")
	fmt.Printf("Word count: %d\n", result.WordCount)
	fmt.Printf("Has markdown content: %v\n", len(result.Content) > 0)

}

Output:

Content type used: markdown
Word count: 601
Has markdown content: true

Index ¶

type Client
- func New(opts ...Option) *Client
- func (c *Client) Parse(ctx context.Context, url string) (*Result, error)
- func (c *Client) ParseHTML(ctx context.Context, html, url string) (*Result, error)
type ErrorCode
- func (e ErrorCode) String() string
type Option
type ParseError
type Parser
type Result

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Client ¶

type Client struct {
	// contains filtered or unexported fields
}

Client is a thread-safe, reusable parser client for extracting content from web pages. It manages its own HTTP client for connection pooling and can be shared across goroutines.

func New ¶

func New(opts ...Option) *Client

New creates a new Hermes client with the provided options. The client is thread-safe and should be reused across requests.

Example:

client := hermes.New(
    hermes.WithTimeout(30*time.Second),
    hermes.WithUserAgent("MyApp/1.0"),
)

func (*Client) Parse ¶

func (c *Client) Parse(ctx context.Context, url string) (*Result, error)

Parse extracts content from the given URL. The context can be used to cancel the request or set a deadline.

Example:

ctx := context.Background()
result, err := client.Parse(ctx, "https://example.com/article")
if err != nil {
    // Handle error
}
fmt.Println(result.Title)

func (*Client) ParseHTML ¶

func (c *Client) ParseHTML(ctx context.Context, html, url string) (*Result, error)

ParseHTML extracts content from pre-fetched HTML. This is useful when you already have the HTML content and want to avoid an additional HTTP request.

Example:

html := "<html>...</html>"
result, err := client.ParseHTML(ctx, html, "https://example.com/article")

type ErrorCode ¶

type ErrorCode int

ErrorCode represents the type of error that occurred during parsing

const (
	// ErrInvalidURL indicates the provided URL is malformed or empty
	ErrInvalidURL ErrorCode = iota

	// ErrFetch indicates a failure to fetch the content from the URL
	ErrFetch

	// ErrTimeout indicates the operation timed out
	ErrTimeout

	// ErrSSRF indicates the URL was blocked by SSRF protection
	ErrSSRF

	// ErrExtract indicates a failure during content extraction
	ErrExtract

	// ErrContext indicates the context was cancelled
	ErrContext
)

func (ErrorCode) String ¶

func (e ErrorCode) String() string

String returns a human-readable string for the error code

type Option ¶

type Option func(*Client)

Option is a functional option for configuring the Client

func WithAllowPrivateNetworks ¶

func WithAllowPrivateNetworks(allow bool) Option

WithAllowPrivateNetworks allows or disallows parsing of private network URLs. By default, private networks are blocked for security (SSRF protection). Set to true only in trusted environments where you need to parse internal URLs.

Private networks include:

10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
127.0.0.0/8 (localhost)
::1 (IPv6 localhost)
fc00::/7 (IPv6 private)

Example:

// For internal tools that need to parse intranet content
client := hermes.New(hermes.WithAllowPrivateNetworks(true))

func WithContentType ¶

func WithContentType(contentType string) Option

WithContentType sets the output content type for parsing. Valid options are "html", "markdown", and "text". By default, content is returned as HTML.

Example:

// Get content as markdown
client := hermes.New(hermes.WithContentType("markdown"))

func WithHTTPClient ¶

func WithHTTPClient(httpClient *http.Client) Option

WithHTTPClient sets a custom HTTP client for the parser. This allows you to configure connection pooling, timeouts, proxies, etc.

Example:

httpClient := &http.Client{
    Timeout: 60 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns: 200,
    },
}
client := hermes.New(hermes.WithHTTPClient(httpClient))

func WithTimeout ¶

func WithTimeout(timeout time.Duration) Option

WithTimeout sets the timeout for HTTP requests. This timeout applies to the entire request, including connection time, redirects, and reading the response body.

Example:

client := hermes.New(hermes.WithTimeout(30 * time.Second))

func WithTransport ¶

func WithTransport(transport http.RoundTripper) Option

WithTransport sets a custom HTTP transport for the parser. This is useful for configuring proxies, TLS settings, connection pooling, etc. If both WithHTTPClient and WithTransport are used, WithHTTPClient takes precedence.

Example:

transport := &http.Transport{
    Proxy: http.ProxyFromEnvironment,
    MaxIdleConns: 100,
    IdleConnTimeout: 90 * time.Second,
}
client := hermes.New(hermes.WithTransport(transport))

func WithUserAgent ¶

func WithUserAgent(userAgent string) Option

WithUserAgent sets the User-Agent header for HTTP requests. This is useful for identifying your application to web servers.

Example:

client := hermes.New(hermes.WithUserAgent("MyApp/1.0"))

type ParseError ¶

type ParseError struct {
	// Code indicates the type of error
	Code ErrorCode

	// URL is the URL that was being parsed when the error occurred
	URL string

	// Op is the operation that failed (e.g., "Parse", "ParseHTML")
	Op string

	// Err is the underlying error
	Err error
}

ParseError represents an error that occurred during parsing. It includes the error code, URL, operation, and underlying error.

func (*ParseError) Error ¶

func (e *ParseError) Error() string

Error implements the error interface

func (*ParseError) Is ¶

func (e *ParseError) Is(target error) bool

Is reports whether the target error is equal to this error

func (*ParseError) IsContext ¶

func (e *ParseError) IsContext() bool

IsContext returns true if the error was caused by context cancellation

func (*ParseError) IsExtract ¶

func (e *ParseError) IsExtract() bool

IsExtract returns true if the error occurred during content extraction

func (*ParseError) IsFetch ¶

func (e *ParseError) IsFetch() bool

IsFetch returns true if the error occurred during content fetching

func (*ParseError) IsInvalidURL ¶

func (e *ParseError) IsInvalidURL() bool

IsInvalidURL returns true if the error was caused by an invalid URL

func (*ParseError) IsSSRF ¶

func (e *ParseError) IsSSRF() bool

IsSSRF returns true if the error was caused by SSRF protection

func (*ParseError) IsTimeout ¶

func (e *ParseError) IsTimeout() bool

IsTimeout returns true if the error was caused by a timeout

func (*ParseError) Unwrap ¶

func (e *ParseError) Unwrap() error

Unwrap returns the underlying error

type Parser ¶

type Parser interface {
	// Parse extracts content from the given URL.
	// The context can be used to cancel the request or set a deadline.
	Parse(ctx context.Context, url string) (*Result, error)

	// ParseHTML extracts content from pre-fetched HTML.
	// This is useful when you already have the HTML content.
	ParseHTML(ctx context.Context, html, url string) (*Result, error)
}

Parser is the interface for content extraction. Implement this interface to create mock parsers for testing.

type Result ¶

type Result struct {
	// Core content fields
	URL           string     `json:"url"`
	Title         string     `json:"title"`
	Content       string     `json:"content"`
	Author        string     `json:"author,omitempty"`
	DatePublished *time.Time `json:"date_published,omitempty"`

	// Media and metadata
	LeadImageURL string `json:"lead_image_url,omitempty"`
	Dek          string `json:"dek,omitempty"`
	Domain       string `json:"domain"`
	Excerpt      string `json:"excerpt,omitempty"`

	// Content metrics
	WordCount     int    `json:"word_count"`
	Direction     string `json:"direction,omitempty"`
	TotalPages    int    `json:"total_pages,omitempty"`
	RenderedPages int    `json:"rendered_pages,omitempty"`

	// Site information
	SiteName    string `json:"site_name,omitempty"`
	Description string `json:"description,omitempty"`
	Language    string `json:"language,omitempty"`
	ThemeColor  string `json:"theme_color,omitempty"`
	Favicon     string `json:"favicon,omitempty"`

	// Video metadata
	VideoURL      string                 `json:"video_url,omitempty"`
	VideoMetadata map[string]interface{} `json:"video_metadata,omitempty"`
}

Result contains the extracted content from a web page. All fields are read-only and represent the parsed article data.

func (*Result) FormatMarkdown ¶

func (r *Result) FormatMarkdown() string

FormatMarkdown formats the result as Markdown with metadata header. This is useful for saving the content in a human-readable format.

Example output:

# Article Title

## Metadata
**Author:** John Doe
**Date:** 2024-01-01
**URL:** https://example.com/article

## Content
Article content here...

func (*Result) HasAuthor ¶

func (r *Result) HasAuthor() bool

HasAuthor returns true if author information is available

func (*Result) HasDate ¶

func (r *Result) HasDate() bool

HasDate returns true if publication date is available

func (*Result) HasImage ¶

func (r *Result) HasImage() bool

HasImage returns true if a lead image is available

func (*Result) IsEmpty ¶

func (r *Result) IsEmpty() bool

IsEmpty returns true if the result contains no meaningful content

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
checks/concurrency command
checks/production command
checks/realworld command
checks/registry command
hermes command
examples
api-server command Package main demonstrates how to build an HTTP API server using Hermes.	Package main demonstrates how to build an HTTP API server using Hermes.
basic command Package main demonstrates basic usage of the Hermes web content extraction library.	Package main demonstrates basic usage of the Hermes web content extraction library.
concurrent command Package main demonstrates concurrent processing with the Hermes library.	Package main demonstrates concurrent processing with the Hermes library.
custom-client command Package main demonstrates advanced HTTP client configuration with Hermes.	Package main demonstrates advanced HTTP client configuration with Hermes.
internal
cache ABOUTME: Helper functions for optimized DOM operations using the existing cache system.	ABOUTME: Helper functions for optimized DOM operations using the existing cache system.
cleaners
extractors ABOUTME: Advanced extractor loader with LRU caching and dynamic loading Reduces startup memory by 90% through lazy loading and automatic cache management	ABOUTME: Advanced extractor loader with LRU caching and dynamic loading Reduces startup memory by 90% through lazy loading and automatic cache management
extractors/custom
extractors/fields
extractors/generic
extractors/validation Package validation provides a comprehensive field validation framework for extracted fields and extended field support.	Package validation provides a comprehensive field validation framework for extracted fields and extended field support.
parser
pools ABOUTME: This file implements sync.Pool for reusing expensive objects like goquery documents and HTTP response bodies.	ABOUTME: This file implements sync.Pool for reusing expensive objects like goquery documents and HTTP response bodies.
resource
utils
utils/dom ABOUTME: Cleans H1 tags from article content based on count threshold analysis.	ABOUTME: Cleans H1 tags from article content based on count threshold analysis.
utils/security
utils/text ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior	ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior
validation
scripts
tools
register command
verify command

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL