Documentation
¶
Overview ¶
Package hermes provides a high-performance web content extraction library that transforms web pages into clean, structured data.
Hermes extracts article content, titles, authors, dates, images, and more from any URL using site-specific custom parsers and generic fallback extraction.
Basic Usage ¶
Create a client and parse a URL:
client := hermes.New()
result, err := client.Parse(context.Background(), "https://example.com/article")
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Title)
fmt.Println(result.Content)
Configuration ¶
The client can be configured with various options:
client := hermes.New(
hermes.WithTimeout(30 * time.Second),
hermes.WithUserAgent("MyApp/1.0"),
hermes.WithAllowPrivateNetworks(false),
)
Custom HTTP Client ¶
You can provide your own HTTP client for custom transport settings:
httpClient := &http.Client{
Transport: &http.Transport{
Proxy: http.ProxyFromEnvironment,
MaxIdleConns: 100,
},
}
client := hermes.New(hermes.WithHTTPClient(httpClient))
Parsing Pre-fetched HTML ¶
If you already have the HTML content, you can parse it directly:
html := "<html>...</html>" result, err := client.ParseHTML(context.Background(), html, "https://example.com")
Error Handling ¶
Errors are typed for programmatic handling:
result, err := client.Parse(ctx, url)
if err != nil {
var parseErr *hermes.ParseError
if errors.As(err, &parseErr) {
switch parseErr.Code {
case hermes.ErrFetch:
// Handle fetch error
case hermes.ErrTimeout:
// Handle timeout
case hermes.ErrSSRF:
// Handle SSRF protection
}
}
}
Thread Safety ¶
The Client is thread-safe and should be reused across goroutines. Create one client and share it throughout your application.
Concurrency ¶
The library parses one URL at a time. For concurrent parsing, implement your own worker pool:
var wg sync.WaitGroup
sem := make(chan struct{}, 10) // Limit concurrency
for _, url := range urls {
wg.Add(1)
sem <- struct{}{}
go func(u string) {
defer wg.Done()
defer func() { <-sem }()
result, err := client.Parse(ctx, u)
// Handle result
}(url)
}
wg.Wait()
Example (Basic) ¶
Example_basic demonstrates basic usage of the Hermes library
package main
import (
"context"
"fmt"
"time"
"github.com/BumpyClock/hermes"
)
func main() {
// Create a client with basic configuration
client := hermes.New(
hermes.WithTimeout(10*time.Second),
hermes.WithUserAgent("Example/1.0"),
)
// Parse a URL
ctx := context.Background()
result, err := client.Parse(ctx, "https://httpbin.org/html")
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Title: %s\n", result.Title)
fmt.Printf("Domain: %s\n", result.Domain)
fmt.Printf("Has content: %v\n", len(result.Content) > 0)
}
Output: Title: Herman Melville - Moby-Dick Domain: httpbin.org Has content: true
Example (Concurrent) ¶
Example_concurrent demonstrates that the client is thread-safe
package main
import (
"context"
"fmt"
"time"
"github.com/BumpyClock/hermes"
)
func main() {
client := hermes.New(
hermes.WithTimeout(10*time.Second),
hermes.WithUserAgent("ConcurrentExample/1.0"),
)
// Channel to collect results
results := make(chan string, 2)
// Launch two concurrent parsing operations
go func() {
ctx := context.Background()
result, err := client.Parse(ctx, "https://httpbin.org/html")
if err != nil {
results <- "Error"
} else {
results <- fmt.Sprintf("Success: %s", result.Domain)
}
}()
go func() {
ctx := context.Background()
result, err := client.Parse(ctx, "https://httpbin.org/html")
if err != nil {
results <- "Error"
} else {
results <- fmt.Sprintf("Success: %s", result.Domain)
}
}()
// Collect results
result1 := <-results
result2 := <-results
fmt.Printf("Concurrent operation 1: %s\n", result1)
fmt.Printf("Concurrent operation 2: %s\n", result2)
fmt.Printf("Client is thread-safe: true\n")
}
Output: Concurrent operation 1: Success: httpbin.org Concurrent operation 2: Success: httpbin.org Client is thread-safe: true
Example (ContentTypes) ¶
Example_contentTypes demonstrates different content type extractions
package main
import (
"context"
"fmt"
"github.com/BumpyClock/hermes"
)
func main() {
testURL := "https://httpbin.org/html"
ctx := context.Background()
// Test HTML extraction
htmlClient := hermes.New(hermes.WithContentType("html"))
htmlResult, err := htmlClient.Parse(ctx, testURL)
if err != nil {
fmt.Printf("HTML Error: %v\n", err)
return
}
// Test Text extraction
textClient := hermes.New(hermes.WithContentType("text"))
textResult, err := textClient.Parse(ctx, testURL)
if err != nil {
fmt.Printf("Text Error: %v\n", err)
return
}
fmt.Printf("HTML content has tags: %v\n", len(htmlResult.Content) > len(textResult.Content))
fmt.Printf("Text content is shorter: %v\n", len(textResult.Content) < len(htmlResult.Content))
fmt.Printf("Both have same title: %v\n", htmlResult.Title == textResult.Title)
}
Output: HTML content has tags: true Text content is shorter: true Both have same title: true
Example (ContextCancellation) ¶
Example_contextCancellation demonstrates context cancellation behavior
package main
import (
"context"
"fmt"
"time"
"github.com/BumpyClock/hermes"
)
func main() {
client := hermes.New(hermes.WithTimeout(30 * time.Second))
// Create a context that will be cancelled quickly
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Millisecond)
defer cancel()
// Try to parse - should be cancelled due to short timeout
_, err := client.Parse(ctx, "https://httpbin.org/delay/5")
if err != nil {
if parseErr, ok := err.(*hermes.ParseError); ok {
fmt.Printf("Request was cancelled: %v\n", parseErr.Code == hermes.ErrTimeout)
fmt.Printf("Error type: %s\n", parseErr.Code)
}
}
}
Output: Request was cancelled: true Error type: timeout
Example (CustomHTTPClient) ¶
Example_customHTTPClient demonstrates using a custom HTTP client
package main
import (
"context"
"crypto/tls"
"fmt"
"net/http"
"time"
"github.com/BumpyClock/hermes"
)
func main() {
// Create custom HTTP client with specific settings
httpClient := &http.Client{
Timeout: 15 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 10,
MaxIdleConnsPerHost: 2,
IdleConnTimeout: 30 * time.Second,
TLSClientConfig: &tls.Config{
InsecureSkipVerify: false,
},
},
}
// Create Hermes client with custom HTTP client
client := hermes.New(
hermes.WithHTTPClient(httpClient),
hermes.WithUserAgent("CustomHTTPClient/1.0"),
)
ctx := context.Background()
result, err := client.Parse(ctx, "https://httpbin.org/user-agent")
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Custom HTTP client used: true\n")
fmt.Printf("User agent configured: %v\n", len(result.Content) > 0)
}
Output: Custom HTTP client used: true User agent configured: true
Example (ErrorHandling) ¶
Example_errorHandling demonstrates error handling patterns
package main
import (
"context"
"fmt"
"time"
"github.com/BumpyClock/hermes"
)
func main() {
client := hermes.New(hermes.WithTimeout(5 * time.Second))
// Try to parse an invalid URL
ctx := context.Background()
_, err := client.Parse(ctx, "not-a-valid-url")
if err != nil {
// Check if it's a ParseError
if parseErr, ok := err.(*hermes.ParseError); ok {
fmt.Printf("Parse error occurred\n")
fmt.Printf("Error code: %s\n", parseErr.Code)
fmt.Printf("Operation: %s\n", parseErr.Op)
fmt.Printf("Is invalid URL error: %v\n", parseErr.Code == hermes.ErrInvalidURL)
}
}
}
Output: Parse error occurred Error code: fetch error Operation: Parse Is invalid URL error: false
Example (ParseHTML) ¶
Example_parseHTML demonstrates parsing pre-fetched HTML content
package main
import (
"context"
"fmt"
"github.com/BumpyClock/hermes"
)
func main() {
client := hermes.New(hermes.WithContentType("text"))
// HTML content to parse
html := `<!DOCTYPE html>
<html>
<head>
<title>Test Article</title>
<meta name="author" content="John Doe">
</head>
<body>
<h1>Sample Article</h1>
<p>This is a test article with some content.</p>
<p>It has multiple paragraphs for demonstration.</p>
</body>
</html>`
// Parse the HTML directly
ctx := context.Background()
result, err := client.ParseHTML(ctx, html, "https://example.com/test")
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Title: %s\n", result.Title)
fmt.Printf("Author: %s\n", result.Author)
fmt.Printf("Domain: %s\n", result.Domain)
fmt.Printf("Word count: %d\n", result.WordCount)
}
Output: Title: Test Article Author: John Doe Domain: example.com Word count: 12
Example (WithOptions) ¶
Example_withOptions demonstrates using various client options
package main
import (
"context"
"fmt"
"time"
"github.com/BumpyClock/hermes"
)
func main() {
// Create client with multiple options
client := hermes.New(
hermes.WithTimeout(30*time.Second),
hermes.WithUserAgent("MyApp/2.0"),
hermes.WithContentType("markdown"),
hermes.WithAllowPrivateNetworks(false),
)
// Parse with context timeout
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
result, err := client.Parse(ctx, "https://httpbin.org/html")
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Content type used: markdown\n")
fmt.Printf("Word count: %d\n", result.WordCount)
fmt.Printf("Has markdown content: %v\n", len(result.Content) > 0)
}
Output: Content type used: markdown Word count: 601 Has markdown content: true
Index ¶
- type Client
- type ErrorCode
- type Option
- type ParseError
- func (e *ParseError) Error() string
- func (e *ParseError) Is(target error) bool
- func (e *ParseError) IsContext() bool
- func (e *ParseError) IsExtract() bool
- func (e *ParseError) IsFetch() bool
- func (e *ParseError) IsInvalidURL() bool
- func (e *ParseError) IsSSRF() bool
- func (e *ParseError) IsTimeout() bool
- func (e *ParseError) Unwrap() error
- type Parser
- type Result
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client is a thread-safe, reusable parser client for extracting content from web pages. It manages its own HTTP client for connection pooling and can be shared across goroutines.
func New ¶
New creates a new Hermes client with the provided options. The client is thread-safe and should be reused across requests.
Example:
client := hermes.New(
hermes.WithTimeout(30*time.Second),
hermes.WithUserAgent("MyApp/1.0"),
)
func (*Client) Parse ¶
Parse extracts content from the given URL. The context can be used to cancel the request or set a deadline.
Example:
ctx := context.Background()
result, err := client.Parse(ctx, "https://example.com/article")
if err != nil {
// Handle error
}
fmt.Println(result.Title)
type ErrorCode ¶
type ErrorCode int
ErrorCode represents the type of error that occurred during parsing
const ( // ErrInvalidURL indicates the provided URL is malformed or empty ErrInvalidURL ErrorCode = iota // ErrFetch indicates a failure to fetch the content from the URL ErrFetch // ErrTimeout indicates the operation timed out ErrTimeout // ErrSSRF indicates the URL was blocked by SSRF protection ErrSSRF // ErrExtract indicates a failure during content extraction ErrExtract // ErrContext indicates the context was cancelled ErrContext )
type Option ¶
type Option func(*Client)
Option is a functional option for configuring the Client
func WithAllowPrivateNetworks ¶
WithAllowPrivateNetworks allows or disallows parsing of private network URLs. By default, private networks are blocked for security (SSRF protection). Set to true only in trusted environments where you need to parse internal URLs.
Private networks include:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
- 127.0.0.0/8 (localhost)
- ::1 (IPv6 localhost)
- fc00::/7 (IPv6 private)
Example:
// For internal tools that need to parse intranet content client := hermes.New(hermes.WithAllowPrivateNetworks(true))
func WithContentType ¶
WithContentType sets the output content type for parsing. Valid options are "html", "markdown", and "text". By default, content is returned as HTML.
Example:
// Get content as markdown
client := hermes.New(hermes.WithContentType("markdown"))
func WithHTTPClient ¶
WithHTTPClient sets a custom HTTP client for the parser. This allows you to configure connection pooling, timeouts, proxies, etc.
Example:
httpClient := &http.Client{
Timeout: 60 * time.Second,
Transport: &http.Transport{
MaxIdleConns: 200,
},
}
client := hermes.New(hermes.WithHTTPClient(httpClient))
func WithTimeout ¶
WithTimeout sets the timeout for HTTP requests. This timeout applies to the entire request, including connection time, redirects, and reading the response body.
Example:
client := hermes.New(hermes.WithTimeout(30 * time.Second))
func WithTransport ¶
func WithTransport(transport http.RoundTripper) Option
WithTransport sets a custom HTTP transport for the parser. This is useful for configuring proxies, TLS settings, connection pooling, etc. If both WithHTTPClient and WithTransport are used, WithHTTPClient takes precedence.
Example:
transport := &http.Transport{
Proxy: http.ProxyFromEnvironment,
MaxIdleConns: 100,
IdleConnTimeout: 90 * time.Second,
}
client := hermes.New(hermes.WithTransport(transport))
func WithUserAgent ¶
WithUserAgent sets the User-Agent header for HTTP requests. This is useful for identifying your application to web servers.
Example:
client := hermes.New(hermes.WithUserAgent("MyApp/1.0"))
type ParseError ¶
type ParseError struct {
// Code indicates the type of error
Code ErrorCode
// URL is the URL that was being parsed when the error occurred
URL string
// Op is the operation that failed (e.g., "Parse", "ParseHTML")
Op string
// Err is the underlying error
Err error
}
ParseError represents an error that occurred during parsing. It includes the error code, URL, operation, and underlying error.
func (*ParseError) Is ¶
func (e *ParseError) Is(target error) bool
Is reports whether the target error is equal to this error
func (*ParseError) IsContext ¶
func (e *ParseError) IsContext() bool
IsContext returns true if the error was caused by context cancellation
func (*ParseError) IsExtract ¶
func (e *ParseError) IsExtract() bool
IsExtract returns true if the error occurred during content extraction
func (*ParseError) IsFetch ¶
func (e *ParseError) IsFetch() bool
IsFetch returns true if the error occurred during content fetching
func (*ParseError) IsInvalidURL ¶
func (e *ParseError) IsInvalidURL() bool
IsInvalidURL returns true if the error was caused by an invalid URL
func (*ParseError) IsSSRF ¶
func (e *ParseError) IsSSRF() bool
IsSSRF returns true if the error was caused by SSRF protection
func (*ParseError) IsTimeout ¶
func (e *ParseError) IsTimeout() bool
IsTimeout returns true if the error was caused by a timeout
type Parser ¶
type Parser interface {
// Parse extracts content from the given URL.
// The context can be used to cancel the request or set a deadline.
Parse(ctx context.Context, url string) (*Result, error)
// ParseHTML extracts content from pre-fetched HTML.
// This is useful when you already have the HTML content.
ParseHTML(ctx context.Context, html, url string) (*Result, error)
}
Parser is the interface for content extraction. Implement this interface to create mock parsers for testing.
type Result ¶
type Result struct {
// Core content fields
URL string `json:"url"`
Title string `json:"title"`
Content string `json:"content"`
Author string `json:"author,omitempty"`
DatePublished *time.Time `json:"date_published,omitempty"`
// Media and metadata
LeadImageURL string `json:"lead_image_url,omitempty"`
Dek string `json:"dek,omitempty"`
Domain string `json:"domain"`
Excerpt string `json:"excerpt,omitempty"`
// Content metrics
WordCount int `json:"word_count"`
Direction string `json:"direction,omitempty"`
TotalPages int `json:"total_pages,omitempty"`
RenderedPages int `json:"rendered_pages,omitempty"`
// Site information
SiteName string `json:"site_name,omitempty"`
Description string `json:"description,omitempty"`
Language string `json:"language,omitempty"`
ThemeColor string `json:"theme_color,omitempty"`
Favicon string `json:"favicon,omitempty"`
// Video metadata
VideoURL string `json:"video_url,omitempty"`
VideoMetadata map[string]interface{} `json:"video_metadata,omitempty"`
}
Result contains the extracted content from a web page. All fields are read-only and represent the parsed article data.
func (*Result) FormatMarkdown ¶
FormatMarkdown formats the result as Markdown with metadata header. This is useful for saving the content in a human-readable format.
Example output:
# Article Title ## Metadata **Author:** John Doe **Date:** 2024-01-01 **URL:** https://example.com/article ## Content Article content here...
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
checks/concurrency
command
|
|
|
checks/production
command
|
|
|
checks/realworld
command
|
|
|
checks/registry
command
|
|
|
hermes
command
|
|
|
examples
|
|
|
api-server
command
Package main demonstrates how to build an HTTP API server using Hermes.
|
Package main demonstrates how to build an HTTP API server using Hermes. |
|
basic
command
Package main demonstrates basic usage of the Hermes web content extraction library.
|
Package main demonstrates basic usage of the Hermes web content extraction library. |
|
concurrent
command
Package main demonstrates concurrent processing with the Hermes library.
|
Package main demonstrates concurrent processing with the Hermes library. |
|
custom-client
command
Package main demonstrates advanced HTTP client configuration with Hermes.
|
Package main demonstrates advanced HTTP client configuration with Hermes. |
|
internal
|
|
|
cache
ABOUTME: Helper functions for optimized DOM operations using the existing cache system.
|
ABOUTME: Helper functions for optimized DOM operations using the existing cache system. |
|
extractors
ABOUTME: Advanced extractor loader with LRU caching and dynamic loading Reduces startup memory by 90% through lazy loading and automatic cache management
|
ABOUTME: Advanced extractor loader with LRU caching and dynamic loading Reduces startup memory by 90% through lazy loading and automatic cache management |
|
extractors/validation
Package validation provides a comprehensive field validation framework for extracted fields and extended field support.
|
Package validation provides a comprehensive field validation framework for extracted fields and extended field support. |
|
pools
ABOUTME: This file implements sync.Pool for reusing expensive objects like goquery documents and HTTP response bodies.
|
ABOUTME: This file implements sync.Pool for reusing expensive objects like goquery documents and HTTP response bodies. |
|
utils/dom
ABOUTME: Cleans H1 tags from article content based on count threshold analysis.
|
ABOUTME: Cleans H1 tags from article content based on count threshold analysis. |
|
utils/text
ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior
|
ABOUTME: Implements article base URL extraction by removing pagination parameters ABOUTME: Faithful port of JavaScript article-base-url.js with identical logic and behavior |
|
register
command
|
|
|
verify
command
|