webfetch

package
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 3, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Overview

Package webfetch provides URL content fetching with HTML-to-Markdown conversion. It supports configurable timeouts, size limits, robots.txt/agents.txt compliance, and optional enhanced fetchers (e.g. Jina Reader) for JS-heavy pages.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ChainFetcher

type ChainFetcher struct {
	// contains filtered or unexported fields
}

ChainFetcher tries the primary fetcher first. If the result content looks empty (common with JS-rendered pages), it falls back to an enhanced fetcher.

func NewChainFetcher

func NewChainFetcher(primary, fallback Fetcher, logger *slog.Logger) *ChainFetcher

NewChainFetcher creates a fetcher that chains primary → fallback. If fallback is nil, it behaves identically to the primary fetcher.

func (*ChainFetcher) Fetch

func (c *ChainFetcher) Fetch(ctx context.Context, url string) (*FetchResult, error)

Fetch tries the primary fetcher; falls back to the enhanced fetcher if the primary result appears empty or too short (likely a JS-rendered page).

func (*ChainFetcher) Name

func (c *ChainFetcher) Name() string

Name returns the fetcher identifier.

type DefaultFetcher

type DefaultFetcher struct {
	// contains filtered or unexported fields
}

DefaultFetcher fetches URLs via HTTP and converts HTML to Markdown.

func NewDefaultFetcher

func NewDefaultFetcher(opts Options) *DefaultFetcher

NewDefaultFetcher creates a fetcher with the given options.

func (*DefaultFetcher) Fetch

func (f *DefaultFetcher) Fetch(ctx context.Context, rawURL string) (*FetchResult, error)

func (*DefaultFetcher) Name

func (f *DefaultFetcher) Name() string

Name returns the fetcher identifier.

type FetchResult

type FetchResult struct {
	URL          string `json:"url"`
	Title        string `json:"title"`
	Content      string `json:"content"` // Markdown
	ContentType  string `json:"content_type"`
	BytesFetched int    `json:"bytes_fetched"`
}

FetchResult holds the fetched and converted content from a URL.

type Fetcher

type Fetcher interface {
	Fetch(ctx context.Context, url string) (*FetchResult, error)
	Name() string
}

Fetcher retrieves content from a URL and returns it as Markdown.

type JinaFetcher

type JinaFetcher struct {
	// contains filtered or unexported fields
}

JinaFetcher uses the Jina Reader API (r.jina.ai) to fetch URLs and convert them to Markdown. Jina handles JavaScript rendering, making it suitable for JS-heavy pages that the DefaultFetcher cannot process.

func NewJinaFetcher

func NewJinaFetcher(timeout time.Duration, logger *slog.Logger) *JinaFetcher

NewJinaFetcher creates a Jina Reader fetcher.

func (*JinaFetcher) Fetch

func (j *JinaFetcher) Fetch(ctx context.Context, rawURL string) (*FetchResult, error)

Fetch retrieves a URL via Jina Reader and returns Markdown content.

func (*JinaFetcher) Name

func (j *JinaFetcher) Name() string

Name returns the fetcher identifier.

type Options

type Options struct {
	Timeout          time.Duration
	MaxSizeBytes     int64
	UserAgent        string
	RespectRobotsTxt bool
	RespectAgentsTxt bool
	Logger           *slog.Logger
}

Options configures a DefaultFetcher.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL