fetcher

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 23, 2026 License: MIT Imports: 12 Imported by: 0

Documentation

Index

Constants

View Source
const MinConvertedLines = 10

MinConvertedLines is the minimum number of non-blank lines for converted content to be considered meaningful (vs. a nav-heavy marketing page).

Variables

View Source
var DefaultContentSelectors = []string{
	"article",
	".markdown",
	".md-content",
	".theme-doc-markdown",
	".document",
	"[role='main']",
	"main",
	".content",
	"#content",
	".main-content",
}

ContentSelectors are tried in order to find the main content element. The first match wins. These cover the major doc platforms.

View Source
var DefaultRemoveSelectors = []string{
	"nav",
	".navbar",
	".sidebar",
	".md-sidebar",
	".md-header",
	".md-footer",
	".md-source-file",
	".md-content__button",
	".md-edit",
	".headerlink",
	".docSidebarContainer",
	".pagination-nav",
	".theme-doc-footer",
	".sphinxsidebarwrapper",
	".advertisement",
	"footer",
	"script",
	"style",
}

DefaultRemoveSelectors strips elements that add noise to converted markdown.

Functions

func CleanHTML

func CleanHTML(html string, contentSelectors, removeSelectors []string) (string, error)

CleanHTML extracts the main content from an HTML page using CSS selectors, removes navigation/chrome elements, and returns the cleaned HTML ready for markdown conversion. This dramatically improves conversion quality compared to converting the full page.

If contentSelectors is nil, DefaultContentSelectors is used. If removeSelectors is nil, DefaultRemoveSelectors is used.

func CleanMDX

func CleanMDX(content string) string

CleanMDX strips MDX/JSX component tags and export statements from markdown content. These are framework artifacts (Mintlify, Nextra, etc.) that add noise without contributing documentation value.

func ConvertHTML

func ConvertHTML(html string, selectors ...[]string) (string, error)

ConvertHTML converts HTML content to markdown. It first cleans the HTML by extracting the main content area (stripping nav, footer, sidebar), then converts to markdown with whitespace normalization.

contentSelectors and removeSelectors are optional — pass nil for defaults. Used when a URL that should serve text/markdown returns HTML instead.

func IsHTML

func IsHTML(contentType string, body []byte) bool

IsHTML returns true if the response looks like an HTML page rather than text/markdown content. Checks both Content-Type header and body sniffing.

func IsJSHeavy

func IsJSHeavy(html string) bool

IsJSHeavy detects whether an HTML page is a JavaScript SPA shell that would need browser rendering to extract meaningful content.

Uses a two-tier strategy:

  1. Short-circuit for obvious SPA shells (mount div + module hint + tiny body text).
  2. Score indicators — returns true if >=2 match (conservative).

Types

type Fetcher

type Fetcher struct {
	// contains filtered or unexported fields
}

Fetcher handles all HTTP requests with per-domain rate limiting.

func New

func New(opts Options) *Fetcher

New creates a Fetcher with the given options.

func (*Fetcher) Fetch

func (f *Fetcher) Fetch(ctx context.Context, url string) (*Response, error)

Fetch retrieves the content at the given URL. Returns nil Response (not an error) for 404s, so callers can treat missing content as expected.

func (*Fetcher) FetchConditional

func (f *Fetcher) FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)

FetchConditional retrieves content only if it has changed, using ETag and Last-Modified headers for cache validation. Returns nil, nil on 304 Not Modified (same pattern as 404).

type HTTPFetcher

type HTTPFetcher interface {
	Fetch(ctx context.Context, url string) (*Response, error)
	FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)
}

HTTPFetcher is the interface for HTTP content retrieval. The default implementation provides per-domain rate limiting and conditional requests. Alternative implementations can add browser rendering (Playwright), authentication, or custom transport logic.

type Options

type Options struct {
	UserAgent    string
	RatePerHost  int
	BurstPerHost int
	Timeout      time.Duration
}

Options configures the Fetcher.

type Response

type Response struct {
	StatusCode   int
	Body         []byte
	ETag         string
	LastModified string
	URL          string
	ContentType  string
}

Response holds the result of a fetch.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL