fetcher

package

v1.0.1 Latest Latest Go to latest Published: Mar 23, 2026 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dmoose/doctrove

Links

Open Source Insights

Documentation ¶

Index ¶

Constants
Variables
func CleanHTML(html string, contentSelectors, removeSelectors []string) (string, error)
func CleanMDX(content string) string
func ConvertHTML(html string, selectors ...[]string) (string, error)
func IsHTML(contentType string, body []byte) bool
func IsJSHeavy(html string) bool
type Fetcher
- func New(opts Options) *Fetcher
- func (f *Fetcher) Fetch(ctx context.Context, url string) (*Response, error)
- func (f *Fetcher) FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)
type HTTPFetcher
type Options
type Response

Constants ¶

View Source

const MinConvertedLines = 10

MinConvertedLines is the minimum number of non-blank lines for converted content to be considered meaningful (vs. a nav-heavy marketing page).

Variables ¶

View Source

var DefaultContentSelectors = []string{
	"article",
	".markdown",
	".md-content",
	".theme-doc-markdown",
	".document",
	"[role='main']",
	"main",
	".content",
	"#content",
	".main-content",
}

ContentSelectors are tried in order to find the main content element. The first match wins. These cover the major doc platforms.

View Source

var DefaultRemoveSelectors = []string{
	"nav",
	".navbar",
	".sidebar",
	".md-sidebar",
	".md-header",
	".md-footer",
	".md-source-file",
	".md-content__button",
	".md-edit",
	".headerlink",
	".docSidebarContainer",
	".pagination-nav",
	".theme-doc-footer",
	".sphinxsidebarwrapper",
	".advertisement",
	"footer",
	"script",
	"style",
}

DefaultRemoveSelectors strips elements that add noise to converted markdown.

Functions ¶

func CleanHTML ¶

func CleanHTML(html string, contentSelectors, removeSelectors []string) (string, error)

CleanHTML extracts the main content from an HTML page using CSS selectors, removes navigation/chrome elements, and returns the cleaned HTML ready for markdown conversion. This dramatically improves conversion quality compared to converting the full page.

If contentSelectors is nil, DefaultContentSelectors is used. If removeSelectors is nil, DefaultRemoveSelectors is used.

func CleanMDX ¶

func CleanMDX(content string) string

CleanMDX strips MDX/JSX component tags and export statements from markdown content. These are framework artifacts (Mintlify, Nextra, etc.) that add noise without contributing documentation value.

func ConvertHTML ¶

func ConvertHTML(html string, selectors ...[]string) (string, error)

ConvertHTML converts HTML content to markdown. It first cleans the HTML by extracting the main content area (stripping nav, footer, sidebar), then converts to markdown with whitespace normalization.

contentSelectors and removeSelectors are optional — pass nil for defaults. Used when a URL that should serve text/markdown returns HTML instead.

func IsHTML ¶

func IsHTML(contentType string, body []byte) bool

IsHTML returns true if the response looks like an HTML page rather than text/markdown content. Checks both Content-Type header and body sniffing.

func IsJSHeavy ¶

func IsJSHeavy(html string) bool

IsJSHeavy detects whether an HTML page is a JavaScript SPA shell that would need browser rendering to extract meaningful content.

Uses a two-tier strategy:

Short-circuit for obvious SPA shells (mount div + module hint + tiny body text).
Score indicators — returns true if >=2 match (conservative).

Types ¶

type Fetcher ¶

type Fetcher struct {
	// contains filtered or unexported fields
}

Fetcher handles all HTTP requests with per-domain rate limiting.

func New ¶

func New(opts Options) *Fetcher

New creates a Fetcher with the given options.

func (*Fetcher) Fetch ¶

func (f *Fetcher) Fetch(ctx context.Context, url string) (*Response, error)

Fetch retrieves the content at the given URL. Returns nil Response (not an error) for 404s, so callers can treat missing content as expected.

func (*Fetcher) FetchConditional ¶

func (f *Fetcher) FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)

FetchConditional retrieves content only if it has changed, using ETag and Last-Modified headers for cache validation. Returns nil, nil on 304 Not Modified (same pattern as 404).

type HTTPFetcher ¶

type HTTPFetcher interface {
	Fetch(ctx context.Context, url string) (*Response, error)
	FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)
}

HTTPFetcher is the interface for HTTP content retrieval. The default implementation provides per-domain rate limiting and conditional requests. Alternative implementations can add browser rendering (Playwright), authentication, or custom transport logic.

type Options ¶

type Options struct {
	UserAgent    string
	RatePerHost  int
	BurstPerHost int
	Timeout      time.Duration
}

Options configures the Fetcher.

type Response ¶

type Response struct {
	StatusCode   int
	Body         []byte
	ETag         string
	LastModified string
	URL          string
	ContentType  string
}

Response holds the result of a fetch.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL