Documentation
¶
Index ¶
- Constants
- Variables
- func CleanHTML(html string, contentSelectors, removeSelectors []string) (string, error)
- func CleanMDX(content string) string
- func ConvertHTML(html string, selectors ...[]string) (string, error)
- func IsHTML(contentType string, body []byte) bool
- func IsJSHeavy(html string) bool
- type Fetcher
- type HTTPFetcher
- type Options
- type Response
Constants ¶
const MinConvertedLines = 10
MinConvertedLines is the minimum number of non-blank lines for converted content to be considered meaningful (vs. a nav-heavy marketing page).
Variables ¶
var DefaultContentSelectors = []string{
"article",
".markdown",
".md-content",
".theme-doc-markdown",
".document",
"[role='main']",
"main",
".content",
"#content",
".main-content",
}
ContentSelectors are tried in order to find the main content element. The first match wins. These cover the major doc platforms.
var DefaultRemoveSelectors = []string{
"nav",
".navbar",
".sidebar",
".md-sidebar",
".md-header",
".md-footer",
".md-source-file",
".md-content__button",
".md-edit",
".headerlink",
".docSidebarContainer",
".pagination-nav",
".theme-doc-footer",
".sphinxsidebarwrapper",
".advertisement",
"footer",
"script",
"style",
}
DefaultRemoveSelectors strips elements that add noise to converted markdown.
Functions ¶
func CleanHTML ¶
CleanHTML extracts the main content from an HTML page using CSS selectors, removes navigation/chrome elements, and returns the cleaned HTML ready for markdown conversion. This dramatically improves conversion quality compared to converting the full page.
If contentSelectors is nil, DefaultContentSelectors is used. If removeSelectors is nil, DefaultRemoveSelectors is used.
func CleanMDX ¶
CleanMDX strips MDX/JSX component tags and export statements from markdown content. These are framework artifacts (Mintlify, Nextra, etc.) that add noise without contributing documentation value.
func ConvertHTML ¶
ConvertHTML converts HTML content to markdown. It first cleans the HTML by extracting the main content area (stripping nav, footer, sidebar), then converts to markdown with whitespace normalization.
contentSelectors and removeSelectors are optional — pass nil for defaults. Used when a URL that should serve text/markdown returns HTML instead.
func IsHTML ¶
IsHTML returns true if the response looks like an HTML page rather than text/markdown content. Checks both Content-Type header and body sniffing.
func IsJSHeavy ¶
IsJSHeavy detects whether an HTML page is a JavaScript SPA shell that would need browser rendering to extract meaningful content.
Uses a two-tier strategy:
- Short-circuit for obvious SPA shells (mount div + module hint + tiny body text).
- Score indicators — returns true if >=2 match (conservative).
Types ¶
type Fetcher ¶
type Fetcher struct {
// contains filtered or unexported fields
}
Fetcher handles all HTTP requests with per-domain rate limiting.
func (*Fetcher) Fetch ¶
Fetch retrieves the content at the given URL. Returns nil Response (not an error) for 404s, so callers can treat missing content as expected.
func (*Fetcher) FetchConditional ¶
func (f *Fetcher) FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)
FetchConditional retrieves content only if it has changed, using ETag and Last-Modified headers for cache validation. Returns nil, nil on 304 Not Modified (same pattern as 404).
type HTTPFetcher ¶
type HTTPFetcher interface {
Fetch(ctx context.Context, url string) (*Response, error)
FetchConditional(ctx context.Context, url, etag, lastModified string) (*Response, error)
}
HTTPFetcher is the interface for HTTP content retrieval. The default implementation provides per-domain rate limiting and conditional requests. Alternative implementations can add browser rendering (Playwright), authentication, or custom transport logic.