Documentation
¶
Overview ¶
Package html provides an HTML parser plugin for transforming HTML content into clean Markdown suitable for LLM consumption and RAG applications.
The parser handles:
- Boilerplate removal (nav, footer, scripts, ads)
- Link normalization (relative URLs to absolute URLs)
- Metadata extraction (author, date, title from Open Graph, Schema.org, Dublin Core)
- HTML to Markdown conversion (preserving structure)
This is particularly useful for RSS feeds and web content where HTML content needs to be cleaned and structured before being embedded in vector databases.
Index ¶
- type HTMLMetadata
- type HTMLParser
- func (p *HTMLParser) CanHandle(path string, info fs.FileInfo) bool
- func (p *HTMLParser) Chunk(content string, path string, opts *schema.CodeChunkingOptions) ([]schema.CodeChunk, error)
- func (p *HTMLParser) Extensions() []string
- func (p *HTMLParser) ExtractMetadata(content string, path string) (schema.FileMetadata, error)
- func (p *HTMLParser) ExtractUsedSymbols(content string) []string
- func (p *HTMLParser) IsGenerated(content string, path string) bool
- func (p *HTMLParser) Name() string
- type Option
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type HTMLMetadata ¶
type HTMLMetadata struct {
Author string `json:"author"`
PublishedDate time.Time `json:"published_date"`
Title string `json:"title"`
Description string `json:"description"`
CanonicalURL string `json:"canonical_url"`
Keywords []string `json:"keywords"`
}
HTMLMetadata represents metadata extracted from HTML documents. It supports multiple metadata formats including Open Graph, Schema.org, and Dublin Core.
type HTMLParser ¶
type HTMLParser struct {
// contains filtered or unexported fields
}
HTMLParser implements the ParserPlugin interface for HTML content. It transforms HTML into clean Markdown while preserving semantic structure and extracting metadata for governance annotations.
func NewHTMLParser ¶
func NewHTMLParser(opts ...Option) *HTMLParser
NewHTMLParser creates a new HTML parser with the given options.
Example:
parser := html.NewHTMLParser(
html.WithBaseURL("https://example.com"),
html.WithBoilerplateRemoval(true),
html.WithMarkdownConversion(true),
)
func (*HTMLParser) CanHandle ¶
func (p *HTMLParser) CanHandle(path string, info fs.FileInfo) bool
CanHandle determines if this parser can handle the given file.
func (*HTMLParser) Chunk ¶
func (p *HTMLParser) Chunk(content string, path string, opts *schema.CodeChunkingOptions) ([]schema.CodeChunk, error)
Chunk parses HTML content and returns code chunks with metadata. This method:
- Parses HTML using goquery
- Extracts metadata (author, date, title) - MUST be before boilerplate removal
- Removes boilerplate (nav, footer, scripts)
- Normalizes links (relative → absolute)
- Converts to Markdown
- Returns chunks with governance annotations
func (*HTMLParser) Extensions ¶
func (p *HTMLParser) Extensions() []string
Extensions returns the file extensions this parser handles.
func (*HTMLParser) ExtractMetadata ¶
func (p *HTMLParser) ExtractMetadata(content string, path string) (schema.FileMetadata, error)
ExtractMetadata extracts file-level metadata from HTML content. Returns metadata about the HTML document including author, title, and other governance info.
func (*HTMLParser) ExtractUsedSymbols ¶
func (p *HTMLParser) ExtractUsedSymbols(content string) []string
ExtractUsedSymbols returns nil as HTML doesn't have symbol references.
func (*HTMLParser) IsGenerated ¶
func (p *HTMLParser) IsGenerated(content string, path string) bool
IsGenerated returns false as HTML files are typically not auto-generated code.
func (*HTMLParser) Name ¶
func (p *HTMLParser) Name() string
Name returns the parser name identifier.
type Option ¶
type Option func(*HTMLParser)
Option configures the HTMLParser.
func WithBaseURL ¶
WithBaseURL sets the base URL for resolving relative links. This enables conversion of relative URLs to absolute URLs.
func WithBoilerplateRemoval ¶
WithBoilerplateRemoval enables or removes non-content elements. When true (default), removes nav, footer, scripts, ads, etc.
func WithMarkdownConversion ¶
WithMarkdownConversion enables conversion of HTML to Markdown. When true (default), preserves semantic structure as Markdown.
func WithMetadataExtraction ¶
WithMetadataExtraction enables extraction of author, date, and other metadata. Extracts from Open Graph, Schema.org, Dublin Core, and standard meta tags.
func WithStructurePreservation ¶
WithStructurePreservation enables preservation of semantic structure. When true (default), maintains headers, lists, code blocks as Markdown.