html

package

v0.35.3 Latest Latest Go to latest Published: Mar 11, 2026 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sevigo/goframe

Links

Open Source Insights

Documentation ¶

Overview ¶

Package html provides an HTML parser plugin for transforming HTML content into clean Markdown suitable for LLM consumption and RAG applications.

The parser handles:

Boilerplate removal (nav, footer, scripts, ads)
Link normalization (relative URLs to absolute URLs)
Metadata extraction (author, date, title from Open Graph, Schema.org, Dublin Core)
HTML to Markdown conversion (preserving structure)

This is particularly useful for RSS feeds and web content where HTML content needs to be cleaned and structured before being embedded in vector databases.

Index ¶

type HTMLMetadata
type HTMLParser
- func NewHTMLParser(opts ...Option) *HTMLParser
type Option

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type HTMLMetadata ¶

type HTMLMetadata struct {
	Author        string    `json:"author"`
	PublishedDate time.Time `json:"published_date"`
	Title         string    `json:"title"`
	Description   string    `json:"description"`
	CanonicalURL  string    `json:"canonical_url"`
	Keywords      []string  `json:"keywords"`
}

HTMLMetadata represents metadata extracted from HTML documents. It supports multiple metadata formats including Open Graph, Schema.org, and Dublin Core.

type HTMLParser ¶

type HTMLParser struct {
	// contains filtered or unexported fields
}

HTMLParser implements the ParserPlugin interface for HTML content. It transforms HTML into clean Markdown while preserving semantic structure and extracting metadata for governance annotations.

func NewHTMLParser ¶

func NewHTMLParser(opts ...Option) *HTMLParser

NewHTMLParser creates a new HTML parser with the given options.

Example:

parser := html.NewHTMLParser(
    html.WithBaseURL("https://example.com"),
    html.WithBoilerplateRemoval(true),
    html.WithMarkdownConversion(true),
)

func (*HTMLParser) CanHandle ¶

func (p *HTMLParser) CanHandle(path string, info fs.FileInfo) bool

CanHandle determines if this parser can handle the given file.

func (*HTMLParser) Chunk ¶

func (p *HTMLParser) Chunk(content string, path string, opts *schema.CodeChunkingOptions) ([]schema.CodeChunk, error)

Chunk parses HTML content and returns code chunks with metadata. This method:

Parses HTML using goquery
Extracts metadata (author, date, title) - MUST be before boilerplate removal
Removes boilerplate (nav, footer, scripts)
Normalizes links (relative → absolute)
Converts to Markdown
Returns chunks with governance annotations

func (*HTMLParser) Extensions ¶

func (p *HTMLParser) Extensions() []string

Extensions returns the file extensions this parser handles.

func (*HTMLParser) ExtractMetadata ¶

func (p *HTMLParser) ExtractMetadata(content string, path string) (schema.FileMetadata, error)

ExtractMetadata extracts file-level metadata from HTML content. Returns metadata about the HTML document including author, title, and other governance info.

func (*HTMLParser) ExtractUsedSymbols ¶

func (p *HTMLParser) ExtractUsedSymbols(content string) []string

ExtractUsedSymbols returns nil as HTML doesn't have symbol references.

func (*HTMLParser) IsGenerated ¶

func (p *HTMLParser) IsGenerated(content string, path string) bool

IsGenerated returns false as HTML files are typically not auto-generated code.

func (*HTMLParser) Name ¶

func (p *HTMLParser) Name() string

Name returns the parser name identifier.

type Option ¶

type Option func(*HTMLParser)

Option configures the HTMLParser.

func WithBaseURL ¶

func WithBaseURL(baseURL string) Option

WithBaseURL sets the base URL for resolving relative links. This enables conversion of relative URLs to absolute URLs.

func WithBoilerplateRemoval ¶

func WithBoilerplateRemoval(remove bool) Option

WithBoilerplateRemoval enables or removes non-content elements. When true (default), removes nav, footer, scripts, ads, etc.

func WithMarkdownConversion ¶

func WithMarkdownConversion(convert bool) Option

WithMarkdownConversion enables conversion of HTML to Markdown. When true (default), preserves semantic structure as Markdown.

func WithMetadataExtraction ¶

func WithMetadataExtraction(extract bool) Option

WithMetadataExtraction enables extraction of author, date, and other metadata. Extracts from Open Graph, Schema.org, Dublin Core, and standard meta tags.

func WithStructurePreservation ¶

func WithStructurePreservation(preserve bool) Option

WithStructurePreservation enables preservation of semantic structure. When true (default), maintains headers, lists, code blocks as Markdown.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL