html

package
v0.35.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 11, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package html provides an HTML parser plugin for transforming HTML content into clean Markdown suitable for LLM consumption and RAG applications.

The parser handles:

  • Boilerplate removal (nav, footer, scripts, ads)
  • Link normalization (relative URLs to absolute URLs)
  • Metadata extraction (author, date, title from Open Graph, Schema.org, Dublin Core)
  • HTML to Markdown conversion (preserving structure)

This is particularly useful for RSS feeds and web content where HTML content needs to be cleaned and structured before being embedded in vector databases.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type HTMLMetadata

type HTMLMetadata struct {
	Author        string    `json:"author"`
	PublishedDate time.Time `json:"published_date"`
	Title         string    `json:"title"`
	Description   string    `json:"description"`
	CanonicalURL  string    `json:"canonical_url"`
	Keywords      []string  `json:"keywords"`
}

HTMLMetadata represents metadata extracted from HTML documents. It supports multiple metadata formats including Open Graph, Schema.org, and Dublin Core.

type HTMLParser

type HTMLParser struct {
	// contains filtered or unexported fields
}

HTMLParser implements the ParserPlugin interface for HTML content. It transforms HTML into clean Markdown while preserving semantic structure and extracting metadata for governance annotations.

func NewHTMLParser

func NewHTMLParser(opts ...Option) *HTMLParser

NewHTMLParser creates a new HTML parser with the given options.

Example:

parser := html.NewHTMLParser(
    html.WithBaseURL("https://example.com"),
    html.WithBoilerplateRemoval(true),
    html.WithMarkdownConversion(true),
)

func (*HTMLParser) CanHandle

func (p *HTMLParser) CanHandle(path string, info fs.FileInfo) bool

CanHandle determines if this parser can handle the given file.

func (*HTMLParser) Chunk

func (p *HTMLParser) Chunk(content string, path string, opts *schema.CodeChunkingOptions) ([]schema.CodeChunk, error)

Chunk parses HTML content and returns code chunks with metadata. This method:

  1. Parses HTML using goquery
  2. Extracts metadata (author, date, title) - MUST be before boilerplate removal
  3. Removes boilerplate (nav, footer, scripts)
  4. Normalizes links (relative → absolute)
  5. Converts to Markdown
  6. Returns chunks with governance annotations

func (*HTMLParser) Extensions

func (p *HTMLParser) Extensions() []string

Extensions returns the file extensions this parser handles.

func (*HTMLParser) ExtractMetadata

func (p *HTMLParser) ExtractMetadata(content string, path string) (schema.FileMetadata, error)

ExtractMetadata extracts file-level metadata from HTML content. Returns metadata about the HTML document including author, title, and other governance info.

func (*HTMLParser) ExtractUsedSymbols

func (p *HTMLParser) ExtractUsedSymbols(content string) []string

ExtractUsedSymbols returns nil as HTML doesn't have symbol references.

func (*HTMLParser) IsGenerated

func (p *HTMLParser) IsGenerated(content string, path string) bool

IsGenerated returns false as HTML files are typically not auto-generated code.

func (*HTMLParser) Name

func (p *HTMLParser) Name() string

Name returns the parser name identifier.

type Option

type Option func(*HTMLParser)

Option configures the HTMLParser.

func WithBaseURL

func WithBaseURL(baseURL string) Option

WithBaseURL sets the base URL for resolving relative links. This enables conversion of relative URLs to absolute URLs.

func WithBoilerplateRemoval

func WithBoilerplateRemoval(remove bool) Option

WithBoilerplateRemoval enables or removes non-content elements. When true (default), removes nav, footer, scripts, ads, etc.

func WithMarkdownConversion

func WithMarkdownConversion(convert bool) Option

WithMarkdownConversion enables conversion of HTML to Markdown. When true (default), preserves semantic structure as Markdown.

func WithMetadataExtraction

func WithMetadataExtraction(extract bool) Option

WithMetadataExtraction enables extraction of author, date, and other metadata. Extracts from Open Graph, Schema.org, Dublin Core, and standard meta tags.

func WithStructurePreservation

func WithStructurePreservation(preserve bool) Option

WithStructurePreservation enables preservation of semantic structure. When true (default), maintains headers, lists, code blocks as Markdown.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL