Documentation
¶
Overview ¶
Package html provides a Normaliser implementation for HTML documents. It extracts readable text content from HTML, stripping tags, scripts, styles, and decoding entities for clean searchable content.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Normaliser ¶
type Normaliser struct{}
Normaliser handles HTML documents.
func (*Normaliser) Normalise ¶
func (n *Normaliser) Normalise(_ context.Context, raw *domain.RawDocument) (*driven.NormaliseResult, error)
Normalise converts an HTML document to a normalised document. The Content field contains the text with HTML tags stripped. Chunking is handled by the PostProcessor pipeline.
func (*Normaliser) Priority ¶
func (n *Normaliser) Priority() int
Priority returns the selection priority.
func (*Normaliser) SupportedConnectorTypes ¶
func (n *Normaliser) SupportedConnectorTypes() []string
SupportedConnectorTypes returns connector types for specialised handling.
func (*Normaliser) SupportedMIMETypes ¶
func (n *Normaliser) SupportedMIMETypes() []string
SupportedMIMETypes returns the MIME types this normaliser handles.
Click to show internal directories.
Click to hide internal directories.