html

package

v0.2.3 Latest Latest Go to latest Published: Dec 17, 2025 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/custodia-labs/sercha-cli

Links

Open Source Insights

Documentation ¶

Overview ¶

Package html provides a Normaliser implementation for HTML documents. It extracts readable text content from HTML, stripping tags, scripts, styles, and decoding entities for clean searchable content.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Normaliser ¶

type Normaliser struct{}

Normaliser handles HTML documents.

func New ¶

func New() *Normaliser

New creates a new HTML normaliser.

func (*Normaliser) Normalise ¶

func (n *Normaliser) Normalise(_ context.Context, raw *domain.RawDocument) (*driven.NormaliseResult, error)

Normalise converts an HTML document to a normalised document. The Content field contains the text with HTML tags stripped. Chunking is handled by the PostProcessor pipeline.

func (*Normaliser) Priority ¶

func (n *Normaliser) Priority() int

Priority returns the selection priority.

func (*Normaliser) SupportedConnectorTypes ¶

func (n *Normaliser) SupportedConnectorTypes() []string

SupportedConnectorTypes returns connector types for specialised handling.

func (*Normaliser) SupportedMIMETypes ¶

func (n *Normaliser) SupportedMIMETypes() []string

SupportedMIMETypes returns the MIME types this normaliser handles.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL