htmlprocessor

package
v0.0.0-...-75bc046 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 24, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Document

type Document interface {
	// Title extracts the page title from <title> tag.
	// Returns empty string if not found.
	// Truncates to 200 characters (runes, not bytes).
	Title() string

	// IndexationStatus determines page indexability with priority:
	// non-200 > blocked by meta > non-canonical > indexable
	IndexationStatus(statusCode int, finalURL string) types.IndexStatus

	// CleanScripts removes executable script elements.
	// Returns true if any were removed.
	CleanScripts() bool

	// GoQueryDoc returns the underlying goquery Document for advanced queries.
	GoQueryDoc() *goquery.Document

	// HTML returns current HTML as bytes (re-serialized from DOM).
	HTML() []byte

	// ExtractPageSEO extracts comprehensive SEO metadata from the document.
	// statusCode and pageURL are needed for IndexationStatus calculation.
	ExtractPageSEO(statusCode int, pageURL string) *types.PageSEO
}

Document provides methods for processing HTML documents.

func ParseWithDOM

func ParseWithDOM(htmlBytes []byte) (Document, error)

ParseWithDOM parses HTML bytes into a Document using DOM parsing.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL