standardize

package
v1.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 11, 2025 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package standardize provides content standardization functionality for the defuddle content extraction system. It converts non-semantic HTML elements to semantic ones and applies standardization rules.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Content added in v0.1.4

func Content(element *goquery.Selection, metadata *metadata.Metadata, doc *goquery.Document, debug bool)

Content standardizes and cleans up the main content element JavaScript original code:

export function standardizeContent(element: Element, metadata: DefuddleMetadata, doc: Document, debug: boolean = false): void {
	standardizeSpaces(element);

	// Remove HTML comments
	removeHTMLComments(element);

	// Handle H1 elements - remove first one and convert others to H2
	standardizeHeadings(element, metadata.title, doc);

	// Standardize footnotes and citations
	standardizeFootnotes(element);

	// Convert embedded content to standard formats
	standardizeElements(element, doc);

	// If not debug mode, do the full cleanup
	if (!debug) {
		// First pass of div flattening
		flattenWrapperElements(element, doc);

		// Strip unwanted attributes
		stripUnwantedAttributes(element, debug);

		// Remove empty elements
		removeEmptyElements(element);

		// Remove trailing headings
		removeTrailingHeadings(element);

		// Final pass of div flattening after cleanup operations
		flattenWrapperElements(element, doc);

		// Standardize consecutive br elements
		stripExtraBrElements(element);

		// Clean up empty lines
		removeEmptyLines(element, doc);
	} else {
		// In debug mode, still do basic cleanup but preserve structure
		stripUnwantedAttributes(element, debug);
		removeTrailingHeadings(element);
		stripExtraBrElements(element);
		logDebug('Debug mode: Skipping div flattening to preserve structure');
	}
}

Types

type StandardizationRule

type StandardizationRule struct {
	Selector  string
	Element   string
	Transform func(el *goquery.Selection, doc *goquery.Document) *goquery.Selection
}

StandardizationRule represents element standardization rules JavaScript original code:

interface StandardizationRule {
	selector: string;
	element: string;
	transform?: (el: Element, doc: Document) => Element;
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL