Documentation
¶
Overview ¶
Package standardize provides content standardization functionality for the defuddle content extraction system. It converts non-semantic HTML elements to semantic ones and applies standardization rules.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Content ¶ added in v0.1.4
func Content(element *goquery.Selection, metadata *metadata.Metadata, doc *goquery.Document, debug bool)
Content standardizes and cleans up the main content element JavaScript original code:
export function standardizeContent(element: Element, metadata: DefuddleMetadata, doc: Document, debug: boolean = false): void {
standardizeSpaces(element);
// Remove HTML comments
removeHTMLComments(element);
// Handle H1 elements - remove first one and convert others to H2
standardizeHeadings(element, metadata.title, doc);
// Standardize footnotes and citations
standardizeFootnotes(element);
// Convert embedded content to standard formats
standardizeElements(element, doc);
// If not debug mode, do the full cleanup
if (!debug) {
// First pass of div flattening
flattenWrapperElements(element, doc);
// Strip unwanted attributes
stripUnwantedAttributes(element, debug);
// Remove empty elements
removeEmptyElements(element);
// Remove trailing headings
removeTrailingHeadings(element);
// Final pass of div flattening after cleanup operations
flattenWrapperElements(element, doc);
// Standardize consecutive br elements
stripExtraBrElements(element);
// Clean up empty lines
removeEmptyLines(element, doc);
} else {
// In debug mode, still do basic cleanup but preserve structure
stripUnwantedAttributes(element, debug);
removeTrailingHeadings(element);
stripExtraBrElements(element);
logDebug('Debug mode: Skipping div flattening to preserve structure');
}
}
Types ¶
type StandardizationRule ¶
type StandardizationRule struct {
Selector string
Element string
Transform func(el *goquery.Selection, doc *goquery.Document) *goquery.Selection
}
StandardizationRule represents element standardization rules JavaScript original code:
interface StandardizationRule {
selector: string;
element: string;
transform?: (el: Element, doc: Document) => Element;
}
Click to show internal directories.
Click to hide internal directories.