normalizer

package
v1.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultDOMTransformations = []string{
	"style, script, path",
	"input[type='hidden']",
	"meta[content]",
	"link[rel='stylesheet']",
	"svg",
	"grammarly-desktop-integration",
	"div[class*='ad'], div[id*='ad'], div[class*='banner'], div[id*='banner'], div[class*='pixel'], div[id*='pixel']",
	"input[name*='csrf'], input[name*='token']",
}

DefaultDOMTransformations is default list of CSS selectors to remove from the DOM.

View Source
var DefaultTextPatterns = []string{

	`\b(?i)[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b`,

	`\b(?:25[0-5]|2[0-4]\d|1?\d?\d)(?:\.(?:25[0-5]|2[0-4]\d|1?\d?\d)){3}\b`,

	`\b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\b`,

	`\b(?:[0-9]{1,2}\s(?:days?|weeks?|months?|years?)\s(?:ago|from\s+now))\b`,

	`[\$€£¥]\s*\d+(?:\.\d{1,2})?\b`,

	`\b\+?\d{7,15}\b`,

	`\b\d{3}-\d{2}-\d{4}\b`,

	`\b(?:(?:[0-9]{4}-[0-9]{2}-[0-9]{2})|(?:(?:[0-9]{2}\/){2}[0-9]{4}))\s(?:[0-9]{2}:[0-9]{2}:[0-9]{2})\b`,
}

DefaultTextPatterns is a list of regex patterns for the text normalizer

View Source
var NoChildrenDomTransformations = []string{
	"div",
	"span",
	"form",
	"iframe",
}

NoChildrenDomTransformations removes all elements with no children

Functions

This section is empty.

Types

type DOMNormalizer

type DOMNormalizer struct {
	// contains filtered or unexported fields
}

DOMNormalizer is a normalizer for DOM content

func NewDOMNormalizer

func NewDOMNormalizer() *DOMNormalizer

NewDOMNormalizer returns a new DOMNormalizer

transformations is a list of CSS selectors to remove from the DOM.

func (*DOMNormalizer) Apply

func (d *DOMNormalizer) Apply(content string) (string, error)

Apply applies the normalizers to the given content

type Normalizer

type Normalizer struct {
	// contains filtered or unexported fields
}

func New

func New() (*Normalizer, error)

New returns a new Normalizer

func (*Normalizer) Apply

func (n *Normalizer) Apply(text string) (string, error)

Apply applies the normalizers to the given content

It normalizes the given content by: - Applying the DOM normalizer - Applying the text normalizer - Denormalizing it

type TextNormalizer

type TextNormalizer struct {
	// contains filtered or unexported fields
}

TextNormalizer is a normalizer for text

func NewTextNormalizer

func NewTextNormalizer() (*TextNormalizer, error)

NewTextNormalizer returns a new TextNormalizer

patterns is a list of regex patterns for the text normalizer DefaultTextPatterns is used if patterns is nil. See DefaultTextPatterns for more info.

func (*TextNormalizer) Apply

func (n *TextNormalizer) Apply(text string) string

Apply applies the patterns to the text and returns the normalized text

Directories

Path Synopsis
Package simhash implements SimHash algorithm for near-duplicate detection.
Package simhash implements SimHash algorithm for near-duplicate detection.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL