htmlpolicy

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 7, 2026 License: AGPL-3.0 Imports: 18 Imported by: 0

README

htmlpolicy

pipeline status coverage report Go Reference

A policy-driven HTML, CSS, and SVG sanitizer for Go.

Most sanitizers hardcode what's "safe" in library code. htmlpolicy takes a different approach: you write a declarative policy that says exactly what to strip, allow, defang, demote, or unwrap. Anything not matched by a rule passes through unchanged. Policies are plain text files, not Go code, so they can be reviewed, versioned, composed, and swapped without recompiling.

API reference and policy language documentation (godoc)

Quick Start

import "gitlab.com/grepular/htmlpolicy"

// Parse a policy and apply it to an HTML fragment.
policy, err := htmlpolicy.Parse(`
    strip script,style,noscript
    strip iframe,object,embed
    strip-attr * on*
    strip-attr a[href^=javascript:] href
`, nil)
if err != nil {
    log.Fatal(err)
}

output, modified, err := policy.ApplyHTML(fragmentContent)

For full HTML documents, use ApplyDocument instead:

output, modified, err := policy.ApplyDocument(documentContent)

For standalone CSS sanitization:

policy, _ := htmlpolicy.Parse("css-strip *\ncss-allow color,font-size", nil)
out, modified, err := policy.ApplyCSS([]byte(".foo { color: red; margin: 10px; }"))
// out = ".foo {\ncolor: red;\nfont-size: 14px;\n}"

Both HTML and CSS expect UTF-8 input. Use ConvertToUTF8 first if needed:

utf8Content, _ := htmlpolicy.ConvertToUTF8(rawContent, contentType)

Features

  • HTML sanitization — strip, allow, defang, demote, comment-out, placeholder, or unwrap elements and attributes using CSS selectors
  • CSS sanitization — filter properties, values, at-rules, url() schemes, and selector pseudo-classes/elements in style attributes, <style> elements, and standalone stylesheets
  • URL scheme filtering — allow/strip/defang URLs by scheme (html and css independently) with per-URL evaluation for multi-URL attributes (srcset)
  • Content-type filtering — control which MIME types are allowed in data URIs with wildcard patterns
  • Recursive sanitization — data:text/html, data:image/svg+xml, srcdoc, and meta refresh URLs are recursively sanitized with depth limiting
  • SVG/MathML namespace validation — prevents mutation XSS from foreign-content elements in the wrong namespace
  • SMIL animation sanitization — applies attribute and scheme rules to animated values, preventing runtime bypasses
  • Parser-differential prevention — always re-serializes through Go's HTML5 parser by default
  • URL rewriting — per-call callback hook (via WithApplyURLRewriter) for rewriting URLs in the final output (HTML attributes, CSS url() values, srcset, SMIL); scoped to a single Apply call so it can safely close over per-call state without concurrency concerns. A companion WithApplyURLPrefetcher hook delivers the full URL set up front (same set, same order) so callers can fetch resources in parallel before the rewriter runs
  • Policy composition — include directives for combining policies
  • Concurrency safe — Policy is safe for concurrent use after creation

Security Guidance

This library is policy-driven — it does not hardcode any element or attribute as "safe" or "dangerous". You must write policies that account for the HTML features your application needs to defend against.

strip-comments
strip script,noscript,style,base
strip iframe,fencedframe,object,embed,applet
strip template,portal
strip form,input,textarea,select,button
strip meta[http-equiv=refresh]
strip-attr * on*
strip-attr * style
strip-attr * is
strip-attr * nonce
strip-attr template shadowrootmode
strip-scheme * href javascript,vbscript
strip-scheme * src javascript,vbscript,data
strip-content-type * * *+xml,*/xml
Key threats to consider
  • Declarative Shadow DOM: <template shadowrootmode="open"> causes template content to become live DOM. Strip <template> or the attribute.
  • <iframe srcdoc>: Contains raw inline HTML. Strip iframes or the srcdoc attribute.
  • <fencedframe>: Chrome-specific iframe-like element. Treat like iframe.
  • is attribute: Triggers custom element constructors, has mXSS edge cases. Strip in security contexts.
  • nonce attribute: CSP bypass if an attacker controls it. Always strip.
  • SVG data URIs: Recursively sanitized, but XML namespace prefixes are a known limitation. Use strip-content-type * * image/svg+xml where not needed, or restrict to <img> context (where browsers disable scripting).
  • unwrap * as an allowlist catch-all: Unwrap removes a wrapper element but keeps its children in flow, mirroring how browsers treat unknown tags (e.g. <dov>hello</dov> renders as hello). It is useful for letting unrecognized structural tags pass through transparently, but it MUST be paired with explicit strip rules for any element whose presence indicates active content or whose textual content should not be visible. At minimum: strip script,style,iframe,object,embed,form,input,textarea,select,button,link,base and strip meta[http-equiv=refresh]. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed, and unknown-named elements that carry active content would be exposed by name.

CLI Tool

go install gitlab.com/grepular/htmlpolicy/cmd/htmlpolicy@latest
htmlpolicy [flags] <policy-arg>... < input.html > output.html

Each argument is either a path to a policy file or inline policy text. Multiple arguments are concatenated in order (last match wins):

htmlpolicy policy.txt < input.html > output.html
htmlpolicy 'strip script' < input.html > output.html
htmlpolicy base.policy 'strip style' 'allow-scheme * href https'
htmlpolicy base.policy overrides.policy 'strip img'
Flag Description
-fragment Parse input as an HTML fragment instead of a full document
-detect-charset Detect and convert input charset to UTF-8
-content-type Content-Type header for charset detection (only used with -detect-charset)
-prefix Override the prefix for defang/comment-out actions (default: htmlpolicy)
-verbose Log each sanitization action to stderr
Verbose Logging

The -verbose flag (or WithVerboseLog in the library) logs each action to stderr:

strip <script> (line 1: strip script)
strip-attr <p> onclick (line 3: strip-attr * on*)
strip-scheme <a> href javascript:alert(1) (line 5: strip-scheme * href javascript)
css-strip position (line 7: css-strip position)
css-strip-at @import (line 8: css-strip-at import)

Requirements

Go 1.26.1 or later.

Testing

make test

100% test coverage is required. Tests fail if coverage drops below 100%.

License

See LICENSE.

Acknowledgements

This project was developed by Mike Cardwell, with the assistance of Claude Code, Anthropic's AI coding tool.

Support/Appreciate my work

Documentation

Overview

Package htmlpolicy implements a policy-driven HTML, CSS, and SVG sanitizer.

This library was built entirely through vibe coding with Claude Code (https://claude.ai/claude-code).

Unlike sanitizers that hardcode safety rules in library code, htmlpolicy uses declarative policies: plain text files where each line starts with an action verb followed by CSS selectors and optional arguments. Rules are evaluated top-to-bottom with last-match-wins semantics. Anything not matched by a rule passes through unchanged.

Policies can be reviewed, versioned, composed via includes, and swapped without recompiling. This makes htmlpolicy suitable for email rendering pipelines, CMS content filtering, user-generated content sanitization, and any HTML-to-HTML transformation where the rules need to be configurable.

Entry Points

There are five entry points for applying a policy:

All entry points expect UTF-8 input and return UTF-8 output. A Policy is safe for concurrent use by multiple goroutines.

Selectors

Selectors use standard CSS syntax compiled via github.com/andybalholm/cascadia:

tag                     match all <tag> elements
*                       match all elements
tag[attr=value]         exact attribute match
tag[attr^=prefix]       prefix match
tag[attr$=suffix]       suffix match
tag[attr*=substring]    contains match
tag[attr~=word]         space-separated word match
tag[attr=value i]       case-insensitive match (CSS4)
div.ads > a             child combinator (quote when spaces present)
svg animate             descendant: <animate> inside <svg>
#tracking               ID selector
:not([href^=https])     negation pseudo-class

Comma-separated selector groups (e.g. "script,style,iframe") compile into a single rule. When a selector contains spaces, quote it so the parser distinguishes it from arguments: strip "svg animate".

Tag matching is case-insensitive. Attribute value matching normalizes control characters and case internally for security, without modifying the output. This prevents bypasses like " javascript:" evading a [href^=javascript:] prefix check, or JaVaScRiPt: evading case-sensitive matching.

HTML Tag Verbs

Tag verbs determine what happens to matched elements:

strip SELECTOR[,...]                remove element and all content
comment-out SELECTOR[,...]          wrap in HTML comment
placeholder SELECTOR[,...]          replace with [removed: tag] label
demote SELECTOR[,...] [ATTR,...]    convert to <div>/<span>, keep listed attrs
allow SELECTOR[,...] [ATTR,...]     keep element, keep only listed attrs
unwrap SELECTOR[,...]               remove the element's tags but keep its children

Inline elements demote to <span>, all others to <div>. The namespace is cleared, so foreign elements (SVG, MathML) become plain HTML.

Unwrap removes the wrapper but leaves the children in flow (mirroring how browsers treat unknown elements). The wrapper's attributes are discarded; void/empty elements with no children are stripped. As a catch-all in an allowlist policy, "unwrap *" MUST be paired with explicit strip rules for any element whose presence indicates active content or whose text content should not be visible — at a minimum: script, style, iframe, object, embed, form, input, textarea, select, button, link, base, meta. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed.

The allow and demote shorthand attribute list is equivalent to separate allow-attr rules. If the selector has a condition, it propagates:

allow a[href^=https:] href,class
# is equivalent to:
allow a[href^=https:]
allow-attr a[href^=https:] href
allow-attr a[href^=https:] class

HTML Attribute Verbs

Attribute verbs filter attributes on matched elements. Attribute name patterns support a trailing * glob: on* matches onclick, onload, etc.

strip-attr SELECTOR ATTR[,...]      strip named attrs
allow-attr SELECTOR ATTR[,...]      allow only named attrs
defang-attr SELECTOR ATTR[,...]     rename to PREFIX-defanged-ATTR

URL Scheme Verbs

Scheme verbs filter URL attributes by the URL's scheme. Use :relative to match schemeless URLs, * to match any scheme:

allow-scheme SELECTOR ATTR SCHEME[,...]     allow URLs with listed schemes
strip-scheme SELECTOR ATTR SCHEME[,...]     strip URLs with listed schemes
defang-scheme SELECTOR ATTR SCHEME[,...]    rename attr when scheme matches

For allowlist behavior, pair with a strip-scheme baseline:

strip-scheme * href *
allow-scheme * href https,mailto,:relative

Each scheme must be a valid URL scheme (an ASCII letter followed by letters, digits, "+", "-", or "."), the wildcard *, or :relative. Comma-separated lists reject a stray comma between entries (e.g. "http,,https") to catch typos.

Multi-URL attributes (srcset, imagesrcset, ping, archive) get per-URL evaluation — each URL is independently matched against the rule chain.

Content-Type Verbs

Content-type verbs control what happens to data URIs based on their MIME type. MIME patterns support a single * wildcard: image/*, *+xml, */xml.

allow-content-type SELECTOR ATTR TYPE[,...]     allow data URIs with listed types
strip-content-type SELECTOR ATTR TYPE[,...]     strip data URIs with listed types
defang-content-type SELECTOR ATTR TYPE[,...]    rename attr when type matches

When no content-type rules are present, default behavior applies (text/html, application/xhtml+xml, and image/svg+xml are recursively sanitized, others pass through). When any content-type rule is present, rules take full control.

CSS Verbs

CSS verbs process style attributes, <style> element contents, data:text/css URIs, and SMIL animated style values. They follow the same last-match-wins semantics as HTML rules. Without CSS rules, CSS content passes through unchanged. HTML scheme rules (strip-scheme etc.) do NOT apply to CSS — use css-*-scheme rules instead.

css-strip PROP[,...]                    strip CSS properties (* = all, trailing * glob ok)
css-allow PROP[,...]                    allow CSS properties
css-defang PROP[,...]                   defang CSS properties (prefix name)
css-strip-value PATTERN                 strip properties matching value pattern
css-defang-value PATTERN                defang properties matching value pattern
css-strip-at NAME[,...]                 strip CSS at-rules (e.g. import, * = all)
css-allow-at NAME[,...]                 allow CSS at-rules
css-defang-at NAME[,...]                defang CSS at-rules (prefix name)
css-strip-scheme TARGET SCHEME[,...]    strip CSS url() values by scheme
css-allow-scheme TARGET SCHEME[,...]    allow CSS url() values by scheme
css-defang-scheme TARGET SCHEME[,...]   defang CSS url() values by scheme
css-strip-pseudo NAME[,...]             strip selectors using a pseudo-class/element
css-allow-pseudo NAME[,...]             allow pseudo-classes/elements

The css-*-scheme TARGET is a comma-separated list of CSS property name patterns or @import. Use * to match all properties and @import. Scheme lists support :relative and * as with HTML scheme rules.

The css-*-pseudo verbs filter CSS selectors (in <style> elements, data:text/css, and standalone Policy.ApplyCSS) by pseudo-class or pseudo-element name. Each comma-separated NAME may carry an optional leading "::" to match pseudo-elements only or ":" to match pseudo-classes only; a bare name (or *) matches either kind. Names support a trailing * glob. The four legacy single-colon pseudo-elements (:before, :after, :first-line, :first-letter) are always treated as pseudo-elements.

When a pseudo with action strip matches anywhere within a complex (comma-separated) selector — including nested inside functional pseudo-classes such as :has(), :is(), :not(), :where() — that entire complex selector is dropped; if every complex selector in a ruleset is dropped, the whole ruleset is removed. This only ever narrows which elements a style applies to. As with other CSS verbs, evaluation is last-match-wins per pseudo and unmatched pseudos pass through, so an allowlist is written as "css-strip-pseudo *" then "css-allow-pseudo hover,focus". Pseudo filtering does not apply to inline style attributes (which have no selectors) or to at-rule preludes such as @page :first or @keyframes stops.

CSS allowlist example:

css-strip *
css-allow color,background-color,font-size,font-family,font-weight
css-allow text-align,text-decoration,margin,padding,border
css-strip-value expression(*)
css-strip-value url(*)
css-strip-at *

CSS blocklist example:

css-strip -moz-binding,behavior
css-strip-value expression(*)
css-strip-at import
css-strip-scheme * javascript,vbscript,data,blob

Standalone CSS sanitization is available via Policy.ApplyCSS (stylesheets) and Policy.ApplyInlineCSS (declaration lists).

Other Verbs

placeholder-label LABEL      customize the label for placeholder (default "removed")
strip-comments               remove HTML comments
include NAME-OR-PATH         inline another policy (loaded via [Resolver])

Lines starting with # are comments. Only full-line comments are supported.

URL Rewriting

WithApplyURLRewriter is an ApplyOption that sets a callback that receives every URL in the sanitized output and can replace it. The rewriter is scoped to a single Apply call, so it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. The rewriter runs after policy evaluation and base URL resolution. It covers HTML URL attributes, CSS url() values, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded. Fragment-only references, empty URLs, and data: URIs themselves are excluded (but URLs inside data URI content are recursed into). The option also works with Policy.ApplyCSS and Policy.ApplyInlineCSS.

WithApplyURLPrefetcher is a companion ApplyOption that receives the full set of URLs the rewriter would be called with — same set, same resolved form, same dedup, in document order — once per call, before the rewrite pass. It lets a caller warm a cache (e.g. fetch image and font resources in parallel) that the subsequent synchronous rewriter reads from. Each URLRef carries a live URLContext (including GetAttr) so the caller can inspect sibling attributes when deciding what to fetch.

Output Normalization

By default, Policy.ApplyHTML and Policy.ApplyDocument always re-serialize through Go's HTML5 parser, even when no rules modified the content. This prevents parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no rules match.

Embedded Content Sanitization

HTML embedded inside attribute values is recursively sanitized: data:text/html URIs, data:application/xhtml+xml URIs, data:image/svg+xml URIs, srcdoc attributes, and meta refresh URLs. Recursion depth is limited to 16 levels; at the limit, content is stripped entirely. Use content-type rules to control which MIME types are allowed.

Namespace Validation (mXSS Prevention)

After applying policy rules, namespace consistency is validated. Elements whose namespace is invalid for their DOM position are stripped. This prevents mutation XSS attacks where foreign-content elements end up in the wrong namespace after policy actions change the tree structure.

SVG SMIL Animation Sanitization

SMIL animation elements (<animate>, <set>, etc.) are sanitized by applying the same attribute and scheme policy rules to animated values. This prevents runtime bypasses via attributeName="href" values="javascript:alert(1)". CSS in animated style values is sanitized through the CSS engine.

Limitations

CSS selectors in <style> elements are not filtered. Attribute selectors combined with url() values can exfiltrate data. The url() scheme filtering mitigates this, but full protection requires stripping <style> elements.

CSS var()/env()/attr() substitution happens at browser computed-value time, not parse time. Known attack vectors are handled (url() in custom properties, @import var(), var() fallback values, var()/env()/attr() in URL-accepting functions). When css-*-scheme rules are active, var()/env()/attr() inside URL-accepting functions (image-set, -webkit-image-set, cross-fade, image, src) is stripped as a precaution. Bare string arguments inside these functions are evaluated as URLs and subjected to css-*-scheme rules. CSS escape sequences in the url() (or attr()) function name itself (e.g. \75rl(...), \61ttr(...)) bypass the lexer's URLToken recognition, so when css-*-scheme rules are active any FunctionToken whose escape-decoded name is "url" causes the declaration to be stripped. attr() with a url-typed substitution — "attr(name url)" or "attr(name type(<url>))" — also causes the declaration to be stripped, since the value of the named HTML attribute would be loaded as a URL at computed time.

Hardcoded limits: data URI recursion depth 16, HTML nesting depth 512, CSS nesting depth 128, output size 10x input (configurable via WithMaxOutputFactor, minimum 32KB).

Index

Examples

Constants

View Source
const DefaultMaxIncludeDepth = 64

DefaultMaxIncludeDepth is the default maximum nesting depth for includes.

Variables

This section is empty.

Functions

func ConvertToUTF8

func ConvertToUTF8(content []byte, contentType string) ([]byte, bool)

ConvertToUTF8 converts content from the charset specified in contentType to UTF-8. The contentType should be a MIME type with optional charset parameter (e.g. "text/html; charset=iso-8859-1"). If no charset is specified, encoding is determined by inspecting the content.

Returns the UTF-8 content and a boolean indicating whether conversion was performed. When the content is already UTF-8 (or the detected encoding matches UTF-8), the original slice is returned with false.

Example
package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	content := []byte("<p>caf\xe9</p>")
	utf8, converted := htmlpolicy.ConvertToUTF8(content, "text/html; charset=iso-8859-1")

	fmt.Println("converted:", converted)
	fmt.Println(string(utf8))
}
Output:
converted: true
<p>café</p>

Types

type Action

type Action int

Action determines what happens to a matched element or attribute. These types are exported for documentation purposes but are not part of the stable API — Policy fields are unexported and there is no public function that accepts or returns rule types.

const (
	// Strip removes the element and all of its content from the output.
	Strip Action = iota
	// CommentOut wraps the element and its content in an HTML comment
	// (<!-- ... -->), making it invisible to browsers while preserving
	// the content in the source for inspection.
	CommentOut
	// Placeholder replaces the element with a text label such as
	// "[removed: script]". The label word is configurable via the
	// placeholder-label policy directive.
	Placeholder
	// Demote converts the element to a safe generic container: inline
	// elements become <span>, all others become <div>. Allowed
	// attributes are preserved; the namespace is cleared.
	Demote
	// Allow keeps the element in the output unchanged.
	Allow
	// Defang renames the attribute by inserting a prefix (e.g.
	// "htmlpolicy-defanged-onclick"), making it inert while preserving
	// the value for inspection.
	Defang
	// Unwrap removes the element's start and end tags but keeps its
	// children in place at the position the element occupied. The
	// element's attributes are discarded along with the wrapper. This
	// mirrors how browsers treat unknown elements (HTMLUnknownElement
	// is a transparent wrapper: the tag has no effect and children
	// render in flow). Void/empty elements with no children behave as
	// [Strip] — there is nothing to keep.
	//
	// Unwrap differs from [Demote]: Demote renames the wrapper to a
	// safe generic container (<div>/<span>), preserving the element
	// boundary, while Unwrap removes the wrapper entirely.
	//
	// Unwrap is appropriate as a catch-all for unknown structural
	// elements in an allowlist policy. It MUST be paired with explicit
	// strip rules for any element whose presence indicates active
	// content or whose text content should not be visible. At a
	// minimum, pair "unwrap *" with strip rules for script, style,
	// iframe, object, embed, form, input, textarea, select, button,
	// link, base, and meta. Without those, raw <script>/<style> text
	// would leak into the output as visible text after the wrapper
	// is removed.
	Unwrap
)

type ApplyOption

type ApplyOption func(*applyConfig)

ApplyOption configures a single call to Policy.ApplyHTML, Policy.ApplyDocument, Policy.ApplyCSS, or Policy.ApplyInlineCSS. Unlike ParseOption (which is baked into the compiled Policy and shared across all calls), ApplyOption values are scoped to one call. This allows per-call state — for example, a URL rewriter that closes over per-message caches or budgets — without recompiling the policy.

func WithApplyURLPrefetcher added in v0.3.0

func WithApplyURLPrefetcher(fn func([]URLRef)) ApplyOption

WithApplyURLPrefetcher registers a callback invoked once per Apply call, after parsing, policy evaluation, and base URL resolution, but before the URL rewrite pass. It receives every URL that a URLRewriter set via WithApplyURLRewriter would be invoked with in this same call — the same set, the same resolved form, the same deduplication semantics, and the same order (document order for top-level URLs; URLs inside recursively sanitized data: content appear at the point their containing attribute is processed). The callback returns nothing; it exists so the caller can warm a cache — for example, fetching image and font resources in parallel — that the subsequent synchronous URLRewriter then reads from.

Each [URLRef.Context].GetAttr is live during the callback and returns the element's attributes, so the caller can inspect sibling attributes (width, height, style, type, rel, …) to decide whether to fetch.

The prefetcher and rewriter are independent: either, both, or neither may be set. When both are set the prefetcher runs first, to completion, and then the normal rewrite walk runs. When no rewriter is set the prefetcher is still invoked (with the set the rewriter would have seen). The callback is always invoked exactly once per Apply call, even when no URLs are found (with an empty slice).

Implementation note: enabling a prefetcher runs the deterministic sanitization pipeline a second time — a recording-only pass that collects the URL set without mutating output — so the prefetch set is guaranteed to match the rewrite pass exactly, including URLs at every recursion depth. This roughly doubles the parse/sanitize CPU for the call (the network fetches it enables run once, in parallel). Callers that do not set a prefetcher pay nothing. If WithVerboseLog is also enabled, rule-match log lines are emitted for both passes.

Like WithApplyURLRewriter, the callback is scoped to a single Apply call and may safely close over per-call state without concurrency concerns. It is invoked synchronously from within Apply; htmlpolicy itself introduces no goroutines (the caller may fan out internally).

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`allow *`, nil)
	if err != nil {
		log.Fatal(err)
	}

	// The prefetcher receives every URL up front, so a caller can fetch them
	// in parallel (with its own concurrency limits) and warm a cache that the
	// synchronous rewriter then reads from.
	cache := map[string]string{}
	prefetch := func(refs []htmlpolicy.URLRef) {
		for i, r := range refs {
			// A real implementation would fetch r.URL here, in parallel.
			cache[r.URL] = fmt.Sprintf("cid:image%03d", i+1)
		}
	}
	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		if cid, ok := cache[u]; ok {
			return cid
		}
		return u
	}

	input := []byte(`<img src="https://example.com/a.jpg"/><img src="https://example.com/b.jpg"/>`)
	output, _, err := policy.ApplyHTML(input,
		htmlpolicy.WithApplyURLPrefetcher(prefetch),
		htmlpolicy.WithApplyURLRewriter(rewriter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<img src="cid:image001"/><img src="cid:image002"/>

func WithApplyURLRewriter

func WithApplyURLRewriter(fn URLRewriter) ApplyOption

WithApplyURLRewriter sets a function that rewrites URLs in the sanitized output for this single Apply call. The rewriter runs after policy evaluation and base URL resolution with each URL's final resolved form. It receives a URLContext describing where the URL was found and returns a replacement URL (or the original to keep it unchanged).

The rewriter covers HTML URL attributes (href, src, action, etc.), CSS url() values in style attributes and <style> elements, @import URLs, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded.

Fragment-only references (#id), empty URLs, and data: URIs are excluded from the callback. However, URLs inside data URI content are recursed into — e.g. url(img.png) inside a data:text/css URI is presented to the rewriter even though the data URI itself is not.

Because the rewriter is scoped to a single Apply call, it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. Two goroutines calling Apply on the same Policy with different rewriters do not share state.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script,style
		strip-attr * on*
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		// Replace external URLs with CID references (for email embedding).
		if ctx.Element == "img" && ctx.Attr == "src" {
			return "cid:image001"
		}
		return u
	}

	input := []byte(`<p>Hello</p><img src="https://example.com/photo.jpg" alt="photo"/>`)
	output, _, err := policy.ApplyHTML(input, htmlpolicy.WithApplyURLRewriter(rewriter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<p>Hello</p><img src="cid:image001" alt="photo"/>

type AttrRule

type AttrRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	AttrName string           // attribute name pattern (may have trailing "*" glob)
	Action   Action           // Allow, Strip, or Defang
	Line     int              // source line number
}

AttrRule defines filtering for a specific attribute on matching elements.

type CSSAtRule

type CSSAtRule struct {
	Name   string // e.g. "import", "font-face"
	Action Action // Allow, Strip, or Defang
	Line   int    // source line number
}

CSSAtRule filters CSS at-rules by name.

type CSSPropertyRule

type CSSPropertyRule struct {
	Properties []string // property names (trailing "*" glob ok)
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSPropertyRule filters CSS properties by name.

type CSSPseudoRule added in v0.4.0

type CSSPseudoRule struct {
	Kind    pseudoKind // any / class / element
	Pattern string     // lowercase name, optional trailing "*" glob
	Action  Action     // Allow or Strip
	Line    int        // source line number
}

CSSPseudoRule filters CSS selectors by pseudo-class / pseudo-element name. When a rule with Action Strip matches a pseudo anywhere within a complex (comma-separated) selector — including pseudos nested inside functional pseudo-classes such as :has(), :is(), :not(), :where() — that entire complex selector is dropped. If every complex selector in a ruleset's prelude is dropped, the whole ruleset is removed.

type CSSSchemeRule

type CSSSchemeRule struct {
	Properties []string // CSS property name patterns (trailing "*" glob ok, "*" = all, "@import" = import URLs)
	Schemes    []string // e.g. ["javascript", "data", ":relative", "*"]
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSSchemeRule filters URLs in CSS url() values by scheme.

type CSSValueRule

type CSSValueRule struct {
	Pattern string // e.g. "url(*)", "expression(*)"
	Action  Action // Strip or Defang
	Line    int    // source line number
}

CSSValueRule filters CSS properties whose values match a pattern. Patterns ending in "(*)" match any value containing a call to that CSS function (e.g. "expression(*)" matches values containing "expression(...)"). Other patterns match the full value literally (case-insensitive).

type ContentTypeRule

type ContentTypeRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	Attr     string           // attribute name pattern (e.g. "src", "*", "href")
	Types    []string         // MIME type patterns (lowercase), e.g. "image/*", "text/html"
	Action   Action           // Allow, Strip, or Defang
	Line     int
}

ContentTypeRule restricts data URI MIME types for an attribute on matching elements. The Action field determines what happens to data URIs whose MIME type matches the rule's pattern list: Allow recursively sanitizes them, Strip removes the URI value, and Defang renames the attribute to make it inert.

type ParseOption

type ParseOption func(*parseConfig)

ParseOption configures the behavior of Parse.

func WithMaxIncludeDepth

func WithMaxIncludeDepth(n int) ParseOption

WithMaxIncludeDepth sets the maximum nesting depth for include directives. The default is DefaultMaxIncludeDepth (64).

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script,style
	`, nil, htmlpolicy.WithMaxIncludeDepth(4))
	if err != nil {
		log.Fatal(err)
	}

	output, _, err := policy.ApplyHTML([]byte("<p>Hello</p><script>evil</script>"))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>Hello</p>

func WithMaxOutputFactor

func WithMaxOutputFactor(factor float64) ParseOption

WithMaxOutputFactor sets the maximum allowed output size as a multiplier of the input size. For example, a factor of 10.0 means the output may be at most 10x the input size (with a minimum of 32KB). This guards against amplification attacks such as a long <base href> resolved into many short relative URLs. The default is 10.0. Set to 0 to disable the limit.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script
	`, nil, htmlpolicy.WithMaxOutputFactor(2.0))
	if err != nil {
		log.Fatal(err)
	}

	// Small input is always allowed (32KB minimum).
	output, _, err := policy.ApplyHTML([]byte("<p>Hello</p>"))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>Hello</p>

func WithPreserveOriginal

func WithPreserveOriginal() ParseOption

WithPreserveOriginal configures Policy.ApplyHTML to return the original input bytes when no policy rules modify the content, instead of re-serializing through the HTML parser.

By default, ApplyHTML always re-serializes through Go's HTML5 parser, which normalizes malformed HTML. This is the safer default because it ensures the browser sees exactly the same structure the sanitizer saw.

Warning: enabling this option means that when no rules match, the original bytes are returned unmodified. If the input contains malformed HTML that Go's parser and the browser parse differently, this creates a parser-differential attack surface. Only enable this option if you trust the input to be well-formed HTML, or if you need byte-for-byte preservation of unmodified content (e.g., to avoid altering whitespace or attribute quoting in content that was not changed by policy rules).

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script
	`, nil, htmlpolicy.WithPreserveOriginal())
	if err != nil {
		log.Fatal(err)
	}

	// Unmodified content is returned as-is (byte-for-byte).
	input := []byte("<p>Hello</p>")
	output, modified, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: false
<p>Hello</p>

func WithVerboseLog

func WithVerboseLog(w io.Writer) ParseOption

WithVerboseLog enables verbose logging of policy rule matches. When set, every sanitization action (strip, defang, demote, comment-out, placeholder) writes a one-line description to w. Elements and attributes that pass through unchanged are not logged.

The writer must be safe for concurrent use if the Policy is used concurrently.

type Policy

type Policy struct {
	// contains filtered or unexported fields
}

Policy is a compiled set of rules for sanitizing HTML, CSS, and SVG content.

A Policy is created by Parse and should not be modified afterward. The only exception is Policy.SetPrefix, which may be called before the first Policy.ApplyHTML call.

A Policy is safe for concurrent use by multiple goroutines. Policy.ApplyHTML does not mutate the Policy.

Security note: this library is policy-driven and does not hardcode any element or attribute as safe or dangerous. Policy authors must account for modern HTML features including Declarative Shadow DOM (<template shadowrootmode>), iframe srcdoc, fencedframe, custom elements (the is attribute), and CSP nonce attributes. See the project README for recommended baselines and security guidance.

func Parse

func Parse(input string, resolver Resolver, opts ...ParseOption) (*Policy, error)

Parse parses policy text into a Policy. The resolver is used to load included policies (by preset name or file path); it may be nil if the policy contains no include directives.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script,style
		strip-attr * on*
		strip-attr a[href^=javascript:] href
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p onclick="track()">Hello</p><script>alert(1)</script>`)
	output, modified, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println("output:", string(output))
}
Output:
modified: true
output: <p>Hello</p>
Example (Allowlist)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip *
		strip-attr * *
		allow p,div,b,i,em,strong
		allow a href
		allow-attr * class,id
		strip-attr * on*
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<div><p class="text"><a href="https://example.com" onclick="x">Link</a> and <b>bold</b></p><script>evil</script></div>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<div><p class="text"><a href="https://example.com">Link</a> and <b>bold</b></p></div>
Example (CommentOut)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		comment-out script
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>safe</p><script>evil()</script>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>safe</p><!--htmlpolicy-commented-out <script>evil()</script>-->
Example (ContentTypeFiltering)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-content-type * src *
		allow-content-type * src image/*
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<img src="data:image/png;base64,iVBOR"/><img src="data:text/html,<script>evil</script>"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<img src="data:image/png;base64,iVBOR"/><img/>
Example (CssAtRules)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip-at import
		css-defang-at media
		css-allow-at keyframes
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte("@import \"evil.css\";\n@media screen { .x { color: red; } }\n@keyframes fade { from { opacity: 0; } }")
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
@htmlpolicy-defanged-media screen {
.x {
color: red;
}
}
@keyframes fade {
from {
opacity: 0;
}
}
Example (DefangContentType)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-content-type * src image/svg+xml
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<img src="data:image/svg+xml,<svg/>"/><img src="data:image/png;base64,iVBOR"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<img htmlpolicy-defanged-src="data:image/svg+xml,&lt;svg/&gt;"/><img src="data:image/png;base64,iVBOR"/>
Example (DefangScheme)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-scheme * href *
		allow-scheme * href https,:relative
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<a href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<a htmlpolicy-defanged-href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>
Example (Demote)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		demote form class
		demote marquee
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<form class="contact"><input type="text"/></form><marquee>hello</marquee>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<div class="contact"><input type="text"/></div><div>hello</div>
Example (Error)
package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	_, err := htmlpolicy.Parse(`badverb script`, nil)
	fmt.Println(err)
}
Output:
line 1: unknown verb "badverb"
Example (Placeholder)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		placeholder script,iframe
		placeholder-label blocked
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>safe</p><script>evil</script><iframe src="x">frame</iframe>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>safe</p><strong title="&lt;script&gt;evil&lt;/script&gt;">[blocked: script]</strong><strong title="&lt;iframe src=&#34;x&#34;&gt;frame&lt;/iframe&gt;">[blocked: iframe]</strong>
Example (SchemeFiltering)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-scheme * href *
		allow-scheme * href https,mailto,:relative
		strip-scheme * src *
		allow-scheme * src https,:relative
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<a href="https://example.com">safe</a><a href="javascript:alert(1)">evil</a><img src="photo.jpg"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<a href="https://example.com">safe</a><a>evil</a><img src="photo.jpg"/>

func (*Policy) ApplyCSS

func (p *Policy) ApplyCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyCSS sanitizes a standalone CSS stylesheet using the policy's CSS rules. It applies property, value, and at-rule filtering, and filters url() schemes. Use this to sanitize raw CSS content that is not embedded in HTML.

If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.

ApplyCSS is safe for concurrent use on the same Policy.

Returns a non-nil error only when WithMaxOutputFactor is set and the sanitized output exceeds the configured size limit.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip *
		css-allow color,font-size
		css-strip-value expression(*)
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`.header { color: red; margin: 10px; font-size: 14px; }`)
	output, modified, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
.header {
color: red;
font-size: 14px;
}
Example (Pseudo)

ExamplePolicy_ApplyCSS_pseudo shows stripping selectors that use a pseudo-class. The whole complex selector is dropped (never just the pseudo), so an interaction-gated style cannot become an always-on one.

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`css-strip-pseudo hover`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`.menu, .item:hover > .sub { color: red; } a:hover { color: blue; }`)
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
.menu {
color: red;
}

func (*Policy) ApplyDocument

func (p *Policy) ApplyDocument(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyDocument applies the policy to a full HTML document. It returns the sanitized document, whether it was modified, and any error.

Unlike [ApplyHTML] which parses input as a fragment (suitable for user content embedded in a page), ApplyDocument parses input as a complete document, preserving the <!DOCTYPE>, <html>, <head>, and <body> structure. If the input is missing these elements, the HTML5 parser adds them.

Policy rules apply to all elements in the document, including those in <head> (e.g., <title>, <meta>, <link>, <style>, <script>).

Use ApplyHTML for user-generated content fragments (comments, emails, forum posts). Use ApplyDocument for sanitizing complete HTML pages.

ApplyDocument is safe for concurrent use on the same Policy.

Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse("strip script", nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<!doctype html><html><head><title>Page</title></head><body><p>Hello</p><script>evil</script></body></html>`)
	output, modified, err := policy.ApplyDocument(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
<!DOCTYPE html><html><head><title>Page</title></head><body><p>Hello</p></body></html>

func (*Policy) ApplyHTML

func (p *Policy) ApplyHTML(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyHTML applies the policy to an HTML fragment. It returns the sanitized content, whether it was modified, and any error.

Input must be UTF-8. Use ConvertToUTF8 first if the charset is unknown. Output is always UTF-8.

Content is parsed as a fragment in a body context — it is never wrapped in <html>/<head>/<body> tags. If the input contains those tags, they are discarded and their content is kept. Use Policy.ApplyDocument to sanitize complete HTML documents with preserved document structure.

By default, output is always re-serialized through Go's HTML5 parser, which normalizes malformed HTML. This ensures the browser sees exactly the same structure the sanitizer saw, preventing parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no policy rules modify the content.

ApplyHTML is safe for concurrent use on the same Policy.

Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse("strip script,style", nil)
	if err != nil {
		log.Fatal(err)
	}

	output, modified, err := policy.ApplyHTML([]byte("<p>Hello</p><script>alert(1)</script>"))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
<p>Hello</p>

func (*Policy) ApplyInlineCSS

func (p *Policy) ApplyInlineCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyInlineCSS sanitizes a standalone inline CSS declaration list (the content of a style attribute, without surrounding HTML). It applies property, value, and URL scheme filtering.

If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.

ApplyInlineCSS is safe for concurrent use on the same Policy.

Returns a non-nil error only when WithMaxOutputFactor is set and the sanitized output exceeds the configured size limit.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip *
		css-allow color,font-size
	`, nil)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`color: red; margin: 10px; font-size: 14px`)
	output, modified, err := policy.ApplyInlineCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
color: red; font-size: 14px

func (*Policy) SetPrefix

func (p *Policy) SetPrefix(prefix string) error

SetPrefix sets the base prefix used when generating names for the CommentOut action (e.g. "PREFIX-commented-out ...") and every defang action: the HTML attribute/scheme/content-type defangs (defang-attr, defang-scheme, defang-content-type, e.g. "PREFIX-defanged-onclick"), the CSS defang verbs (css-defang, css-defang-at, css-defang-value, e.g. "PREFIX-defanged-color"), and SVG animated-value defang. The default is "htmlpolicy".

The prefix must be non-empty and contain only ASCII letters, digits, or hyphens. This prevents injection of arbitrary content into HTML attribute names and comment bodies.

SetPrefix is safe to call concurrently with Policy.ApplyHTML.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-attr * onclick
		comment-out script
	`, nil)
	if err != nil {
		log.Fatal(err)
	}
	if err := policy.SetPrefix("myapp"); err != nil {
		log.Fatal(err)
	}

	output, _, err := policy.ApplyHTML([]byte(`<p onclick="track()">Hello</p><script>evil</script>`))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p myapp-defanged-onclick="track()">Hello</p><!--myapp-commented-out <script>evil</script>-->
Example (Error)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`defang-attr * on*`, nil)
	if err != nil {
		log.Fatal(err)
	}

	err = policy.SetPrefix("")
	fmt.Println(err)
}
Output:
prefix must not be empty

func (*Policy) String

func (p *Policy) String() string

String returns the policy as flattened policy text. The output is valid policy syntax that can be parsed back to produce an equivalent policy. Includes are fully resolved (inlined).

type Resolver

type Resolver interface {
	Resolve(from, name string) (canonical, text string, err error)
}

Resolver loads policy text for include directives.

name is whatever appears after "include " in the policy file. The Resolver implementation decides how to interpret it (file path, database key, embedded resource, etc.).

from identifies the policy that contains the include directive, allowing the resolver to interpret name relative to the including policy's location. It is the canonical name returned by an earlier Resolve call (the include that loaded the parent), or the empty string for include directives in the top-level policy text passed to Parse. A file-based resolver typically treats from as a file path and joins relative name values against filepath.Dir(from); resolvers without a notion of location can ignore from entirely.

Resolve returns the canonical name of the loaded resource along with its text. The parser forwards canonical as from on nested include calls and uses it as the key for circular-include detection. Implementations should normalize the canonical form (e.g. resolve relative paths to absolute, strip "./" prefixes, normalize separators) so the same underlying resource always yields the same canonical string; otherwise a cycle through e.g. "foo.policy" and "./foo.policy" will only be caught by the depth limit (with a misleading "exceeds maximum depth" error rather than "circular include"). Resolvers without a notion of canonicalization may return name unchanged.

type SchemeRule

type SchemeRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	Attr     string           // attribute name pattern (e.g. "href", "*", "on*")
	Schemes  []string         // schemes (lowercase), ":relative" for schemeless URLs, or "*" for any scheme
	Action   Action           // Allow, Strip, or Defang
	Line     int
}

SchemeRule restricts URL schemes for an attribute on matching elements. The Action field determines what happens to URLs with matching schemes: Allow passes them through, Strip removes them, and Defang renames the attribute to make it inert.

type TagRule

type TagRule struct {
	Selector string           // CSS selector text (for String())
	Matcher  cascadia.Matcher // compiled selector
	Action   Action           // what to do with matching elements
	Line     int              // source line number (for diagnostics)
}

TagRule defines an action for matching elements.

type URLContext

type URLContext struct {
	// Element is the lowercase HTML element name (e.g., "img", "a", "style").
	// Empty for standalone CSS sanitization ([Policy.ApplyCSS], [Policy.ApplyInlineCSS]).
	Element string

	// Attr is the HTML attribute name (e.g., "href", "src", "style").
	// Empty for URLs inside <style> elements or standalone CSS.
	Attr string

	// CSSProperty is the CSS property name containing the url() value
	// (e.g., "background-image", "background"), or "@import" for @import URLs.
	// Empty for non-CSS URLs (HTML attributes).
	CSSProperty string

	// Parent is the lowercase name of the URL's element's parent in the
	// HTML tree (e.g., "picture", "audio", "video", "head", "body"). Empty
	// when the element has no element parent — top-level fragment nodes,
	// document-root children whose parent is the document node, and
	// standalone CSS sanitization.
	//
	// Useful for elements whose semantics depend on ancestry, e.g.
	// <source src> is an image candidate inside <picture> but a media file
	// inside <audio>/<video>.
	Parent string

	// GetAttr returns the value of the named attribute on the URL's
	// element, or "" if the attribute is absent. Lookups are
	// case-insensitive. GetAttr is always non-nil; for orphan contexts
	// (standalone ApplyCSS / ApplyInlineCSS) it returns "" for any name.
	//
	// Useful for elements whose semantics depend on a sibling attribute on
	// the same element, e.g. <input type=image>, <link rel=stylesheet>,
	// <object type=...>. GetAttr reflects the live attribute slice on the
	// element at the time of the call, so mutations applied by earlier
	// rewriter calls on the same element are visible.
	//
	// GetAttr is not considered part of the deduplication key — the
	// cached rewriter result for a given {url, element, attr, parent,
	// cssProperty} tuple comes from the first call, so if you discriminate
	// on GetAttr across same-key elements only the first element's attrs
	// influence the cached value.
	GetAttr func(name string) string
}

URLContext describes where a URL was found in the sanitized output. It is passed to URLRewriter callbacks (configured per call via WithApplyURLRewriter) to provide context about each URL.

type URLRef added in v0.3.0

type URLRef struct {
	// Context describes where the URL was found: the element, attribute, CSS
	// property, parent element name, and a live GetAttr accessor. It is the
	// exact same URLContext value the [URLRewriter] would receive for this
	// URL. Context.GetAttr is valid for the duration of the prefetch callback
	// (the underlying nodes are still alive); do not retain it past the call.
	Context URLContext

	// URL is the discovered URL in its final resolved form — identical to the
	// value that would be passed to a [URLRewriter] for this same context.
	URL string
}

URLRef pairs a URL discovered during sanitization with the URLContext describing where it was found. A slice of URLRef is passed to the callback registered by WithApplyURLPrefetcher, allowing the caller to warm a cache (e.g. fetch resources in parallel) before the synchronous URLRewriter runs.

type URLRewriter

type URLRewriter func(ctx URLContext, url string) string

URLRewriter is a callback that receives each URL in the sanitized output and returns a replacement. Return the original url unchanged to keep it. Supply a rewriter to one Apply call via WithApplyURLRewriter. Because the rewriter is scoped to that single call, it does not need to be safe for concurrent use; two goroutines calling Apply on the same Policy pass independent rewriters and never share state.

URLs are presented after policy evaluation and base URL resolution. Fragment-only references (#id), empty URLs, and data: URIs are excluded. URLs inside data URI content are recursed into (e.g. url(img.png) inside data:text/css is presented even though the data URI is not).

Calls are deduplicated within a single top-level applyHTMLAt / applyDocumentAt / applyCSS / applyInlineCSS invocation: each unique combination of {url, element, attr, parent, cssProperty} is presented to the callback exactly once for that invocation. The same URL in different contexts (e.g. <img src> vs <video poster>, or the same URL on a <source> inside <picture> vs inside <video>) produces separate calls. Recursive invocations triggered by embedded data URI content each have their own deduplication cache.

Directories

Path Synopsis
cmd
htmlpolicy command
Command htmlpolicy applies an HTML sanitization policy to HTML content.
Command htmlpolicy applies an HTML sanitization policy to HTML content.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL