htmlpolicy

package module
v0.9.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2026 License: AGPL-3.0 Imports: 21 Imported by: 0

README

htmlpolicy

pipeline status coverage report Go Reference

⚠️ Pre-v1 stability warning: until this library reaches v1.0.0, the API and policy language are liable to experience breaking changes between releases.

A policy-driven HTML, CSS, and SVG sanitizer for Go.

Most sanitizers hardcode what's "safe" in library code. htmlpolicy takes a different approach: you write a declarative policy that says exactly what to strip, allow, defang, demote, or unwrap. Anything not matched by a rule passes through unchanged. Policies are plain text files, not Go code, so they can be reviewed, versioned, composed, and swapped without recompiling.

API reference and policy language documentation (godoc)

Quick Start

import "gitlab.com/grepular/htmlpolicy"

// Parse a policy and apply it to an HTML fragment.
policy, err := htmlpolicy.Parse(`
    strip script,style,noscript
    strip iframe,object,embed
    strip-attr * on*
    strip-attr a[href^=javascript:] href
`)
if err != nil {
    log.Fatal(err)
}

output, modified, err := policy.ApplyHTML(fragmentContent)

For full HTML documents, use ApplyDocument instead:

output, modified, err := policy.ApplyDocument(documentContent)

For standalone CSS sanitization:

policy, _ := htmlpolicy.Parse("css-strip *\ncss-allow color,font-size")
out, modified, err := policy.ApplyCSS([]byte(".foo { color: red; margin: 10px; }"))
// out = ".foo {\ncolor: red;\nfont-size: 14px;\n}"

Both HTML and CSS expect UTF-8 input. Use ConvertToUTF8 first if needed:

utf8Content, _ := htmlpolicy.ConvertToUTF8(rawContent, contentType)

Features

  • HTML sanitization — strip, allow, defang, demote, comment-out, placeholder, or unwrap elements and attributes using CSS selectors
  • CSS sanitization — filter properties, values, at-rules, url() schemes, url() data-URI content types, and selector pseudo-classes/elements in style attributes, <style> elements, and standalone stylesheets
  • URL scheme filtering — allow/strip/defang URLs by scheme (html and css independently) with per-URL evaluation for multi-URL attributes (srcset)
  • Content-type filtering — control which MIME types are allowed in data URIs with wildcard patterns
  • Recursive sanitization — data:text/html, srcdoc, and meta refresh URLs are recursively sanitized with depth limiting (SVG/XHTML/XML data URIs are stripped by default — opt in per context with allow-content-type)
  • SVG/MathML namespace validation — prevents mutation XSS from foreign-content elements in the wrong namespace
  • SMIL animation sanitization — applies attribute and scheme rules to animated values, preventing runtime bypasses
  • Parser-differential prevention — always re-serializes through Go's HTML5 parser by default
  • URL rewriting — per-call callback hook (via WithApplyURLRewriter) for rewriting URLs in the final output (HTML attributes, CSS url() values, srcset, SMIL); scoped to a single Apply call so it can safely close over per-call state without concurrency concerns. A companion WithApplyURLPrefetcher hook delivers the full URL set up front (same set, same order) so callers can fetch resources in parallel before the rewriter runs
  • Policy composition — include directives for combining policies
  • Policy lintingPolicy.Lint() (and htmlpolicy -lint) flags likely mistakes: URL attributes reopened with no scheme baseline, rules made dead by a later rule, and unsafe unwrap * catch-alls
  • Concurrency safe — Policy is safe for concurrent use after creation

Security Guidance

This library is policy-driven — it does not hardcode any element or attribute as "safe" or "dangerous". You must write policies that account for the HTML features your application needs to defend against.

Rather than copy a baseline into your codebase (where it bit-rots as browsers add features), include a maintained preset that ships with the library and improves on go get -u. include builtin:NAME works in any policy with no resolver configured, and later rules override the preset:

// Blocklist: strip active content, dangerous schemes, XML/SVG data URIs.
policy, _ := htmlpolicy.Parse("include builtin:blocklist")

// Allowlist: start fail-closed, then re-allow exactly what you need.
policy, _ := htmlpolicy.Parse(`
    include builtin:allowlist-base
    allow p,div,span,b,i,em,strong,a,ul,ol,li,br
    allow-attr * class,id
    allow a href
    include builtin:url-safe
`)
Preset Purpose
builtin:blocklist Strong blocklist baseline — strips active-content elements/attributes, dangerous URL schemes, and XML/SVG data URIs; everything else passes through
builtin:allowlist-base Fail-closed skeleton — strips all elements, attributes, schemes, content types, and CSS, so anything you don't explicitly re-allow is dropped
builtin:url-safe Restrict every URL attribute to http, https, mailto, and relative URLs — pair with rules that reopen URL-bearing attributes

Presets are ordinary policy text: BuiltinPolicy(name) returns the source, BuiltinPolicyNames() lists them, and Policy.String() shows them inlined.

Allowlist safety: when you reopen a URL-bearing attribute (e.g. allow a href), also constrain its schemes (include builtin:url-safe, or your own allow-scheme rules). Reopening an attribute without a scheme baseline lets javascript: URLs through.

strip-comments
strip script,noscript,style,base
strip iframe,fencedframe,object,embed,applet
strip template,portal
strip form,input,textarea,select,button
strip meta[http-equiv=refresh]
strip-attr * on*
strip-attr * style
strip-attr * is
strip-attr * nonce
strip-attr template shadowrootmode
strip-scheme * href javascript,vbscript
strip-scheme * src javascript,vbscript,data
strip-content-type * * *+xml,*/xml
Key threats to consider
  • Declarative Shadow DOM: <template shadowrootmode="open"> causes template content to become live DOM. Strip <template> or the attribute.
  • <iframe srcdoc>: Contains raw inline HTML. Strip iframes or the srcdoc attribute.
  • <fencedframe>: Chrome-specific iframe-like element. Treat like iframe.
  • is attribute: Triggers custom element constructors, has mXSS edge cases. Strip in security contexts.
  • nonce attribute: CSP bypass if an attacker controls it. Always strip.
  • SVG/XHTML/XML data URIs: Browsers parse these with an XML parser, but this library uses an HTML5 parser, so they cannot be fully sanitized. They are now stripped by default — no policy needed. If you opt a context back into recursion (allow-content-type img src image/svg+xml), sanitization is best-effort only: namespace-prefixed elements (<a:script>) are stripped, but XML custom entity expansion (<!ENTITY x "<script>..."> then &x;), CDATA sections, and processing instructions are not. Only opt in for trusted sources or scripting-disabled contexts like <img>.
  • Weird data-URI charsets: a data URI declaring a charset the library can't reproduce (UTF-7, UTF-16, …) is stripped, since the browser would decode different bytes than the sanitizer saw (e.g. data:text/html;charset=utf-7,+ADw-script+AD4-). UTF-8, US-ASCII, and no charset are recursed normally.
  • unwrap * as an allowlist catch-all: Unwrap removes a wrapper element but keeps its children in flow, mirroring how browsers treat unknown tags (e.g. <dov>hello</dov> renders as hello). It is useful for letting unrecognized structural tags pass through transparently, but it MUST be paired with explicit strip rules for any element whose presence indicates active content or whose textual content should not be visible. At minimum: strip script,style,iframe,object,embed,form,input,textarea,select,button,link,base and strip meta[http-equiv=refresh]. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed, and unknown-named elements that carry active content would be exposed by name.

CLI Tool

go install gitlab.com/grepular/htmlpolicy/cmd/htmlpolicy@latest
htmlpolicy [flags] <policy-arg>... < input.html > output.html

Each argument is either a path to a policy file or inline policy text. Multiple arguments are concatenated in order (last match wins):

htmlpolicy policy.txt < input.html > output.html
htmlpolicy 'strip script' < input.html > output.html
htmlpolicy base.policy 'strip style' 'allow-scheme * href https'
htmlpolicy base.policy overrides.policy 'strip img'
Flag Description
-fragment Parse input as an HTML fragment instead of a full document
-detect-charset Detect and convert input charset to UTF-8
-content-type Content-Type header for charset detection (only used with -detect-charset)
-prefix Override the prefix for defang/comment-out actions (default: htmlpolicy)
-verbose Log each sanitization action to stderr
-lint Check the policy for likely mistakes and exit (does not read stdin)
Verbose Logging

The -verbose flag (or WithVerboseLog in the library) logs each action to stderr:

strip <script> (line 1: strip script)
strip-attr <p> onclick (line 3: strip-attr * on*)
strip-scheme <a> href javascript:alert(1) (line 5: strip-scheme * href javascript)
css-strip position (line 7: css-strip position)
css-strip-at @import (line 8: css-strip-at import)

Requirements

Go 1.26.1 or later.

Testing

make test

100% test coverage is required. Tests fail if coverage drops below 100%.

License

See LICENSE.

Acknowledgements

This project was developed by Mike Cardwell, with the assistance of Claude Code, Anthropic's AI coding tool.

Support/Appreciate my work

Documentation

Overview

Package htmlpolicy implements a policy-driven HTML, CSS, and SVG sanitizer.

Warning: until this library reaches v1.0.0, the API and policy language are liable to experience breaking changes between releases.

This library was built entirely through vibe coding with Claude Code (https://claude.ai/claude-code).

Unlike sanitizers that hardcode safety rules in library code, htmlpolicy uses declarative policies: plain text files where each line starts with an action verb followed by CSS selectors and optional arguments. Rules are evaluated top-to-bottom with last-match-wins semantics. Anything not matched by a rule passes through unchanged.

Policies can be reviewed, versioned, composed via includes, and swapped without recompiling. This makes htmlpolicy suitable for email rendering pipelines, CMS content filtering, user-generated content sanitization, and any HTML-to-HTML transformation where the rules need to be configurable.

Entry Points

There are five entry points for applying a policy:

All entry points expect UTF-8 input and return UTF-8 output. A Policy is safe for concurrent use by multiple goroutines.

Selectors

Selectors use standard CSS syntax compiled via github.com/andybalholm/cascadia:

tag                     match all <tag> elements
*                       match all elements
tag[attr=value]         exact attribute match
tag[attr^=prefix]       prefix match
tag[attr$=suffix]       suffix match
tag[attr*=substring]    contains match
tag[attr~=word]         space-separated word match
tag[attr=value i]       case-insensitive match (CSS4)
div.ads > a             child combinator (quote when spaces present)
svg animate             descendant: <animate> inside <svg>
#tracking               ID selector
:not([href^=https])     negation pseudo-class

Comma-separated selector groups (e.g. "script,style,iframe") compile into a single rule. When a selector contains spaces, quote it so the parser distinguishes it from arguments: strip "svg animate".

Tag matching is case-insensitive. Attribute value matching normalizes control characters and case internally for security, without modifying the output. This prevents bypasses like " javascript:" evading a [href^=javascript:] prefix check, or JaVaScRiPt: evading case-sensitive matching.

HTML Tag Verbs

Tag verbs determine what happens to matched elements:

strip SELECTOR[,...]                remove element and all content
comment-out SELECTOR[,...]          wrap in HTML comment
placeholder SELECTOR[,...]          replace with [removed: tag] label
demote SELECTOR[,...] [ATTR,...]    convert to <div>/<span>, keep listed attrs
allow SELECTOR[,...] [ATTR,...]     keep element, keep only listed attrs
unwrap SELECTOR[,...]               remove the element's tags but keep its children

Inline elements demote to <span>, all others to <div>. The namespace is cleared, so foreign elements (SVG, MathML) become plain HTML.

Unwrap removes the wrapper but leaves the children in flow (mirroring how browsers treat unknown elements). The wrapper's attributes are discarded; void/empty elements with no children are stripped. As a catch-all in an allowlist policy, "unwrap *" MUST be paired with explicit strip rules for any element whose presence indicates active content or whose text content should not be visible — at a minimum: script, style, iframe, object, embed, form, input, textarea, select, button, link, base, meta. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed.

The allow and demote shorthand attribute list is equivalent to separate allow-attr rules. If the selector has a condition, it propagates:

allow a[href^=https:] href,class
# is equivalent to:
allow a[href^=https:]
allow-attr a[href^=https:] href
allow-attr a[href^=https:] class

HTML Attribute Verbs

Attribute verbs filter attributes on matched elements. Attribute name patterns support a trailing * glob: on* matches onclick, onload, etc.

strip-attr SELECTOR ATTR[,...]      strip named attrs
allow-attr SELECTOR ATTR[,...]      allow only named attrs
defang-attr SELECTOR ATTR[,...]     rename to PREFIX-defanged-ATTR

URL Scheme Verbs

Scheme verbs filter URL attributes by the URL's scheme. Use :relative to match schemeless URLs, * to match any scheme:

allow-scheme SELECTOR ATTR SCHEME[,...]     allow URLs with listed schemes
strip-scheme SELECTOR ATTR SCHEME[,...]     strip URLs with listed schemes
defang-scheme SELECTOR ATTR SCHEME[,...]    rename attr when scheme matches

For allowlist behavior, pair with a strip-scheme baseline:

strip-scheme * href *
allow-scheme * href https,mailto,:relative

Each scheme must be a valid URL scheme (an ASCII letter followed by letters, digits, "+", "-", or "."), the wildcard *, or :relative. Comma-separated lists reject a stray comma between entries (e.g. "http,,https") to catch typos.

Multi-URL attributes (srcset, imagesrcset, ping, archive) get per-URL evaluation — each URL is independently matched against the rule chain.

Content-Type Verbs

Content-type verbs control what happens to data URIs based on their MIME type. MIME patterns support a single * wildcard: image/*, *+xml, */xml.

allow-content-type SELECTOR ATTR TYPE[,...]     allow data URIs with listed types
strip-content-type SELECTOR ATTR TYPE[,...]     strip data URIs with listed types
defang-content-type SELECTOR ATTR TYPE[,...]    rename attr when type matches

By default, text/html data URIs are recursively sanitized (the HTML5 parser matches the browser), while SVG/XHTML/XML data URIs (image/svg+xml, application/xhtml+xml, */xml, *+xml) are stripped — the HTML5 parser cannot match a browser's XML parser, so they are removed rather than passed through with best-effort sanitization. To recurse an XML/SVG data URI anyway (best-effort, accepting the limitations below), opt in with an allow-content-type rule, e.g.:

allow-content-type img src image/svg+xml

scopes the opt-in to <img src> (a scripting-disabled context in browsers). Other data URIs pass through unchanged unless a content-type rule matches.

Independently, any data URI declaring a charset the library cannot reproduce faithfully (anything other than UTF-8, US-ASCII, or none — e.g. UTF-7 or UTF-16) is stripped, because the content is decoded to raw bytes and handed to the UTF-8/ASCII HTML5 parser; a browser honoring the declared charset would decode different markup (a charset differential).

CSS Verbs

CSS verbs process style attributes, <style> element contents, data:text/css URIs, and SMIL animated style values. They follow the same last-match-wins semantics as HTML rules. Without CSS rules, CSS content passes through unchanged. HTML scheme rules (strip-scheme etc.) do NOT apply to CSS — use css-*-scheme rules instead.

css-strip PROP[,...]                    strip CSS properties (bare name matches vendor-prefixed variants)
css-allow PROP[,...]                    allow CSS properties
css-defang PROP[,...]                   defang CSS properties (prefix name)
css-strip-value PATTERN                 strip properties matching value pattern
css-defang-value PATTERN                defang properties matching value pattern
css-strip-at NAME[,...]                 strip CSS at-rules (e.g. import, * = all)
css-allow-at NAME[,...]                 allow CSS at-rules
css-defang-at NAME[,...]                defang CSS at-rules (prefix name)
css-strip-scheme TARGET SCHEME[,...]         strip CSS url() values by scheme
css-allow-scheme TARGET SCHEME[,...]         allow CSS url() values by scheme
css-defang-scheme TARGET SCHEME[,...]        defang CSS url() values by scheme
css-strip-content-type TARGET TYPE[,...]     strip CSS url() data: URIs by MIME type
css-allow-content-type TARGET TYPE[,...]     allow CSS url() data: URIs by MIME type
css-defang-content-type TARGET TYPE[,...]    defang CSS url() data: URIs by MIME type
css-strip-pseudo NAME[,...]                  strip selectors using a pseudo-class/element
css-allow-pseudo NAME[,...]                  allow pseudo-classes/elements

The css-*-scheme and css-*-content-type TARGET is a comma-separated list of CSS property name patterns or @import. Use * to match all properties and @import. Scheme lists support :relative and * as with HTML scheme rules; content-type TYPE lists use the same MIME patterns as the HTML content-type verbs (a single * wildcard, e.g. image/*, *+xml).

css-*-content-type filters data: URIs found in CSS url() values and @import by their MIME type — the CSS counterpart of the HTML content-type verbs (HTML content-type rules do not apply to CSS). It composes with css-*-scheme as AND (the most restrictive of the two decisions wins). When no css-*-content-type rule matches a url() data: URI, the default applies: text/css is recursively sanitized and SVG/XHTML data: URIs are best-effort recursed (a CSS url() loads in the browser's secure mode, with no scripting). It applies to url() and @import data: URIs, not to bare-string arguments of image-set()/cross-fade()/etc.

The css-*-pseudo verbs filter CSS selectors (in <style> elements, data:text/css, and standalone Policy.ApplyCSS) by pseudo-class or pseudo-element name. Each comma-separated NAME may carry an optional leading "::" to match pseudo-elements only or ":" to match pseudo-classes only; a bare name (or *) matches either kind. Names support a trailing * glob. The four legacy single-colon pseudo-elements (:before, :after, :first-line, :first-letter) are always treated as pseudo-elements.

When a pseudo with action strip matches anywhere within a complex (comma-separated) selector — including nested inside functional pseudo-classes such as :has(), :is(), :not(), :where() — that entire complex selector is dropped; if every complex selector in a ruleset is dropped, the whole ruleset is removed. This only ever narrows which elements a style applies to. As with other CSS verbs, evaluation is last-match-wins per pseudo and unmatched pseudos pass through, so an allowlist is written as "css-strip-pseudo *" then "css-allow-pseudo hover,focus". Pseudo filtering does not apply to inline style attributes (which have no selectors) or to at-rule preludes such as @page :first or @keyframes stops.

Property (css-strip/css-allow/css-defang), at-rule (css-*-at), scheme TARGET (css-*-scheme), and pseudo (css-*-pseudo) NAME matching is vendor-prefix-insensitive: a bare pattern matches both the canonical name and every vendor-prefixed variant, because browsers alias -webkit-foo, -moz-foo, -ms-foo, etc. to foo. So css-strip transform also strips -webkit-transform, css-strip-at keyframes also strips @-webkit-keyframes, and css-strip-pseudo scrollbar also drops ::-webkit-scrollbar. An explicitly hyphen-prefixed pattern (e.g. css-strip -webkit-transform) matches only that single variant — the escape hatch for targeting one vendor. Matching is name-only and never rewrites the surviving bytes; custom property names (--x) are not treated as vendor-prefixed.

CSS allowlist example:

css-strip *
css-allow color,background-color,font-size,font-family,font-weight
css-allow text-align,text-decoration,margin,padding,border
css-strip-value expression(*)
css-strip-value url(*)
css-strip-at *

CSS blocklist example:

css-strip -moz-binding,behavior
css-strip-value expression(*)
css-strip-at import
css-strip-scheme * javascript,vbscript,data,blob

Standalone CSS sanitization is available via Policy.ApplyCSS (stylesheets) and Policy.ApplyInlineCSS (declaration lists).

Other Verbs

prefix NAME                  set the prefix for defang/comment-out names (default "htmlpolicy")
placeholder-label LABEL      customize the label for placeholder (default "removed")
strip-comments               remove HTML comments
include NAME-OR-PATH         inline another policy (loaded via [Resolver])
include builtin:NAME         inline a built-in preset (no [Resolver] needed)

Lines starting with # are comments. Only full-line comments are supported.

Linting

Policy.Lint statically analyzes a compiled policy for likely mistakes (without applying it) and returns advisory Warning values: URL attributes reopened with no scheme baseline, rules made dead by a later same-family rule (last match wins by order, so an allow placed before a universal strip never fires), and "unwrap *" used as a catch-all without strip rules for active-content elements. The CLI exposes this as "htmlpolicy -lint".

Built-in Presets

The library ships a small set of maintained policy presets, includable as "include builtin:NAME" from any policy with no Resolver configured. They are ordinary policy text — fully inlined by Policy.String, and overridable by any later rule under last-match-wins — so they are a safe, reviewable starting point rather than hidden engine behavior. Upgrading the library picks up preset improvements (e.g. coverage for newly recognized HTML features). Available presets:

builtin:blocklist        strong blocklist baseline (strips active-content
                         elements/attributes, dangerous URL schemes, and
                         XML/SVG data URIs; everything else passes through)
builtin:allowlist-base   fail-closed skeleton (strips all elements,
                         attributes, schemes, content types, and CSS;
                         follow it with your own allow rules)
builtin:url-safe         restrict every URL attribute to http, https,
                         mailto, and relative URLs (pair with rules that
                         reopen URL-bearing attributes)

Use BuiltinPolicy to fetch a preset's text in Go and BuiltinPolicyNames to list them. Because allowlist-base reopens nothing on its own, building an allowlist on top of it fails safe: forgetting to re-allow a rule family drops content rather than passing it through.

URL Rewriting

WithApplyURLRewriter is an ApplyOption that sets a callback that receives every URL in the sanitized output and can replace it. The rewriter is scoped to a single Apply call, so it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. The rewriter runs after policy evaluation and base URL resolution. It covers HTML URL attributes, CSS url() values, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded. Fragment-only references, empty URLs, and data: URIs themselves are excluded (but URLs inside data URI content are recursed into). The option also works with Policy.ApplyCSS and Policy.ApplyInlineCSS.

WithApplyURLPrefetcher is a companion ApplyOption that receives the full set of URLs the rewriter would be called with — same set, same resolved form, same dedup, in document order — once per call, before the rewrite pass. It lets a caller warm a cache (e.g. fetch image and font resources in parallel) that the subsequent synchronous rewriter reads from. Each URLRef carries a live URLContext (including GetAttr) so the caller can inspect sibling attributes when deciding what to fetch.

Output Normalization

By default, Policy.ApplyHTML and Policy.ApplyDocument always re-serialize through Go's HTML5 parser, even when no rules modified the content. This prevents parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no rules match.

Embedded Content Sanitization

HTML embedded inside attribute values is recursively sanitized: data:text/html URIs, srcdoc attributes, and meta refresh URLs. Recursion depth is limited to 16 levels; at the limit, content is stripped entirely.

SVG/XHTML/XML data: URIs are stripped by default rather than recursed, because the HTML5 parser cannot match a browser's XML parser. An allow-content-type rule (e.g. "allow-content-type img src image/svg+xml") opts a given context back into best-effort recursion, accepting the limitation below. When recursing, namespace-prefixed elements (e.g. <a:script xmlns:a="...">) are stripped, but XML custom entity declarations (<!ENTITY x "<script>...">), CDATA sections, and processing instructions cannot be safely sanitized this way — so opt in only for content you trust or contexts where browsers disable scripting (e.g. <img>). See Limitations.

Namespace Validation (mXSS Prevention)

After applying policy rules, namespace consistency is validated. Elements whose namespace is invalid for their DOM position are stripped. This prevents mutation XSS attacks where foreign-content elements end up in the wrong namespace after policy actions change the tree structure.

SVG SMIL Animation Sanitization

SMIL animation elements (<animate>, <set>, etc.) are sanitized by applying the same attribute and scheme policy rules to animated values. This prevents runtime bypasses via attributeName="href" values="javascript:alert(1)". CSS in animated style values is sanitized through the CSS engine.

Limitations

CSS selectors in <style> elements are not filtered. Attribute selectors combined with url() values can exfiltrate data. The url() scheme filtering mitigates this, but full protection requires stripping <style> elements.

CSS var()/env()/attr() substitution happens at browser computed-value time, not parse time. Known attack vectors are handled (url() in custom properties, @import var(), var() fallback values, var()/env()/attr() in URL-accepting functions). When css-*-scheme rules are active, var()/env()/attr() inside URL-accepting functions (image-set, -webkit-image-set, cross-fade, image, src) is stripped as a precaution. Bare string arguments inside these functions are evaluated as URLs and subjected to css-*-scheme rules. CSS escape sequences in the url() (or attr()) function name itself (e.g. \75rl(...), \61ttr(...)) bypass the lexer's URLToken recognition, so when css-*-scheme rules are active any FunctionToken whose escape-decoded name is "url" causes the declaration to be stripped. attr() with a url-typed substitution — "attr(name url)" or "attr(name type(<url>))" — also causes the declaration to be stripped, since the value of the named HTML attribute would be loaded as a URL at computed time.

SVG/XHTML/XML data: URIs cannot be fully sanitized, so they are stripped by default. Browsers parse them with an XML parser; this library uses an HTML5 parser. If you opt a context back into recursion with an allow-content-type rule, sanitization is best-effort only: namespace-prefixed elements are stripped as defense-in-depth, but XML custom entity expansion (<!ENTITY x "<script>...">, then &x;), CDATA sections, and processing instructions are XML-only constructs the HTML5 parser treats as inert text or comments — so an XML-parsing browser may execute content the sanitizer considered safe. Only opt in for trusted content or scripting-disabled contexts (e.g. <img>). The CSS url() path is an exception: SVG/XML data URIs there are left as best-effort recursion because a CSS url() loads in the browser's secure mode (no scripting).

Hardcoded limits: data URI recursion depth 16, HTML nesting depth 512, CSS nesting depth 128, output size 10x input (configurable via WithMaxOutputFactor, minimum 32KB).

Index

Examples

Constants

View Source
const DefaultMaxIncludeDepth = 64

DefaultMaxIncludeDepth is the default maximum nesting depth for includes.

Variables

This section is empty.

Functions

func BuiltinPolicy added in v0.9.0

func BuiltinPolicy(name string) (text string, ok bool)

BuiltinPolicy returns the raw policy text of the named built-in preset, or ok=false if no such preset exists. The name is the bare preset name without the "builtin:" prefix (e.g. "blocklist", "allowlist-base", "url-safe").

Built-in presets are ordinary policy text maintained with the library. They can be applied directly via an include directive — "include builtin:NAME", which works in any policy without configuring a Resolver — or fetched with this function to inspect or compose them in Go. Use BuiltinPolicyNames to list the available presets.

Because presets are plain policy text, later rules in a policy override them under last-match-wins semantics, and the fully resolved policy (with presets inlined) is visible via Policy.String.

func BuiltinPolicyNames added in v0.9.0

func BuiltinPolicyNames() []string

BuiltinPolicyNames returns the names of all built-in presets in sorted order, each usable as "include builtin:NAME" or passed to BuiltinPolicy.

func ConvertToUTF8

func ConvertToUTF8(content []byte, contentType string) ([]byte, bool)

ConvertToUTF8 converts content from the charset specified in contentType to UTF-8. The contentType should be a MIME type with optional charset parameter (e.g. "text/html; charset=iso-8859-1"). If no charset is specified, encoding is determined by inspecting the content.

Returns the UTF-8 content and a boolean indicating whether conversion was performed. When the content is already UTF-8 (or the detected encoding matches UTF-8), the original slice is returned with false.

Limitation: UTF-7 is not supported. The underlying charset detector has no UTF-7 decoder, so content declaring charset=utf-7 is returned unchanged (with false). This matches modern browsers, which no longer sniff or decode UTF-7; the classic "+ADw-script+AD4-" obfuscation is therefore not auto-decoded here. Callers that must handle attacker- controlled UTF-7 should reject or pre-decode such input before sanitizing.

Example
package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	content := []byte("<p>caf\xe9</p>")
	utf8, converted := htmlpolicy.ConvertToUTF8(content, "text/html; charset=iso-8859-1")

	fmt.Println("converted:", converted)
	fmt.Println(string(utf8))
}
Output:
converted: true
<p>café</p>

Types

type Action

type Action int

Action determines what happens to a matched element or attribute.

Action is part of the stable API: Policy.URLSchemeAction returns it. The rule types that carry an Action field (TagRule, AttrRule, etc.) are exported for documentation only — Policy fields are unexported and no public function accepts or returns those rule types. Policy.URLSchemeAction returns only Allow, Strip, or Defang; the remaining constants arise solely from element/attribute verbs applied during sanitization.

const (
	// Strip removes the element and all of its content from the output.
	Strip Action = iota
	// CommentOut wraps the element and its content in an HTML comment
	// (<!-- ... -->), making it invisible to browsers while preserving
	// the content in the source for inspection.
	CommentOut
	// Placeholder replaces the element with a text label such as
	// "[removed: script]". The label word is configurable via the
	// placeholder-label policy directive.
	Placeholder
	// Demote converts the element to a safe generic container: inline
	// elements become <span>, all others become <div>. Allowed
	// attributes are preserved; the namespace is cleared.
	Demote
	// Allow keeps the element in the output unchanged.
	Allow
	// Defang renames the attribute by inserting a prefix (e.g.
	// "htmlpolicy-defanged-onclick"), making it inert while preserving
	// the value for inspection.
	Defang
	// Unwrap removes the element's start and end tags but keeps its
	// children in place at the position the element occupied. The
	// element's attributes are discarded along with the wrapper. This
	// mirrors how browsers treat unknown elements (HTMLUnknownElement
	// is a transparent wrapper: the tag has no effect and children
	// render in flow). Void/empty elements with no children behave as
	// [Strip] — there is nothing to keep.
	//
	// Unwrap differs from [Demote]: Demote renames the wrapper to a
	// safe generic container (<div>/<span>), preserving the element
	// boundary, while Unwrap removes the wrapper entirely.
	//
	// Unwrap is appropriate as a catch-all for unknown structural
	// elements in an allowlist policy. It MUST be paired with explicit
	// strip rules for any element whose presence indicates active
	// content or whose text content should not be visible. At a
	// minimum, pair "unwrap *" with strip rules for script, style,
	// iframe, object, embed, form, input, textarea, select, button,
	// link, base, and meta. Without those, raw <script>/<style> text
	// would leak into the output as visible text after the wrapper
	// is removed.
	Unwrap
)

type ApplyOption

type ApplyOption func(*applyConfig)

ApplyOption configures a single call to Policy.ApplyHTML, Policy.ApplyDocument, Policy.ApplyCSS, or Policy.ApplyInlineCSS. Unlike ParseOption (which is baked into the compiled Policy and shared across all calls), ApplyOption values are scoped to one call. This allows per-call state — for example, a URL rewriter that closes over per-message caches or budgets, or a verbose log writer for a single request — without recompiling the policy.

func WithApplyURLPrefetcher added in v0.3.0

func WithApplyURLPrefetcher(fn func([]URLRef)) ApplyOption

WithApplyURLPrefetcher registers a callback invoked once per Apply call, after parsing, policy evaluation, and base URL resolution, but before the URL rewrite pass. It receives every URL that a URLRewriter set via WithApplyURLRewriter would be invoked with in this same call — the same set, the same resolved form, the same deduplication semantics, and the same order (document order for top-level URLs; URLs inside recursively sanitized data: content appear at the point their containing attribute is processed). The callback returns nothing; it exists so the caller can warm a cache — for example, fetching image and font resources in parallel — that the subsequent synchronous URLRewriter then reads from.

Each [URLRef.Context].GetAttr is live during the callback and returns the element's attributes, so the caller can inspect sibling attributes (width, height, style, type, rel, …) to decide whether to fetch.

The prefetcher and rewriter are independent: either, both, or neither may be set. When both are set the prefetcher runs first, to completion, and then the normal rewrite walk runs. When no rewriter is set the prefetcher is still invoked (with the set the rewriter would have seen). The callback is always invoked exactly once per Apply call, even when no URLs are found (with an empty slice).

Implementation note: enabling a prefetcher runs the deterministic sanitization pipeline a second time — a recording-only pass that collects the URL set without mutating output — so the prefetch set is guaranteed to match the rewrite pass exactly, including URLs at every recursion depth. This roughly doubles the parse/sanitize CPU for the call (the network fetches it enables run once, in parallel). Callers that do not set a prefetcher pay nothing. The recording pass is silent: it runs with no verbose log writer, so WithVerboseLog never produces duplicate lines.

Like WithApplyURLRewriter, the callback is scoped to a single Apply call and may safely close over per-call state without concurrency concerns. It is invoked synchronously from within Apply; htmlpolicy itself introduces no goroutines (the caller may fan out internally).

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`allow *`)
	if err != nil {
		log.Fatal(err)
	}

	// The prefetcher receives every URL up front, so a caller can fetch them
	// in parallel (with its own concurrency limits) and warm a cache that the
	// synchronous rewriter then reads from.
	cache := map[string]string{}
	prefetch := func(refs []htmlpolicy.URLRef) {
		for i, r := range refs {
			// A real implementation would fetch r.URL here, in parallel.
			cache[r.URL] = fmt.Sprintf("cid:image%03d", i+1)
		}
	}
	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		if cid, ok := cache[u]; ok {
			return cid
		}
		return u
	}

	input := []byte(`<img src="https://example.com/a.jpg"/><img src="https://example.com/b.jpg"/>`)
	output, _, err := policy.ApplyHTML(input,
		htmlpolicy.WithApplyURLPrefetcher(prefetch),
		htmlpolicy.WithApplyURLRewriter(rewriter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<img src="cid:image001"/><img src="cid:image002"/>

func WithApplyURLRewriter

func WithApplyURLRewriter(fn URLRewriter) ApplyOption

WithApplyURLRewriter sets a function that rewrites URLs in the sanitized output for this single Apply call. The rewriter runs after policy evaluation and base URL resolution with each URL's final resolved form. It receives a URLContext describing where the URL was found and returns a replacement URL (or the original to keep it unchanged).

The rewriter covers HTML URL attributes (href, src, action, etc.), CSS url() values in style attributes and <style> elements, @import URLs, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded.

Fragment-only references (#id), empty URLs, and data: URIs are excluded from the callback. However, URLs inside data URI content are recursed into — e.g. url(img.png) inside a data:text/css URI is presented to the rewriter even though the data URI itself is not.

The rewriter is per-URL: its output is written verbatim into the slot the URL came from and is not re-sanitized. To prevent one replacement from expanding into several entries, a replacement containing the structural separator of its slot is rejected and the original URL is kept — ';' for SMIL animation values lists, ASCII whitespace for srcset candidates and space-separated URL lists (ping/archive), and '(' ')' quotes or whitespace for SVG url() functional attributes. (HTML attribute values are always quote-escaped on output, so a replacement cannot inject markup regardless.)

Because the rewriter is scoped to a single Apply call, it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. Two goroutines calling Apply on the same Policy with different rewriters do not share state.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script,style
		strip-attr * on*
	`)
	if err != nil {
		log.Fatal(err)
	}

	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		// Replace external URLs with CID references (for email embedding).
		if ctx.Element == "img" && ctx.Attr == "src" {
			return "cid:image001"
		}
		return u
	}

	input := []byte(`<p>Hello</p><img src="https://example.com/photo.jpg" alt="photo"/>`)
	output, _, err := policy.ApplyHTML(input, htmlpolicy.WithApplyURLRewriter(rewriter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<p>Hello</p><img src="cid:image001" alt="photo"/>

func WithMaxOutputFactor

func WithMaxOutputFactor(factor float64) ApplyOption

WithMaxOutputFactor sets the maximum allowed output size for this Apply call as a multiplier of the input size. For example, a factor of 10.0 means the output may be at most 10x the input size (with a minimum of 32KB). This guards against amplification attacks such as a long <base href> resolved into many short relative URLs. The default is 10.0. Set to 0 to disable the limit.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script
	`)
	if err != nil {
		log.Fatal(err)
	}

	// Small input is always allowed (32KB minimum).
	output, _, err := policy.ApplyHTML([]byte("<p>Hello</p>"), htmlpolicy.WithMaxOutputFactor(2.0))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>Hello</p>

func WithPreserveOriginal

func WithPreserveOriginal() ApplyOption

WithPreserveOriginal configures this Apply call to return the original input bytes when no policy rules modify the content, instead of re-serializing through the HTML parser.

By default, ApplyHTML/ApplyDocument always re-serialize through Go's HTML5 parser, which normalizes malformed HTML. This is the safer default because it ensures the browser sees exactly the same structure the sanitizer saw.

Warning: enabling this option means that when no rules match, the original bytes are returned unmodified. If the input contains malformed HTML that Go's parser and the browser parse differently, this creates a parser-differential attack surface. Only enable this option if you trust the input to be well-formed HTML, or if you need byte-for-byte preservation of unmodified content (e.g., to avoid altering whitespace or attribute quoting in content that was not changed by policy rules).

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script
	`)
	if err != nil {
		log.Fatal(err)
	}

	// Unmodified content is returned as-is (byte-for-byte).
	input := []byte("<p>Hello</p>")
	output, modified, err := policy.ApplyHTML(input, htmlpolicy.WithPreserveOriginal())
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: false
<p>Hello</p>

func WithVerboseLog

func WithVerboseLog(w io.Writer) ApplyOption

WithVerboseLog enables verbose logging of policy rule matches for this single Apply call. When set, every sanitization action (strip, defang, demote, comment-out, placeholder) writes a one-line description to w. Elements and attributes that pass through unchanged are not logged.

Because the option is scoped to one call, the writer need not be safe for concurrent use unless the same writer is passed to concurrent Apply calls.

type AttrRule

type AttrRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	AttrName string           // attribute name pattern (may have trailing "*" glob)
	Action   Action           // Allow, Strip, or Defang
	Line     int              // source line number
}

AttrRule defines filtering for a specific attribute on matching elements.

type CSSAtRule

type CSSAtRule struct {
	Name   string // e.g. "import", "font-face"
	Action Action // Allow, Strip, or Defang
	Line   int    // source line number
}

CSSAtRule filters CSS at-rules by name.

type CSSContentTypeRule added in v0.9.0

type CSSContentTypeRule struct {
	Properties []string // CSS property name patterns (trailing "*" glob ok, "*" = all, "@import" = import URLs)
	Types      []string // MIME type patterns (lowercase), e.g. "image/*", "image/svg+xml"
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSContentTypeRule filters data: URIs in CSS url() values (and @import) by MIME type. It mirrors CSSSchemeRule but matches the data: URI's media type instead of the URL scheme. The Properties target is a CSS property name pattern list (or "@import"); the Types are MIME patterns (a single "*" wildcard allowed, e.g. "image/*", "*+xml").

type CSSPropertyRule

type CSSPropertyRule struct {
	Properties []string // property names (trailing "*" glob ok)
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSPropertyRule filters CSS properties by name.

type CSSPseudoRule added in v0.4.0

type CSSPseudoRule struct {
	Kind    pseudoKind // any / class / element
	Pattern string     // lowercase name, optional trailing "*" glob
	Action  Action     // Allow or Strip
	Line    int        // source line number
}

CSSPseudoRule filters CSS selectors by pseudo-class / pseudo-element name. When a rule with Action Strip matches a pseudo anywhere within a complex (comma-separated) selector — including pseudos nested inside functional pseudo-classes such as :has(), :is(), :not(), :where() — that entire complex selector is dropped. If every complex selector in a ruleset's prelude is dropped, the whole ruleset is removed.

type CSSSchemeRule

type CSSSchemeRule struct {
	Properties []string // CSS property name patterns (trailing "*" glob ok, "*" = all, "@import" = import URLs)
	Schemes    []string // e.g. ["javascript", "data", ":relative", "*"]
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSSchemeRule filters URLs in CSS url() values by scheme.

type CSSValueRule

type CSSValueRule struct {
	Pattern string // e.g. "url(*)", "expression(*)"
	Action  Action // Strip or Defang
	Line    int    // source line number
}

CSSValueRule filters CSS properties whose values match a pattern. Patterns ending in "(*)" match any value containing a call to that CSS function (e.g. "expression(*)" matches values containing "expression(...)"). Other patterns match the full value literally (case-insensitive).

type ContentTypeRule

type ContentTypeRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	Attr     string           // attribute name pattern (e.g. "src", "*", "href")
	Types    []string         // MIME type patterns (lowercase), e.g. "image/*", "text/html"
	Action   Action           // Allow, Strip, or Defang
	Line     int
}

ContentTypeRule restricts data URI MIME types for an attribute on matching elements. The Action field determines what happens to data URIs whose MIME type matches the rule's pattern list: Allow recursively sanitizes them, Strip removes the URI value, and Defang renames the attribute to make it inert.

type FileResolver added in v0.9.0

type FileResolver struct {
	// BaseDir is the directory that top-level relative include names resolve
	// against. An empty BaseDir resolves relative names against the process's
	// current working directory.
	BaseDir string
}

FileResolver is a Resolver that loads include directives by reading policy files from disk. Supply it via WithResolver:

policy, err := htmlpolicy.Parse(text, htmlpolicy.WithResolver(
	htmlpolicy.FileResolver{BaseDir: "/etc/policies"}))

Relative include names resolve against the directory of the including policy file (filepath.Dir(from)) when from is an absolute path. For top-level includes (from == "") or when from is not absolute, relative names resolve against BaseDir. The resolver returns the absolute path of the loaded file as the canonical name, so nested includes continue to resolve against the right directory and circular includes are detected reliably.

func (FileResolver) Resolve added in v0.9.0

func (r FileResolver) Resolve(from, name string) (canonical, text string, err error)

Resolve implements Resolver.

type ParseOption

type ParseOption func(*parseConfig)

ParseOption configures the behavior of Parse.

func WithMaxIncludeDepth

func WithMaxIncludeDepth(n int) ParseOption

WithMaxIncludeDepth sets the maximum nesting depth for include directives. The default is DefaultMaxIncludeDepth (64).

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script,style
	`, htmlpolicy.WithMaxIncludeDepth(4))
	if err != nil {
		log.Fatal(err)
	}

	output, _, err := policy.ApplyHTML([]byte("<p>Hello</p><script>evil</script>"))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>Hello</p>

func WithPrefix added in v0.9.0

func WithPrefix(prefix string) ParseOption

WithPrefix sets the base prefix used when generating names for the CommentOut action (e.g. "PREFIX-commented-out ...") and every defang action (defang-attr/defang-scheme/defang-content-type, the CSS defang verbs, and SVG animated-value defang, e.g. "PREFIX-defanged-onclick"). The default is "htmlpolicy".

The prefix must be non-empty and contain only ASCII letters, digits, or hyphens; an invalid prefix is reported as an error from Parse. A "prefix" directive in the policy text sets the same value and, being applied during parsing, takes precedence over WithPrefix.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-attr * onclick
		comment-out script
	`, htmlpolicy.WithPrefix("myapp"))
	if err != nil {
		log.Fatal(err)
	}

	output, _, err := policy.ApplyHTML([]byte(`<p onclick="track()">Hello</p><script>evil</script>`))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p myapp-defanged-onclick="track()">Hello</p><!--myapp-commented-out <script>evil</script>-->
Example (Error)
package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	_, err := htmlpolicy.Parse(`defang-attr * on*`, htmlpolicy.WithPrefix(""))
	fmt.Println(err)
}
Output:
prefix must not be empty

func WithResolver added in v0.9.0

func WithResolver(r Resolver) ParseOption

WithResolver sets the Resolver used to load policies referenced by include directives (by name or path). It is only needed when the policy contains include directives that are not built-in presets ("include builtin:NAME" works without a resolver).

type Policy

type Policy struct {
	// contains filtered or unexported fields
}

Policy is a compiled set of rules for sanitizing HTML, CSS, and SVG content.

A Policy is created by Parse and is immutable afterward.

A Policy is safe for concurrent use by multiple goroutines. Policy.ApplyHTML does not mutate the Policy.

Security note: this library is policy-driven and does not hardcode any element or attribute as safe or dangerous. Policy authors must account for modern HTML features including Declarative Shadow DOM (<template shadowrootmode>), iframe srcdoc, fencedframe, custom elements (the is attribute), and CSP nonce attributes. See the project README for recommended baselines and security guidance.

func Parse

func Parse(input string, opts ...ParseOption) (*Policy, error)

Parse parses policy text into a Policy. Include directives that are not built-in presets require a Resolver supplied via WithResolver.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip script,style
		strip-attr * on*
		strip-attr a[href^=javascript:] href
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p onclick="track()">Hello</p><script>alert(1)</script>`)
	output, modified, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println("output:", string(output))
}
Output:
modified: true
output: <p>Hello</p>
Example (Allowlist)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// A safe allowlist needs a baseline in every rule family it reopens.
	// Reopening an attribute with "allow a href" also needs a scheme baseline,
	// or javascript: URLs pass through. Here strip-scheme/allow-scheme drop the
	// javascript: href while keeping the https one.
	policy, err := htmlpolicy.Parse(`
		strip *
		strip-attr * *
		allow p,div,b,i,em,strong
		allow a href
		allow-attr * class,id
		strip-attr * on*
		strip-scheme * href *
		allow-scheme * href https,mailto
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<div><p class="text"><a href="https://example.com" onclick="x">Link</a> <a href="javascript:alert(1)">evil</a> and <b>bold</b></p><script>evil</script></div>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<div><p class="text"><a href="https://example.com">Link</a> <a>evil</a> and <b>bold</b></p></div>
Example (BuiltinPreset)

ExampleParse_builtinPreset shows applying a maintained built-in preset via an include directive. "include builtin:NAME" works in any policy with no resolver configured; later rules override the preset under last-match-wins.

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		include builtin:blocklist
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>Hi</p><a href="javascript:alert(1)">x</a><script>evil</script>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<p>Hi</p><a>x</a>
Example (CommentOut)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		comment-out script
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>safe</p><script>evil()</script>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>safe</p><!--htmlpolicy-commented-out <script>evil()</script>-->
Example (ContentTypeFiltering)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-content-type * src *
		allow-content-type * src image/*
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<img src="data:image/png;base64,iVBOR"/><img src="data:text/html,<script>evil</script>"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<img src="data:image/png;base64,iVBOR"/><img/>
Example (CssAtRules)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip-at import
		css-defang-at media
		css-allow-at keyframes
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte("@import \"evil.css\";\n@media screen { .x { color: red; } }\n@keyframes fade { from { opacity: 0; } }")
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
@htmlpolicy-defanged-media screen {
.x {
color: red;
}
}
@keyframes fade {
from {
opacity: 0;
}
}
Example (DefangContentType)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-content-type * src image/svg+xml
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<img src="data:image/svg+xml,<svg/>"/><img src="data:image/png;base64,iVBOR"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<img htmlpolicy-defanged-src="data:image/svg+xml,&lt;svg/&gt;"/><img src="data:image/png;base64,iVBOR"/>
Example (DefangScheme)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-scheme * href *
		allow-scheme * href https,:relative
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<a href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<a htmlpolicy-defanged-href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>
Example (Demote)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		demote form class
		demote marquee
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<form class="contact"><input type="text"/></form><marquee>hello</marquee>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<div class="contact"><input type="text"/></div><div>hello</div>
Example (Error)
package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	_, err := htmlpolicy.Parse(`badverb script`)
	fmt.Println(err)
}
Output:
line 1: unknown verb "badverb"
Example (Placeholder)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		placeholder script,iframe
		placeholder-label blocked
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>safe</p><script>evil</script><iframe src="x">frame</iframe>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}
Output:
<p>safe</p><strong title="&lt;script&gt;evil&lt;/script&gt;">[blocked: script]</strong><strong title="&lt;iframe src=&#34;x&#34;&gt;frame&lt;/iframe&gt;">[blocked: iframe]</strong>
Example (SchemeFiltering)
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-scheme * href *
		allow-scheme * href https,mailto,:relative
		strip-scheme * src *
		allow-scheme * src https,:relative
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<a href="https://example.com">safe</a><a href="javascript:alert(1)">evil</a><img src="photo.jpg"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
<a href="https://example.com">safe</a><a>evil</a><img src="photo.jpg"/>

func (*Policy) ApplyCSS

func (p *Policy) ApplyCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyCSS sanitizes a standalone CSS stylesheet using the policy's CSS rules. It applies property, value, and at-rule filtering, and filters url() schemes. Use this to sanitize raw CSS content that is not embedded in HTML.

The second return value reports whether CSS rules or the rewriter changed the content; unlike Policy.ApplyHTML, when it is false the input bytes are returned unchanged (CSS is not re-serialized when nothing matched).

If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.

ApplyCSS is safe for concurrent use on the same Policy.

Returns a non-nil error only when WithMaxOutputFactor is set and the sanitized output exceeds the configured size limit.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip *
		css-allow color,font-size
		css-strip-value expression(*)
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`.header { color: red; margin: 10px; font-size: 14px; }`)
	output, modified, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
.header {
color: red;
font-size: 14px;
}
Example (Pseudo)

ExamplePolicy_ApplyCSS_pseudo shows stripping selectors that use a pseudo-class. The whole complex selector is dropped (never just the pseudo), so an interaction-gated style cannot become an always-on one.

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`css-strip-pseudo hover`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`.menu, .item:hover > .sub { color: red; } a:hover { color: blue; }`)
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}
Output:
.menu {
color: red;
}

func (*Policy) ApplyDocument

func (p *Policy) ApplyDocument(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyDocument applies the policy to a full HTML document. It returns the sanitized document, whether it was modified, and any error. The second return value reports whether policy rules changed the content, not whether the output bytes differ from the input — see Policy.ApplyHTML.

Unlike [ApplyHTML] which parses input as a fragment (suitable for user content embedded in a page), ApplyDocument parses input as a complete document, preserving the <!DOCTYPE>, <html>, <head>, and <body> structure. If the input is missing these elements, the HTML5 parser adds them.

Policy rules apply to all elements in the document, including those in <head> (e.g., <title>, <meta>, <link>, <style>, <script>).

Use ApplyHTML for user-generated content fragments (comments, emails, forum posts). Use ApplyDocument for sanitizing complete HTML pages.

ApplyDocument is safe for concurrent use on the same Policy.

Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse("strip script")
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<!doctype html><html><head><title>Page</title></head><body><p>Hello</p><script>evil</script></body></html>`)
	output, modified, err := policy.ApplyDocument(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
<!DOCTYPE html><html><head><title>Page</title></head><body><p>Hello</p></body></html>

func (*Policy) ApplyHTML

func (p *Policy) ApplyHTML(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyHTML applies the policy to an HTML fragment. It returns the sanitized content, whether it was modified, and any error.

The second return value reports whether policy rules changed the content (an element or attribute was stripped, defanged, demoted, etc.) — not whether the output bytes differ from the input. Because the default behavior re-serializes through Go's HTML5 parser, the output bytes can differ from the input (whitespace, attribute quoting, tag normalization) even when this is false. Use WithPreserveOriginal to return the original bytes verbatim when no rules matched.

Input must be UTF-8. Use ConvertToUTF8 first if the charset is unknown. Output is always UTF-8.

Content is parsed as a fragment in a body context — it is never wrapped in <html>/<head>/<body> tags. If the input contains those tags, they are discarded and their content is kept. Use Policy.ApplyDocument to sanitize complete HTML documents with preserved document structure.

By default, output is always re-serialized through Go's HTML5 parser, which normalizes malformed HTML. This ensures the browser sees exactly the same structure the sanitizer saw, preventing parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no policy rules modify the content.

ApplyHTML is safe for concurrent use on the same Policy.

Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse("strip script,style")
	if err != nil {
		log.Fatal(err)
	}

	output, modified, err := policy.ApplyHTML([]byte("<p>Hello</p><script>alert(1)</script>"))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
<p>Hello</p>

func (*Policy) ApplyInlineCSS

func (p *Policy) ApplyInlineCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyInlineCSS sanitizes a standalone inline CSS declaration list (the content of a style attribute, without surrounding HTML). It applies property, value, and URL scheme filtering.

The second return value reports whether CSS rules or the rewriter changed the content; when it is false the input bytes are returned unchanged.

If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.

ApplyInlineCSS is safe for concurrent use on the same Policy.

Returns a non-nil error only when WithMaxOutputFactor is set and the sanitized output exceeds the configured size limit.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip *
		css-allow color,font-size
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`color: red; margin: 10px; font-size: 14px`)
	output, modified, err := policy.ApplyInlineCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}
Output:
modified: true
color: red; font-size: 14px

func (*Policy) Lint added in v0.9.0

func (p *Policy) Lint() []Warning

Lint reports likely mistakes in the policy without applying it. It is a static analysis of the compiled rules and never reads HTML input. An empty result means no issues were found; a non-empty result is advisory only (the policy is still valid and applied normally).

The checks are conservative — they aim for few false positives — and cover:

  • URL-bearing attributes that are explicitly allowed but left with no scheme rule constraining them, so javascript:/vbscript:/data: URLs may pass through (the most common fail-open mistake).
  • Rules made dead by a later rule in the same family: an identical rule, or a universal ("*") rule that overrides everything earlier. Because the engine is last-match-wins by order (not by specificity), an allow placed before a universal strip never takes effect.
  • "unwrap *" used as an allowlist catch-all without strip rules for active-content elements (script, style, iframe, …), which would leak their text content into the output.

Warnings are returned sorted by line.

Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// This allowlist reopens <a href> but never constrains its scheme, so
	// javascript: URLs would pass through. Lint catches it statically.
	policy, err := htmlpolicy.Parse(`
		strip *
		strip-attr * *
		allow a href
	`)
	if err != nil {
		log.Fatal(err)
	}

	for _, w := range policy.Lint() {
		fmt.Println(w)
	}
}
Output:
line 4: URL attribute "href" is allowed but no scheme rule constrains it; javascript:/vbscript:/data: URLs may pass through (add a strip-scheme/allow-scheme baseline or `include builtin:url-safe`)

func (*Policy) String

func (p *Policy) String() string

String returns the policy as flattened policy text. The output is valid policy syntax that can be parsed back to produce an equivalent policy. Includes are fully resolved (inlined).

func (*Policy) URLSchemeAction added in v0.7.0

func (p *Policy) URLSchemeAction(element, attr, rawURL string) Action

URLSchemeAction reports the action the policy's scheme rules (allow-scheme, strip-scheme, defang-scheme) would take for a URL used in the named attribute of the named element. It lets a caller validate a URL out of band — without building and re-parsing HTML — answering the single question "is this URL's scheme permitted here?".

The return value is Allow if the scheme is permitted, or Strip / Defang if a scheme rule would neutralize it. These are the only actions scheme rules produce. element and attr are matched case-insensitively (e.g. "a", "href"). rawURL is normalized exactly as Policy.ApplyHTML normalizes a single-URL attribute value — control characters trimmed and, for URL attributes, zero-width characters stripped — before its scheme is extracted, so the result matches what document sanitization would decide for that scheme. Last-match-wins, and the empty (schemeless) scheme is matched by a rule listing ":relative" or "*".

Scope, and what Allow does NOT mean:

  • URLSchemeAction evaluates scheme rules ONLY. It does not consider whether the element or attribute would itself be removed by other verbs (e.g. "strip a" or "strip-attr a href"), nor content-type rules that filter data: URIs by MIME type. A return of Allow therefore means "no scheme rule objects to this scheme" — not "this attribute survives sanitization". When the policy has no scheme rules at all, the result is always Allow.
  • Selectors are evaluated against a detached element carrying only the named attribute. Scheme rules whose selectors depend on document position, ancestors, siblings, or other attributes may not match the way full-document Apply would; for such policies URLSchemeAction may report Allow for a URL that the document sanitizer would strip. For the common scheme policies (selectors of the form "*", a tag name, or an attribute selector on the URL attribute itself) the result is exact.
Example
package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// URLSchemeAction validates a URL's scheme against the policy without
	// building or re-parsing HTML — useful when an application has a URL in
	// hand and wants to decide whether to keep, drop, or defang it.
	policy, err := htmlpolicy.Parse(`
		strip-scheme * href *
		allow-scheme a href https,mailto,:relative
		defang-scheme a href http
	`)
	if err != nil {
		log.Fatal(err)
	}

	for _, u := range []string{
		"https://example.com",
		"http://example.com",
		"javascript:alert(1)",
		"/relative/path",
	} {
		switch policy.URLSchemeAction("a", "href", u) {
		case htmlpolicy.Allow:
			fmt.Printf("allow:  %s\n", u)
		case htmlpolicy.Defang:
			fmt.Printf("defang: %s\n", u)
		default:
			fmt.Printf("strip:  %s\n", u)
		}
	}
}
Output:
allow:  https://example.com
defang: http://example.com
strip:  javascript:alert(1)
allow:  /relative/path

type Resolver

type Resolver interface {
	Resolve(from, name string) (canonical, text string, err error)
}

Resolver loads policy text for include directives.

name is whatever appears after "include " in the policy file. The Resolver implementation decides how to interpret it (file path, database key, embedded resource, etc.).

from identifies the policy that contains the include directive, allowing the resolver to interpret name relative to the including policy's location. It is the canonical name returned by an earlier Resolve call (the include that loaded the parent), or the empty string for include directives in the top-level policy text passed to Parse. A file-based resolver typically treats from as a file path and joins relative name values against filepath.Dir(from); resolvers without a notion of location can ignore from entirely.

Resolve returns the canonical name of the loaded resource along with its text. The parser forwards canonical as from on nested include calls and uses it as the key for circular-include detection. Implementations should normalize the canonical form (e.g. resolve relative paths to absolute, strip "./" prefixes, normalize separators) so the same underlying resource always yields the same canonical string; otherwise a cycle through e.g. "foo.policy" and "./foo.policy" will only be caught by the depth limit (with a misleading "exceeds maximum depth" error rather than "circular include"). Resolvers without a notion of canonicalization may return name unchanged.

type SchemeRule

type SchemeRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	Attr     string           // attribute name pattern (e.g. "href", "*", "on*")
	Schemes  []string         // schemes (lowercase), ":relative" for schemeless URLs, or "*" for any scheme
	Action   Action           // Allow, Strip, or Defang
	Line     int
}

SchemeRule restricts URL schemes for an attribute on matching elements. The Action field determines what happens to URLs with matching schemes: Allow passes them through, Strip removes them, and Defang renames the attribute to make it inert.

type TagRule

type TagRule struct {
	Selector string           // CSS selector text (for String())
	Matcher  cascadia.Matcher // compiled selector
	Action   Action           // what to do with matching elements
	Line     int              // source line number (for diagnostics)
}

TagRule defines an action for matching elements.

type URLContext

type URLContext struct {
	// Element is the lowercase HTML element name (e.g., "img", "a", "style").
	// Empty for standalone CSS sanitization ([Policy.ApplyCSS], [Policy.ApplyInlineCSS]).
	Element string

	// Attr is the HTML attribute name (e.g., "href", "src", "style").
	// Empty for URLs inside <style> elements or standalone CSS.
	Attr string

	// CSSProperty is the CSS property name containing the url() value
	// (e.g., "background-image", "background"), or "@import" for @import URLs.
	// Empty for non-CSS URLs (HTML attributes).
	CSSProperty string

	// Parent is the lowercase name of the URL's element's parent in the
	// HTML tree (e.g., "picture", "audio", "video", "head", "body"). Empty
	// when the element has no element parent — top-level fragment nodes,
	// document-root children whose parent is the document node, and
	// standalone CSS sanitization.
	//
	// Useful for elements whose semantics depend on ancestry, e.g.
	// <source src> is an image candidate inside <picture> but a media file
	// inside <audio>/<video>.
	Parent string

	// GetAttr returns the value of the named attribute on the URL's
	// element, or "" if the attribute is absent. Lookups are
	// case-insensitive. GetAttr is always non-nil; for orphan contexts
	// (standalone ApplyCSS / ApplyInlineCSS) it returns "" for any name.
	//
	// Useful for elements whose semantics depend on a sibling attribute on
	// the same element, e.g. <input type=image>, <link rel=stylesheet>,
	// <object type=...>. GetAttr reflects the live attribute slice on the
	// element at the time of the call, so mutations applied by earlier
	// rewriter calls on the same element are visible.
	//
	// GetAttr is not considered part of the deduplication key — the
	// cached rewriter result for a given {url, element, attr, parent,
	// cssProperty} tuple comes from the first call, so if you discriminate
	// on GetAttr across same-key elements only the first element's attrs
	// influence the cached value.
	GetAttr func(name string) string
}

URLContext describes where a URL was found in the sanitized output. It is passed to URLRewriter callbacks (configured per call via WithApplyURLRewriter) to provide context about each URL.

type URLRef added in v0.3.0

type URLRef struct {
	// Context describes where the URL was found: the element, attribute, CSS
	// property, parent element name, and a live GetAttr accessor. It is the
	// exact same URLContext value the [URLRewriter] would receive for this
	// URL. Context.GetAttr is valid for the duration of the prefetch callback
	// (the underlying nodes are still alive); do not retain it past the call.
	Context URLContext

	// URL is the discovered URL in its final resolved form — identical to the
	// value that would be passed to a [URLRewriter] for this same context.
	URL string
}

URLRef pairs a URL discovered during sanitization with the URLContext describing where it was found. A slice of URLRef is passed to the callback registered by WithApplyURLPrefetcher, allowing the caller to warm a cache (e.g. fetch resources in parallel) before the synchronous URLRewriter runs.

type URLRewriter

type URLRewriter func(ctx URLContext, url string) string

URLRewriter is a callback that receives each URL in the sanitized output and returns a replacement. Return the original url unchanged to keep it. Supply a rewriter to one Apply call via WithApplyURLRewriter. Because the rewriter is scoped to that single call, it does not need to be safe for concurrent use; two goroutines calling Apply on the same Policy pass independent rewriters and never share state.

URLs are presented after policy evaluation and base URL resolution. Fragment-only references (#id), empty URLs, and data: URIs are excluded. URLs inside data URI content are recursed into (e.g. url(img.png) inside data:text/css is presented even though the data URI is not).

Calls are deduplicated within a single top-level applyHTMLAt / applyDocumentAt / applyCSS / applyInlineCSS invocation: each unique combination of {url, element, attr, parent, cssProperty} is presented to the callback exactly once for that invocation. The same URL in different contexts (e.g. <img src> vs <video poster>, or the same URL on a <source> inside <picture> vs inside <video>) produces separate calls. Recursive invocations triggered by embedded data URI content each have their own deduplication cache.

type Warning added in v0.9.0

type Warning struct {
	Line    int
	Message string
}

Warning describes a likely policy mistake found by Policy.Lint. Warnings are advisory: a policy that produces warnings is still valid and applied normally. Line is the 1-based source line the warning relates to, or 0 when it is not tied to a specific line.

func (Warning) String added in v0.9.0

func (w Warning) String() string

String renders the warning as "line N: message" (or just the message when Line is 0).

Directories

Path Synopsis
cmd
htmlpolicy command
Command htmlpolicy applies an HTML sanitization policy to HTML content.
Command htmlpolicy applies an HTML sanitization policy to HTML content.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL