htmlpolicy

package module

v0.21.0 Latest Latest Go to latest Published: Jul 21, 2026 License: AGPL-3.0 Imports: 21 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

gitlab.com/grepular/htmlpolicy

Links

Open Source Insights

README ¶

htmlpolicy

⚠️ Pre-v1 stability warning: until this library reaches v1.0.0, the API and policy language are liable to experience breaking changes between releases.

License & contributing. htmlpolicy is free software licensed under the AGPL-3.0. The author reserves the right to also distribute it under a separate commercial license. Because of this, any contribution must be offered under terms that let the author relicense it under any license, including a proprietary/commercial one — see CONTRIBUTING.md. (Want a non-AGPL commercial license? Open an issue or get in touch.)

A policy-driven HTML, CSS, and SVG sanitizer for Go.

Most sanitizers hardcode what's "safe" in library code. htmlpolicy takes a different approach: you write a declarative policy that says exactly what to strip, allow, defang, demote, or unwrap. Anything not matched by a rule passes through unchanged. Policies are plain text files, not Go code, so they can be reviewed, versioned, composed, and swapped without recompiling.

API reference and policy language documentation (godoc)

Playground

Try policies interactively at https://grepular.gitlab.io/htmlpolicy/ — the library compiled to WebAssembly, running entirely client-side. Edit a policy and HTML/CSS side by side and watch the sanitized output, lint warnings, verbose action log, and compiled policy update live. The policy, input, and output panes are syntax-highlighted, and the policy editor autocompletes verbs and builtin: presets (Ctrl-Space to trigger). No content leaves your browser, and shareable links encode everything in the URL fragment.

To run it locally: make playground and open http://localhost:8080.

Quick Start

import "gitlab.com/grepular/htmlpolicy"

// Parse a policy and apply it to an HTML fragment.
policy, err := htmlpolicy.Parse(`
    strip-tag script,style,noscript
    strip-tag iframe,object,embed
    strip-attr * on*
    strip-attr a[href^=javascript:] href
`)
if err != nil {
    log.Fatal(err)
}

output, modified, err := policy.ApplyHTML(fragmentContent)

For full HTML documents, use ApplyDocument instead:

output, modified, err := policy.ApplyDocument(documentContent)

For standalone CSS sanitization:

policy, _ := htmlpolicy.Parse("css-strip *\ncss-allow color,font-size")
out, modified, err := policy.ApplyCSS([]byte(".foo { color: red; margin: 10px; }"))
// out = ".foo {\ncolor: red;\nfont-size: 14px;\n}"

Both HTML and CSS expect UTF-8 input. Use ConvertToUTF8 first if needed:

utf8Content, _ := htmlpolicy.ConvertToUTF8(rawContent, contentType)

Features

HTML sanitization — strip, allow, defang, demote, comment-out, placeholder, or unwrap elements and attributes using CSS selectors
CSS sanitization — filter properties, values, at-rules, url() schemes, url() data-URI content types, and selector pseudo-classes/elements in style attributes, <style> elements, and standalone stylesheets
URL scheme filtering — allow/strip/defang URLs by scheme (html and css independently) with per-URL evaluation for multi-URL attributes (srcset). Defanging an <a>/<area> href prefixes the scheme in place (href="htmlpolicy-defanged:bitcoin:...") so non-allowlisted app-protocol links stay hoverable/copyable but inert; other attributes are renamed
Content-type filtering — control which MIME types are allowed in data URIs with wildcard patterns
Recursive sanitization — data:text/html, srcdoc, and meta refresh URLs are recursively sanitized with depth limiting (SVG/XHTML/XML data URIs are stripped by default — opt in per context with allow-content-type)
SVG/MathML namespace validation — prevents mutation XSS from foreign-content elements in the wrong namespace
SMIL animation sanitization — applies attribute and scheme rules to animated values, preventing runtime bypasses
Parser-differential prevention — always re-serializes through Go's HTML5 parser by default
URL rewriting — per-call callback hook (via WithApplyURLRewriter) for rewriting URLs in the final output (HTML attributes, CSS url() values, srcset, SMIL); scoped to a single Apply call so it can safely close over per-call state without concurrency concerns. A companion WithApplyURLPrefetcher hook delivers the full URL set up front (same set, same order) so callers can fetch resources in parallel before the rewriter runs
Original-URL preservation — WithOriginalURLAttr records the pre-rewrite value of every attribute the URL rewriter changed as a companion {prefix}-original-{attr} attribute (e.g. htmlpolicy-original-href), so rewritten URLs stay recoverable from the output; attacker-supplied attributes matching the reserved pattern are stripped first, so a preserved value can't be spoofed (WithKeepOriginalURLAttrs opts out for re-processing trusted prior output)
Policy composition — include directives for combining policies
Policy linting — Policy.Lint() (and htmlpolicy -lint) flags likely mistakes: URL attributes reopened with no scheme baseline (or with a dangerous scheme re-allowed after an earlier strip — for data:, only when no content-type rule governs the attribute), dangerous non-URL attributes (on*/nonce/is) re-allowed after a strip, allow-* rules with no effect, rules made dead by a later rule, and unsafe unwrap-tag * catch-alls
Concurrency safe — Policy is safe for concurrent use after creation

Security Guidance

This library is policy-driven — it does not hardcode any element or attribute as "safe" or "dangerous". You must write policies that account for the HTML features your application needs to defend against.

Built-in presets (recommended starting points)

Rather than copy a baseline into your codebase (where it bit-rots as browsers add features), include a maintained preset that ships with the library and improves on go get -u. include builtin:NAME works in any policy with no resolver configured, and later rules override the preset:

// Blocklist: strip active content, dangerous schemes, XML/SVG data URIs.
policy, _ := htmlpolicy.Parse("include builtin:blocklist")

// Allowlist: start fail-closed, then re-allow exactly what you need.
policy, _ := htmlpolicy.Parse(`
    include builtin:allowlist-base
    allow-tag p,div,span,b,i,em,strong,a,ul,ol,li,br
    allow-attr * class,id
    allow-tag a
    allow-attr a href
    include builtin:url-safe
`)

Preset	Purpose
`builtin:blocklist`	Strong blocklist baseline — strips active-content elements/attributes, dangerous URL schemes, and XML/SVG data URIs; everything else passes through
`builtin:allowlist-base`	Fail-closed skeleton — strips all elements, attributes, schemes, content types, and CSS, so anything you don't explicitly re-allow is dropped
`builtin:url-safe`	Restrict every URL attribute to `http`, `https`, `mailto`, and relative URLs — pair with rules that reopen URL-bearing attributes
`builtin:strip-dark-mode`	Remove dark-mode signals — `prefers-color-scheme: dark` `@media` overrides, the `color-scheme` property and `<meta>`, and Apple Mail's legacy `supported-color-schemes` property and `<meta>`. Strips the dark-mode mechanism, not author-chosen colors, so it does not guarantee a light render for a page that defaults to dark

Presets are ordinary policy text: BuiltinPolicy(name) returns the source, BuiltinPolicyNames() lists them, and Policy.String() shows them inlined.

Allowlist safety: when you reopen a URL-bearing attribute (e.g. allow-attr a href), also constrain its schemes (include builtin:url-safe, or your own allow-scheme rules). Reopening an attribute without a scheme baseline lets javascript: URLs through.

Recommended blocklist baseline (equivalent to `builtin:blocklist`)

strip-comments
strip-tag script,noscript,style,base,link
strip-tag iframe,fencedframe,object,embed,applet
strip-tag template,portal
strip-tag form,input,textarea,select,button
strip-tag meta[http-equiv=refresh]
strip-attr * on*
strip-attr * style
strip-attr * is
strip-attr * nonce
strip-attr template shadowrootmode
strip-scheme * href javascript,vbscript
strip-scheme * src javascript,vbscript,data
strip-content-type * * *+xml,*/xml

Key threats to consider

Declarative Shadow DOM: <template shadowrootmode="open"> causes template content to become live DOM. Strip <template> or the attribute.
<iframe srcdoc>: Contains raw inline HTML. Strip iframes or the srcdoc attribute.
<fencedframe>: Chrome-specific iframe-like element. Treat like iframe.
is attribute: Triggers custom element constructors, has mXSS edge cases. Strip in security contexts.
nonce attribute: CSP bypass if an attacker controls it. Always strip.
SVG/XHTML/XML data URIs: Browsers parse these with an XML parser, but this library uses an HTML5 parser, so they cannot be fully sanitized. They are now stripped by default — no policy needed. If you opt a context back into recursion (allow-content-type img src image/svg+xml), sanitization is best-effort only: namespace-prefixed elements (<a:script>) are stripped, but XML custom entity expansion (<!ENTITY x "<script>..."> then &x;), CDATA sections, and processing instructions are not. Only opt in for trusted sources or scripting-disabled contexts like <img>.
Weird data-URI charsets: a data URI declaring a charset the library can't reproduce (UTF-7, UTF-16, …) is stripped, since the browser would decode different bytes than the sanitizer saw (e.g. data:text/html;charset=utf-7,+ADw-script+AD4-). UTF-8, US-ASCII, and no charset are recursed normally.
unwrap-tag * as an allowlist catch-all: Unwrap removes a wrapper element but keeps its children in flow, mirroring how browsers treat unknown tags (e.g. <dov>hello</dov> renders as hello). It is useful for letting unrecognized structural tags pass through transparently, but it MUST be paired with explicit strip-tag rules for any element whose presence indicates active content or whose textual content should not be visible. At minimum: strip-tag script,style,iframe,object,embed,form,input,textarea,select,button,link,base and strip-tag meta[http-equiv=refresh]. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed, and unknown-named elements that carry active content would be exposed by name.

CLI Tool

go install gitlab.com/grepular/htmlpolicy/cmd/htmlpolicy@latest
htmlpolicy [flags] <policy-arg>... < input.html > output.html

Each argument is either a path to a policy file or inline policy text. Multiple arguments are concatenated in order (last match wins):

htmlpolicy policy.txt < input.html > output.html
htmlpolicy 'strip-tag script' < input.html > output.html
htmlpolicy base.policy 'strip-tag style' 'allow-scheme * href https'
htmlpolicy base.policy overrides.policy 'strip-tag img'

Flag	Description
`-fragment`	Parse input as an HTML fragment instead of a full document
`-detect-charset`	Detect and convert input charset to UTF-8
`-content-type`	Content-Type header for charset detection (only used with `-detect-charset`)
`-prefix`	Override the prefix for defang/comment-out actions (default: `htmlpolicy`)
`-verbose`	Log each sanitization action to stderr
`-lint`	Check the policy for likely mistakes and exit (does not read stdin)

Verbose Logging

The -verbose flag (or WithVerboseLog in the library) logs each action to stderr:

strip-tag <script> (line 1: strip-tag script)
strip-attr <p> onclick (line 3: strip-attr * on*)
strip-scheme <a> href javascript:alert(1) (line 5: strip-scheme * href javascript)
css-strip position (line 7: css-strip position)
css-strip-at @import (line 8: css-strip-at import)

Requirements

Go 1.26.4 or later (see the go directive in go.mod).

Testing

make test

100% test coverage is required. Tests fail if coverage drops below 100%.

License

See LICENSE. htmlpolicy is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

A separate commercial license is available for those who cannot or do not wish to comply with the AGPL — open an issue or get in touch to arrange one. This is possible because the author holds the copyright to all original code; contributions are therefore accepted only under terms that grant the author the right to relicense them under any license, including a commercial one (see CONTRIBUTING.md). The commercial license covers htmlpolicy's own code only; third-party dependencies remain under their respective (permissive) licenses.

Acknowledgements

This project was developed by Mike Cardwell, with the assistance of Claude Code, Anthropic's AI coding tool.

Support/Appreciate my work

Bitcoin: 1PQLtWnjUi1itHLG6QCQeHM3Nxua8pRsq1
Paypal
Patreon

Documentation ¶

Overview ¶

Package htmlpolicy implements a policy-driven HTML, CSS, and SVG sanitizer.

Warning: until this library reaches v1.0.0, the API and policy language are liable to experience breaking changes between releases.

This library was built entirely through vibe coding with Claude Code (https://claude.ai/claude-code).

Unlike sanitizers that hardcode safety rules in library code, htmlpolicy uses declarative policies: plain text files where each line starts with an action verb followed by CSS selectors and optional arguments. Rules are evaluated top-to-bottom with last-match-wins semantics. Anything not matched by a rule passes through unchanged.

Policies can be reviewed, versioned, composed via includes, and swapped without recompiling. This makes htmlpolicy suitable for email rendering pipelines, CMS content filtering, user-generated content sanitization, and any HTML-to-HTML transformation where the rules need to be configurable.

Entry Points ¶

There are five entry points for applying a policy:

Policy.ApplyHTML sanitizes an HTML fragment (user content to embed in a page).
Policy.ApplyDocument sanitizes a full HTML document (preserves DOCTYPE/html/head/body).
Policy.ApplyCSS sanitizes a standalone CSS stylesheet.
Policy.ApplyInlineCSS sanitizes an inline CSS declaration list (style attribute content).
ConvertToUTF8 converts content from a detected charset to UTF-8 (call before the others if needed).

All entry points expect UTF-8 input and return UTF-8 output. A Policy is safe for concurrent use by multiple goroutines.

Selectors ¶

Selectors use standard CSS syntax compiled via github.com/andybalholm/cascadia:

tag                     match all <tag> elements
*                       match all elements
tag[attr=value]         exact attribute match
tag[attr^=prefix]       prefix match
tag[attr$=suffix]       suffix match
tag[attr*=substring]    contains match
tag[attr~=word]         space-separated word match
tag[attr=value i]       case-insensitive match (CSS4)
div.ads > a             child combinator (quote when spaces present)
svg animate             descendant: <animate> inside <svg>
#tracking               ID selector
:not([href^=https])     negation pseudo-class

Comma-separated selector groups (e.g. "script,style,iframe") compile into a single rule. When a selector contains spaces, quote it so the parser distinguishes it from arguments: strip-tag "svg animate".

Tag matching is case-insensitive. Attribute value matching normalizes control characters and case internally for security, without modifying the output. This prevents bypasses like " javascript:" evading a [href^=javascript:] prefix check, or JaVaScRiPt: evading case-sensitive matching.

HTML Tag Verbs ¶

Tag verbs determine what happens to matched elements. Each takes a single selector and nothing else — attributes are handled only by the -attr verbs below:

strip-tag SELECTOR[,...]         remove element and all content
comment-out-tag SELECTOR[,...]   wrap in HTML comment
placeholder-tag SELECTOR[,...]   replace with [removed: tag] label
demote-tag SELECTOR[,...]        convert to <div>/<span>
allow-tag SELECTOR[,...]         keep element (attributes still filtered separately)
unwrap-tag SELECTOR[,...]        remove the element's tags but keep its children

Inline elements demote to <span>, all others to <div>. The namespace is cleared, so foreign elements (SVG, MathML) become plain HTML.

Unwrap removes the wrapper but leaves the children in flow (mirroring how browsers treat unknown elements). The wrapper's attributes are discarded; void/empty elements with no children are stripped. As a catch-all in an allowlist policy, "unwrap-tag *" MUST be paired with explicit strip rules for any element whose presence indicates active content or whose text content should not be visible — at a minimum: script, style, iframe, object, embed, form, input, textarea, select, button, link, base, meta. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed.

A tag verb takes exactly one selector token. A bare second word is an error, because it is almost always a mistake — either an attribute name (use an -attr verb) or an unquoted descendant combinator. To match a descendant combinator, quote the whole selector so the space is unambiguous:

allow-tag a            # keep <a>
allow-attr a href      # ...and keep its href attribute
strip-tag "td a"       # strip <a> elements inside <td> (descendant selector)
strip-tag td a         # ERROR: unexpected "a" after selector "td"

HTML Attribute Verbs ¶

Attribute verbs filter attributes on matched elements. Attribute name patterns support a trailing * glob: on* matches onclick, onload, etc.

strip-attr SELECTOR ATTR[,...]      strip named attrs
allow-attr SELECTOR ATTR[,...]      allow only named attrs
defang-attr SELECTOR ATTR[,...]     rename to PREFIX-defanged-ATTR

URL Scheme Verbs ¶

Scheme verbs filter URL attributes by the URL's scheme. Use :relative to match schemeless URLs, * to match any scheme:

allow-scheme SELECTOR ATTR SCHEME[,...]     allow URLs with listed schemes
strip-scheme SELECTOR ATTR SCHEME[,...]     strip URLs with listed schemes
defang-scheme SELECTOR ATTR SCHEME[,...]    neutralize URLs when scheme matches

For allowlist behavior, pair with a strip-scheme baseline:

strip-scheme * href *
allow-scheme * href https,mailto,:relative

Each scheme must be a valid URL scheme (an ASCII letter followed by letters, digits, "+", "-", or "."), the wildcard *, or :relative. Comma-separated lists reject a stray comma between entries (e.g. "http,,https") to catch typos.

defang-scheme is attribute-aware. For href on <a> and <area> (including an SVG <a>'s xlink:href) it prefixes the URL's scheme in place with PREFIX-defanged — for example href="bitcoin:..." becomes href="htmlpolicy-defanged:bitcoin:...". The PREFIX-defanged scheme is registered nowhere, so the link is inert, but the URL stays in href and therefore remains hoverable and copyable. This lets a reader copy or inspect app-protocol links (bitcoin:, tel:, slack:, ...) that the policy does not allowlist. The transform is fail-safe (prefixing only ever breaks a URL, never activates one) and idempotent (a value that already carries the PREFIX-defanged scheme passes through unchanged on re-apply). For every other attribute, defang-scheme renames the attribute to PREFIX-defanged-ATTR (so an auto-loaded resource such as <img src> never fires a request) just like defang-attr.

Multi-URL attributes (srcset, imagesrcset, ping, archive) get per-URL evaluation — each URL is independently matched against the rule chain — and always use the attribute-rename form (they are auto-loaded, not hovered).

Content-Type Verbs ¶

Content-type verbs control what happens to data URIs based on their MIME type. MIME patterns support a single * wildcard: image/*, *+xml, */xml.

allow-content-type SELECTOR ATTR TYPE[,...]     allow data URIs with listed types
strip-content-type SELECTOR ATTR TYPE[,...]     strip data URIs with listed types
defang-content-type SELECTOR ATTR TYPE[,...]    rename attr when type matches

By default, text/html data URIs are recursively sanitized (the HTML5 parser matches the browser), while SVG/XHTML/XML data URIs (image/svg+xml, application/xhtml+xml, */xml, *+xml) are stripped — the HTML5 parser cannot match a browser's XML parser, so they are removed rather than passed through with best-effort sanitization. To recurse an XML/SVG data URI anyway (best-effort, accepting the limitations below), opt in with an allow-content-type rule, e.g.:

allow-content-type img src image/svg+xml

scopes the opt-in to <img src> (a scripting-disabled context in browsers). Other data URIs pass through unchanged unless a content-type rule matches.

Independently, any data URI declaring a charset the library cannot reproduce faithfully (anything other than UTF-8, US-ASCII, or none — e.g. UTF-7 or UTF-16) is stripped, because the content is decoded to raw bytes and handed to the UTF-8/ASCII HTML5 parser; a browser honoring the declared charset would decode different markup (a charset differential).

CSS Verbs ¶

CSS verbs process style attributes, <style> element contents, data:text/css URIs, and SMIL animated style values. They follow the same last-match-wins semantics as HTML rules. Without CSS rules, CSS content passes through unchanged. HTML scheme rules (strip-scheme etc.) do NOT apply to CSS — use css-*-scheme rules instead.

css-strip PROP[,...]                    strip CSS properties (bare name matches vendor-prefixed variants)
css-allow PROP[,...]                    allow CSS properties
css-defang PROP[,...]                   defang CSS properties (prefix name)
css-strip-value PATTERN                 strip properties matching value pattern
css-defang-value PATTERN                defang properties matching value pattern
css-strip-at NAME[,...]                 strip CSS at-rules (e.g. import, * = all)
css-allow-at NAME[,...]                 allow CSS at-rules
css-defang-at NAME[,...]                defang CSS at-rules (prefix name)
css-strip-scheme TARGET SCHEME[,...]         strip CSS url() values by scheme
css-allow-scheme TARGET SCHEME[,...]         allow CSS url() values by scheme
css-defang-scheme TARGET SCHEME[,...]        defang CSS url() values by scheme
css-strip-content-type TARGET TYPE[,...]     strip CSS url() data: URIs by MIME type
css-allow-content-type TARGET TYPE[,...]     allow CSS url() data: URIs by MIME type
css-defang-content-type TARGET TYPE[,...]    defang CSS url() data: URIs by MIME type
css-strip-pseudo NAME[,...]                  strip selectors using a pseudo-class/element
css-allow-pseudo NAME[,...]                  allow pseudo-classes/elements
css-strip-media FEATURE[:VALUE][,...]        strip @media blocks by media feature
css-allow-media FEATURE[:VALUE][,...]        allow @media blocks by media feature

The css-*-scheme and css-*-content-type TARGET is a comma-separated list of CSS property name patterns or @import. Use * to match all properties and @import. Scheme lists support :relative and * as with HTML scheme rules; content-type TYPE lists use the same MIME patterns as the HTML content-type verbs (a single * wildcard, e.g. image/*, *+xml).

css-*-content-type filters data: URIs found in CSS url() values and @import by their MIME type — the CSS counterpart of the HTML content-type verbs (HTML content-type rules do not apply to CSS). It composes with css-*-scheme as AND (the most restrictive of the two decisions wins). When no css-*-content-type rule matches a url() data: URI, the default applies: text/css is recursively sanitized and SVG/XHTML data: URIs are best-effort recursed (a CSS url() loads in the browser's secure mode, with no scripting). It applies to url() and @import data: URIs and equally to bare-string URL arguments of image-set()/cross-fade()/image()/src(), which browsers load exactly like url() tokens.

The css-*-pseudo verbs filter CSS selectors (in <style> elements, data:text/css, and standalone Policy.ApplyCSS) by pseudo-class or pseudo-element name. Each comma-separated NAME may carry an optional leading "::" to match pseudo-elements only or ":" to match pseudo-classes only; a bare name (or *) matches either kind. Names support a trailing * glob. The four legacy single-colon pseudo-elements (:before, :after, :first-line, :first-letter) are always treated as pseudo-elements.

When a pseudo with action strip matches anywhere within a complex (comma-separated) selector — including nested inside functional pseudo-classes such as :has(), :is(), :not(), :where() — that entire complex selector is dropped; if every complex selector in a ruleset is dropped, the whole ruleset is removed. This only ever narrows which elements a style applies to. As with other CSS verbs, evaluation is last-match-wins per pseudo and unmatched pseudos pass through, so an allowlist is written as "css-strip-pseudo *" then "css-allow-pseudo hover,focus". Pseudo filtering does not apply to inline style attributes (which have no selectors) or to at-rule preludes such as @page :first or @keyframes stops.

The css-*-media verbs filter CSS @media query blocks (and @import media conditions) by media feature. Each comma-separated entry is a media feature or media type with an optional ":VALUE", e.g. "prefers-color-scheme:dark", "prefers-*", or "print"; both FEATURE and VALUE support a trailing * glob, omitting ":VALUE" matches the feature regardless of value, and a bare * matches any @media block. When a strip rule matches, the whole @media block (or @import) is removed. Matching is feature-presence based: the prelude is scanned through "not"/"only"/"and"/"or" combinators, so "not (prefers-color-scheme: dark)" is matched on presence, not semantically inverted, and modern range syntax such as "(width >= 600px)" is matched only by feature name, not value. css-*-media is more specific than the @media keyword, so a matching media rule overrides any css-*-at decision for that block (e.g. "css-strip-media *" then "css-allow-media screen" keeps only screen queries). There is no defang variant — like css-*-pseudo, media queries gate block applicability rather than carrying a value to neutralize. Like css-*-pseudo, css-*-media does not apply to inline style attributes (a declaration list has no at-rules) — only to <style> elements, stylesheets (Policy.ApplyCSS), and data:text/css. A bare "css-strip-media *" strips every @media block, and every @import that carries a media condition, but leaves a plain @import (no media condition) to the css-*-at rules.

A common use is forcing light mode by stripping dark-mode overrides ("css-strip-media prefers-color-scheme:dark"); see the strip-dark-mode built-in preset.

Property (css-strip/css-allow/css-defang), at-rule (css-*-at), scheme TARGET (css-*-scheme), and pseudo (css-*-pseudo) NAME matching is vendor-prefix-insensitive: a bare pattern matches both the canonical name and every vendor-prefixed variant, because browsers alias -webkit-foo, -moz-foo, -ms-foo, etc. to foo. So css-strip transform also strips -webkit-transform, css-strip-at keyframes also strips @-webkit-keyframes, and css-strip-pseudo scrollbar also drops ::-webkit-scrollbar. An explicitly hyphen-prefixed pattern (e.g. css-strip -webkit-transform) matches only that single variant — the escape hatch for targeting one vendor. Matching is name-only and never rewrites the surviving bytes; custom property names (--x) are not treated as vendor-prefixed.

CSS allowlist example:

css-strip *
css-allow color,background-color,font-size,font-family,font-weight
css-allow text-align,text-decoration,margin,padding,border
css-strip-value expression(*)
css-strip-value url(*)
css-strip-at *

CSS blocklist example:

css-strip -moz-binding,behavior
css-strip-value expression(*)
css-strip-at import
css-strip-scheme * javascript,vbscript,data,blob

Standalone CSS sanitization is available via Policy.ApplyCSS (stylesheets) and Policy.ApplyInlineCSS (declaration lists).

Other Verbs ¶

prefix NAME                  set the prefix for defang/comment-out names (default "htmlpolicy")
placeholder-label LABEL      customize the label for placeholder (default "removed")
strip-comments               remove HTML comments
include NAME-OR-PATH         inline another policy (loaded via [Resolver])
include builtin:NAME         inline a built-in preset (no [Resolver] needed)

Lines starting with # are comments. Only full-line comments are supported.

Linting ¶

Policy.Lint statically analyzes a compiled policy for likely mistakes (without applying it) and returns advisory Warning values: URL attributes reopened with no scheme baseline, allow-scheme rules that reopen a dangerous scheme (javascript/vbscript, or data: with no content-type rule governing the attribute) an earlier scheme rule stripped, rules made dead by a later same-family rule (last match wins by order, so an allow placed before a universal strip never fires), and "unwrap-tag *" used as a catch-all without strip rules for active-content elements. The CLI exposes this as "htmlpolicy -lint".

Built-in Presets ¶

The library ships a small set of maintained policy presets, includable as "include builtin:NAME" from any policy with no Resolver configured. They are ordinary policy text — fully inlined by Policy.String, and overridable by any later rule under last-match-wins — so they are a safe, reviewable starting point rather than hidden engine behavior. Upgrading the library picks up preset improvements (e.g. coverage for newly recognized HTML features). Available presets:

builtin:blocklist        strong blocklist baseline (strips active-content
                         elements/attributes, dangerous URL schemes, and
                         XML/SVG data URIs; everything else passes through)
builtin:allowlist-base   fail-closed skeleton (strips all elements,
                         attributes, schemes, content types, and CSS;
                         follow it with your own allow rules)
builtin:url-safe         restrict every URL attribute to http, https,
                         mailto, and relative URLs (pair with rules that
                         reopen URL-bearing attributes)
builtin:strip-dark-mode  remove dark-mode signals: prefers-color-scheme
                         dark @media overrides, the color-scheme property
                         and <meta>, and Apple Mail's legacy
                         supported-color-schemes property and <meta>

builtin:strip-dark-mode strips the dark-mode mechanism, not author-chosen colors: a page that defaults to light and adds a dark @media override (the common case) renders light afterward, but a page hardcoded dark with no media query is not repainted. It does not guarantee a light result.

Use BuiltinPolicy to fetch a preset's text in Go and BuiltinPolicyNames to list them. Because allowlist-base reopens nothing on its own, building an allowlist on top of it fails safe: forgetting to re-allow a rule family drops content rather than passing it through.

URL Rewriting ¶

WithApplyURLRewriter is an ApplyOption that sets a callback that receives every URL in the sanitized output and can replace it. The rewriter is scoped to a single Apply call, so it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. The rewriter runs after policy evaluation and base URL resolution. It covers HTML URL attributes, CSS url() values, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded. Fragment-only references, empty URLs, and data: URIs themselves are excluded (but URLs inside data URI content are recursed into). The option also works with Policy.ApplyCSS and Policy.ApplyInlineCSS.

WithOriginalURLAttr preserves the pre-rewrite value of every HTML attribute the rewriter changed as a companion "{prefix}-original-{attrname}" attribute on the same element, so the original URL can be recovered from the output. Pre-existing attributes matching that reserved pattern are stripped from the input first (spoof prevention); WithKeepOriginalURLAttrs opts out of the stripping for trusted re-processing of this library's own prior output.

WithApplyURLPrefetcher is a companion ApplyOption that receives the full set of URLs the rewriter would be called with — same set, same resolved form, same dedup, in document order — once per call, before the rewrite pass. It lets a caller warm a cache (e.g. fetch image and font resources in parallel) that the subsequent synchronous rewriter reads from. Each URLRef carries a live URLContext (including GetAttr) so the caller can inspect sibling attributes when deciding what to fetch.

Output Normalization ¶

By default, Policy.ApplyHTML and Policy.ApplyDocument always re-serialize through Go's HTML5 parser, even when no rules modified the content. This prevents parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no rules match.

Embedded Content Sanitization ¶

HTML embedded inside attribute values is recursively sanitized: data:text/html URIs, srcdoc attributes, and meta refresh URLs. Recursion depth is limited to 16 levels; at the limit, content is stripped entirely.

SVG/XHTML/XML data: URIs are stripped by default rather than recursed, because the HTML5 parser cannot match a browser's XML parser. An allow-content-type rule (e.g. "allow-content-type img src image/svg+xml") opts a given context back into best-effort recursion, accepting the limitation below. When recursing, namespace-prefixed elements (e.g. <a:script xmlns:a="...">) are stripped, but XML custom entity declarations (<!ENTITY x "<script>...">), CDATA sections, and processing instructions cannot be safely sanitized this way — so opt in only for content you trust or contexts where browsers disable scripting (e.g. <img>). See Limitations.

Namespace Validation (mXSS Prevention) ¶

After applying policy rules, namespace consistency is validated. Elements whose namespace is invalid for their DOM position are stripped. This prevents mutation XSS attacks where foreign-content elements end up in the wrong namespace after policy actions change the tree structure.

SVG SMIL Animation Sanitization ¶

SMIL animation elements (<animate>, <set>, etc.) are sanitized by applying the same attribute and scheme policy rules to animated values. This prevents runtime bypasses via attributeName="href" values="javascript:alert(1)". CSS in animated style values is sanitized through the CSS engine.

Limitations ¶

CSS selectors in <style> elements are not filtered. Attribute selectors combined with url() values can exfiltrate data. The url() scheme filtering mitigates this, but full protection requires stripping <style> elements.

CSS var()/env()/attr() substitution happens at browser computed-value time, not parse time. Known attack vectors are handled (url() in custom properties, @import var(), var() fallback values, var()/env()/attr() in URL-accepting functions). When css-*-scheme rules are active, var()/env()/attr() inside URL-accepting functions (image-set, -webkit-image-set, cross-fade, image, src) is stripped as a precaution. Bare string arguments inside these functions are evaluated as URLs and run through the full url()-token path: css-*-scheme rules, css-*-content-type rules, and recursive sanitization of data: URI content. CSS escape sequences in the url() (or attr()) function name itself (e.g. \75rl(...), \61ttr(...)) bypass the lexer's URLToken recognition, so when css-*-scheme rules are active any FunctionToken whose escape-decoded name is "url" causes the declaration to be stripped. attr() with a url-typed substitution — "attr(name url)" or "attr(name type(<url>))" — also causes the declaration to be stripped, since the value of the named HTML attribute would be loaded as a URL at computed time.

SVG/XHTML/XML data: URIs cannot be fully sanitized, so they are stripped by default. Browsers parse them with an XML parser; this library uses an HTML5 parser. If you opt a context back into recursion with an allow-content-type rule, sanitization is best-effort only: namespace-prefixed elements are stripped as defense-in-depth, but XML custom entity expansion (<!ENTITY x "<script>...">, then &x;), CDATA sections, and processing instructions are XML-only constructs the HTML5 parser treats as inert text or comments — so an XML-parsing browser may execute content the sanitizer considered safe. Only opt in for trusted content or scripting-disabled contexts (e.g. <img>). The CSS url() path is an exception: SVG/XML data URIs there are left as best-effort recursion because a CSS url() loads in the browser's secure mode (no scripting).

Hardcoded limits: data URI recursion depth 16, HTML nesting depth 512, CSS nesting depth 128, output size 10x input (configurable via WithMaxOutputFactor, minimum 32KB).

Index ¶

Constants
func BuiltinPolicy(name string) (text string, ok bool)
func BuiltinPolicyNames() []string
func ConvertToUTF8(content []byte, contentType string) ([]byte, bool)
type Action
type ApplyOption
- func WithApplyURLPrefetcher(fn func([]URLRef)) ApplyOption
- func WithApplyURLRewriter(fn URLRewriter) ApplyOption
- func WithKeepOriginalURLAttrs() ApplyOption
- func WithMaxOutputFactor(factor float64) ApplyOption
- func WithOriginalURLAttr(filter func(URLContext) bool) ApplyOption
- func WithPreserveOriginal() ApplyOption
- func WithVerboseLog(w io.Writer) ApplyOption
type AttrRule
type CSSAtRule
type CSSContentTypeRule
type CSSMediaRule
type CSSPropertyRule
type CSSPseudoRule
type CSSSchemeRule
type CSSValueRule
type ContentTypeRule
type FileResolver
- func (r FileResolver) Resolve(from, name string) (canonical, text string, err error)
type ParseOption
- func WithMaxIncludeDepth(n int) ParseOption
- func WithPrefix(prefix string) ParseOption
- func WithResolver(r Resolver) ParseOption
type Policy
- func Parse(input string, opts ...ParseOption) (*Policy, error)
- func (p *Policy) ApplyCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) ApplyDocument(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) ApplyHTML(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) ApplyInlineCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) Lint() []Warning
- func (p *Policy) String() string
- func (p *Policy) URLSchemeAction(element, attr, rawURL string) Action
type Resolver
type SchemeRule
type TagRule
type URLContext
type URLRef
type URLRewriter
type Warning
- func (w Warning) String() string

Constants ¶

View Source

const DefaultMaxIncludeDepth = 64

DefaultMaxIncludeDepth is the default maximum nesting depth for includes.

Variables ¶

This section is empty.

Functions ¶

func BuiltinPolicy ¶ added in v0.9.0

func BuiltinPolicy(name string) (text string, ok bool)

BuiltinPolicy returns the raw policy text of the named built-in preset, or ok=false if no such preset exists. The name is the bare preset name without the "builtin:" prefix (e.g. "blocklist", "allowlist-base", "url-safe").

Built-in presets are ordinary policy text maintained with the library. They can be applied directly via an include directive — "include builtin:NAME", which works in any policy without configuring a Resolver — or fetched with this function to inspect or compose them in Go. Use BuiltinPolicyNames to list the available presets.

Because presets are plain policy text, later rules in a policy override them under last-match-wins semantics, and the fully resolved policy (with presets inlined) is visible via Policy.String.

func BuiltinPolicyNames ¶ added in v0.9.0

func BuiltinPolicyNames() []string

BuiltinPolicyNames returns the names of all built-in presets in sorted order, each usable as "include builtin:NAME" or passed to BuiltinPolicy.

func ConvertToUTF8 ¶

func ConvertToUTF8(content []byte, contentType string) ([]byte, bool)

ConvertToUTF8 converts content from the charset specified in contentType to UTF-8. The contentType should be a MIME type with optional charset parameter (e.g. "text/html; charset=iso-8859-1"). If no charset is specified, encoding is determined by inspecting the content.

Returns the UTF-8 content and a boolean indicating whether conversion was performed. When the content is already UTF-8 (or the detected encoding matches UTF-8), the original slice is returned with false.

Limitation: UTF-7 is not supported. The underlying charset detector has no UTF-7 decoder, so content declaring charset=utf-7 is returned unchanged (with false). This matches modern browsers, which no longer sniff or decode UTF-7; the classic "+ADw-script+AD4-" obfuscation is therefore not auto-decoded here. Callers that must handle attacker- controlled UTF-7 should reject or pre-decode such input before sanitizing.

Example ¶

package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	content := []byte("<p>caf\xe9</p>")
	utf8, converted := htmlpolicy.ConvertToUTF8(content, "text/html; charset=iso-8859-1")

	fmt.Println("converted:", converted)
	fmt.Println(string(utf8))
}

Output:
converted: true
<p>café</p>

Types ¶

type Action ¶

type Action int

Action determines what happens to a matched element or attribute.

Action is part of the stable API: Policy.URLSchemeAction returns it. The rule types that carry an Action field (TagRule, AttrRule, etc.) are exported for documentation only — Policy fields are unexported and no public function accepts or returns those rule types. Policy.URLSchemeAction returns only Allow, Strip, or Defang; the remaining constants arise solely from element/attribute verbs applied during sanitization.

const (
	// Strip removes the element and all of its content from the output.
	Strip Action = iota
	// CommentOut wraps the element and its content in an HTML comment
	// (<!-- ... -->), making it invisible to browsers while preserving
	// the content in the source for inspection.
	CommentOut
	// Placeholder replaces the element with a text label such as
	// "[removed: script]". The label word is configurable via the
	// placeholder-label policy directive.
	Placeholder
	// Demote converts the element to a safe generic container: inline
	// elements become <span>, all others become <div>. Allowed
	// attributes are preserved; the namespace is cleared.
	Demote
	// Allow keeps the element in the output unchanged.
	Allow
	// Defang neutralizes content while preserving the value for
	// inspection. It usually renames the attribute by inserting a prefix
	// (e.g. "htmlpolicy-defanged-onclick"). As a scheme rule on href of
	// <a>/<area> it instead prefixes the URL scheme in place with
	// "{prefix}-defanged" (e.g. "htmlpolicy-defanged:bitcoin:...") so the
	// link stays hoverable/copyable but inert — see the defang-scheme
	// documentation in the package overview.
	Defang
	// Unwrap removes the element's start and end tags but keeps its
	// children in place at the position the element occupied. The
	// element's attributes are discarded along with the wrapper. This
	// mirrors how browsers treat unknown elements (HTMLUnknownElement
	// is a transparent wrapper: the tag has no effect and children
	// render in flow). Void/empty elements with no children behave as
	// [Strip] — there is nothing to keep.
	//
	// Unwrap differs from [Demote]: Demote renames the wrapper to a
	// safe generic container (<div>/<span>), preserving the element
	// boundary, while Unwrap removes the wrapper entirely.
	//
	// Unwrap is appropriate as a catch-all for unknown structural
	// elements in an allowlist policy. It MUST be paired with explicit
	// strip rules for any element whose presence indicates active
	// content or whose text content should not be visible. At a
	// minimum, pair "unwrap-tag *" with strip rules for script, style,
	// iframe, object, embed, form, input, textarea, select, button,
	// link, base, and meta. Without those, raw <script>/<style> text
	// would leak into the output as visible text after the wrapper
	// is removed.
	Unwrap
)

type ApplyOption ¶

type ApplyOption func(*applyConfig)

ApplyOption configures a single call to Policy.ApplyHTML, Policy.ApplyDocument, Policy.ApplyCSS, or Policy.ApplyInlineCSS. Unlike ParseOption (which is baked into the compiled Policy and shared across all calls), ApplyOption values are scoped to one call. This allows per-call state — for example, a URL rewriter that closes over per-message caches or budgets, or a verbose log writer for a single request — without recompiling the policy.

func WithApplyURLPrefetcher ¶ added in v0.3.0

func WithApplyURLPrefetcher(fn func([]URLRef)) ApplyOption

WithApplyURLPrefetcher registers a callback invoked once per Apply call, after parsing, policy evaluation, and base URL resolution, but before the URL rewrite pass. It receives every URL that a URLRewriter set via WithApplyURLRewriter would be invoked with in this same call — the same set, the same resolved form, the same deduplication semantics, and the same order (document order for top-level URLs; URLs inside recursively sanitized data: content appear at the point their containing attribute is processed). The callback returns nothing; it exists so the caller can warm a cache — for example, fetching image and font resources in parallel — that the subsequent synchronous URLRewriter then reads from.

Each URLRef.Context.GetAttr is live during the callback and returns the element's attributes, so the caller can inspect sibling attributes (width, height, style, type, rel, …) to decide whether to fetch.

The prefetcher and rewriter are independent: either, both, or neither may be set. When both are set the prefetcher runs first, to completion, and then the normal rewrite walk runs. When no rewriter is set the prefetcher is still invoked (with the set the rewriter would have seen). The callback is always invoked exactly once per Apply call, even when no URLs are found (with an empty slice).

Implementation note: enabling a prefetcher runs the deterministic sanitization pipeline a second time — a recording-only pass that collects the URL set without mutating output — so the prefetch set is guaranteed to match the rewrite pass exactly, including URLs at every recursion depth. This roughly doubles the parse/sanitize CPU for the call (the network fetches it enables run once, in parallel). Callers that do not set a prefetcher pay nothing. The recording pass is silent: it runs with no verbose log writer, so WithVerboseLog never produces duplicate lines.

Like WithApplyURLRewriter, the callback is scoped to a single Apply call and may safely close over per-call state without concurrency concerns. It is invoked synchronously from within Apply; htmlpolicy itself introduces no goroutines (the caller may fan out internally).

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`allow-tag *`)
	if err != nil {
		log.Fatal(err)
	}

	// The prefetcher receives every URL up front, so a caller can fetch them
	// in parallel (with its own concurrency limits) and warm a cache that the
	// synchronous rewriter then reads from.
	cache := map[string]string{}
	prefetch := func(refs []htmlpolicy.URLRef) {
		for i, r := range refs {
			// A real implementation would fetch r.URL here, in parallel.
			cache[r.URL] = fmt.Sprintf("cid:image%03d", i+1)
		}
	}
	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		if cid, ok := cache[u]; ok {
			return cid
		}
		return u
	}

	input := []byte(`<img src="https://example.com/a.jpg"/><img src="https://example.com/b.jpg"/>`)
	output, _, err := policy.ApplyHTML(input,
		htmlpolicy.WithApplyURLPrefetcher(prefetch),
		htmlpolicy.WithApplyURLRewriter(rewriter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
<img src="cid:image001"/><img src="cid:image002"/>

func WithApplyURLRewriter ¶

func WithApplyURLRewriter(fn URLRewriter) ApplyOption

WithApplyURLRewriter sets a function that rewrites URLs in the sanitized output for this single Apply call. The rewriter runs after policy evaluation and base URL resolution with each URL's final resolved form. It receives a URLContext describing where the URL was found and returns a replacement URL (or the original to keep it unchanged).

The rewriter covers HTML URL attributes (href, src, action, etc.), CSS url() values in style attributes and <style> elements (including bare-string URL arguments of image-set()/cross-fade()/image()/src(), which browsers load exactly like url() tokens), @import URLs, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded.

Fragment-only references (#id), empty URLs, and data: URIs are excluded from the callback. However, URLs inside data URI content are recursed into — e.g. url(img.png) inside a data:text/css URI is presented to the rewriter even though the data URI itself is not.

The rewriter is per-URL: its output is written verbatim into the slot the URL came from and is not re-sanitized. To prevent one replacement from expanding into several entries, a replacement containing the structural separator of its slot is rejected and the original URL is kept — ';' for SMIL animation values lists, ASCII whitespace for srcset candidates and space-separated URL lists (ping/archive), and '(' ')' quotes or whitespace for SVG url() functional attributes. (HTML attribute values are always quote-escaped on output, so a replacement cannot inject markup regardless.)

Because the rewriter is scoped to a single Apply call, it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. Two goroutines calling Apply on the same Policy with different rewriters do not share state.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-tag script,style
		strip-attr * on*
	`)
	if err != nil {
		log.Fatal(err)
	}

	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		// Replace external URLs with CID references (for email embedding).
		if ctx.Element == "img" && ctx.Attr == "src" {
			return "cid:image001"
		}
		return u
	}

	input := []byte(`<p>Hello</p><img src="https://example.com/photo.jpg" alt="photo"/>`)
	output, _, err := policy.ApplyHTML(input, htmlpolicy.WithApplyURLRewriter(rewriter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
<p>Hello</p><img src="cid:image001" alt="photo"/>

func WithKeepOriginalURLAttrs ¶ added in v0.20.0

func WithKeepOriginalURLAttrs() ApplyOption

WithKeepOriginalURLAttrs disables the spoof-prevention stripping performed by WithOriginalURLAttr: pre-existing "{prefix}-original-*" attributes in the input are kept, and an attribute that already has a companion annotation is never re-annotated — the earliest preserved value wins across repeated Apply passes.

Set this only when re-processing this library's own prior output (or input that is otherwise trusted): it is an explicit trust assertion, and with it set an attacker-supplied {prefix}-original-href in the input survives as if it were a preserved value.

It has no effect unless WithOriginalURLAttr is also set.

func WithMaxOutputFactor ¶

func WithMaxOutputFactor(factor float64) ApplyOption

WithMaxOutputFactor sets the maximum allowed output size for this Apply call as a multiplier of the input size. For example, a factor of 10.0 means the output may be at most 10x the input size (with a minimum of 32KB). This guards against amplification attacks such as a long <base href> resolved into many short relative URLs. The default is 10.0. Set to 0 to disable the limit.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-tag script
	`)
	if err != nil {
		log.Fatal(err)
	}

	// Small input is always allowed (32KB minimum).
	output, _, err := policy.ApplyHTML([]byte("<p>Hello</p>"), htmlpolicy.WithMaxOutputFactor(2.0))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<p>Hello</p>

func WithOriginalURLAttr ¶ added in v0.20.0

func WithOriginalURLAttr(filter func(URLContext) bool) ApplyOption

WithOriginalURLAttr preserves, for this single Apply call, the pre-rewrite value of every HTML element attribute that a URLRewriter (set via WithApplyURLRewriter) changed. After the rewrite pass, each changed attribute gains a companion attribute on the same element:

{prefix}-original-{attrname}="{pre-rewrite value}"

For example, with the default prefix, rewriting <a href="https://x/page"> to a proxy URL produces:

<a href="https://proxy/abc" htmlpolicy-original-href="https://x/page">

so the original URL can be recovered from the output. The annotation is per node: two identical elements are annotated independently, even though rewriter calls for identical URL+context pairs are deduplicated.

filter, if non-nil, restricts which attributes are annotated. It receives a URLContext with Element, Attr, Parent, and GetAttr populated and CSSProperty always empty — annotation is attribute-granular, not URL-granular. Return true to annotate. A nil filter annotates every changed attribute.

Granularity and coverage:

The whole pre-rewrite attribute value is preserved. For multi-URL attributes (srcset, ping) that is the entire original list; for style attributes whose CSS url() values were rewritten it is the entire original declaration list.
The preserved value is the post-sanitization, post-base-resolution value — exactly what the rewrite pass saw, never the raw input — so annotations cannot resurrect content the policy stripped or defanged.
URLs with no attribute to annotate are not preserved: <style> element contents, standalone Policy.ApplyCSS / Policy.ApplyInlineCSS (where this option is a no-op), and URLs inside data:text/css content (though the containing attribute, whose data: URI value changed, is itself annotated).
Annotation runs at all recursion depths, so elements inside recursively sanitized data:text/html content are annotated within that content.
A namespaced attribute contributes its namespace to the companion name: an SVG xlink:href is preserved as {prefix}-original-xlink-href.

Spoof prevention: whenever this option is set, attributes whose name matches the reserved "{prefix}-original-" pattern (case-insensitive) are stripped from the input before annotation, so an attacker-supplied {prefix}-original-href can never masquerade as a value this library preserved. See WithKeepOriginalURLAttrs to opt out when re-processing trusted prior output. Without this option, such attributes are ordinary attributes governed by normal policy rules.

Annotations count toward the output size limit (WithMaxOutputFactor).

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`strip-tag script`)
	if err != nil {
		log.Fatal(err)
	}

	// Route link targets through a redirector, but keep the original URL
	// recoverable from the output.
	rewriter := func(ctx htmlpolicy.URLContext, u string) string {
		return "https://redirect.example/?1"
	}
	// Only annotate anchor hrefs; other rewritten attributes are not preserved.
	filter := func(ctx htmlpolicy.URLContext) bool {
		return ctx.Element == "a" && ctx.Attr == "href"
	}

	input := []byte(`<a href="https://example.com/page">link</a>`)
	output, _, err := policy.ApplyHTML(input,
		htmlpolicy.WithApplyURLRewriter(rewriter),
		htmlpolicy.WithOriginalURLAttr(filter))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
<a href="https://redirect.example/?1" htmlpolicy-original-href="https://example.com/page">link</a>

func WithPreserveOriginal ¶

func WithPreserveOriginal() ApplyOption

WithPreserveOriginal configures this Apply call to return the original input bytes when no policy rules modify the content, instead of re-serializing through the HTML parser.

By default, ApplyHTML/ApplyDocument always re-serialize through Go's HTML5 parser, which normalizes malformed HTML. This is the safer default because it ensures the browser sees exactly the same structure the sanitizer saw.

Warning: enabling this option means that when no rules match, the original bytes are returned unmodified. If the input contains malformed HTML that Go's parser and the browser parse differently, this creates a parser-differential attack surface. Only enable this option if you trust the input to be well-formed HTML, or if you need byte-for-byte preservation of unmodified content (e.g., to avoid altering whitespace or attribute quoting in content that was not changed by policy rules).

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-tag script
	`)
	if err != nil {
		log.Fatal(err)
	}

	// Unmodified content is returned as-is (byte-for-byte).
	input := []byte("<p>Hello</p>")
	output, modified, err := policy.ApplyHTML(input, htmlpolicy.WithPreserveOriginal())
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}

Output:
modified: false
<p>Hello</p>

func WithVerboseLog ¶

func WithVerboseLog(w io.Writer) ApplyOption

WithVerboseLog enables verbose logging of policy rule matches for this single Apply call. When set, every sanitization action (strip, defang, demote, comment-out, placeholder) writes a one-line description to w. Elements and attributes that pass through unchanged are not logged.

Because the option is scoped to one call, the writer need not be safe for concurrent use unless the same writer is passed to concurrent Apply calls.

type AttrRule ¶

type AttrRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	AttrName string           // attribute name pattern (may have trailing "*" glob)
	Action   Action           // Allow, Strip, or Defang
	Line     int              // source line number
}

AttrRule defines filtering for a specific attribute on matching elements.

type CSSAtRule ¶

type CSSAtRule struct {
	Name   string // e.g. "import", "font-face"
	Action Action // Allow, Strip, or Defang
	Line   int    // source line number
}

CSSAtRule filters CSS at-rules by name.

type CSSContentTypeRule ¶ added in v0.9.0

type CSSContentTypeRule struct {
	Properties []string // CSS property name patterns (trailing "*" glob ok, "*" = all, "@import" = import URLs)
	Types      []string // MIME type patterns (lowercase), e.g. "image/*", "image/svg+xml"
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSContentTypeRule filters data: URIs in CSS url() values (and @import) by MIME type. It mirrors CSSSchemeRule but matches the data: URI's media type instead of the URL scheme. The Properties target is a CSS property name pattern list (or "@import"); the Types are MIME patterns (a single "*" wildcard allowed, e.g. "image/*", "*+xml").

type CSSMediaRule ¶ added in v0.14.0

type CSSMediaRule struct {
	Feature string // lowercase glob, e.g. "prefers-color-scheme", "prefers-*", "print", "*"
	Value   string // lowercase glob, e.g. "dark"; "" matches any value
	Action  Action // Allow or Strip
	Line    int    // source line number
}

CSSMediaRule filters CSS @media query blocks (and @import media conditions) by media feature. A rule matches an @media block when any media feature or media type in its prelude matches Feature (and, if Value is non-empty, the feature's value). When a rule with Action Strip matches, the whole @media block (or @import) is removed. Matching is feature-presence based: it scans through "not"/"only"/"and"/"or" wrappers, so "not (prefers-color-scheme: dark)" is matched on presence, not semantically inverted. Like CSSPseudoRule, there is no Defang variant — media queries gate block applicability, they do not carry a value to neutralize.

type CSSPropertyRule ¶

type CSSPropertyRule struct {
	Properties []string // property names (trailing "*" glob ok)
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSPropertyRule filters CSS properties by name.

type CSSPseudoRule ¶ added in v0.4.0

type CSSPseudoRule struct {
	Kind    pseudoKind // any / class / element
	Pattern string     // lowercase name, optional trailing "*" glob
	Action  Action     // Allow or Strip
	Line    int        // source line number
}

CSSPseudoRule filters CSS selectors by pseudo-class / pseudo-element name. When a rule with Action Strip matches a pseudo anywhere within a complex (comma-separated) selector — including pseudos nested inside functional pseudo-classes such as :has(), :is(), :not(), :where() — that entire complex selector is dropped. If every complex selector in a ruleset's prelude is dropped, the whole ruleset is removed.

type CSSSchemeRule ¶

type CSSSchemeRule struct {
	Properties []string // CSS property name patterns (trailing "*" glob ok, "*" = all, "@import" = import URLs)
	Schemes    []string // e.g. ["javascript", "data", ":relative", "*"]
	Action     Action   // Allow, Strip, or Defang
	Line       int      // source line number
}

CSSSchemeRule filters URLs in CSS url() values by scheme.

type CSSValueRule ¶

type CSSValueRule struct {
	Pattern string // e.g. "url(*)", "expression(*)"
	Action  Action // Strip or Defang
	Line    int    // source line number
}

CSSValueRule filters CSS properties whose values match a pattern. Patterns ending in "(*)" match any value containing a call to that CSS function (e.g. "expression(*)" matches values containing "expression(...)"). Other patterns match the full value literally (case-insensitive).

type ContentTypeRule ¶

type ContentTypeRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	Attr     string           // attribute name pattern (e.g. "src", "*", "href")
	Types    []string         // MIME type patterns (lowercase), e.g. "image/*", "text/html"
	Action   Action           // Allow, Strip, or Defang
	Line     int
}

ContentTypeRule restricts data URI MIME types for an attribute on matching elements. The Action field determines what happens to data URIs whose MIME type matches the rule's pattern list: Allow recursively sanitizes them, Strip removes the URI value, and Defang renames the attribute to make it inert.

type FileResolver ¶ added in v0.9.0

type FileResolver struct {
	// BaseDir is the directory that top-level relative include names resolve
	// against. An empty BaseDir resolves relative names against the process's
	// current working directory.
	BaseDir string
}

FileResolver is a Resolver that loads include directives by reading policy files from disk. Supply it via WithResolver:

policy, err := htmlpolicy.Parse(text, htmlpolicy.WithResolver(
	htmlpolicy.FileResolver{BaseDir: "/etc/policies"}))

Relative include names resolve against the directory of the including policy file (filepath.Dir(from)) when from is an absolute path. For top-level includes (from == "") or when from is not absolute, relative names resolve against BaseDir. The resolver returns the absolute path of the loaded file as the canonical name, so nested includes continue to resolve against the right directory and circular includes are detected reliably.

func (FileResolver) Resolve ¶ added in v0.9.0

func (r FileResolver) Resolve(from, name string) (canonical, text string, err error)

Resolve implements Resolver.

type ParseOption ¶

type ParseOption func(*parseConfig)

ParseOption configures the behavior of Parse.

func WithMaxIncludeDepth ¶

func WithMaxIncludeDepth(n int) ParseOption

WithMaxIncludeDepth sets the maximum nesting depth for include directives. The default is DefaultMaxIncludeDepth (64).

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-tag script,style
	`, htmlpolicy.WithMaxIncludeDepth(4))
	if err != nil {
		log.Fatal(err)
	}

	output, _, err := policy.ApplyHTML([]byte("<p>Hello</p><script>evil</script>"))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<p>Hello</p>

func WithPrefix ¶ added in v0.9.0

func WithPrefix(prefix string) ParseOption

WithPrefix sets the base prefix used when generating names for the CommentOut action (e.g. "PREFIX-commented-out ...") and every defang action (defang-attr/defang-scheme/defang-content-type, the CSS defang verbs, and SVG animated-value defang, e.g. "PREFIX-defanged-onclick"). The default is "htmlpolicy".

The prefix must be non-empty and contain only ASCII letters, digits, or hyphens; an invalid prefix is reported as an error from Parse. A "prefix" directive in the policy text sets the same value and, being applied during parsing, takes precedence over WithPrefix.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-attr * onclick
		comment-out-tag script
	`, htmlpolicy.WithPrefix("myapp"))
	if err != nil {
		log.Fatal(err)
	}

	output, _, err := policy.ApplyHTML([]byte(`<p onclick="track()">Hello</p><script>evil</script>`))
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<p myapp-defanged-onclick="track()">Hello</p><!--myapp-commented-out <script>evil</script>-->

Example (Error) ¶

package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	_, err := htmlpolicy.Parse(`defang-attr * on*`, htmlpolicy.WithPrefix(""))
	fmt.Println(err)
}

Output:
prefix must not be empty

func WithResolver ¶ added in v0.9.0

func WithResolver(r Resolver) ParseOption

WithResolver sets the Resolver used to load policies referenced by include directives (by name or path). It is only needed when the policy contains include directives that are not built-in presets ("include builtin:NAME" works without a resolver).

type Policy ¶

type Policy struct {
	// contains filtered or unexported fields
}

Policy is a compiled set of rules for sanitizing HTML, CSS, and SVG content.

A Policy is created by Parse and is immutable afterward.

A Policy is safe for concurrent use by multiple goroutines. Policy.ApplyHTML does not mutate the Policy.

Security note: this library is policy-driven and does not hardcode any element or attribute as safe or dangerous. Policy authors must account for modern HTML features including Declarative Shadow DOM (<template shadowrootmode>), iframe srcdoc, fencedframe, custom elements (the is attribute), and CSP nonce attributes. See the project README for recommended baselines and security guidance.

func Parse ¶

func Parse(input string, opts ...ParseOption) (*Policy, error)

Parse parses policy text into a Policy. Include directives that are not built-in presets require a Resolver supplied via WithResolver.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-tag script,style
		strip-attr * on*
		strip-attr a[href^=javascript:] href
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p onclick="track()">Hello</p><script>alert(1)</script>`)
	output, modified, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println("output:", string(output))
}

Output:
modified: true
output: <p>Hello</p>

Example (Allowlist) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// A safe allowlist needs a baseline in every rule family it reopens.
	// Reopening an attribute with "allow-attr a href" also needs a scheme baseline,
	// or javascript: URLs pass through. Here strip-scheme/allow-scheme drop the
	// javascript: href while keeping the https one.
	policy, err := htmlpolicy.Parse(`
		strip-tag *
		strip-attr * *
		allow-tag p,div,b,i,em,strong
		allow-tag a
		allow-attr a href
		allow-attr * class,id
		strip-attr * on*
		strip-scheme * href *
		allow-scheme * href https,mailto
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<div><p class="text"><a href="https://example.com" onclick="x">Link</a> <a href="javascript:alert(1)">evil</a> and <b>bold</b></p><script>evil</script></div>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
<div><p class="text"><a href="https://example.com">Link</a> <a>evil</a> and <b>bold</b></p></div>

Example (BuiltinPreset) ¶

ExampleParse_builtinPreset shows applying a maintained built-in preset via an include directive. "include builtin:NAME" works in any policy with no resolver configured; later rules override the preset under last-match-wins.

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		include builtin:blocklist
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>Hi</p><a href="javascript:alert(1)">x</a><script>evil</script>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
<p>Hi</p><a>x</a>

Example (CommentOut) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		comment-out-tag script
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>safe</p><script>evil()</script>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<p>safe</p><!--htmlpolicy-commented-out <script>evil()</script>-->

Example (ContentTypeFiltering) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-content-type * src *
		allow-content-type * src image/*
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<img src="data:image/png;base64,iVBOR"/><img src="data:text/html,<script>evil</script>"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<img src="data:image/png;base64,iVBOR"/><img/>

Example (CssAtRules) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip-at import
		css-defang-at media
		css-allow-at keyframes
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte("@import \"evil.css\";\n@media screen { .x { color: red; } }\n@keyframes fade { from { opacity: 0; } }")
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
@htmlpolicy-defanged-media screen {
.x {
color: red;
}
}
@keyframes fade {
from {
opacity: 0;
}
}

Example (CssMedia) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// Force light mode by stripping prefers-color-scheme:dark @media overrides.
	// See also the builtin:strip-dark-mode preset.
	policy, err := htmlpolicy.Parse(`css-strip-media prefers-color-scheme:dark`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte("body { background: white; }\n@media (prefers-color-scheme: dark) { body { background: black; } }")
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
body {
background: white;
}

Example (DefangContentType) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-content-type * src image/svg+xml
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<img src="data:image/svg+xml,<svg/>"/><img src="data:image/png;base64,iVBOR"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<img htmlpolicy-defanged-src="data:image/svg+xml,&lt;svg/&gt;"/><img src="data:image/png;base64,iVBOR"/>

Example (DefangScheme) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		defang-scheme * href *
		allow-scheme * href https,:relative
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<a href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
	// A navigable anchor href is defanged in place: the scheme is prefixed
	// with htmlpolicy-defanged (inert) so the link stays visible/copyable.

Output:

Example (Demote) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		demote-tag form
		allow-attr form class
		demote-tag marquee
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<form class="contact"><input type="text"/></form><marquee>hello</marquee>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<div class="contact"><input type="text"/></div><div>hello</div>

Example (Error) ¶

package main

import (
	"fmt"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	_, err := htmlpolicy.Parse(`badverb script`)
	fmt.Println(err)
}

Output:
line 1: unknown verb "badverb"

Example (Placeholder) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		placeholder-tag script,iframe
		placeholder-label blocked
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<p>safe</p><script>evil</script><iframe src="x">frame</iframe>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(string(output))
}

Output:
<p>safe</p><strong title="&lt;script&gt;evil&lt;/script&gt;">[blocked: script]</strong><strong title="&lt;iframe src=&#34;x&#34;&gt;frame&lt;/iframe&gt;">[blocked: iframe]</strong>

Example (SchemeFiltering) ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		strip-scheme * href *
		allow-scheme * href https,mailto,:relative
		strip-scheme * src *
		allow-scheme * src https,:relative
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<a href="https://example.com">safe</a><a href="javascript:alert(1)">evil</a><img src="photo.jpg"/>`)
	output, _, err := policy.ApplyHTML(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
<a href="https://example.com">safe</a><a>evil</a><img src="photo.jpg"/>

func (*Policy) ApplyCSS ¶

func (p *Policy) ApplyCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyCSS sanitizes a standalone CSS stylesheet using the policy's CSS rules. It applies property, value, and at-rule filtering, and filters url() schemes. Use this to sanitize raw CSS content that is not embedded in HTML.

The second return value reports whether CSS rules or the rewriter changed the content; unlike Policy.ApplyHTML, when it is false the input bytes are returned unchanged (CSS is not re-serialized when nothing matched).

If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.

ApplyCSS is safe for concurrent use on the same Policy.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip *
		css-allow color,font-size
		css-strip-value expression(*)
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`.header { color: red; margin: 10px; font-size: 14px; }`)
	output, modified, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}

Output:
modified: true
.header {
color: red;
font-size: 14px;
}

Example (Pseudo) ¶

ExamplePolicy_ApplyCSS_pseudo shows stripping selectors that use a pseudo-class. The whole complex selector is dropped (never just the pseudo), so an interaction-gated style cannot become an always-on one.

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`css-strip-pseudo hover`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`.menu, .item:hover > .sub { color: red; } a:hover { color: blue; }`)
	output, _, err := policy.ApplyCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(string(output))
}

Output:
.menu {
color: red;
}

func (*Policy) ApplyDocument ¶

func (p *Policy) ApplyDocument(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyDocument applies the policy to a full HTML document. It returns the sanitized document, whether it was modified, and any error. The second return value reports whether policy rules changed the content, not whether the output bytes differ from the input — see Policy.ApplyHTML.

Unlike [ApplyHTML] which parses input as a fragment (suitable for user content embedded in a page), ApplyDocument parses input as a complete document, preserving the <!DOCTYPE>, <html>, <head>, and <body> structure. If the input is missing these elements, the HTML5 parser adds them.

Policy rules apply to all elements in the document, including those in <head> (e.g., <title>, <meta>, <link>, <style>, <script>).

Use ApplyHTML for user-generated content fragments (comments, emails, forum posts). Use ApplyDocument for sanitizing complete HTML pages.

ApplyDocument is safe for concurrent use on the same Policy.

Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse("strip-tag script")
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`<!doctype html><html><head><title>Page</title></head><body><p>Hello</p><script>evil</script></body></html>`)
	output, modified, err := policy.ApplyDocument(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}

Output:
modified: true
<!DOCTYPE html><html><head><title>Page</title></head><body><p>Hello</p></body></html>

func (*Policy) ApplyHTML ¶

func (p *Policy) ApplyHTML(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyHTML applies the policy to an HTML fragment. It returns the sanitized content, whether it was modified, and any error.

The second return value reports whether policy rules changed the content (an element or attribute was stripped, defanged, demoted, etc.) — not whether the output bytes differ from the input. Because the default behavior re-serializes through Go's HTML5 parser, the output bytes can differ from the input (whitespace, attribute quoting, tag normalization) even when this is false. Use WithPreserveOriginal to return the original bytes verbatim when no rules matched.

Input must be UTF-8. Use ConvertToUTF8 first if the charset is unknown. Output is always UTF-8.

Content is parsed as a fragment in a body context — it is never wrapped in <html>/<head>/<body> tags. If the input contains those tags, they are discarded and their content is kept. Use Policy.ApplyDocument to sanitize complete HTML documents with preserved document structure.

By default, output is always re-serialized through Go's HTML5 parser, which normalizes malformed HTML. This ensures the browser sees exactly the same structure the sanitizer saw, preventing parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no policy rules modify the content.

ApplyHTML is safe for concurrent use on the same Policy.

Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse("strip-tag script,style")
	if err != nil {
		log.Fatal(err)
	}

	output, modified, err := policy.ApplyHTML([]byte("<p>Hello</p><script>alert(1)</script>"))
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}

Output:
modified: true
<p>Hello</p>

func (*Policy) ApplyInlineCSS ¶

func (p *Policy) ApplyInlineCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)

ApplyInlineCSS sanitizes a standalone inline CSS declaration list (the content of a style attribute, without surrounding HTML). It applies property, value, and URL scheme filtering.

The second return value reports whether CSS rules or the rewriter changed the content; when it is false the input bytes are returned unchanged.

If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.

ApplyInlineCSS is safe for concurrent use on the same Policy.

Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	policy, err := htmlpolicy.Parse(`
		css-strip *
		css-allow color,font-size
	`)
	if err != nil {
		log.Fatal(err)
	}

	input := []byte(`color: red; margin: 10px; font-size: 14px`)
	output, modified, err := policy.ApplyInlineCSS(input)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println("modified:", modified)
	fmt.Println(string(output))
}

Output:
modified: true
color: red; font-size: 14px

func (*Policy) Lint ¶ added in v0.9.0

func (p *Policy) Lint() []Warning

Lint reports likely mistakes in the policy without applying it. It is a static analysis of the compiled rules and never reads HTML input. An empty result means no issues were found; a non-empty result is advisory only (the policy is still valid and applied normally).

The checks are conservative — they aim for few false positives — and cover:

URL-bearing attributes that are explicitly allowed but left with no scheme rule constraining them, so javascript:/vbscript:/data: URLs may pass through (the most common fail-open mistake).
allow-scheme rules that reopen a dangerous scheme (javascript, vbscript, data) on an explicitly allowed URL attribute after an earlier scheme rule stripped or defanged it (last match wins by order). This check compares rules by attribute name only, not selector, so a strip and an allow on disjoint selectors can still warn. For data: only, the warning is suppressed when a content-type rule targets the attribute — an allowlist that permits inline data: images must take the strip-then-allow shape, and content-type rules are how an author governs what an allowed data: URI may carry.
allow-attr rules that reopen a dangerous non-URL attribute (an on* event handler, nonce, or is) after an earlier strip-attr removed it (last match wins by order) — the attribute analogue of the reopened-scheme check. A targeted allow such as `allow-attr img *` re-permits onerror even though it is not a universal rule the shadow check would catch. Compared by attribute name only, not selector.
allow-* rules with no effect: an allow whose target nothing earlier strips or defangs only re-affirms the default (unmatched content already passes through), so the rule is dead. Comparison is by name/target and is deliberately over-broad about overlap, so a "dead" verdict is definite (an allowlist with a strip baseline before the allow is never flagged).
Rules made dead by a later rule in the same family: an identical rule, or a universal ("*") rule that overrides everything earlier. Because the engine is last-match-wins by order (not by specificity), an allow placed before a universal strip never takes effect.
"unwrap-tag *" used as an allowlist catch-all without strip rules for active-content elements (script, style, iframe, …), which would leak their text content into the output.
unwrap/demote of a raw-text wrapper (noscript/noframes/noembed) without stripping <script>, which exposes the wrapper's inert raw-text content as live HTML.

Warnings are returned sorted by line.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// This allowlist reopens <a href> but never constrains its scheme, so
	// javascript: URLs would pass through. Lint catches it statically.
	policy, err := htmlpolicy.Parse(`
		strip-tag *
		strip-attr * *
		allow-tag a
		allow-attr a href
	`)
	if err != nil {
		log.Fatal(err)
	}

	for _, w := range policy.Lint() {
		fmt.Println(w)
	}
}

Output:
line 5: URL attribute "href" is allowed but no scheme rule constrains it; javascript:/vbscript:/data: URLs may pass through (add a strip-scheme/allow-scheme baseline or `include builtin:url-safe`)

func (*Policy) String ¶

func (p *Policy) String() string

String returns the policy as flattened policy text. The output is valid policy syntax that can be parsed back to produce an equivalent policy. Includes are fully resolved (inlined).

func (*Policy) URLSchemeAction ¶ added in v0.7.0

func (p *Policy) URLSchemeAction(element, attr, rawURL string) Action

URLSchemeAction reports the action the policy's scheme rules (allow-scheme, strip-scheme, defang-scheme) would take for a URL used in the named attribute of the named element. It lets a caller validate a URL out of band — without building and re-parsing HTML — answering the single question "is this URL's scheme permitted here?".

The return value is Allow if the scheme is permitted, or Strip / Defang if a scheme rule would neutralize it. These are the only actions scheme rules produce. element and attr are matched case-insensitively (e.g. "a", "href"). rawURL is normalized exactly as Policy.ApplyHTML normalizes a single-URL attribute value — control characters trimmed and, for URL attributes, zero-width characters stripped — before its scheme is extracted, so the result matches what document sanitization would decide for that scheme. Last-match-wins, and the empty (schemeless) scheme is matched by a rule listing ":relative" or "*".

Scope, and what Allow does NOT mean:

URLSchemeAction evaluates scheme rules ONLY. It does not consider whether the element or attribute would itself be removed by other verbs (e.g. "strip a" or "strip-attr a href"), nor content-type rules that filter data: URIs by MIME type. A return of Allow therefore means "no scheme rule objects to this scheme" — not "this attribute survives sanitization". When the policy has no scheme rules at all, the result is always Allow.
Selectors are evaluated against a detached element carrying only the named attribute. Scheme rules whose selectors depend on document position, ancestors, siblings, or other attributes may not match the way full-document Apply would; for such policies URLSchemeAction may report Allow for a URL that the document sanitizer would strip. For the common scheme policies (selectors of the form "*", a tag name, or an attribute selector on the URL attribute itself) the result is exact.

Example ¶

package main

import (
	"fmt"
	"log"

	"gitlab.com/grepular/htmlpolicy"
)

func main() {
	// URLSchemeAction validates a URL's scheme against the policy without
	// building or re-parsing HTML — useful when an application has a URL in
	// hand and wants to decide whether to keep, drop, or defang it.
	policy, err := htmlpolicy.Parse(`
		strip-scheme * href *
		allow-scheme a href https,mailto,:relative
		defang-scheme a href http
	`)
	if err != nil {
		log.Fatal(err)
	}

	for _, u := range []string{
		"https://example.com",
		"http://example.com",
		"javascript:alert(1)",
		"/relative/path",
	} {
		switch policy.URLSchemeAction("a", "href", u) {
		case htmlpolicy.Allow:
			fmt.Printf("allow:  %s\n", u)
		case htmlpolicy.Defang:
			fmt.Printf("defang: %s\n", u)
		default:
			fmt.Printf("strip:  %s\n", u)
		}
	}
}

Output:
allow:  https://example.com
defang: http://example.com
strip:  javascript:alert(1)
allow:  /relative/path

type Resolver ¶

type Resolver interface {
	Resolve(from, name string) (canonical, text string, err error)
}

Resolver loads policy text for include directives.

name is whatever appears after "include " in the policy file. The Resolver implementation decides how to interpret it (file path, database key, embedded resource, etc.).

from identifies the policy that contains the include directive, allowing the resolver to interpret name relative to the including policy's location. It is the canonical name returned by an earlier Resolve call (the include that loaded the parent), or the empty string for include directives in the top-level policy text passed to Parse. A file-based resolver typically treats from as a file path and joins relative name values against filepath.Dir(from); resolvers without a notion of location can ignore from entirely.

Resolve returns the canonical name of the loaded resource along with its text. The parser forwards canonical as from on nested include calls and uses it as the key for circular-include detection. Implementations should normalize the canonical form (e.g. resolve relative paths to absolute, strip "./" prefixes, normalize separators) so the same underlying resource always yields the same canonical string; otherwise a cycle through e.g. "foo.policy" and "./foo.policy" will only be caught by the depth limit (with a misleading "exceeds maximum depth" error rather than "circular include"). Resolvers without a notion of canonicalization may return name unchanged.

type SchemeRule ¶

type SchemeRule struct {
	Selector string           // CSS selector text
	Matcher  cascadia.Matcher // compiled selector
	Attr     string           // attribute name pattern (e.g. "href", "*", "on*")
	Schemes  []string         // schemes (lowercase), ":relative" for schemeless URLs, or "*" for any scheme
	Action   Action           // Allow, Strip, or Defang
	Line     int
}

SchemeRule restricts URL schemes for an attribute on matching elements. The Action field determines what happens to URLs with matching schemes: Allow passes them through, Strip removes them, and Defang renames the attribute to make it inert.

type TagRule ¶

type TagRule struct {
	Selector string           // CSS selector text (for String())
	Matcher  cascadia.Matcher // compiled selector
	Action   Action           // what to do with matching elements
	Line     int              // source line number (for diagnostics)
}

TagRule defines an action for matching elements.

type URLContext ¶

type URLContext struct {
	// Element is the lowercase HTML element name (e.g., "img", "a", "style").
	// Empty for standalone CSS sanitization ([Policy.ApplyCSS], [Policy.ApplyInlineCSS]).
	Element string

	// Attr is the HTML attribute name (e.g., "href", "src", "style").
	// Empty for URLs inside <style> elements or standalone CSS.
	Attr string

	// CSSProperty is the CSS property name containing the url() value
	// (e.g., "background-image", "background"), or "@import" for @import URLs.
	// Empty for non-CSS URLs (HTML attributes).
	CSSProperty string

	// Parent is the lowercase name of the URL's element's parent in the
	// HTML tree (e.g., "picture", "audio", "video", "head", "body"). Empty
	// when the element has no element parent — top-level fragment nodes,
	// document-root children whose parent is the document node, and
	// standalone CSS sanitization.
	//
	// Useful for elements whose semantics depend on ancestry, e.g.
	// <source src> is an image candidate inside <picture> but a media file
	// inside <audio>/<video>.
	Parent string

	// GetAttr returns the value of the named attribute on the URL's
	// element, or "" if the attribute is absent. Lookups are
	// case-insensitive. GetAttr is always non-nil; for orphan contexts
	// (standalone ApplyCSS / ApplyInlineCSS) it returns "" for any name.
	//
	// Useful for elements whose semantics depend on a sibling attribute on
	// the same element, e.g. <input type=image>, <link rel=stylesheet>,
	// <object type=...>. GetAttr reflects the live attribute slice on the
	// element at the time of the call, so mutations applied by earlier
	// rewriter calls on the same element are visible.
	//
	// GetAttr is not considered part of the deduplication key — the
	// cached rewriter result for a given {url, element, attr, parent,
	// cssProperty} tuple comes from the first call, so if you discriminate
	// on GetAttr across same-key elements only the first element's attrs
	// influence the cached value.
	GetAttr func(name string) string
}

URLContext describes where a URL was found in the sanitized output. It is passed to URLRewriter callbacks (configured per call via WithApplyURLRewriter) to provide context about each URL.

type URLRef ¶ added in v0.3.0

type URLRef struct {
	// Context describes where the URL was found: the element, attribute, CSS
	// property, parent element name, and a live GetAttr accessor. It is the
	// exact same URLContext value the [URLRewriter] would receive for this
	// URL. Context.GetAttr is valid for the duration of the prefetch callback
	// (the underlying nodes are still alive); do not retain it past the call.
	Context URLContext

	// URL is the discovered URL in its final resolved form — identical to the
	// value that would be passed to a [URLRewriter] for this same context.
	URL string
}

URLRef pairs a URL discovered during sanitization with the URLContext describing where it was found. A slice of URLRef is passed to the callback registered by WithApplyURLPrefetcher, allowing the caller to warm a cache (e.g. fetch resources in parallel) before the synchronous URLRewriter runs.

type URLRewriter ¶

type URLRewriter func(ctx URLContext, url string) string

URLRewriter is a callback that receives each URL in the sanitized output and returns a replacement. Return the original url unchanged to keep it. Supply a rewriter to one Apply call via WithApplyURLRewriter. Because the rewriter is scoped to that single call, it does not need to be safe for concurrent use; two goroutines calling Apply on the same Policy pass independent rewriters and never share state.

URLs are presented after policy evaluation and base URL resolution. Fragment-only references (#id), empty URLs, and data: URIs are excluded. URLs inside data URI content are recursed into (e.g. url(img.png) inside data:text/css is presented even though the data URI is not).

Calls are deduplicated within a single top-level applyHTMLAt / applyDocumentAt / applyCSS / applyInlineCSS invocation: each unique combination of {url, element, attr, parent, cssProperty} is presented to the callback exactly once for that invocation. The same URL in different contexts (e.g. <img src> vs <video poster>, or the same URL on a <source> inside <picture> vs inside <video>) produces separate calls. Recursive invocations triggered by embedded data URI content each have their own deduplication cache.

type Warning ¶ added in v0.9.0

type Warning struct {
	Line    int
	Message string
}

Warning describes a likely policy mistake found by Policy.Lint. Warnings are advisory: a policy that produces warnings is still valid and applied normally. Line is the 1-based source line the warning relates to, or 0 when it is not tied to a specific line.

func (Warning) String ¶ added in v0.9.0

func (w Warning) String() string

String renders the warning as "line N: message" (or just the message when Line is 0).

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
htmlpolicy command Command htmlpolicy applies an HTML sanitization policy to HTML content.	Command htmlpolicy applies an HTML sanitization policy to HTML content.
playground Command playground is the WebAssembly entry point for the browser playground.	Command playground is the WebAssembly entry point for the browser playground.
engine Package engine is the platform-independent core of the WebAssembly playground.	Package engine is the platform-independent core of the WebAssembly playground.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

htmlpolicy

Playground

Quick Start

Features

Security Guidance

Built-in presets (recommended starting points)

Recommended blocklist baseline (equivalent to builtin:blocklist)

Key threats to consider

CLI Tool

Verbose Logging

Requirements

Testing

License

Acknowledgements

Support/Appreciate my work

Documentation ¶

Overview ¶

Entry Points ¶

Selectors ¶

HTML Tag Verbs ¶

HTML Attribute Verbs ¶

URL Scheme Verbs ¶

Content-Type Verbs ¶

CSS Verbs ¶

Other Verbs ¶

Linting ¶

Built-in Presets ¶

URL Rewriting ¶

Output Normalization ¶

Embedded Content Sanitization ¶

Namespace Validation (mXSS Prevention) ¶

SVG SMIL Animation Sanitization ¶

Limitations ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func BuiltinPolicy ¶ added in v0.9.0

func BuiltinPolicyNames ¶ added in v0.9.0

func ConvertToUTF8 ¶

Types ¶

type Action ¶

type ApplyOption ¶

func WithApplyURLPrefetcher ¶ added in v0.3.0

func WithApplyURLRewriter ¶

func WithKeepOriginalURLAttrs ¶ added in v0.20.0

func WithMaxOutputFactor ¶

func WithOriginalURLAttr ¶ added in v0.20.0

func WithPreserveOriginal ¶

func WithVerboseLog ¶

type AttrRule ¶

type CSSAtRule ¶

type CSSContentTypeRule ¶ added in v0.9.0

type CSSMediaRule ¶ added in v0.14.0

type CSSPropertyRule ¶

type CSSPseudoRule ¶ added in v0.4.0

type CSSSchemeRule ¶

type CSSValueRule ¶

type ContentTypeRule ¶

type FileResolver ¶ added in v0.9.0

func (FileResolver) Resolve ¶ added in v0.9.0

type ParseOption ¶

func WithMaxIncludeDepth ¶

func WithPrefix ¶ added in v0.9.0

func WithResolver ¶ added in v0.9.0

type Policy ¶

func Parse ¶

func (*Policy) ApplyCSS ¶

func (*Policy) ApplyDocument ¶

func (*Policy) ApplyHTML ¶

func (*Policy) ApplyInlineCSS ¶

func (*Policy) Lint ¶ added in v0.9.0

func (*Policy) String ¶

func (*Policy) URLSchemeAction ¶ added in v0.7.0

type Resolver ¶

type SchemeRule ¶

type TagRule ¶

type URLContext ¶

Recommended blocklist baseline (equivalent to `builtin:blocklist`)