Documentation
¶
Overview ¶
Package htmlpolicy implements a policy-driven HTML, CSS, and SVG sanitizer.
This library was built entirely through vibe coding with Claude Code (https://claude.ai/claude-code).
Unlike sanitizers that hardcode safety rules in library code, htmlpolicy uses declarative policies: plain text files where each line starts with an action verb followed by CSS selectors and optional arguments. Rules are evaluated top-to-bottom with last-match-wins semantics. Anything not matched by a rule passes through unchanged.
Policies can be reviewed, versioned, composed via includes, and swapped without recompiling. This makes htmlpolicy suitable for email rendering pipelines, CMS content filtering, user-generated content sanitization, and any HTML-to-HTML transformation where the rules need to be configurable.
Entry Points ¶
There are five entry points for applying a policy:
- Policy.ApplyHTML sanitizes an HTML fragment (user content to embed in a page).
- Policy.ApplyDocument sanitizes a full HTML document (preserves DOCTYPE/html/head/body).
- Policy.ApplyCSS sanitizes a standalone CSS stylesheet.
- Policy.ApplyInlineCSS sanitizes an inline CSS declaration list (style attribute content).
- ConvertToUTF8 converts content from a detected charset to UTF-8 (call before the others if needed).
All entry points expect UTF-8 input and return UTF-8 output. A Policy is safe for concurrent use by multiple goroutines.
Selectors ¶
Selectors use standard CSS syntax compiled via github.com/andybalholm/cascadia:
tag match all <tag> elements * match all elements tag[attr=value] exact attribute match tag[attr^=prefix] prefix match tag[attr$=suffix] suffix match tag[attr*=substring] contains match tag[attr~=word] space-separated word match tag[attr=value i] case-insensitive match (CSS4) div.ads > a child combinator (quote when spaces present) svg animate descendant: <animate> inside <svg> #tracking ID selector :not([href^=https]) negation pseudo-class
Comma-separated selector groups (e.g. "script,style,iframe") compile into a single rule. When a selector contains spaces, quote it so the parser distinguishes it from arguments: strip "svg animate".
Tag matching is case-insensitive. Attribute value matching normalizes control characters and case internally for security, without modifying the output. This prevents bypasses like " javascript:" evading a [href^=javascript:] prefix check, or JaVaScRiPt: evading case-sensitive matching.
HTML Tag Verbs ¶
Tag verbs determine what happens to matched elements:
strip SELECTOR[,...] remove element and all content comment-out SELECTOR[,...] wrap in HTML comment placeholder SELECTOR[,...] replace with [removed: tag] label demote SELECTOR[,...] [ATTR,...] convert to <div>/<span>, keep listed attrs allow SELECTOR[,...] [ATTR,...] keep element, keep only listed attrs unwrap SELECTOR[,...] remove the element's tags but keep its children
Inline elements demote to <span>, all others to <div>. The namespace is cleared, so foreign elements (SVG, MathML) become plain HTML.
Unwrap removes the wrapper but leaves the children in flow (mirroring how browsers treat unknown elements). The wrapper's attributes are discarded; void/empty elements with no children are stripped. As a catch-all in an allowlist policy, "unwrap *" MUST be paired with explicit strip rules for any element whose presence indicates active content or whose text content should not be visible — at a minimum: script, style, iframe, object, embed, form, input, textarea, select, button, link, base, meta. Without those, raw <script>/<style> text would leak into the output as visible text once the wrapper is removed.
The allow and demote shorthand attribute list is equivalent to separate allow-attr rules. If the selector has a condition, it propagates:
allow a[href^=https:] href,class # is equivalent to: allow a[href^=https:] allow-attr a[href^=https:] href allow-attr a[href^=https:] class
HTML Attribute Verbs ¶
Attribute verbs filter attributes on matched elements. Attribute name patterns support a trailing * glob: on* matches onclick, onload, etc.
strip-attr SELECTOR ATTR[,...] strip named attrs allow-attr SELECTOR ATTR[,...] allow only named attrs defang-attr SELECTOR ATTR[,...] rename to PREFIX-defanged-ATTR
URL Scheme Verbs ¶
Scheme verbs filter URL attributes by the URL's scheme. Use :relative to match schemeless URLs, * to match any scheme:
allow-scheme SELECTOR ATTR SCHEME[,...] allow URLs with listed schemes strip-scheme SELECTOR ATTR SCHEME[,...] strip URLs with listed schemes defang-scheme SELECTOR ATTR SCHEME[,...] rename attr when scheme matches
For allowlist behavior, pair with a strip-scheme baseline:
strip-scheme * href * allow-scheme * href https,mailto,:relative
Each scheme must be a valid URL scheme (an ASCII letter followed by letters, digits, "+", "-", or "."), the wildcard *, or :relative. Comma-separated lists reject a stray comma between entries (e.g. "http,,https") to catch typos.
Multi-URL attributes (srcset, imagesrcset, ping, archive) get per-URL evaluation — each URL is independently matched against the rule chain.
Content-Type Verbs ¶
Content-type verbs control what happens to data URIs based on their MIME type. MIME patterns support a single * wildcard: image/*, *+xml, */xml.
allow-content-type SELECTOR ATTR TYPE[,...] allow data URIs with listed types strip-content-type SELECTOR ATTR TYPE[,...] strip data URIs with listed types defang-content-type SELECTOR ATTR TYPE[,...] rename attr when type matches
When no content-type rules are present, default behavior applies (text/html, application/xhtml+xml, and image/svg+xml are recursively sanitized, others pass through). When any content-type rule is present, rules take full control.
CSS Verbs ¶
CSS verbs process style attributes, <style> element contents, data:text/css URIs, and SMIL animated style values. They follow the same last-match-wins semantics as HTML rules. Without CSS rules, CSS content passes through unchanged. HTML scheme rules (strip-scheme etc.) do NOT apply to CSS — use css-*-scheme rules instead.
css-strip PROP[,...] strip CSS properties (* = all, trailing * glob ok) css-allow PROP[,...] allow CSS properties css-defang PROP[,...] defang CSS properties (prefix name) css-strip-value PATTERN strip properties matching value pattern css-defang-value PATTERN defang properties matching value pattern css-strip-at NAME[,...] strip CSS at-rules (e.g. import, * = all) css-allow-at NAME[,...] allow CSS at-rules css-defang-at NAME[,...] defang CSS at-rules (prefix name) css-strip-scheme TARGET SCHEME[,...] strip CSS url() values by scheme css-allow-scheme TARGET SCHEME[,...] allow CSS url() values by scheme css-defang-scheme TARGET SCHEME[,...] defang CSS url() values by scheme
The css-*-scheme TARGET is a comma-separated list of CSS property name patterns or @import. Use * to match all properties and @import. Scheme lists support :relative and * as with HTML scheme rules.
CSS allowlist example:
css-strip * css-allow color,background-color,font-size,font-family,font-weight css-allow text-align,text-decoration,margin,padding,border css-strip-value expression(*) css-strip-value url(*) css-strip-at *
CSS blocklist example:
css-strip -moz-binding,behavior css-strip-value expression(*) css-strip-at import css-strip-scheme * javascript,vbscript,data,blob
Standalone CSS sanitization is available via Policy.ApplyCSS (stylesheets) and Policy.ApplyInlineCSS (declaration lists).
Other Verbs ¶
placeholder-label LABEL customize the label for placeholder (default "removed") strip-comments remove HTML comments include NAME-OR-PATH inline another policy (loaded via [Resolver])
Lines starting with # are comments. Only full-line comments are supported.
URL Rewriting ¶
WithApplyURLRewriter is an ApplyOption that sets a callback that receives every URL in the sanitized output and can replace it. The rewriter is scoped to a single Apply call, so it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. The rewriter runs after policy evaluation and base URL resolution. It covers HTML URL attributes, CSS url() values, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded. Fragment-only references, empty URLs, and data: URIs themselves are excluded (but URLs inside data URI content are recursed into). The option also works with Policy.ApplyCSS and Policy.ApplyInlineCSS.
WithApplyURLPrefetcher is a companion ApplyOption that receives the full set of URLs the rewriter would be called with — same set, same resolved form, same dedup, in document order — once per call, before the rewrite pass. It lets a caller warm a cache (e.g. fetch image and font resources in parallel) that the subsequent synchronous rewriter reads from. Each URLRef carries a live URLContext (including GetAttr) so the caller can inspect sibling attributes when deciding what to fetch.
Output Normalization ¶
By default, Policy.ApplyHTML and Policy.ApplyDocument always re-serialize through Go's HTML5 parser, even when no rules modified the content. This prevents parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no rules match.
Embedded Content Sanitization ¶
HTML embedded inside attribute values is recursively sanitized: data:text/html URIs, data:application/xhtml+xml URIs, data:image/svg+xml URIs, srcdoc attributes, and meta refresh URLs. Recursion depth is limited to 16 levels; at the limit, content is stripped entirely. Use content-type rules to control which MIME types are allowed.
Namespace Validation (mXSS Prevention) ¶
After applying policy rules, namespace consistency is validated. Elements whose namespace is invalid for their DOM position are stripped. This prevents mutation XSS attacks where foreign-content elements end up in the wrong namespace after policy actions change the tree structure.
SVG SMIL Animation Sanitization ¶
SMIL animation elements (<animate>, <set>, etc.) are sanitized by applying the same attribute and scheme policy rules to animated values. This prevents runtime bypasses via attributeName="href" values="javascript:alert(1)". CSS in animated style values is sanitized through the CSS engine.
Limitations ¶
CSS selectors in <style> elements are not filtered. Attribute selectors combined with url() values can exfiltrate data. The url() scheme filtering mitigates this, but full protection requires stripping <style> elements.
CSS var()/env()/attr() substitution happens at browser computed-value time, not parse time. Known attack vectors are handled (url() in custom properties, @import var(), var() fallback values, var()/env()/attr() in URL-accepting functions). When css-*-scheme rules are active, var()/env()/attr() inside URL-accepting functions (image-set, -webkit-image-set, cross-fade, image, src) is stripped as a precaution. Bare string arguments inside these functions are evaluated as URLs and subjected to css-*-scheme rules. CSS escape sequences in the url() (or attr()) function name itself (e.g. \75rl(...), \61ttr(...)) bypass the lexer's URLToken recognition, so when css-*-scheme rules are active any FunctionToken whose escape-decoded name is "url" causes the declaration to be stripped. attr() with a url-typed substitution — "attr(name url)" or "attr(name type(<url>))" — also causes the declaration to be stripped, since the value of the named HTML attribute would be loaded as a URL at computed time.
Hardcoded limits: data URI recursion depth 16, HTML nesting depth 512, CSS nesting depth 128, output size 10x input (configurable via WithMaxOutputFactor, minimum 32KB).
Index ¶
- Constants
- func ConvertToUTF8(content []byte, contentType string) ([]byte, bool)
- type Action
- type ApplyOption
- type AttrRule
- type CSSAtRule
- type CSSPropertyRule
- type CSSSchemeRule
- type CSSValueRule
- type ContentTypeRule
- type ParseOption
- type Policy
- func (p *Policy) ApplyCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) ApplyDocument(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) ApplyHTML(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) ApplyInlineCSS(content []byte, opts ...ApplyOption) ([]byte, bool, error)
- func (p *Policy) SetPrefix(prefix string) error
- func (p *Policy) String() string
- type Resolver
- type SchemeRule
- type TagRule
- type URLContext
- type URLRef
- type URLRewriter
Examples ¶
- ConvertToUTF8
- Parse
- Parse (Allowlist)
- Parse (CommentOut)
- Parse (ContentTypeFiltering)
- Parse (CssAtRules)
- Parse (DefangContentType)
- Parse (DefangScheme)
- Parse (Demote)
- Parse (Error)
- Parse (Placeholder)
- Parse (SchemeFiltering)
- Policy.ApplyCSS
- Policy.ApplyDocument
- Policy.ApplyHTML
- Policy.ApplyInlineCSS
- Policy.SetPrefix
- Policy.SetPrefix (Error)
- WithApplyURLPrefetcher
- WithApplyURLRewriter
- WithMaxIncludeDepth
- WithMaxOutputFactor
- WithPreserveOriginal
Constants ¶
const DefaultMaxIncludeDepth = 64
DefaultMaxIncludeDepth is the default maximum nesting depth for includes.
Variables ¶
This section is empty.
Functions ¶
func ConvertToUTF8 ¶
ConvertToUTF8 converts content from the charset specified in contentType to UTF-8. The contentType should be a MIME type with optional charset parameter (e.g. "text/html; charset=iso-8859-1"). If no charset is specified, encoding is determined by inspecting the content.
Returns the UTF-8 content and a boolean indicating whether conversion was performed. When the content is already UTF-8 (or the detected encoding matches UTF-8), the original slice is returned with false.
Example ¶
package main
import (
"fmt"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
content := []byte("<p>caf\xe9</p>")
utf8, converted := htmlpolicy.ConvertToUTF8(content, "text/html; charset=iso-8859-1")
fmt.Println("converted:", converted)
fmt.Println(string(utf8))
}
Output: converted: true <p>café</p>
Types ¶
type Action ¶
type Action int
Action determines what happens to a matched element or attribute. These types are exported for documentation purposes but are not part of the stable API — Policy fields are unexported and there is no public function that accepts or returns rule types.
const ( // Strip removes the element and all of its content from the output. Strip Action = iota // CommentOut wraps the element and its content in an HTML comment // (<!-- ... -->), making it invisible to browsers while preserving // the content in the source for inspection. CommentOut // Placeholder replaces the element with a text label such as // "[removed: script]". The label word is configurable via the // placeholder-label policy directive. Placeholder // Demote converts the element to a safe generic container: inline // elements become <span>, all others become <div>. Allowed // attributes are preserved; the namespace is cleared. Demote // Allow keeps the element in the output unchanged. Allow // Defang renames the attribute by inserting a prefix (e.g. // "htmlpolicy-defanged-onclick"), making it inert while preserving // the value for inspection. Defang // Unwrap removes the element's start and end tags but keeps its // children in place at the position the element occupied. The // element's attributes are discarded along with the wrapper. This // mirrors how browsers treat unknown elements (HTMLUnknownElement // is a transparent wrapper: the tag has no effect and children // render in flow). Void/empty elements with no children behave as // [Strip] — there is nothing to keep. // // Unwrap differs from [Demote]: Demote renames the wrapper to a // safe generic container (<div>/<span>), preserving the element // boundary, while Unwrap removes the wrapper entirely. // // Unwrap is appropriate as a catch-all for unknown structural // elements in an allowlist policy. It MUST be paired with explicit // strip rules for any element whose presence indicates active // content or whose text content should not be visible. At a // minimum, pair "unwrap *" with strip rules for script, style, // iframe, object, embed, form, input, textarea, select, button, // link, base, and meta. Without those, raw <script>/<style> text // would leak into the output as visible text after the wrapper // is removed. Unwrap )
type ApplyOption ¶
type ApplyOption func(*applyConfig)
ApplyOption configures a single call to Policy.ApplyHTML, Policy.ApplyDocument, Policy.ApplyCSS, or Policy.ApplyInlineCSS. Unlike ParseOption (which is baked into the compiled Policy and shared across all calls), ApplyOption values are scoped to one call. This allows per-call state — for example, a URL rewriter that closes over per-message caches or budgets — without recompiling the policy.
func WithApplyURLPrefetcher ¶ added in v0.3.0
func WithApplyURLPrefetcher(fn func([]URLRef)) ApplyOption
WithApplyURLPrefetcher registers a callback invoked once per Apply call, after parsing, policy evaluation, and base URL resolution, but before the URL rewrite pass. It receives every URL that a URLRewriter set via WithApplyURLRewriter would be invoked with in this same call — the same set, the same resolved form, the same deduplication semantics, and the same order (document order for top-level URLs; URLs inside recursively sanitized data: content appear at the point their containing attribute is processed). The callback returns nothing; it exists so the caller can warm a cache — for example, fetching image and font resources in parallel — that the subsequent synchronous URLRewriter then reads from.
Each [URLRef.Context].GetAttr is live during the callback and returns the element's attributes, so the caller can inspect sibling attributes (width, height, style, type, rel, …) to decide whether to fetch.
The prefetcher and rewriter are independent: either, both, or neither may be set. When both are set the prefetcher runs first, to completion, and then the normal rewrite walk runs. When no rewriter is set the prefetcher is still invoked (with the set the rewriter would have seen). The callback is always invoked exactly once per Apply call, even when no URLs are found (with an empty slice).
Implementation note: enabling a prefetcher runs the deterministic sanitization pipeline a second time — a recording-only pass that collects the URL set without mutating output — so the prefetch set is guaranteed to match the rewrite pass exactly, including URLs at every recursion depth. This roughly doubles the parse/sanitize CPU for the call (the network fetches it enables run once, in parallel). Callers that do not set a prefetcher pay nothing. If WithVerboseLog is also enabled, rule-match log lines are emitted for both passes.
Like WithApplyURLRewriter, the callback is scoped to a single Apply call and may safely close over per-call state without concurrency concerns. It is invoked synchronously from within Apply; htmlpolicy itself introduces no goroutines (the caller may fan out internally).
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`allow *`, nil)
if err != nil {
log.Fatal(err)
}
// The prefetcher receives every URL up front, so a caller can fetch them
// in parallel (with its own concurrency limits) and warm a cache that the
// synchronous rewriter then reads from.
cache := map[string]string{}
prefetch := func(refs []htmlpolicy.URLRef) {
for i, r := range refs {
// A real implementation would fetch r.URL here, in parallel.
cache[r.URL] = fmt.Sprintf("cid:image%03d", i+1)
}
}
rewriter := func(ctx htmlpolicy.URLContext, u string) string {
if cid, ok := cache[u]; ok {
return cid
}
return u
}
input := []byte(`<img src="https://example.com/a.jpg"/><img src="https://example.com/b.jpg"/>`)
output, _, err := policy.ApplyHTML(input,
htmlpolicy.WithApplyURLPrefetcher(prefetch),
htmlpolicy.WithApplyURLRewriter(rewriter))
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <img src="cid:image001"/><img src="cid:image002"/>
func WithApplyURLRewriter ¶
func WithApplyURLRewriter(fn URLRewriter) ApplyOption
WithApplyURLRewriter sets a function that rewrites URLs in the sanitized output for this single Apply call. The rewriter runs after policy evaluation and base URL resolution with each URL's final resolved form. It receives a URLContext describing where the URL was found and returns a replacement URL (or the original to keep it unchanged).
The rewriter covers HTML URL attributes (href, src, action, etc.), CSS url() values in style attributes and <style> elements, @import URLs, srcset entries, SVG url() attributes, meta refresh URLs, and SMIL animation values. URLs inside recursively-sanitized data:text/html content are also rewritten. Stripped and defanged URLs are excluded.
Fragment-only references (#id), empty URLs, and data: URIs are excluded from the callback. However, URLs inside data URI content are recursed into — e.g. url(img.png) inside a data:text/css URI is presented to the rewriter even though the data URI itself is not.
Because the rewriter is scoped to a single Apply call, it may safely close over per-call state (caches, counters, accumulators) without concurrency concerns. Two goroutines calling Apply on the same Policy with different rewriters do not share state.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip script,style
strip-attr * on*
`, nil)
if err != nil {
log.Fatal(err)
}
rewriter := func(ctx htmlpolicy.URLContext, u string) string {
// Replace external URLs with CID references (for email embedding).
if ctx.Element == "img" && ctx.Attr == "src" {
return "cid:image001"
}
return u
}
input := []byte(`<p>Hello</p><img src="https://example.com/photo.jpg" alt="photo"/>`)
output, _, err := policy.ApplyHTML(input, htmlpolicy.WithApplyURLRewriter(rewriter))
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <p>Hello</p><img src="cid:image001" alt="photo"/>
type AttrRule ¶
type AttrRule struct {
Selector string // CSS selector text
Matcher cascadia.Matcher // compiled selector
AttrName string // attribute name pattern (may have trailing "*" glob)
Action Action // Allow, Strip, or Defang
Line int // source line number
}
AttrRule defines filtering for a specific attribute on matching elements.
type CSSAtRule ¶
type CSSAtRule struct {
Name string // e.g. "import", "font-face"
Action Action // Allow, Strip, or Defang
Line int // source line number
}
CSSAtRule filters CSS at-rules by name.
type CSSPropertyRule ¶
type CSSPropertyRule struct {
Properties []string // property names (trailing "*" glob ok)
Action Action // Allow, Strip, or Defang
Line int // source line number
}
CSSPropertyRule filters CSS properties by name.
type CSSSchemeRule ¶
type CSSSchemeRule struct {
Properties []string // CSS property name patterns (trailing "*" glob ok, "*" = all, "@import" = import URLs)
Schemes []string // e.g. ["javascript", "data", ":relative", "*"]
Action Action // Allow, Strip, or Defang
Line int // source line number
}
CSSSchemeRule filters URLs in CSS url() values by scheme.
type CSSValueRule ¶
type CSSValueRule struct {
Pattern string // e.g. "url(*)", "expression(*)"
Action Action // Strip or Defang
Line int // source line number
}
CSSValueRule filters CSS properties whose values match a pattern. Patterns ending in "(*)" match any value containing a call to that CSS function (e.g. "expression(*)" matches values containing "expression(...)"). Other patterns match the full value literally (case-insensitive).
type ContentTypeRule ¶
type ContentTypeRule struct {
Selector string // CSS selector text
Matcher cascadia.Matcher // compiled selector
Attr string // attribute name pattern (e.g. "src", "*", "href")
Types []string // MIME type patterns (lowercase), e.g. "image/*", "text/html"
Action Action // Allow, Strip, or Defang
Line int
}
ContentTypeRule restricts data URI MIME types for an attribute on matching elements. The Action field determines what happens to data URIs whose MIME type matches the rule's pattern list: Allow recursively sanitizes them, Strip removes the URI value, and Defang renames the attribute to make it inert.
type ParseOption ¶
type ParseOption func(*parseConfig)
ParseOption configures the behavior of Parse.
func WithMaxIncludeDepth ¶
func WithMaxIncludeDepth(n int) ParseOption
WithMaxIncludeDepth sets the maximum nesting depth for include directives. The default is DefaultMaxIncludeDepth (64).
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip script,style
`, nil, htmlpolicy.WithMaxIncludeDepth(4))
if err != nil {
log.Fatal(err)
}
output, _, err := policy.ApplyHTML([]byte("<p>Hello</p><script>evil</script>"))
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <p>Hello</p>
func WithMaxOutputFactor ¶
func WithMaxOutputFactor(factor float64) ParseOption
WithMaxOutputFactor sets the maximum allowed output size as a multiplier of the input size. For example, a factor of 10.0 means the output may be at most 10x the input size (with a minimum of 32KB). This guards against amplification attacks such as a long <base href> resolved into many short relative URLs. The default is 10.0. Set to 0 to disable the limit.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip script
`, nil, htmlpolicy.WithMaxOutputFactor(2.0))
if err != nil {
log.Fatal(err)
}
// Small input is always allowed (32KB minimum).
output, _, err := policy.ApplyHTML([]byte("<p>Hello</p>"))
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <p>Hello</p>
func WithPreserveOriginal ¶
func WithPreserveOriginal() ParseOption
WithPreserveOriginal configures Policy.ApplyHTML to return the original input bytes when no policy rules modify the content, instead of re-serializing through the HTML parser.
By default, ApplyHTML always re-serializes through Go's HTML5 parser, which normalizes malformed HTML. This is the safer default because it ensures the browser sees exactly the same structure the sanitizer saw.
Warning: enabling this option means that when no rules match, the original bytes are returned unmodified. If the input contains malformed HTML that Go's parser and the browser parse differently, this creates a parser-differential attack surface. Only enable this option if you trust the input to be well-formed HTML, or if you need byte-for-byte preservation of unmodified content (e.g., to avoid altering whitespace or attribute quoting in content that was not changed by policy rules).
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip script
`, nil, htmlpolicy.WithPreserveOriginal())
if err != nil {
log.Fatal(err)
}
// Unmodified content is returned as-is (byte-for-byte).
input := []byte("<p>Hello</p>")
output, modified, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println("modified:", modified)
fmt.Println(string(output))
}
Output: modified: false <p>Hello</p>
func WithVerboseLog ¶
func WithVerboseLog(w io.Writer) ParseOption
WithVerboseLog enables verbose logging of policy rule matches. When set, every sanitization action (strip, defang, demote, comment-out, placeholder) writes a one-line description to w. Elements and attributes that pass through unchanged are not logged.
The writer must be safe for concurrent use if the Policy is used concurrently.
type Policy ¶
type Policy struct {
// contains filtered or unexported fields
}
Policy is a compiled set of rules for sanitizing HTML, CSS, and SVG content.
A Policy is created by Parse and should not be modified afterward. The only exception is Policy.SetPrefix, which may be called before the first Policy.ApplyHTML call.
A Policy is safe for concurrent use by multiple goroutines. Policy.ApplyHTML does not mutate the Policy.
Security note: this library is policy-driven and does not hardcode any element or attribute as safe or dangerous. Policy authors must account for modern HTML features including Declarative Shadow DOM (<template shadowrootmode>), iframe srcdoc, fencedframe, custom elements (the is attribute), and CSP nonce attributes. See the project README for recommended baselines and security guidance.
func Parse ¶
func Parse(input string, resolver Resolver, opts ...ParseOption) (*Policy, error)
Parse parses policy text into a Policy. The resolver is used to load included policies (by preset name or file path); it may be nil if the policy contains no include directives.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip script,style
strip-attr * on*
strip-attr a[href^=javascript:] href
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<p onclick="track()">Hello</p><script>alert(1)</script>`)
output, modified, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println("modified:", modified)
fmt.Println("output:", string(output))
}
Output: modified: true output: <p>Hello</p>
Example (Allowlist) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip *
strip-attr * *
allow p,div,b,i,em,strong
allow a href
allow-attr * class,id
strip-attr * on*
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<div><p class="text"><a href="https://example.com" onclick="x">Link</a> and <b>bold</b></p><script>evil</script></div>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <div><p class="text"><a href="https://example.com">Link</a> and <b>bold</b></p></div>
Example (CommentOut) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
comment-out script
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<p>safe</p><script>evil()</script>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <p>safe</p><!--htmlpolicy-commented-out <script>evil()</script>-->
Example (ContentTypeFiltering) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip-content-type * src *
allow-content-type * src image/*
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<img src="data:image/png;base64,iVBOR"/><img src="data:text/html,<script>evil</script>"/>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <img src="data:image/png;base64,iVBOR"/><img/>
Example (CssAtRules) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
css-strip-at import
css-defang-at media
css-allow-at keyframes
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte("@import \"evil.css\";\n@media screen { .x { color: red; } }\n@keyframes fade { from { opacity: 0; } }")
output, _, err := policy.ApplyCSS(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: @htmlpolicy-defanged-media screen { .x { color: red; } } @keyframes fade { from { opacity: 0; } }
Example (DefangContentType) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
defang-content-type * src image/svg+xml
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<img src="data:image/svg+xml,<svg/>"/><img src="data:image/png;base64,iVBOR"/>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <img htmlpolicy-defanged-src="data:image/svg+xml,<svg/>"/><img src="data:image/png;base64,iVBOR"/>
Example (DefangScheme) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
defang-scheme * href *
allow-scheme * href https,:relative
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<a href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <a htmlpolicy-defanged-href="javascript:alert(1)">evil</a><a href="https://example.com">safe</a>
Example (Demote) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
demote form class
demote marquee
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<form class="contact"><input type="text"/></form><marquee>hello</marquee>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <div class="contact"><input type="text"/></div><div>hello</div>
Example (Error) ¶
package main
import (
"fmt"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
_, err := htmlpolicy.Parse(`badverb script`, nil)
fmt.Println(err)
}
Output: line 1: unknown verb "badverb"
Example (Placeholder) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
placeholder script,iframe
placeholder-label blocked
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<p>safe</p><script>evil</script><iframe src="x">frame</iframe>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <p>safe</p><strong title="<script>evil</script>">[blocked: script]</strong><strong title="<iframe src="x">frame</iframe>">[blocked: iframe]</strong>
Example (SchemeFiltering) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
strip-scheme * href *
allow-scheme * href https,mailto,:relative
strip-scheme * src *
allow-scheme * src https,:relative
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<a href="https://example.com">safe</a><a href="javascript:alert(1)">evil</a><img src="photo.jpg"/>`)
output, _, err := policy.ApplyHTML(input)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <a href="https://example.com">safe</a><a>evil</a><img src="photo.jpg"/>
func (*Policy) ApplyCSS ¶
ApplyCSS sanitizes a standalone CSS stylesheet using the policy's CSS rules. It applies property, value, and at-rule filtering, and filters url() schemes. Use this to sanitize raw CSS content that is not embedded in HTML.
If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.
ApplyCSS is safe for concurrent use on the same Policy.
Returns a non-nil error only when WithMaxOutputFactor is set and the sanitized output exceeds the configured size limit.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
css-strip *
css-allow color,font-size
css-strip-value expression(*)
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`.header { color: red; margin: 10px; font-size: 14px; }`)
output, modified, err := policy.ApplyCSS(input)
if err != nil {
log.Fatal(err)
}
fmt.Println("modified:", modified)
fmt.Println(string(output))
}
Output: modified: true .header { color: red; font-size: 14px; }
func (*Policy) ApplyDocument ¶
ApplyDocument applies the policy to a full HTML document. It returns the sanitized document, whether it was modified, and any error.
Unlike [ApplyHTML] which parses input as a fragment (suitable for user content embedded in a page), ApplyDocument parses input as a complete document, preserving the <!DOCTYPE>, <html>, <head>, and <body> structure. If the input is missing these elements, the HTML5 parser adds them.
Policy rules apply to all elements in the document, including those in <head> (e.g., <title>, <meta>, <link>, <style>, <script>).
Use ApplyHTML for user-generated content fragments (comments, emails, forum posts). Use ApplyDocument for sanitizing complete HTML pages.
ApplyDocument is safe for concurrent use on the same Policy.
Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.
Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).
Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse("strip script", nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`<!doctype html><html><head><title>Page</title></head><body><p>Hello</p><script>evil</script></body></html>`)
output, modified, err := policy.ApplyDocument(input)
if err != nil {
log.Fatal(err)
}
fmt.Println("modified:", modified)
fmt.Println(string(output))
}
Output: modified: true <!DOCTYPE html><html><head><title>Page</title></head><body><p>Hello</p></body></html>
func (*Policy) ApplyHTML ¶
ApplyHTML applies the policy to an HTML fragment. It returns the sanitized content, whether it was modified, and any error.
Input must be UTF-8. Use ConvertToUTF8 first if the charset is unknown. Output is always UTF-8.
Content is parsed as a fragment in a body context — it is never wrapped in <html>/<head>/<body> tags. If the input contains those tags, they are discarded and their content is kept. Use Policy.ApplyDocument to sanitize complete HTML documents with preserved document structure.
By default, output is always re-serialized through Go's HTML5 parser, which normalizes malformed HTML. This ensures the browser sees exactly the same structure the sanitizer saw, preventing parser-differential attacks. Use WithPreserveOriginal to return the original bytes when no policy rules modify the content.
ApplyHTML is safe for concurrent use on the same Policy.
Any <base> elements in the input are always stripped, and relative URLs are resolved against the first base href before policy rules run. This prevents <base> injection attacks and ensures scheme rules see resolved URLs.
Returns an error if the output size exceeds the configured limit (see WithMaxOutputFactor).
Pass WithApplyURLRewriter to rewrite URLs in the sanitized output for this call only — see WithApplyURLRewriter for details.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse("strip script,style", nil)
if err != nil {
log.Fatal(err)
}
output, modified, err := policy.ApplyHTML([]byte("<p>Hello</p><script>alert(1)</script>"))
if err != nil {
log.Fatal(err)
}
fmt.Println("modified:", modified)
fmt.Println(string(output))
}
Output: modified: true <p>Hello</p>
func (*Policy) ApplyInlineCSS ¶
ApplyInlineCSS sanitizes a standalone inline CSS declaration list (the content of a style attribute, without surrounding HTML). It applies property, value, and URL scheme filtering.
If the policy has no CSS rules and no URL rewriter is supplied via WithApplyURLRewriter, the content is returned unchanged. When a rewriter is supplied, url() values are rewritten even without CSS sanitization rules.
ApplyInlineCSS is safe for concurrent use on the same Policy.
Returns a non-nil error only when WithMaxOutputFactor is set and the sanitized output exceeds the configured size limit.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
css-strip *
css-allow color,font-size
`, nil)
if err != nil {
log.Fatal(err)
}
input := []byte(`color: red; margin: 10px; font-size: 14px`)
output, modified, err := policy.ApplyInlineCSS(input)
if err != nil {
log.Fatal(err)
}
fmt.Println("modified:", modified)
fmt.Println(string(output))
}
Output: modified: true color: red; font-size: 14px
func (*Policy) SetPrefix ¶
SetPrefix sets the base prefix used when generating names for the CommentOut action (e.g. "PREFIX-commented-out ...") and every defang action: the HTML attribute/scheme/content-type defangs (defang-attr, defang-scheme, defang-content-type, e.g. "PREFIX-defanged-onclick"), the CSS defang verbs (css-defang, css-defang-at, css-defang-value, e.g. "PREFIX-defanged-color"), and SVG animated-value defang. The default is "htmlpolicy".
The prefix must be non-empty and contain only ASCII letters, digits, or hyphens. This prevents injection of arbitrary content into HTML attribute names and comment bodies.
SetPrefix is safe to call concurrently with Policy.ApplyHTML.
Example ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`
defang-attr * onclick
comment-out script
`, nil)
if err != nil {
log.Fatal(err)
}
if err := policy.SetPrefix("myapp"); err != nil {
log.Fatal(err)
}
output, _, err := policy.ApplyHTML([]byte(`<p onclick="track()">Hello</p><script>evil</script>`))
if err != nil {
log.Fatal(err)
}
fmt.Println(string(output))
}
Output: <p myapp-defanged-onclick="track()">Hello</p><!--myapp-commented-out <script>evil</script>-->
Example (Error) ¶
package main
import (
"fmt"
"log"
"gitlab.com/grepular/htmlpolicy"
)
func main() {
policy, err := htmlpolicy.Parse(`defang-attr * on*`, nil)
if err != nil {
log.Fatal(err)
}
err = policy.SetPrefix("")
fmt.Println(err)
}
Output: prefix must not be empty
type Resolver ¶
Resolver loads policy text for include directives.
name is whatever appears after "include " in the policy file. The Resolver implementation decides how to interpret it (file path, database key, embedded resource, etc.).
from identifies the policy that contains the include directive, allowing the resolver to interpret name relative to the including policy's location. It is the canonical name returned by an earlier Resolve call (the include that loaded the parent), or the empty string for include directives in the top-level policy text passed to Parse. A file-based resolver typically treats from as a file path and joins relative name values against filepath.Dir(from); resolvers without a notion of location can ignore from entirely.
Resolve returns the canonical name of the loaded resource along with its text. The parser forwards canonical as from on nested include calls and uses it as the key for circular-include detection. Implementations should normalize the canonical form (e.g. resolve relative paths to absolute, strip "./" prefixes, normalize separators) so the same underlying resource always yields the same canonical string; otherwise a cycle through e.g. "foo.policy" and "./foo.policy" will only be caught by the depth limit (with a misleading "exceeds maximum depth" error rather than "circular include"). Resolvers without a notion of canonicalization may return name unchanged.
type SchemeRule ¶
type SchemeRule struct {
Selector string // CSS selector text
Matcher cascadia.Matcher // compiled selector
Attr string // attribute name pattern (e.g. "href", "*", "on*")
Schemes []string // schemes (lowercase), ":relative" for schemeless URLs, or "*" for any scheme
Action Action // Allow, Strip, or Defang
Line int
}
SchemeRule restricts URL schemes for an attribute on matching elements. The Action field determines what happens to URLs with matching schemes: Allow passes them through, Strip removes them, and Defang renames the attribute to make it inert.
type TagRule ¶
type TagRule struct {
Selector string // CSS selector text (for String())
Matcher cascadia.Matcher // compiled selector
Action Action // what to do with matching elements
Line int // source line number (for diagnostics)
}
TagRule defines an action for matching elements.
type URLContext ¶
type URLContext struct {
// Element is the lowercase HTML element name (e.g., "img", "a", "style").
// Empty for standalone CSS sanitization ([Policy.ApplyCSS], [Policy.ApplyInlineCSS]).
Element string
// Attr is the HTML attribute name (e.g., "href", "src", "style").
// Empty for URLs inside <style> elements or standalone CSS.
Attr string
// CSSProperty is the CSS property name containing the url() value
// (e.g., "background-image", "background"), or "@import" for @import URLs.
// Empty for non-CSS URLs (HTML attributes).
CSSProperty string
// Parent is the lowercase name of the URL's element's parent in the
// HTML tree (e.g., "picture", "audio", "video", "head", "body"). Empty
// when the element has no element parent — top-level fragment nodes,
// document-root children whose parent is the document node, and
// standalone CSS sanitization.
//
// Useful for elements whose semantics depend on ancestry, e.g.
// <source src> is an image candidate inside <picture> but a media file
// inside <audio>/<video>.
Parent string
// GetAttr returns the value of the named attribute on the URL's
// element, or "" if the attribute is absent. Lookups are
// case-insensitive. GetAttr is always non-nil; for orphan contexts
// (standalone ApplyCSS / ApplyInlineCSS) it returns "" for any name.
//
// Useful for elements whose semantics depend on a sibling attribute on
// the same element, e.g. <input type=image>, <link rel=stylesheet>,
// <object type=...>. GetAttr reflects the live attribute slice on the
// element at the time of the call, so mutations applied by earlier
// rewriter calls on the same element are visible.
//
// GetAttr is not considered part of the deduplication key — the
// cached rewriter result for a given {url, element, attr, parent,
// cssProperty} tuple comes from the first call, so if you discriminate
// on GetAttr across same-key elements only the first element's attrs
// influence the cached value.
GetAttr func(name string) string
}
URLContext describes where a URL was found in the sanitized output. It is passed to URLRewriter callbacks (configured per call via WithApplyURLRewriter) to provide context about each URL.
type URLRef ¶ added in v0.3.0
type URLRef struct {
// Context describes where the URL was found: the element, attribute, CSS
// property, parent element name, and a live GetAttr accessor. It is the
// exact same URLContext value the [URLRewriter] would receive for this
// URL. Context.GetAttr is valid for the duration of the prefetch callback
// (the underlying nodes are still alive); do not retain it past the call.
Context URLContext
// URL is the discovered URL in its final resolved form — identical to the
// value that would be passed to a [URLRewriter] for this same context.
URL string
}
URLRef pairs a URL discovered during sanitization with the URLContext describing where it was found. A slice of URLRef is passed to the callback registered by WithApplyURLPrefetcher, allowing the caller to warm a cache (e.g. fetch resources in parallel) before the synchronous URLRewriter runs.
type URLRewriter ¶
type URLRewriter func(ctx URLContext, url string) string
URLRewriter is a callback that receives each URL in the sanitized output and returns a replacement. Return the original url unchanged to keep it. Supply a rewriter to one Apply call via WithApplyURLRewriter. Because the rewriter is scoped to that single call, it does not need to be safe for concurrent use; two goroutines calling Apply on the same Policy pass independent rewriters and never share state.
URLs are presented after policy evaluation and base URL resolution. Fragment-only references (#id), empty URLs, and data: URIs are excluded. URLs inside data URI content are recursed into (e.g. url(img.png) inside data:text/css is presented even though the data URI is not).
Calls are deduplicated within a single top-level applyHTMLAt / applyDocumentAt / applyCSS / applyInlineCSS invocation: each unique combination of {url, element, attr, parent, cssProperty} is presented to the callback exactly once for that invocation. The same URL in different contexts (e.g. <img src> vs <video poster>, or the same URL on a <source> inside <picture> vs inside <video>) produces separate calls. Recursive invocations triggered by embedded data URI content each have their own deduplication cache.
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
htmlpolicy
command
Command htmlpolicy applies an HTML sanitization policy to HTML content.
|
Command htmlpolicy applies an HTML sanitization policy to HTML content. |