sanitize

package
v0.4.11 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 19, 2026 License: AGPL-3.0 Imports: 3 Imported by: 0

Documentation

Overview

Package sanitize provides write-path content hygiene for memini: stripping unambiguous corruption (always-on) and detecting "script-salad" garble (opt-in). It exists because ingestion stores whatever a harness sends, and an upstream model/harness glitch can hand memini a garbled digest that then surfaces in recall verbatim.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Clean

func Clean(s string) string

Clean strips unambiguous corruption from text before it is embedded or persisted: invalid UTF-8 byte sequences, the U+FFFD replacement character, C0/C1 control codes (except tab, newline, carriage return), and Unicode non-character code points. It deliberately does NOT touch valid printable text in any language — legitimate Chinese, Japanese, Arabic, or emoji content passes through untouched. A string that is pure binary garbage cleans to (or near) empty, which the caller can then reject.

func Garbled

func Garbled(s string) bool

Garbled reports whether text looks like script-salad — Latin glued to CJK glued to Cyrillic with no separators, the signature of garbled multilingual model output (e.g. `I'm这个家制品 with在上世纪`). It is a heuristic, not a proof: it CANNOT tell semantically-random mixing from a rare legitimate case, so callers must treat a positive as "downrank/flag", never "delete". It is off by default for exactly this reason.

Only *glued* transitions between two different real scripts count — a space or punctuation break resets adjacency, so ordinary code-switching ("the 这个 thing") scores zero. Han, kana, and hangul collapse into one CJK bucket so legitimate Japanese (Han+kana) and CJK-with-embedded-Latin tech terms ("使用React框架") stay well under the threshold.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL