Documentation
¶
Overview ¶
Package sanitize provides write-path content hygiene for memini: stripping unambiguous corruption (always-on) and detecting "script-salad" garble (opt-in). It exists because ingestion stores whatever a harness sends, and an upstream model/harness glitch can hand memini a garbled digest that then surfaces in recall verbatim.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Clean ¶
Clean strips unambiguous corruption from text before it is embedded or persisted: invalid UTF-8 byte sequences, the U+FFFD replacement character, C0/C1 control codes (except tab, newline, carriage return), and Unicode non-character code points. It deliberately does NOT touch valid printable text in any language — legitimate Chinese, Japanese, Arabic, or emoji content passes through untouched. A string that is pure binary garbage cleans to (or near) empty, which the caller can then reject.
func Garbled ¶
Garbled reports whether text looks like script-salad — Latin glued to CJK glued to Cyrillic with no separators, the signature of garbled multilingual model output (e.g. `I'm这个家制品 with在上世纪`). It is a heuristic, not a proof: it CANNOT tell semantically-random mixing from a rare legitimate case, so callers must treat a positive as "downrank/flag", never "delete". It is off by default for exactly this reason.
Only *glued* transitions between two different real scripts count — a space or punctuation break resets adjacency, so ordinary code-switching ("the 这个 thing") scores zero. Han, kana, and hangul collapse into one CJK bucket so legitimate Japanese (Han+kana) and CJK-with-embedded-Latin tech terms ("使用React框架") stay well under the threshold.
Types ¶
This section is empty.