Documentation
¶
Overview ¶
Package tokenstrip is a streaming, token-aware compaction stage for session raw.jsonl streams. It sits downstream of tokenopt in the session pipeline and reduces token count — not bytes — via a small set of intentionally conservative transforms.
Why a separate package ¶
tokenopt produces a byte-reduced stream (ANSI strip, image elision, tool-result dedup, etc.). Those transforms save bytes but rarely save tokens in proportion. tokenstrip attacks the tokenizer directly: NFC-normalize, eliminate zero-width characters, canonicalize whitespace, and — strictly inside assistant <thinking> blocks — drop stop words and optionally substitute high-token phrases with shorter synonyms.
Safety model — precise contract by transform ¶
Some transforms here are lossy. The package is therefore OFF by default in upstream callers and gated behind explicit opt-in.
Fields NEVER mutated, regardless of transform or config:
- header entries (session metadata)
- user turns, in their entirety (intent signal is sacred)
- tool_name, tool_input, tool_mark.brief (summarizer scaffolding)
For assistant entries, the applicability depends on whether a transform is lossless or lossy:
Lossless transforms (apply to assistant content globally): - NFC Unicode normalization — round-trippable canonical form - Zero-width + unusual whitespace strip — information-free glyphs - Whitespace canonicalization — multiple spaces/newlines → one Lossy transforms (apply ONLY to text inside <thinking>…</thinking>): - Stop-word removal - Synonym substitution (opt-in even when tokenstrip is enabled)
This means assistant prose OUTSIDE <thinking> may see its whitespace canonicalized and zero-width chars removed (lossless-safe), but its words will never be dropped or rewritten (preserves the answer to the user verbatim). Assistant prose INSIDE <thinking> may additionally lose stop words / have synonyms substituted (lossy but scoped to reasoning).
Streaming ¶
Compress is single-pass over r, bounded memory, tolerant of oversized entries (>64KB). Unknown top-level JSON fields on each entry round-trip via map[string]json.RawMessage so downstream consumers keep whatever schema extensions upstream added.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func DefaultSynonymTable ¶
DefaultSynonymTable returns the baseline high-token phrase → shorter form mapping. Kept short and conservative; callers wanting more aggressive shortening should pass their own table.
Types ¶
type Options ¶
type Options struct {
// EnableSynonymSub turns on phrase→synonym substitution inside assistant
// <thinking> blocks. Off by default even when tokenstrip itself is on,
// because the table is opinionated and can produce awkward reasoning
// text; callers should opt in explicitly.
EnableSynonymSub bool
// SynonymTable overrides the default substitution table. Keys are
// matched case-insensitively as whole words. A nil map falls back to
// DefaultSynonymTable().
SynonymTable map[string]string
// StopWordLanguage is an ISO-639-1 language code (e.g. "en", "fr").
// Empty string defaults to "en".
StopWordLanguage string
}
Options configures a Compress run. Zero value is a reasonable default (English stop words, synonym substitution OFF).
type Stats ¶
type Stats struct {
EntriesIn int
EntriesOut int
BytesIn int64
BytesOut int64
NFCNormalized int // entries where NFC normalization changed content
ZeroWidthStripped int // entries where zero-width / unusual whitespace was removed
WhitespaceCanonicalized int // entries where whitespace collapse changed content
StopWordsRemoved int // <thinking> blocks where stop words were removed
SynonymsSubstituted int // <thinking> blocks where synonym substitution fired
// Token estimates use a ~4 chars/token heuristic (Anthropic's rule of
// thumb). They exist so callers can log a rough token-reduction number
// without pulling in a BPE-heavy tokenizer. When a real tokenizer is
// wired in later, swap the estimator in transforms.go and these fields
// will reflect actual counts.
TokensInEstimate int64
TokensOutEstimate int64
}
Stats reports what Compress did. Zero values mean no matches.
func Compress ¶
Compress reads raw.jsonl entries from r, applies token-aware transforms, and writes the transformed stream to w. Equivalent to CompressWith with a zero Options value.
func CompressWith ¶
CompressWith is Compress with tunable options.
Guarantees:
- Single pass over r, bounded memory.
- Entry order preserved; nothing is dropped.
- User turns and header entries are byte-identical on output.
- Unknown top-level JSON fields survive round-trip.
func (Stats) LogValue ¶
LogValue implements slog.LogValuer. Enables single-line key=value telemetry:
slog.Info("tokenstrip", "stats", stats)
func (Stats) TokenReduction ¶
TokenReduction returns estimated tokens saved and percentage.