tokenstrip

package

v0.7.1 Latest Latest Go to latest Published: May 3, 2026 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sageox/ox

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenstrip is a streaming, token-aware compaction stage for session raw.jsonl streams. It sits downstream of tokenopt in the session pipeline and reduces token count — not bytes — via a small set of intentionally conservative transforms.

Why a separate package ¶

tokenopt produces a byte-reduced stream (ANSI strip, image elision, tool-result dedup, etc.). Those transforms save bytes but rarely save tokens in proportion. tokenstrip attacks the tokenizer directly: NFC-normalize, eliminate zero-width characters, canonicalize whitespace, and — strictly inside assistant <thinking> blocks — drop stop words and optionally substitute high-token phrases with shorter synonyms.

Safety model — precise contract by transform ¶

Some transforms here are lossy. The package is therefore OFF by default in upstream callers and gated behind explicit opt-in.

Fields NEVER mutated, regardless of transform or config:

header entries (session metadata)
user turns, in their entirety (intent signal is sacred)
tool_name, tool_input, tool_mark.brief (summarizer scaffolding)

For assistant entries, the applicability depends on whether a transform is lossless or lossy:

Lossless transforms (apply to assistant content globally):
  - NFC Unicode normalization — round-trippable canonical form
  - Zero-width + unusual whitespace strip — information-free glyphs
  - Whitespace canonicalization — multiple spaces/newlines → one

Lossy transforms (apply ONLY to text inside <thinking>…</thinking>):
  - Stop-word removal
  - Synonym substitution (opt-in even when tokenstrip is enabled)

This means assistant prose OUTSIDE <thinking> may see its whitespace canonicalized and zero-width chars removed (lossless-safe), but its words will never be dropped or rewritten (preserves the answer to the user verbatim). Assistant prose INSIDE <thinking> may additionally lose stop words / have synonyms substituted (lossy but scoped to reasoning).

Streaming ¶

Compress is single-pass over r, bounded memory, tolerant of oversized entries (>64KB). Unknown top-level JSON fields on each entry round-trip via map[string]json.RawMessage so downstream consumers keep whatever schema extensions upstream added.

Index ¶

func DefaultSynonymTable() map[string]string
type Options
type Stats
- func Compress(r io.Reader, w io.Writer) (Stats, error)
- func CompressWith(r io.Reader, w io.Writer, opts Options) (Stats, error)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func DefaultSynonymTable ¶

func DefaultSynonymTable() map[string]string

DefaultSynonymTable returns the baseline high-token phrase → shorter form mapping. Kept short and conservative; callers wanting more aggressive shortening should pass their own table.

Types ¶

type Options ¶

type Options struct {
	// EnableSynonymSub turns on phrase→synonym substitution inside assistant
	// <thinking> blocks. Off by default even when tokenstrip itself is on,
	// because the table is opinionated and can produce awkward reasoning
	// text; callers should opt in explicitly.
	EnableSynonymSub bool

	// SynonymTable overrides the default substitution table. Keys are
	// matched case-insensitively as whole words. A nil map falls back to
	// DefaultSynonymTable().
	SynonymTable map[string]string

	// StopWordLanguage is an ISO-639-1 language code (e.g. "en", "fr").
	// Empty string defaults to "en".
	StopWordLanguage string
}

Options configures a Compress run. Zero value is a reasonable default (English stop words, synonym substitution OFF).

type Stats ¶

type Stats struct {
	EntriesIn  int
	EntriesOut int
	BytesIn    int64
	BytesOut   int64

	NFCNormalized           int // entries where NFC normalization changed content
	ZeroWidthStripped       int // entries where zero-width / unusual whitespace was removed
	WhitespaceCanonicalized int // entries where whitespace collapse changed content
	StopWordsRemoved        int // <thinking> blocks where stop words were removed
	SynonymsSubstituted     int // <thinking> blocks where synonym substitution fired

	// Token estimates use a ~4 chars/token heuristic (Anthropic's rule of
	// thumb). They exist so callers can log a rough token-reduction number
	// without pulling in a BPE-heavy tokenizer. When a real tokenizer is
	// wired in later, swap the estimator in transforms.go and these fields
	// will reflect actual counts.
	TokensInEstimate  int64
	TokensOutEstimate int64
}

Stats reports what Compress did. Zero values mean no matches.

func Compress ¶

func Compress(r io.Reader, w io.Writer) (Stats, error)

Compress reads raw.jsonl entries from r, applies token-aware transforms, and writes the transformed stream to w. Equivalent to CompressWith with a zero Options value.

func CompressWith ¶

func CompressWith(r io.Reader, w io.Writer, opts Options) (Stats, error)

CompressWith is Compress with tunable options.

Guarantees:

Single pass over r, bounded memory.
Entry order preserved; nothing is dropped.
User turns and header entries are byte-identical on output.
Unknown top-level JSON fields survive round-trip.

func (*Stats) Add ¶

func (s *Stats) Add(other Stats)

Add accumulates other into s. Useful for aggregating across many sessions.

func (Stats) LogValue ¶

func (s Stats) LogValue() slog.Value

LogValue implements slog.LogValuer. Enables single-line key=value telemetry:

slog.Info("tokenstrip", "stats", stats)

func (Stats) Reduction ¶

func (s Stats) Reduction() (saved int64, pct float64)

Reduction returns bytes saved and percentage. Safe when BytesIn is zero.

func (Stats) TokenReduction ¶

func (s Stats) TokenReduction() (saved int64, pct float64)

TokenReduction returns estimated tokens saved and percentage.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL