tokenstrip

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 1, 2026 License: MIT Imports: 10 Imported by: 0

Documentation

Overview

Package tokenstrip is a streaming, token-aware compaction stage for session raw.jsonl streams. It sits downstream of tokenopt in the session pipeline and reduces token count — not bytes — via a small set of intentionally conservative transforms.

Why a separate package

tokenopt produces a byte-reduced stream (ANSI strip, image elision, tool-result dedup, etc.). Those transforms save bytes but rarely save tokens in proportion. tokenstrip attacks the tokenizer directly: NFC-normalize, eliminate zero-width characters, canonicalize whitespace, and — strictly inside assistant <thinking> blocks — drop stop words and optionally substitute high-token phrases with shorter synonyms.

Safety model — precise contract by transform

Some transforms here are lossy. The package is therefore OFF by default in upstream callers and gated behind explicit opt-in.

Fields NEVER mutated, regardless of transform or config:

  • header entries (session metadata)
  • user turns, in their entirety (intent signal is sacred)
  • tool_name, tool_input, tool_mark.brief (summarizer scaffolding)

For assistant entries, the applicability depends on whether a transform is lossless or lossy:

Lossless transforms (apply to assistant content globally):
  - NFC Unicode normalization — round-trippable canonical form
  - Zero-width + unusual whitespace strip — information-free glyphs
  - Whitespace canonicalization — multiple spaces/newlines → one

Lossy transforms (apply ONLY to text inside <thinking>…</thinking>):
  - Stop-word removal
  - Synonym substitution (opt-in even when tokenstrip is enabled)

This means assistant prose OUTSIDE <thinking> may see its whitespace canonicalized and zero-width chars removed (lossless-safe), but its words will never be dropped or rewritten (preserves the answer to the user verbatim). Assistant prose INSIDE <thinking> may additionally lose stop words / have synonyms substituted (lossy but scoped to reasoning).

Streaming

Compress is single-pass over r, bounded memory, tolerant of oversized entries (>64KB). Unknown top-level JSON fields on each entry round-trip via map[string]json.RawMessage so downstream consumers keep whatever schema extensions upstream added.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func DefaultSynonymTable

func DefaultSynonymTable() map[string]string

DefaultSynonymTable returns the baseline high-token phrase → shorter form mapping. Kept short and conservative; callers wanting more aggressive shortening should pass their own table.

Types

type Options

type Options struct {
	// EnableSynonymSub turns on phrase→synonym substitution inside assistant
	// <thinking> blocks. Off by default even when tokenstrip itself is on,
	// because the table is opinionated and can produce awkward reasoning
	// text; callers should opt in explicitly.
	EnableSynonymSub bool

	// SynonymTable overrides the default substitution table. Keys are
	// matched case-insensitively as whole words. A nil map falls back to
	// DefaultSynonymTable().
	SynonymTable map[string]string

	// StopWordLanguage is an ISO-639-1 language code (e.g. "en", "fr").
	// Empty string defaults to "en".
	StopWordLanguage string
}

Options configures a Compress run. Zero value is a reasonable default (English stop words, synonym substitution OFF).

type Stats

type Stats struct {
	EntriesIn  int
	EntriesOut int
	BytesIn    int64
	BytesOut   int64

	NFCNormalized           int // entries where NFC normalization changed content
	ZeroWidthStripped       int // entries where zero-width / unusual whitespace was removed
	WhitespaceCanonicalized int // entries where whitespace collapse changed content
	StopWordsRemoved        int // <thinking> blocks where stop words were removed
	SynonymsSubstituted     int // <thinking> blocks where synonym substitution fired

	// Token estimates use a ~4 chars/token heuristic (Anthropic's rule of
	// thumb). They exist so callers can log a rough token-reduction number
	// without pulling in a BPE-heavy tokenizer. When a real tokenizer is
	// wired in later, swap the estimator in transforms.go and these fields
	// will reflect actual counts.
	TokensInEstimate  int64
	TokensOutEstimate int64
}

Stats reports what Compress did. Zero values mean no matches.

func Compress

func Compress(r io.Reader, w io.Writer) (Stats, error)

Compress reads raw.jsonl entries from r, applies token-aware transforms, and writes the transformed stream to w. Equivalent to CompressWith with a zero Options value.

func CompressWith

func CompressWith(r io.Reader, w io.Writer, opts Options) (Stats, error)

CompressWith is Compress with tunable options.

Guarantees:

  • Single pass over r, bounded memory.
  • Entry order preserved; nothing is dropped.
  • User turns and header entries are byte-identical on output.
  • Unknown top-level JSON fields survive round-trip.

func (*Stats) Add

func (s *Stats) Add(other Stats)

Add accumulates other into s. Useful for aggregating across many sessions.

func (Stats) LogValue

func (s Stats) LogValue() slog.Value

LogValue implements slog.LogValuer. Enables single-line key=value telemetry:

slog.Info("tokenstrip", "stats", stats)

func (Stats) Reduction

func (s Stats) Reduction() (saved int64, pct float64)

Reduction returns bytes saved and percentage. Safe when BytesIn is zero.

func (Stats) TokenReduction

func (s Stats) TokenReduction() (saved int64, pct float64)

TokenReduction returns estimated tokens saved and percentage.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL