Documentation
¶
Overview ¶
Package tok provides high-performance text compression for LLM context windows.
Usage:
compressed, stats := tok.Compress(text, tok.Aggressive) compressed, stats := tok.Compress(text, tok.WithBudget(4000)) c := tok.NewCompressor(tok.Adaptive) compressed, stats := c.Compress(text)
Index ¶
- Constants
- func CompressJSON(text string, maxItems int) string
- func CompressLog(text string) string
- func DetectLanguageByExtension(path string) string
- func EstimateTokens(text string) int
- func EstimateTokensPrecise(text string) int
- func GetLanguagePatterns(lang string) []string
- func RegisterChunker(ext string, fn ChunkerFunc)
- func RegisterLanguagePatterns(lang string, patterns []string)
- func WarmupTokenizer()
- type ChunkOptions
- type ChunkerFunc
- type CodeChunk
- type Compressor
- type LayerStat
- type Mode
- type Option
- type SeparatorKeep
- type Stats
- type StreamCompressor
- type Tier
Constants ¶
const DefaultMinChunkSize = 250
DefaultMinChunkSize is the default minimum chunk size in tokens.
Variables ¶
This section is empty.
Functions ¶
func CompressJSON ¶
CompressJSON samples large JSON arrays, keeping error/failure items, first 2, last 2, and a random sample of the middle. If maxItems <= 0, defaults to 20. Non-array input is returned unchanged.
func CompressLog ¶
CompressLog preserves ERROR/WARN/FATAL lines and stack traces, collapsing runs of 3+ similar INFO/DEBUG lines into a summary.
func DetectLanguageByExtension ¶
DetectLanguageByExtension returns the programming language name for a file path based on its extension. Returns "" for unknown extensions.
func EstimateTokens ¶
EstimateTokens returns the estimated token count for the given text.
func EstimateTokensPrecise ¶
EstimateTokensPrecise uses BPE tokenization (slower, more accurate).
func GetLanguagePatterns ¶
GetLanguagePatterns returns custom patterns if registered, else built-in pattern strings.
func RegisterChunker ¶
func RegisterChunker(ext string, fn ChunkerFunc)
RegisterChunker registers a custom chunker for a file extension (e.g. ".go").
func RegisterLanguagePatterns ¶
RegisterLanguagePatterns registers custom boundary patterns for a language. These override the built-in patterns when looking up boundaries.
func WarmupTokenizer ¶
func WarmupTokenizer()
WarmupTokenizer pre-initializes the BPE tokenizer in the background. Call at application startup to avoid latency on the first Compress call.
Types ¶
type ChunkOptions ¶
type ChunkOptions struct {
MaxTokens int
MinTokens int
MinChunkSize int // hard minimum; chunks below this get heavy DP penalty (default: DefaultMinChunkSize)
Language string
Overlap int // number of tokens worth of content to repeat from previous chunk
KeepSeparator SeparatorKeep // controls boundary line placement (default: SepLeft)
}
ChunkOptions configures the code chunking behavior.
func DefaultChunkOptions ¶
func DefaultChunkOptions() ChunkOptions
DefaultChunkOptions returns sensible defaults for code chunking.
type ChunkerFunc ¶
ChunkerFunc is a custom chunker that takes a file path and content, returning the detected language and code chunks.
type CodeChunk ¶
type CodeChunk struct {
Content string `json:"content"`
StartLine int `json:"start_line"`
EndLine int `json:"end_line"`
Symbol string `json:"symbol,omitempty"`
Tokens int `json:"tokens"`
}
CodeChunk represents a semantically meaningful chunk of source code.
func ChunkCode ¶
func ChunkCode(source string, opts ChunkOptions) []CodeChunk
ChunkCode splits source code into semantically meaningful chunks based on language-aware boundary detection (function/class/method definitions). If a custom chunker is registered for the file extension in opts.Language, it is used instead of the default pipeline.
func ChunkCodePath ¶
func ChunkCodePath(path, source string, opts ChunkOptions) []CodeChunk
ChunkCodePath is like ChunkCode but accepts a file path for registry lookup.
type Compressor ¶
type Compressor struct {
// contains filtered or unexported fields
}
Compressor is a reusable compression instance. Reuses internal caches across calls for better performance.
func NewCompressor ¶
func NewCompressor(opts ...Option) *Compressor
NewCompressor creates a reusable compressor.
type Option ¶
type Option interface {
// contains filtered or unexported methods
}
Option configures compression behavior.
var ( Minimal Option = WithMode(ModeMinimal) Aggressive Option = WithMode(ModeAggressive) Surface Option = WithTier(TierSurface) Adaptive Option = WithTier(TierAdaptive) Code Option = WithTier(TierCode) Log Option = WithTier(TierLog) )
Pre-built option presets (use directly as options).
func WithBudget ¶
WithBudget sets a hard token limit for the output.
type SeparatorKeep ¶
type SeparatorKeep int
SeparatorKeep controls what happens to boundary separators during splitting.
const ( SepLeft SeparatorKeep = iota // separator stays with preceding chunk (default) SepRight // separator stays with following chunk SepDiscard // separator is removed from output )
type Stats ¶
type Stats struct {
OriginalTokens int
FinalTokens int
TokensSaved int
ReductionPercent float64
Layers map[string]LayerStat
}
Stats contains compression statistics.
type StreamCompressor ¶
type StreamCompressor struct {
// contains filtered or unexported fields
}
StreamCompressor maintains a background-compressed version of accumulating content. As new content is appended, it re-compresses in the background so a compressed snapshot is always available without blocking.
func NewStreamCompressor ¶
func NewStreamCompressor(threshold int, opts ...Option) *StreamCompressor
NewStreamCompressor creates a background compressor that keeps compressed output ready at all times. Threshold is the token count that triggers background re-compression. If threshold <= 0, it defaults to 2000 tokens.
func (*StreamCompressor) Append ¶
func (sc *StreamCompressor) Append(content string)
Append adds new content. If accumulated tokens exceed the threshold, triggers background re-compression.
func (*StreamCompressor) Close ¶
func (sc *StreamCompressor) Close()
Close shuts down the background compressor and waits for any in-progress compression to finish.
func (*StreamCompressor) Raw ¶
func (sc *StreamCompressor) Raw() string
Raw returns all accumulated raw content.
func (*StreamCompressor) Snapshot ¶
func (sc *StreamCompressor) Snapshot() (string, Stats)
Snapshot returns the current compressed output without blocking. If compression hasn't run yet, returns the raw content joined.
func (*StreamCompressor) TokenCount ¶
func (sc *StreamCompressor) TokenCount() int
TokenCount returns estimated token count of raw content.
type Tier ¶
type Tier string
Tier selects a pre-built pipeline profile.
const ( TierSurface Tier = "surface" // 4 layers, fast TierTrim Tier = "trim" // 8 layers, balanced TierExtract Tier = "extract" // 20 layers, max compression TierCore Tier = "core" // 20 layers, quality-first TierCode Tier = "code" // code-aware TierLog Tier = "log" // log-aware TierThread Tier = "thread" // conversation-aware TierAdaptive Tier = "adaptive" // auto-detect content type )
Source Files
¶
Directories
¶
| Path | Synopsis |
|---|---|
|
internal
|
|
|
cache
Package cache provides persistent query caching for tok.
|
Package cache provides persistent query caching for tok. |
|
config
Package config provides configuration management for tok.
|
Package config provides configuration management for tok. |
|
core
Package core provides core interfaces and utilities for tok.
|
Package core provides core interfaces and utilities for tok. |
|
filter
Package filter provides LRU caching using the unified cache package.
|
Package filter provides LRU caching using the unified cache package. |
|
simd
Package simd provides performance-optimized string operations using manual loop unrolling (processing 16 bytes per iteration).
|
Package simd provides performance-optimized string operations using manual loop unrolling (processing 16 bytes per iteration). |