tokenizer

package
v1.3.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 8, 2026 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package tokenizer provides token counting functionality for LLM context management.

Token counting is essential for managing context windows and ensuring prompts fit within model limits. This package provides:

  • TokenCounter interface for pluggable implementations
  • HeuristicTokenCounter with model-aware word-to-token ratios
  • MessageTokenCounter for counting tokens across multimodal messages
  • Support for different model families (GPT, Claude, Gemini, etc.)
  • Content-aware ratio adjustment (code, CJK text, mixed content)

The heuristic approach is suitable for context truncation decisions where approximate counts are sufficient. For exact token counts (billing, etc.), use provider-specific CostInfo from API responses.

Index

Constants

This section is empty.

Variables

DefaultTokenCounter is a package-level counter using the default model family. Use this when you don't need model-specific tokenization.

Functions

func CountMessageTokensDefault added in v1.3.10

func CountMessageTokensDefault(messages []types.Message) int

CountMessageTokensDefault is a convenience function that counts message tokens using the default token counter.

func CountTokens

func CountTokens(text string) int

CountTokens is a convenience function using the default token counter.

func DetectContentType added in v1.3.10

func DetectContentType(text string) float64

DetectContentType analyzes text content and returns an adjusted token ratio multiplier based on content characteristics (code, CJK text, etc.).

Types

type HeuristicTokenCounter

type HeuristicTokenCounter struct {
	// contains filtered or unexported fields
}

HeuristicTokenCounter estimates token counts using word-based heuristics. This is fast and suitable for context management decisions where exact counts are not required. For accurate counts, use a tokenizer library like tiktoken-go.

func NewHeuristicTokenCounter

func NewHeuristicTokenCounter(family ModelFamily) *HeuristicTokenCounter

NewHeuristicTokenCounter creates a token counter for the specified model family.

func NewHeuristicTokenCounterWithRatio

func NewHeuristicTokenCounterWithRatio(ratio float64) *HeuristicTokenCounter

NewHeuristicTokenCounterWithRatio creates a token counter with a custom ratio. Use this when you have measured the actual token ratio for your specific use case.

func (*HeuristicTokenCounter) CountMessageTokens added in v1.3.10

func (h *HeuristicTokenCounter) CountMessageTokens(messages []types.Message) int

CountMessageTokens estimates the total token count for a slice of messages. It handles multimodal content by estimating image tokens based on detail level and counting text tokens with content-aware heuristics.

func (*HeuristicTokenCounter) CountMultiple

func (h *HeuristicTokenCounter) CountMultiple(texts []string) int

CountMultiple returns the total token count for multiple text segments.

func (*HeuristicTokenCounter) CountTokens

func (h *HeuristicTokenCounter) CountTokens(text string) int

CountTokens estimates token count for the given text. Returns 0 for empty text.

func (*HeuristicTokenCounter) CountTokensContentAware added in v1.3.10

func (h *HeuristicTokenCounter) CountTokensContentAware(text string) int

CountTokensContentAware estimates token count with content-type awareness. It adjusts the base ratio based on whether the text appears to be code, CJK text, or regular prose.

func (*HeuristicTokenCounter) Ratio

func (h *HeuristicTokenCounter) Ratio() float64

Ratio returns the current token ratio. Thread-safe.

func (*HeuristicTokenCounter) SetRatio

func (h *HeuristicTokenCounter) SetRatio(ratio float64)

SetRatio updates the token ratio. Thread-safe.

type ModelFamily

type ModelFamily string

ModelFamily represents a family of LLM models with similar tokenization.

const (
	// ModelFamilyGPT covers OpenAI GPT models (GPT-3.5, GPT-4, etc.)
	// Uses cl100k_base tokenizer - approximately 1.3 tokens per word for English.
	ModelFamilyGPT ModelFamily = "gpt"

	// ModelFamilyClaude covers Anthropic Claude models.
	// Similar to GPT tokenization - approximately 1.3 tokens per word.
	ModelFamilyClaude ModelFamily = "claude"

	// ModelFamilyGemini covers Google Gemini models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyGemini ModelFamily = "gemini"

	// ModelFamilyLlama covers Meta Llama models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyLlama ModelFamily = "llama"

	// ModelFamilyDefault is used when the model family is unknown.
	// Uses a conservative estimate of 1.35 tokens per word.
	ModelFamilyDefault ModelFamily = "default"
)

func GetModelFamily

func GetModelFamily(modelName string) ModelFamily

GetModelFamily returns the appropriate ModelFamily for a model name. This performs prefix matching to categorize models.

type TokenCounter

type TokenCounter interface {
	// CountTokens returns the estimated or actual token count for the given text.
	CountTokens(text string) int

	// CountMultiple returns the total token count for multiple text segments.
	CountMultiple(texts []string) int
}

TokenCounter provides token counting functionality. Implementations may use heuristics or actual tokenization.

func NewTokenCounterForModel

func NewTokenCounterForModel(modelName string) TokenCounter

NewTokenCounterForModel creates a token counter appropriate for the given model.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL