tokenizer

package
v1.1.9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 24, 2026 License: Apache-2.0 Imports: 2 Imported by: 0

Documentation

Overview

Package tokenizer provides token counting functionality for LLM context management.

Token counting is essential for managing context windows and ensuring prompts fit within model limits. This package provides:

  • TokenCounter interface for pluggable implementations
  • HeuristicTokenCounter with model-aware word-to-token ratios
  • Support for different model families (GPT, Claude, Gemini, etc.)

The heuristic approach is suitable for context truncation decisions where approximate counts are sufficient. For exact token counts (billing, etc.), use provider-specific CostInfo from API responses.

Index

Constants

This section is empty.

Variables

DefaultTokenCounter is a package-level counter using the default model family. Use this when you don't need model-specific tokenization.

Functions

func CountTokens

func CountTokens(text string) int

CountTokens is a convenience function using the default token counter.

Types

type HeuristicTokenCounter

type HeuristicTokenCounter struct {
	// contains filtered or unexported fields
}

HeuristicTokenCounter estimates token counts using word-based heuristics. This is fast and suitable for context management decisions where exact counts are not required. For accurate counts, use a tokenizer library like tiktoken-go.

func NewHeuristicTokenCounter

func NewHeuristicTokenCounter(family ModelFamily) *HeuristicTokenCounter

NewHeuristicTokenCounter creates a token counter for the specified model family.

func NewHeuristicTokenCounterWithRatio

func NewHeuristicTokenCounterWithRatio(ratio float64) *HeuristicTokenCounter

NewHeuristicTokenCounterWithRatio creates a token counter with a custom ratio. Use this when you have measured the actual token ratio for your specific use case.

func (*HeuristicTokenCounter) CountMultiple

func (h *HeuristicTokenCounter) CountMultiple(texts []string) int

CountMultiple returns the total token count for multiple text segments.

func (*HeuristicTokenCounter) CountTokens

func (h *HeuristicTokenCounter) CountTokens(text string) int

CountTokens estimates token count for the given text. Returns 0 for empty text.

func (*HeuristicTokenCounter) Ratio

func (h *HeuristicTokenCounter) Ratio() float64

Ratio returns the current token ratio. Thread-safe.

func (*HeuristicTokenCounter) SetRatio

func (h *HeuristicTokenCounter) SetRatio(ratio float64)

SetRatio updates the token ratio. Thread-safe.

type ModelFamily

type ModelFamily string

ModelFamily represents a family of LLM models with similar tokenization.

const (
	// ModelFamilyGPT covers OpenAI GPT models (GPT-3.5, GPT-4, etc.)
	// Uses cl100k_base tokenizer - approximately 1.3 tokens per word for English.
	ModelFamilyGPT ModelFamily = "gpt"

	// ModelFamilyClaude covers Anthropic Claude models.
	// Similar to GPT tokenization - approximately 1.3 tokens per word.
	ModelFamilyClaude ModelFamily = "claude"

	// ModelFamilyGemini covers Google Gemini models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyGemini ModelFamily = "gemini"

	// ModelFamilyLlama covers Meta Llama models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyLlama ModelFamily = "llama"

	// ModelFamilyDefault is used when the model family is unknown.
	// Uses a conservative estimate of 1.35 tokens per word.
	ModelFamilyDefault ModelFamily = "default"
)

func GetModelFamily

func GetModelFamily(modelName string) ModelFamily

GetModelFamily returns the appropriate ModelFamily for a model name. This performs prefix matching to categorize models.

type TokenCounter

type TokenCounter interface {
	// CountTokens returns the estimated or actual token count for the given text.
	CountTokens(text string) int

	// CountMultiple returns the total token count for multiple text segments.
	CountMultiple(texts []string) int
}

TokenCounter provides token counting functionality. Implementations may use heuristics or actual tokenization.

func NewTokenCounterForModel

func NewTokenCounterForModel(modelName string) TokenCounter

NewTokenCounterForModel creates a token counter appropriate for the given model.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL