tokenizer

package

v1.3.10 Latest Latest Go to latest Published: Mar 8, 2026 License: Apache-2.0 Imports: 4 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/AltairaLabs/PromptKit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer provides token counting functionality for LLM context management.

Token counting is essential for managing context windows and ensuring prompts fit within model limits. This package provides:

TokenCounter interface for pluggable implementations
HeuristicTokenCounter with model-aware word-to-token ratios
MessageTokenCounter for counting tokens across multimodal messages
Support for different model families (GPT, Claude, Gemini, etc.)
Content-aware ratio adjustment (code, CJK text, mixed content)

The heuristic approach is suitable for context truncation decisions where approximate counts are sufficient. For exact token counts (billing, etc.), use provider-specific CostInfo from API responses.

Index ¶

Variables
func CountMessageTokensDefault(messages []types.Message) int
func CountTokens(text string) int
func DetectContentType(text string) float64
type HeuristicTokenCounter
- func NewHeuristicTokenCounter(family ModelFamily) *HeuristicTokenCounter
- func NewHeuristicTokenCounterWithRatio(ratio float64) *HeuristicTokenCounter
type ModelFamily
- func GetModelFamily(modelName string) ModelFamily
type TokenCounter
- func NewTokenCounterForModel(modelName string) TokenCounter

Constants ¶

This section is empty.

Variables ¶

View Source

var DefaultTokenCounter = NewHeuristicTokenCounter(ModelFamilyDefault)

DefaultTokenCounter is a package-level counter using the default model family. Use this when you don't need model-specific tokenization.

Functions ¶

func CountMessageTokensDefault ¶ added in v1.3.10

func CountMessageTokensDefault(messages []types.Message) int

CountMessageTokensDefault is a convenience function that counts message tokens using the default token counter.

func CountTokens ¶

func CountTokens(text string) int

CountTokens is a convenience function using the default token counter.

func DetectContentType ¶ added in v1.3.10

func DetectContentType(text string) float64

DetectContentType analyzes text content and returns an adjusted token ratio multiplier based on content characteristics (code, CJK text, etc.).

Types ¶

type HeuristicTokenCounter ¶

type HeuristicTokenCounter struct {
	// contains filtered or unexported fields
}

HeuristicTokenCounter estimates token counts using word-based heuristics. This is fast and suitable for context management decisions where exact counts are not required. For accurate counts, use a tokenizer library like tiktoken-go.

func NewHeuristicTokenCounter ¶

func NewHeuristicTokenCounter(family ModelFamily) *HeuristicTokenCounter

NewHeuristicTokenCounter creates a token counter for the specified model family.

func NewHeuristicTokenCounterWithRatio ¶

func NewHeuristicTokenCounterWithRatio(ratio float64) *HeuristicTokenCounter

NewHeuristicTokenCounterWithRatio creates a token counter with a custom ratio. Use this when you have measured the actual token ratio for your specific use case.

func (*HeuristicTokenCounter) CountMessageTokens ¶ added in v1.3.10

func (h *HeuristicTokenCounter) CountMessageTokens(messages []types.Message) int

CountMessageTokens estimates the total token count for a slice of messages. It handles multimodal content by estimating image tokens based on detail level and counting text tokens with content-aware heuristics.

func (*HeuristicTokenCounter) CountMultiple ¶

func (h *HeuristicTokenCounter) CountMultiple(texts []string) int

CountMultiple returns the total token count for multiple text segments.

func (*HeuristicTokenCounter) CountTokens ¶

func (h *HeuristicTokenCounter) CountTokens(text string) int

CountTokens estimates token count for the given text. Returns 0 for empty text.

func (*HeuristicTokenCounter) CountTokensContentAware ¶ added in v1.3.10

func (h *HeuristicTokenCounter) CountTokensContentAware(text string) int

CountTokensContentAware estimates token count with content-type awareness. It adjusts the base ratio based on whether the text appears to be code, CJK text, or regular prose.

func (*HeuristicTokenCounter) Ratio ¶

func (h *HeuristicTokenCounter) Ratio() float64

Ratio returns the current token ratio. Thread-safe.

func (*HeuristicTokenCounter) SetRatio ¶

func (h *HeuristicTokenCounter) SetRatio(ratio float64)

SetRatio updates the token ratio. Thread-safe.

type ModelFamily ¶

type ModelFamily string

ModelFamily represents a family of LLM models with similar tokenization.

const (
	// ModelFamilyGPT covers OpenAI GPT models (GPT-3.5, GPT-4, etc.)
	// Uses cl100k_base tokenizer - approximately 1.3 tokens per word for English.
	ModelFamilyGPT ModelFamily = "gpt"

	// ModelFamilyClaude covers Anthropic Claude models.
	// Similar to GPT tokenization - approximately 1.3 tokens per word.
	ModelFamilyClaude ModelFamily = "claude"

	// ModelFamilyGemini covers Google Gemini models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyGemini ModelFamily = "gemini"

	// ModelFamilyLlama covers Meta Llama models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyLlama ModelFamily = "llama"

	// ModelFamilyDefault is used when the model family is unknown.
	// Uses a conservative estimate of 1.35 tokens per word.
	ModelFamilyDefault ModelFamily = "default"
)

func GetModelFamily ¶

func GetModelFamily(modelName string) ModelFamily

GetModelFamily returns the appropriate ModelFamily for a model name. This performs prefix matching to categorize models.

type TokenCounter ¶

type TokenCounter interface {
	// CountTokens returns the estimated or actual token count for the given text.
	CountTokens(text string) int

	// CountMultiple returns the total token count for multiple text segments.
	CountMultiple(texts []string) int
}

TokenCounter provides token counting functionality. Implementations may use heuristics or actual tokenization.

func NewTokenCounterForModel ¶

func NewTokenCounterForModel(modelName string) TokenCounter

NewTokenCounterForModel creates a token counter appropriate for the given model.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL