tokenizer

package

v1.1.9 Latest Latest Go to latest Published: Jan 24, 2026 License: Apache-2.0 Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/AltairaLabs/PromptKit

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer provides token counting functionality for LLM context management.

Token counting is essential for managing context windows and ensuring prompts fit within model limits. This package provides:

TokenCounter interface for pluggable implementations
HeuristicTokenCounter with model-aware word-to-token ratios
Support for different model families (GPT, Claude, Gemini, etc.)

The heuristic approach is suitable for context truncation decisions where approximate counts are sufficient. For exact token counts (billing, etc.), use provider-specific CostInfo from API responses.

Index ¶

Variables
func CountTokens(text string) int
type HeuristicTokenCounter
- func NewHeuristicTokenCounter(family ModelFamily) *HeuristicTokenCounter
- func NewHeuristicTokenCounterWithRatio(ratio float64) *HeuristicTokenCounter
type ModelFamily
- func GetModelFamily(modelName string) ModelFamily
type TokenCounter
- func NewTokenCounterForModel(modelName string) TokenCounter

Constants ¶

This section is empty.

Variables ¶

View Source

var DefaultTokenCounter = NewHeuristicTokenCounter(ModelFamilyDefault)

DefaultTokenCounter is a package-level counter using the default model family. Use this when you don't need model-specific tokenization.

Functions ¶

func CountTokens ¶

func CountTokens(text string) int

CountTokens is a convenience function using the default token counter.

Types ¶

type HeuristicTokenCounter ¶

type HeuristicTokenCounter struct {
	// contains filtered or unexported fields
}

HeuristicTokenCounter estimates token counts using word-based heuristics. This is fast and suitable for context management decisions where exact counts are not required. For accurate counts, use a tokenizer library like tiktoken-go.

func NewHeuristicTokenCounter ¶

func NewHeuristicTokenCounter(family ModelFamily) *HeuristicTokenCounter

NewHeuristicTokenCounter creates a token counter for the specified model family.

func NewHeuristicTokenCounterWithRatio ¶

func NewHeuristicTokenCounterWithRatio(ratio float64) *HeuristicTokenCounter

NewHeuristicTokenCounterWithRatio creates a token counter with a custom ratio. Use this when you have measured the actual token ratio for your specific use case.

func (*HeuristicTokenCounter) CountMultiple ¶

func (h *HeuristicTokenCounter) CountMultiple(texts []string) int

CountMultiple returns the total token count for multiple text segments.

func (*HeuristicTokenCounter) CountTokens ¶

func (h *HeuristicTokenCounter) CountTokens(text string) int

CountTokens estimates token count for the given text. Returns 0 for empty text.

func (*HeuristicTokenCounter) Ratio ¶

func (h *HeuristicTokenCounter) Ratio() float64

Ratio returns the current token ratio. Thread-safe.

func (*HeuristicTokenCounter) SetRatio ¶

func (h *HeuristicTokenCounter) SetRatio(ratio float64)

SetRatio updates the token ratio. Thread-safe.

type ModelFamily ¶

type ModelFamily string

ModelFamily represents a family of LLM models with similar tokenization.

const (
	// ModelFamilyGPT covers OpenAI GPT models (GPT-3.5, GPT-4, etc.)
	// Uses cl100k_base tokenizer - approximately 1.3 tokens per word for English.
	ModelFamilyGPT ModelFamily = "gpt"

	// ModelFamilyClaude covers Anthropic Claude models.
	// Similar to GPT tokenization - approximately 1.3 tokens per word.
	ModelFamilyClaude ModelFamily = "claude"

	// ModelFamilyGemini covers Google Gemini models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyGemini ModelFamily = "gemini"

	// ModelFamilyLlama covers Meta Llama models.
	// Uses SentencePiece tokenizer - approximately 1.4 tokens per word.
	ModelFamilyLlama ModelFamily = "llama"

	// ModelFamilyDefault is used when the model family is unknown.
	// Uses a conservative estimate of 1.35 tokens per word.
	ModelFamilyDefault ModelFamily = "default"
)

func GetModelFamily ¶

func GetModelFamily(modelName string) ModelFamily

GetModelFamily returns the appropriate ModelFamily for a model name. This performs prefix matching to categorize models.

type TokenCounter ¶

type TokenCounter interface {
	// CountTokens returns the estimated or actual token count for the given text.
	CountTokens(text string) int

	// CountMultiple returns the total token count for multiple text segments.
	CountMultiple(texts []string) int
}

TokenCounter provides token counting functionality. Implementations may use heuristics or actual tokenization.

func NewTokenCounterForModel ¶

func NewTokenCounterForModel(modelName string) TokenCounter

NewTokenCounterForModel creates a token counter appropriate for the given model.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL