tokenizer

package
v1.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 2, 2026 License: MIT Imports: 4 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ModelToEncoding = map[string]Encoding{

	"gpt-4o":                 O200kBase,
	"gpt-4o-mini":            O200kBase,
	"gpt-4o-2024-05-13":      O200kBase,
	"gpt-4o-mini-2024-07-18": O200kBase,

	"gpt-4":               Cl100kBase,
	"gpt-4-turbo":         Cl100kBase,
	"gpt-4-turbo-preview": Cl100kBase,
	"gpt-4-0125-preview":  Cl100kBase,
	"gpt-4-1106-preview":  Cl100kBase,
	"gpt-4-0613":          Cl100kBase,
	"gpt-4-0314":          Cl100kBase,

	"gpt-3.5-turbo":      Cl100kBase,
	"gpt-3.5-turbo-0125": Cl100kBase,
	"gpt-3.5-turbo-1106": Cl100kBase,
	"gpt-3.5-turbo-0613": Cl100kBase,
	"gpt-3.5-turbo-0301": Cl100kBase,

	"text-embedding-ada-002": Cl100kBase,
	"text-embedding-3-small": Cl100kBase,
	"text-embedding-3-large": Cl100kBase,

	"davinci": P50kBase,
	"curie":   P50kBase,
	"babbage": P50kBase,
	"ada":     P50kBase,

	"claude-3-opus":     Cl100kBase,
	"claude-3-sonnet":   Cl100kBase,
	"claude-3-haiku":    Cl100kBase,
	"claude-3.5-sonnet": Cl100kBase,
	"claude-3.5-haiku":  Cl100kBase,
}

ModelToEncoding maps model names to their encodings.

Functions

func CompareCounts

func CompareCounts(text string) (heuristic, actual int, diff float64)

CompareCounts compares heuristic vs actual token count.

func EstimateTokens

func EstimateTokens(text string) int

EstimateTokens provides a quick heuristic token count. Delegates to core.EstimateTokens for single source of truth (T22).

Types

type CountStats

type CountStats struct {
	TotalTokens int
	TotalChars  int
	TotalLines  int
	FilesCount  int
	Encoding    Encoding
}

CountStats holds statistics about token counting.

func (*CountStats) Summary

func (s *CountStats) Summary() string

Summary returns a formatted summary of the stats.

type Encoding

type Encoding string

Encoding represents a tokenizer encoding type.

const (
	// Cl100kBase is the encoding for GPT-4, GPT-3.5-turbo, text-embedding-ada-002.
	Cl100kBase Encoding = "cl100k_base"
	// O200kBase is the encoding for GPT-4o, GPT-4o-mini.
	O200kBase Encoding = "o200k_base"
	// P50kBase is the encoding for GPT-3 (davinci, curie, babbage, ada).
	P50kBase Encoding = "p50k_base"
	// R50kBase is the encoding for GPT-3 (davinci, curie, babbage, ada) without regex splitting.
	R50kBase Encoding = "r50k_base"
)

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer wraps the tiktoken tokenizer.

func New

func New(enc Encoding) (*Tokenizer, error)

New creates a new Tokenizer with the specified encoding.

func NewForModel

func NewForModel(model string) (*Tokenizer, error)

NewForModel creates a Tokenizer for a specific model.

func (*Tokenizer) Count

func (t *Tokenizer) Count(text string) int

Count returns the number of tokens in the given text.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL