tokenizer

package

v1.2.0 Latest Latest Go to latest Published: Mar 19, 2026 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/GrayCodeAI/tokman

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func CompareCounts(text string) (heuristic, actual int, diff float64)
func EstimateTokens(text string) int
func FormatCount(count int, model string) string
type CountStats
- func (s *CountStats) Summary() string
type Encoding
type Message
type Tokenizer
- func New(enc Encoding) (*Tokenizer, error)
- func NewForModel(model string) (*Tokenizer, error)

Constants ¶

This section is empty.

Variables ¶

View Source

var ModelToEncoding = map[string]Encoding{

	"gpt-4o":                 O200kBase,
	"gpt-4o-mini":            O200kBase,
	"gpt-4o-2024-05-13":      O200kBase,
	"gpt-4o-mini-2024-07-18": O200kBase,

	"gpt-4":               Cl100kBase,
	"gpt-4-turbo":         Cl100kBase,
	"gpt-4-turbo-preview": Cl100kBase,
	"gpt-4-0125-preview":  Cl100kBase,
	"gpt-4-1106-preview":  Cl100kBase,
	"gpt-4-0613":          Cl100kBase,
	"gpt-4-0314":          Cl100kBase,

	"gpt-3.5-turbo":      Cl100kBase,
	"gpt-3.5-turbo-0125": Cl100kBase,
	"gpt-3.5-turbo-1106": Cl100kBase,
	"gpt-3.5-turbo-0613": Cl100kBase,
	"gpt-3.5-turbo-0301": Cl100kBase,

	"text-embedding-ada-002": Cl100kBase,
	"text-embedding-3-small": Cl100kBase,
	"text-embedding-3-large": Cl100kBase,

	"davinci": P50kBase,
	"curie":   P50kBase,
	"babbage": P50kBase,
	"ada":     P50kBase,

	"claude-3-opus":     Cl100kBase,
	"claude-3-sonnet":   Cl100kBase,
	"claude-3-haiku":    Cl100kBase,
	"claude-3.5-sonnet": Cl100kBase,
	"claude-3.5-haiku":  Cl100kBase,
}

ModelToEncoding maps model names to their encodings.

Functions ¶

func CompareCounts ¶

func CompareCounts(text string) (heuristic, actual int, diff float64)

CompareCounts compares heuristic vs actual token count.

func EstimateTokens ¶

func EstimateTokens(text string) int

EstimateTokens provides a quick heuristic token count. Uses the formula: ceil(text.length / 4.0) This is a fallback when tiktoken is not needed.

func FormatCount ¶ added in v1.2.0

func FormatCount(count int, model string) string

FormatCount formats a token count with model context limits.

Types ¶

type CountStats ¶

type CountStats struct {
	TotalTokens int
	TotalChars  int
	TotalLines  int
	FilesCount  int
	Encoding    Encoding
}

CountStats holds statistics about token counting.

func (*CountStats) Summary ¶

func (s *CountStats) Summary() string

Summary returns a formatted summary of the stats.

type Encoding ¶

type Encoding string

Encoding represents a tokenizer encoding type.

const (
	// Cl100kBase is the encoding for GPT-4, GPT-3.5-turbo, text-embedding-ada-002.
	Cl100kBase Encoding = "cl100k_base"
	// O200kBase is the encoding for GPT-4o, GPT-4o-mini.
	O200kBase Encoding = "o200k_base"
	// P50kBase is the encoding for GPT-3 (davinci, curie, babbage, ada).
	P50kBase Encoding = "p50k_base"
	// R50kBase is the encoding for GPT-3 (davinci, curie, babbage, ada) without regex splitting.
	R50kBase Encoding = "r50k_base"
)

type Message ¶ added in v1.2.0

type Message struct {
	Role    string `json:"role"`
	Content string `json:"content"`
	Name    string `json:"name,omitempty"`
}

Message represents a chat message.

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer wraps the tiktoken tokenizer.

func New ¶

func New(enc Encoding) (*Tokenizer, error)

New creates a new Tokenizer with the specified encoding.

func NewForModel ¶

func NewForModel(model string) (*Tokenizer, error)

NewForModel creates a Tokenizer for a specific model.

func (*Tokenizer) Count ¶

func (t *Tokenizer) Count(text string) int

Count returns the number of tokens in the given text.

func (*Tokenizer) CountFile ¶ added in v1.2.0

func (t *Tokenizer) CountFile(path string) (int, error)

CountFile counts tokens in a file.

func (*Tokenizer) CountMessages ¶ added in v1.2.0

func (t *Tokenizer) CountMessages(messages []Message) int

CountMessages counts tokens in OpenAI chat message format.

func (*Tokenizer) CountReader ¶ added in v1.2.0

func (t *Tokenizer) CountReader(r io.Reader) (int, error)

CountReader counts tokens from an io.Reader.

func (*Tokenizer) CountWithDetails ¶ added in v1.2.0

func (t *Tokenizer) CountWithDetails(text string) (int, []string)

CountWithDetails returns token count and tokens.

func (*Tokenizer) EncodingName ¶ added in v1.2.0

func (t *Tokenizer) EncodingName() Encoding

EncodingName returns the encoding name.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL