Documentation
¶
Index ¶
- Variables
- func CompareCounts(text string) (heuristic, actual int, diff float64)
- func EstimateTokens(text string) int
- func FormatCount(count int, model string) string
- type CountStats
- type Encoding
- type Message
- type Tokenizer
- func (t *Tokenizer) Count(text string) int
- func (t *Tokenizer) CountFile(path string) (int, error)
- func (t *Tokenizer) CountMessages(messages []Message) int
- func (t *Tokenizer) CountReader(r io.Reader) (int, error)
- func (t *Tokenizer) CountWithDetails(text string) (int, []string)
- func (t *Tokenizer) EncodingName() Encoding
Constants ¶
This section is empty.
Variables ¶
var ModelToEncoding = map[string]Encoding{ "gpt-4o": O200kBase, "gpt-4o-mini": O200kBase, "gpt-4o-2024-05-13": O200kBase, "gpt-4o-mini-2024-07-18": O200kBase, "gpt-4": Cl100kBase, "gpt-4-turbo": Cl100kBase, "gpt-4-turbo-preview": Cl100kBase, "gpt-4-0125-preview": Cl100kBase, "gpt-4-1106-preview": Cl100kBase, "gpt-4-0613": Cl100kBase, "gpt-4-0314": Cl100kBase, "gpt-3.5-turbo": Cl100kBase, "gpt-3.5-turbo-0125": Cl100kBase, "gpt-3.5-turbo-1106": Cl100kBase, "gpt-3.5-turbo-0613": Cl100kBase, "gpt-3.5-turbo-0301": Cl100kBase, "text-embedding-ada-002": Cl100kBase, "text-embedding-3-small": Cl100kBase, "text-embedding-3-large": Cl100kBase, "davinci": P50kBase, "curie": P50kBase, "babbage": P50kBase, "ada": P50kBase, "claude-3-opus": Cl100kBase, "claude-3-sonnet": Cl100kBase, "claude-3-haiku": Cl100kBase, "claude-3.5-sonnet": Cl100kBase, "claude-3.5-haiku": Cl100kBase, }
ModelToEncoding maps model names to their encodings.
Functions ¶
func CompareCounts ¶
CompareCounts compares heuristic vs actual token count.
func EstimateTokens ¶
EstimateTokens provides a quick heuristic token count. Uses the formula: ceil(text.length / 4.0) This is a fallback when tiktoken is not needed.
func FormatCount ¶ added in v1.2.0
FormatCount formats a token count with model context limits.
Types ¶
type CountStats ¶
type CountStats struct {
TotalTokens int
TotalChars int
TotalLines int
FilesCount int
Encoding Encoding
}
CountStats holds statistics about token counting.
func (*CountStats) Summary ¶
func (s *CountStats) Summary() string
Summary returns a formatted summary of the stats.
type Encoding ¶
type Encoding string
Encoding represents a tokenizer encoding type.
const ( // Cl100kBase is the encoding for GPT-4, GPT-3.5-turbo, text-embedding-ada-002. Cl100kBase Encoding = "cl100k_base" // O200kBase is the encoding for GPT-4o, GPT-4o-mini. O200kBase Encoding = "o200k_base" // P50kBase is the encoding for GPT-3 (davinci, curie, babbage, ada). P50kBase Encoding = "p50k_base" // R50kBase is the encoding for GPT-3 (davinci, curie, babbage, ada) without regex splitting. R50kBase Encoding = "r50k_base" )
type Message ¶ added in v1.2.0
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
Name string `json:"name,omitempty"`
}
Message represents a chat message.
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer wraps the tiktoken tokenizer.
func NewForModel ¶
NewForModel creates a Tokenizer for a specific model.
func (*Tokenizer) CountMessages ¶ added in v1.2.0
CountMessages counts tokens in OpenAI chat message format.
func (*Tokenizer) CountReader ¶ added in v1.2.0
CountReader counts tokens from an io.Reader.
func (*Tokenizer) CountWithDetails ¶ added in v1.2.0
CountWithDetails returns token count and tokens.
func (*Tokenizer) EncodingName ¶ added in v1.2.0
EncodingName returns the encoding name.