Documentation
¶
Overview ¶
Package tokenizer provides text tokenization for LLM inference in Born ML.
This package wraps the internal tokenizer implementations and provides a clean public API for tokenization tasks.
Supported tokenizers:
- TikToken: OpenAI BPE tokenizers (GPT-3, GPT-4)
- BPE: Byte-Pair Encoding from HuggingFace
- Chat Templates: Format conversational messages
Example usage:
import "github.com/born-ml/born/tokenizer"
// Load tiktoken
tok, err := tokenizer.NewTikToken("cl100k_base")
if err != nil {
log.Fatal(err)
}
// Encode text
tokens, err := tok.Encode("Hello, world!")
if err != nil {
log.Fatal(err)
}
// Decode tokens
text, err := tok.Decode(tokens)
if err != nil {
log.Fatal(err)
}
// Apply chat template
messages := []tokenizer.ChatMessage{
{Role: "system", Content: "You are helpful."},
{Role: "user", Content: "Hi!"},
}
template := tokenizer.NewChatMLTemplate()
prompt := template.Apply(messages)
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type ChatMessage ¶
type ChatMessage = tokenizer.ChatMessage
ChatMessage represents a single message in a conversation.
type ChatTemplate ¶
type ChatTemplate = tokenizer.ChatTemplate
ChatTemplate formats messages for conversational models.
func GetChatTemplate ¶
func GetChatTemplate(name string) (ChatTemplate, error)
GetChatTemplate returns a chat template by name.
Supported names: "chatml", "llama", "mistral".
func NewChatMLTemplate ¶
func NewChatMLTemplate() ChatTemplate
NewChatMLTemplate creates a ChatML template (OpenAI, DeepSeek format).
Format: <|im_start|>role\ncontent<|im_end|>.
func NewLLaMATemplate ¶
func NewLLaMATemplate() ChatTemplate
NewLLaMATemplate creates a LLaMA chat template.
Format: [INST] user message [/INST] assistant response.
func NewMistralTemplate ¶
func NewMistralTemplate() ChatTemplate
NewMistralTemplate creates a Mistral chat template.
type Tokenizer ¶
Tokenizer is the core interface for text tokenization.
All tokenizer implementations must implement this interface.
func AutoLoad ¶
AutoLoad attempts to automatically load the correct tokenizer.
It tries multiple strategies:
- Load from HuggingFace model directory (tokenizer.json)
- Load tiktoken by model name
- Load tiktoken by encoding name
func ExampleBPE ¶
func ExampleBPE() Tokenizer
ExampleBPE creates a minimal BPE tokenizer for testing and examples.
func LoadFromHuggingFace ¶
LoadFromHuggingFace loads a tokenizer from a HuggingFace model directory.
The directory should contain tokenizer.json.
func NewTikToken ¶
NewTikToken creates a new TikToken tokenizer with the specified encoding.
Supported encodings: "cl100k_base" (GPT-4), "p50k_base" (GPT-3).
func NewTikTokenForModel ¶
NewTikTokenForModel creates a TikToken tokenizer for a specific model.
Example models: "gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002".