tokenizer

package

v0.7.7 Latest Latest Go to latest Published: Jan 6, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/born-ml/born

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer provides text tokenization for LLM inference in Born ML.

This package wraps the internal tokenizer implementations and provides a clean public API for tokenization tasks.

Supported tokenizers:

TikToken: OpenAI BPE tokenizers (GPT-3, GPT-4)
BPE: Byte-Pair Encoding from HuggingFace
Chat Templates: Format conversational messages

Example usage:

import "github.com/born-ml/born/tokenizer"

// Load tiktoken
tok, err := tokenizer.NewTikToken("cl100k_base")
if err != nil {
    log.Fatal(err)
}

// Encode text
tokens, err := tok.Encode("Hello, world!")
if err != nil {
    log.Fatal(err)
}

// Decode tokens
text, err := tok.Decode(tokens)
if err != nil {
    log.Fatal(err)
}

// Apply chat template
messages := []tokenizer.ChatMessage{
    {Role: "system", Content: "You are helpful."},
    {Role: "user", Content: "Hi!"},
}
template := tokenizer.NewChatMLTemplate()
prompt := template.Apply(messages)

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ChatMessage ¶

type ChatMessage = tokenizer.ChatMessage

ChatMessage represents a single message in a conversation.

type ChatTemplate ¶

type ChatTemplate = tokenizer.ChatTemplate

ChatTemplate formats messages for conversational models.

func GetChatTemplate ¶

func GetChatTemplate(name string) (ChatTemplate, error)

GetChatTemplate returns a chat template by name.

Supported names: "chatml", "llama", "mistral".

func NewChatMLTemplate ¶

func NewChatMLTemplate() ChatTemplate

NewChatMLTemplate creates a ChatML template (OpenAI, DeepSeek format).

Format: <|im_start|>role\ncontent<|im_end|>.

func NewLLaMATemplate ¶

func NewLLaMATemplate() ChatTemplate

NewLLaMATemplate creates a LLaMA chat template.

Format: [INST] user message [/INST] assistant response.

func NewMistralTemplate ¶

func NewMistralTemplate() ChatTemplate

NewMistralTemplate creates a Mistral chat template.

type Tokenizer ¶

type Tokenizer = tokenizer.Tokenizer

Tokenizer is the core interface for text tokenization.

All tokenizer implementations must implement this interface.

func AutoLoad ¶

func AutoLoad(pathOrName string) (Tokenizer, error)

AutoLoad attempts to automatically load the correct tokenizer.

It tries multiple strategies:

Load from HuggingFace model directory (tokenizer.json)
Load tiktoken by model name
Load tiktoken by encoding name

func ExampleBPE ¶

func ExampleBPE() Tokenizer

ExampleBPE creates a minimal BPE tokenizer for testing and examples.

func LoadFromHuggingFace ¶

func LoadFromHuggingFace(modelPath string) (Tokenizer, error)

LoadFromHuggingFace loads a tokenizer from a HuggingFace model directory.

The directory should contain tokenizer.json.

func NewTikToken ¶

func NewTikToken(encodingName string) (Tokenizer, error)

NewTikToken creates a new TikToken tokenizer with the specified encoding.

Supported encodings: "cl100k_base" (GPT-4), "p50k_base" (GPT-3).

func NewTikTokenForModel ¶

func NewTikTokenForModel(modelName string) (Tokenizer, error)

NewTikTokenForModel creates a TikToken tokenizer for a specific model.

Example models: "gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002".

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL