tokenizer

package
v0.7.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 6, 2026 License: Apache-2.0 Imports: 1 Imported by: 0

Documentation

Overview

Package tokenizer provides text tokenization for LLM inference in Born ML.

This package wraps the internal tokenizer implementations and provides a clean public API for tokenization tasks.

Supported tokenizers:

  • TikToken: OpenAI BPE tokenizers (GPT-3, GPT-4)
  • BPE: Byte-Pair Encoding from HuggingFace
  • Chat Templates: Format conversational messages

Example usage:

import "github.com/born-ml/born/tokenizer"

// Load tiktoken
tok, err := tokenizer.NewTikToken("cl100k_base")
if err != nil {
    log.Fatal(err)
}

// Encode text
tokens, err := tok.Encode("Hello, world!")
if err != nil {
    log.Fatal(err)
}

// Decode tokens
text, err := tok.Decode(tokens)
if err != nil {
    log.Fatal(err)
}

// Apply chat template
messages := []tokenizer.ChatMessage{
    {Role: "system", Content: "You are helpful."},
    {Role: "user", Content: "Hi!"},
}
template := tokenizer.NewChatMLTemplate()
prompt := template.Apply(messages)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type ChatMessage

type ChatMessage = tokenizer.ChatMessage

ChatMessage represents a single message in a conversation.

type ChatTemplate

type ChatTemplate = tokenizer.ChatTemplate

ChatTemplate formats messages for conversational models.

func GetChatTemplate

func GetChatTemplate(name string) (ChatTemplate, error)

GetChatTemplate returns a chat template by name.

Supported names: "chatml", "llama", "mistral".

func NewChatMLTemplate

func NewChatMLTemplate() ChatTemplate

NewChatMLTemplate creates a ChatML template (OpenAI, DeepSeek format).

Format: <|im_start|>role\ncontent<|im_end|>.

func NewLLaMATemplate

func NewLLaMATemplate() ChatTemplate

NewLLaMATemplate creates a LLaMA chat template.

Format: [INST] user message [/INST] assistant response.

func NewMistralTemplate

func NewMistralTemplate() ChatTemplate

NewMistralTemplate creates a Mistral chat template.

type Tokenizer

type Tokenizer = tokenizer.Tokenizer

Tokenizer is the core interface for text tokenization.

All tokenizer implementations must implement this interface.

func AutoLoad

func AutoLoad(pathOrName string) (Tokenizer, error)

AutoLoad attempts to automatically load the correct tokenizer.

It tries multiple strategies:

  1. Load from HuggingFace model directory (tokenizer.json)
  2. Load tiktoken by model name
  3. Load tiktoken by encoding name

func ExampleBPE

func ExampleBPE() Tokenizer

ExampleBPE creates a minimal BPE tokenizer for testing and examples.

func LoadFromHuggingFace

func LoadFromHuggingFace(modelPath string) (Tokenizer, error)

LoadFromHuggingFace loads a tokenizer from a HuggingFace model directory.

The directory should contain tokenizer.json.

func NewTikToken

func NewTikToken(encodingName string) (Tokenizer, error)

NewTikToken creates a new TikToken tokenizer with the specified encoding.

Supported encodings: "cl100k_base" (GPT-4), "p50k_base" (GPT-3).

func NewTikTokenForModel

func NewTikTokenForModel(modelName string) (Tokenizer, error)

NewTikTokenForModel creates a TikToken tokenizer for a specific model.

Example models: "gpt-4", "gpt-3.5-turbo", "text-embedding-ada-002".

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL