tokenizer

package
v0.0.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 2, 2026 License: Apache-2.0 Imports: 6 Imported by: 0

README

LOOM Tokenizer

Pure Go implementation of BPE (Byte Pair Encoding) tokenizer compatible with HuggingFace tokenizer.json format.

Features

  • ✅ Pure Go - no native dependencies
  • ✅ Loads HuggingFace tokenizer.json format
  • ✅ BPE encoding algorithm
  • ✅ Special tokens support
  • ✅ Byte fallback for unknown characters
  • ✅ Compatible with Qwen, Llama, GPT-2, and other BPE-based models

Usage

Load from File
import "github.com/openfluke/loom/tokenizer"

// Load tokenizer from file
tk, err := tokenizer.LoadFromFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
text := "Hello, world!"
tokens := tk.Encode(text, false)
fmt.Printf("Tokens: %v\n", tokens)

// Decode token IDs back to text
decoded := tk.Decode(tokens, false)
fmt.Printf("Decoded: %s\n", decoded)
import (
    "os"
    "github.com/openfluke/loom/tokenizer"
)

// Read tokenizer data into memory
data, err := os.ReadFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}

// Load tokenizer from bytes
tk, err := tokenizer.LoadFromBytes(data)
if err != nil {
    log.Fatal(err)
}

// Use tokenizer
tokens := tk.Encode("Hello, world!", false)
decoded := tk.Decode(tokens, false)

// Get vocabulary size
fmt.Printf("Vocab size: %d\n", tk.VocabSize())

Why use LoadFromBytes?

  • Works with embedded data
  • Load from network streams
  • Custom storage backends (databases, cloud storage)
  • Better for testing with mock data

Supported Models

  • Qwen / Qwen2.5 (BPE)
  • Llama / Llama2 (BPE)
  • GPT-2 / GPT-3 (BPE)
  • Mistral (BPE)
  • Most HuggingFace models using BPE tokenization

Architecture

BPE Algorithm
  1. Pre-tokenization: Split text into words using regex patterns
  2. Character splitting: Break words into individual characters
  3. Merge application: Apply BPE merges in rank order
  4. Vocabulary lookup: Convert final tokens to IDs
  5. Byte fallback: Handle unknown tokens as raw bytes
File Format

Compatible with HuggingFace's tokenizer.json:

{
  "model": {
    "type": "BPE",
    "vocab": { "token": id, ... },
    "merges": ["first second", ...]
  },
  "added_tokens": [
    { "id": 151643, "content": "<|endoftext|>", "special": true }
  ]
}

Extending

To add support for new tokenizer types:

  1. Implement the tokenization algorithm in a new file (e.g., wordpiece.go, unigram.go)
  2. Add a type field to detect the tokenizer type
  3. Create a factory function in bpe.go to route to the correct implementation

Example:

func LoadFromFile(path string) (*Tokenizer, error) {
    // Parse JSON
    var tokJSON TokenizerJSON
    // ...

    switch tokJSON.Model.Type {
    case "BPE":
        return loadBPE(tokJSON)
    case "WordPiece":
        return loadWordPiece(tokJSON)
    case "Unigram":
        return loadUnigram(tokJSON)
    default:
        return nil, fmt.Errorf("unsupported tokenizer type: %s", tokJSON.Model.Type)
    }
}

Performance

  • Encoding: ~1-2ms for 100 tokens
  • Decoding: <1ms for 100 tokens
  • Memory: Vocab size * ~100 bytes

Testing

cd tokenizer
go test -v

Future Improvements

  • WordPiece tokenizer (BERT)
  • Unigram tokenizer (SentencePiece)
  • Caching for frequently used merges
  • Parallel encoding for long texts
  • Character offset tracking
  • Normalization (lowercase, unicode, etc.)

Documentation

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type MergePair

type MergePair struct {
	First  string
	Second string
	Rank   int
}

MergePair represents a BPE merge rule

type PairWithIndex

type PairWithIndex struct {
	First  string
	Second string
	Index  int
}

PairWithIndex tracks position of a pair

type PreTokenizer

type PreTokenizer struct {
	Pattern *regexp.Regexp
}

PreTokenizer handles text splitting before BPE

func (*PreTokenizer) Split

func (pt *PreTokenizer) Split(text string) []string

Split splits text using the pre-tokenizer pattern It preserves special tokens by finding them first before regex splitting

func (*PreTokenizer) SplitWithSpecialTokens added in v0.0.7

func (pt *PreTokenizer) SplitWithSpecialTokens(text string, specialTokens map[string]int) []string

SplitWithSpecialTokens splits text while preserving special tokens

type Tokenizer

type Tokenizer struct {
	Vocab         map[string]int // token -> id
	ReverseVocab  map[int]string // id -> token
	Merges        []MergePair    // BPE merge rules
	SpecialTokens map[string]int // special tokens
	AddedTokens   map[string]int // added tokens
	PreTokenizer  *PreTokenizer  // pre-tokenization rules
	ByteFallback  bool           // use byte fallback for unknown chars
}

Tokenizer represents a BPE tokenizer

func LoadFromBytes

func LoadFromBytes(data []byte) (*Tokenizer, error)

LoadFromBytes loads a tokenizer from HuggingFace tokenizer.json data

Example

Example of loading tokenizer from bytes

package main

import (
	"github.com/openfluke/loom/tokenizer"
)

func main() {
	// In production, you might get this data from:
	// - Embedded files (go:embed)
	// - Network request
	// - Database
	// - Custom storage backend

	data := []byte(`{
		"model": {
			"type": "BPE",
			"vocab": {
				"hello": 0,
				"world": 1,
				" ": 2
			},
			"merges": []
		},
		"added_tokens": []
	}`)

	tk, err := tokenizer.LoadFromBytes(data)
	if err != nil {
		panic(err)
	}

	// Use the tokenizer
	tokens := tk.Encode("hello world", false)
	_ = tokens
}

func LoadFromFile

func LoadFromFile(path string) (*Tokenizer, error)

LoadFromFile loads a tokenizer from a HuggingFace tokenizer.json file

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(ids []uint32, skipSpecialTokens bool) string

Decode converts token IDs to text

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(text string, addSpecialTokens bool) []uint32

Encode converts text to token IDs

func (*Tokenizer) EncodeWithOffsets

func (t *Tokenizer) EncodeWithOffsets(text string) ([]uint32, [][2]int)

EncodeWithOffsets returns tokens with their character offsets

func (*Tokenizer) IDToToken

func (t *Tokenizer) IDToToken(id int) (string, bool)

IDToToken converts a token ID to its string

func (*Tokenizer) TokenToID

func (t *Tokenizer) TokenToID(token string) (int, bool)

TokenToID converts a token string to its ID

func (*Tokenizer) VocabSize

func (t *Tokenizer) VocabSize() int

VocabSize returns the size of the vocabulary

type TokenizerJSON

type TokenizerJSON struct {
	Model struct {
		Type         string         `json:"type"`
		Vocab        map[string]int `json:"vocab"`
		Merges       []string       `json:"merges"`
		ByteFallback bool           `json:"byte_fallback,omitempty"`
	} `json:"model"`
	AddedTokens []struct {
		ID      int    `json:"id"`
		Content string `json:"content"`
		Special bool   `json:"special"`
	} `json:"added_tokens"`
	PreTokenizer struct {
		Type          string `json:"type"`
		Pretokenizers []struct {
			Type    string `json:"type"`
			Pattern struct {
				String string `json:"String"`
			} `json:"pattern,omitempty"`
		} `json:"pretokenizers,omitempty"`
	} `json:"pre_tokenizer"`
}

TokenizerJSON represents the HuggingFace tokenizer.json format

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL