tokenizer

package

v0.0.7 Latest Latest Go to latest Published: Jan 2, 2026 License: Apache-2.0 Imports: 6 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/openfluke/loom

Links

Open Source Insights

README ¶

LOOM Tokenizer

Pure Go implementation of BPE (Byte Pair Encoding) tokenizer compatible with HuggingFace tokenizer.json format.

Features

✅ Pure Go - no native dependencies
✅ Loads HuggingFace tokenizer.json format
✅ BPE encoding algorithm
✅ Special tokens support
✅ Byte fallback for unknown characters
✅ Compatible with Qwen, Llama, GPT-2, and other BPE-based models

Usage

Load from File

import "github.com/openfluke/loom/tokenizer"

// Load tokenizer from file
tk, err := tokenizer.LoadFromFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
text := "Hello, world!"
tokens := tk.Encode(text, false)
fmt.Printf("Tokens: %v\n", tokens)

// Decode token IDs back to text
decoded := tk.Decode(tokens, false)
fmt.Printf("Decoded: %s\n", decoded)

Load from Bytes (Recommended)

import (
    "os"
    "github.com/openfluke/loom/tokenizer"
)

// Read tokenizer data into memory
data, err := os.ReadFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}

// Load tokenizer from bytes
tk, err := tokenizer.LoadFromBytes(data)
if err != nil {
    log.Fatal(err)
}

// Use tokenizer
tokens := tk.Encode("Hello, world!", false)
decoded := tk.Decode(tokens, false)

// Get vocabulary size
fmt.Printf("Vocab size: %d\n", tk.VocabSize())

Why use LoadFromBytes?

Works with embedded data
Load from network streams
Custom storage backends (databases, cloud storage)
Better for testing with mock data

Supported Models

Qwen / Qwen2.5 (BPE)
Llama / Llama2 (BPE)
GPT-2 / GPT-3 (BPE)
Mistral (BPE)
Most HuggingFace models using BPE tokenization

Architecture

BPE Algorithm

Pre-tokenization: Split text into words using regex patterns
Character splitting: Break words into individual characters
Merge application: Apply BPE merges in rank order
Vocabulary lookup: Convert final tokens to IDs
Byte fallback: Handle unknown tokens as raw bytes

File Format

Compatible with HuggingFace's tokenizer.json:

{
  "model": {
    "type": "BPE",
    "vocab": { "token": id, ... },
    "merges": ["first second", ...]
  },
  "added_tokens": [
    { "id": 151643, "content": "<|endoftext|>", "special": true }
  ]
}

Extending

To add support for new tokenizer types:

Implement the tokenization algorithm in a new file (e.g., wordpiece.go, unigram.go)
Add a type field to detect the tokenizer type
Create a factory function in bpe.go to route to the correct implementation

Example:

func LoadFromFile(path string) (*Tokenizer, error) {
    // Parse JSON
    var tokJSON TokenizerJSON
    // ...

    switch tokJSON.Model.Type {
    case "BPE":
        return loadBPE(tokJSON)
    case "WordPiece":
        return loadWordPiece(tokJSON)
    case "Unigram":
        return loadUnigram(tokJSON)
    default:
        return nil, fmt.Errorf("unsupported tokenizer type: %s", tokJSON.Model.Type)
    }
}

Performance

Encoding: ~1-2ms for 100 tokens
Decoding: <1ms for 100 tokens
Memory: Vocab size * ~100 bytes

Testing

cd tokenizer
go test -v

Future Improvements

WordPiece tokenizer (BERT)
Unigram tokenizer (SentencePiece)
Caching for frequently used merges
Parallel encoding for long texts
Character offset tracking
Normalization (lowercase, unicode, etc.)

Documentation ¶

Index ¶

type MergePair
type PairWithIndex
type PreTokenizer
- func (pt *PreTokenizer) Split(text string) []string
- func (pt *PreTokenizer) SplitWithSpecialTokens(text string, specialTokens map[string]int) []string
type Tokenizer
- func LoadFromBytes(data []byte) (*Tokenizer, error)
- func LoadFromFile(path string) (*Tokenizer, error)
type TokenizerJSON

Examples ¶

LoadFromBytes

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type MergePair ¶

type MergePair struct {
	First  string
	Second string
	Rank   int
}

MergePair represents a BPE merge rule

type PairWithIndex ¶

type PairWithIndex struct {
	First  string
	Second string
	Index  int
}

PairWithIndex tracks position of a pair

type PreTokenizer ¶

type PreTokenizer struct {
	Pattern *regexp.Regexp
}

PreTokenizer handles text splitting before BPE

func (*PreTokenizer) Split ¶

func (pt *PreTokenizer) Split(text string) []string

Split splits text using the pre-tokenizer pattern It preserves special tokens by finding them first before regex splitting

func (*PreTokenizer) SplitWithSpecialTokens ¶ added in v0.0.7

func (pt *PreTokenizer) SplitWithSpecialTokens(text string, specialTokens map[string]int) []string

SplitWithSpecialTokens splits text while preserving special tokens

type Tokenizer ¶

type Tokenizer struct {
	Vocab         map[string]int // token -> id
	ReverseVocab  map[int]string // id -> token
	Merges        []MergePair    // BPE merge rules
	SpecialTokens map[string]int // special tokens
	AddedTokens   map[string]int // added tokens
	PreTokenizer  *PreTokenizer  // pre-tokenization rules
	ByteFallback  bool           // use byte fallback for unknown chars
}

Tokenizer represents a BPE tokenizer

func LoadFromBytes ¶

func LoadFromBytes(data []byte) (*Tokenizer, error)

LoadFromBytes loads a tokenizer from HuggingFace tokenizer.json data

Example ¶

Example of loading tokenizer from bytes

package main

import (
	"github.com/openfluke/loom/tokenizer"
)

func main() {
	// In production, you might get this data from:
	// - Embedded files (go:embed)
	// - Network request
	// - Database
	// - Custom storage backend

	data := []byte(`{
		"model": {
			"type": "BPE",
			"vocab": {
				"hello": 0,
				"world": 1,
				" ": 2
			},
			"merges": []
		},
		"added_tokens": []
	}`)

	tk, err := tokenizer.LoadFromBytes(data)
	if err != nil {
		panic(err)
	}

	// Use the tokenizer
	tokens := tk.Encode("hello world", false)
	_ = tokens
}

func LoadFromFile ¶

func LoadFromFile(path string) (*Tokenizer, error)

LoadFromFile loads a tokenizer from a HuggingFace tokenizer.json file

func (*Tokenizer) Decode ¶

func (t *Tokenizer) Decode(ids []uint32, skipSpecialTokens bool) string

Decode converts token IDs to text

func (*Tokenizer) Encode ¶

func (t *Tokenizer) Encode(text string, addSpecialTokens bool) []uint32

Encode converts text to token IDs

func (*Tokenizer) EncodeWithOffsets ¶

func (t *Tokenizer) EncodeWithOffsets(text string) ([]uint32, [][2]int)

EncodeWithOffsets returns tokens with their character offsets

func (*Tokenizer) IDToToken ¶

func (t *Tokenizer) IDToToken(id int) (string, bool)

IDToToken converts a token ID to its string

func (*Tokenizer) TokenToID ¶

func (t *Tokenizer) TokenToID(token string) (int, bool)

TokenToID converts a token string to its ID

func (*Tokenizer) VocabSize ¶

func (t *Tokenizer) VocabSize() int

VocabSize returns the size of the vocabulary

type TokenizerJSON ¶

type TokenizerJSON struct {
	Model struct {
		Type         string         `json:"type"`
		Vocab        map[string]int `json:"vocab"`
		Merges       []string       `json:"merges"`
		ByteFallback bool           `json:"byte_fallback,omitempty"`
	} `json:"model"`
	AddedTokens []struct {
		ID      int    `json:"id"`
		Content string `json:"content"`
		Special bool   `json:"special"`
	} `json:"added_tokens"`
	PreTokenizer struct {
		Type          string `json:"type"`
		Pretokenizers []struct {
			Type    string `json:"type"`
			Pattern struct {
				String string `json:"String"`
			} `json:"pattern,omitempty"`
		} `json:"pretokenizers,omitempty"`
	} `json:"pre_tokenizer"`
}

TokenizerJSON represents the HuggingFace tokenizer.json format

Source Files ¶

View all Source files

bpe.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL