tokenizer

package
v0.18.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 18, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer handles BPE and SentencePiece tokenization

func LoadFromBytes

func LoadFromBytes(data []byte) (*Tokenizer, error)

LoadFromBytes loads a tokenizer from tokenizer.json bytes. This is useful when loading from blob storage where the file content is already in memory. Note: This won't load special token config from companion files. Use LoadFromBytesWithConfig to provide tokenizer_config.json data for proper PAD/EOS token loading.

func LoadFromBytesWithConfig

func LoadFromBytesWithConfig(data []byte, config *TokenizerConfig) (*Tokenizer, error)

LoadFromBytesWithConfig loads a tokenizer from tokenizer.json bytes with additional config files. This is useful when loading from blob storage where companion config files are also blobs.

func (*Tokenizer) AddBOS added in v0.18.2

func (t *Tokenizer) AddBOS() bool

AddBOS returns whether a BOS token should be prepended during encoding.

func (*Tokenizer) BOS

func (t *Tokenizer) BOS() int32

BOS returns the beginning of sequence token ID

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(ids []int32) string

Decode converts token IDs back to text

func (*Tokenizer) EOS

func (t *Tokenizer) EOS() int32

EOS returns the first end of sequence token ID (for backwards compatibility)

func (*Tokenizer) EOSTokens

func (t *Tokenizer) EOSTokens() []int32

EOSTokens returns all end of sequence token IDs

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(s string, addBOS bool) []int32

Encode tokenizes text to token IDs. Parallel encoding is used only for very large inputs with enough chunks per worker.

func (*Tokenizer) GetSpecialToken

func (t *Tokenizer) GetSpecialToken(name string) (int32, bool)

GetSpecialToken returns the token ID for a special token string

func (*Tokenizer) IsEOS

func (t *Tokenizer) IsEOS(id int32) bool

IsEOS returns true if the token ID is an end of sequence token

func (*Tokenizer) PAD

func (t *Tokenizer) PAD() int32

PAD returns the padding token ID, or -1 if not set

func (*Tokenizer) VocabSize

func (t *Tokenizer) VocabSize() int

VocabSize returns the vocabulary size

type TokenizerConfig

type TokenizerConfig struct {
	TokenizerConfigJSON  []byte // tokenizer_config.json content
	GenerationConfigJSON []byte // generation_config.json content
	SpecialTokensMapJSON []byte // special_tokens_map.json content
	ConfigJSON           []byte // config.json content
}

TokenizerConfig holds optional configuration data that can be passed to LoadFromBytesWithConfig.

type TokenizerType

type TokenizerType int

TokenizerType identifies the tokenization algorithm

const (
	TokenizerBPE           TokenizerType = iota // GPT-2 style byte-level BPE
	TokenizerSentencePiece                      // SentencePiece with ▁ for spaces
)

type Vocabulary

type Vocabulary struct {
	Values  []string
	Reverse map[string]int32
	Merges  map[string]int

	BOS    int32
	EOS    []int32 // Multiple EOS tokens supported (e.g., Gemma has <eos> and <end_of_turn>)
	PAD    int32   // Padding token (often <|endoftext|> or <pad>)
	AddBOS bool
	AddEOS bool
	// contains filtered or unexported fields
}

Vocabulary holds the tokenizer vocabulary and merges

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL