Documentation
¶
Index ¶
- type Tokenizer
- func (t *Tokenizer) AddBOS() bool
- func (t *Tokenizer) BOS() int32
- func (t *Tokenizer) Decode(ids []int32) string
- func (t *Tokenizer) EOS() int32
- func (t *Tokenizer) EOSTokens() []int32
- func (t *Tokenizer) Encode(s string, addBOS bool) []int32
- func (t *Tokenizer) GetSpecialToken(name string) (int32, bool)
- func (t *Tokenizer) IsEOS(id int32) bool
- func (t *Tokenizer) PAD() int32
- func (t *Tokenizer) VocabSize() int
- type TokenizerConfig
- type TokenizerType
- type Vocabulary
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer handles BPE and SentencePiece tokenization
func LoadFromBytes ¶
LoadFromBytes loads a tokenizer from tokenizer.json bytes. This is useful when loading from blob storage where the file content is already in memory. Note: This won't load special token config from companion files. Use LoadFromBytesWithConfig to provide tokenizer_config.json data for proper PAD/EOS token loading.
func LoadFromBytesWithConfig ¶
func LoadFromBytesWithConfig(data []byte, config *TokenizerConfig) (*Tokenizer, error)
LoadFromBytesWithConfig loads a tokenizer from tokenizer.json bytes with additional config files. This is useful when loading from blob storage where companion config files are also blobs.
func (*Tokenizer) AddBOS ¶ added in v0.18.2
AddBOS returns whether a BOS token should be prepended during encoding.
func (*Tokenizer) EOS ¶
EOS returns the first end of sequence token ID (for backwards compatibility)
func (*Tokenizer) Encode ¶
Encode tokenizes text to token IDs. Parallel encoding is used only for very large inputs with enough chunks per worker.
func (*Tokenizer) GetSpecialToken ¶
GetSpecialToken returns the token ID for a special token string
type TokenizerConfig ¶
type TokenizerConfig struct {
TokenizerConfigJSON []byte // tokenizer_config.json content
GenerationConfigJSON []byte // generation_config.json content
SpecialTokensMapJSON []byte // special_tokens_map.json content
ConfigJSON []byte // config.json content
}
TokenizerConfig holds optional configuration data that can be passed to LoadFromBytesWithConfig.
type TokenizerType ¶
type TokenizerType int
TokenizerType identifies the tokenization algorithm
const ( TokenizerBPE TokenizerType = iota // GPT-2 style byte-level BPE TokenizerSentencePiece // SentencePiece with ▁ for spaces )
type Vocabulary ¶
type Vocabulary struct {
Values []string
Reverse map[string]int32
Merges map[string]int
BOS int32
EOS []int32 // Multiple EOS tokens supported (e.g., Gemma has <eos> and <end_of_turn>)
PAD int32 // Padding token (often <|endoftext|> or <pad>)
AddBOS bool
AddEOS bool
// contains filtered or unexported fields
}
Vocabulary holds the tokenizer vocabulary and merges