Documentation
¶
Index ¶
- type MergePair
- type PairWithIndex
- type PreTokenizer
- type Tokenizer
- func (t *Tokenizer) Decode(ids []uint32, skipSpecialTokens bool) string
- func (t *Tokenizer) Encode(text string, addSpecialTokens bool) []uint32
- func (t *Tokenizer) EncodeWithOffsets(text string) ([]uint32, [][2]int)
- func (t *Tokenizer) IDToToken(id int) (string, bool)
- func (t *Tokenizer) TokenToID(token string) (int, bool)
- func (t *Tokenizer) VocabSize() int
- type TokenizerJSON
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type PairWithIndex ¶
PairWithIndex tracks position of a pair
type PreTokenizer ¶
PreTokenizer handles text splitting before BPE
func (*PreTokenizer) Split ¶
func (pt *PreTokenizer) Split(text string) []string
Split splits text using the pre-tokenizer pattern It preserves special tokens by finding them first before regex splitting
func (*PreTokenizer) SplitWithSpecialTokens ¶ added in v0.0.7
func (pt *PreTokenizer) SplitWithSpecialTokens(text string, specialTokens map[string]int) []string
SplitWithSpecialTokens splits text while preserving special tokens
type Tokenizer ¶
type Tokenizer struct {
Vocab map[string]int // token -> id
ReverseVocab map[int]string // id -> token
Merges []MergePair // BPE merge rules
SpecialTokens map[string]int // special tokens
AddedTokens map[string]int // added tokens
PreTokenizer *PreTokenizer // pre-tokenization rules
ByteFallback bool // use byte fallback for unknown chars
}
Tokenizer represents a BPE tokenizer
func LoadFromBytes ¶
LoadFromBytes loads a tokenizer from HuggingFace tokenizer.json data
Example ¶
Example of loading tokenizer from bytes
package main
import (
"github.com/openfluke/loom/tokenizer"
)
func main() {
// In production, you might get this data from:
// - Embedded files (go:embed)
// - Network request
// - Database
// - Custom storage backend
data := []byte(`{
"model": {
"type": "BPE",
"vocab": {
"hello": 0,
"world": 1,
" ": 2
},
"merges": []
},
"added_tokens": []
}`)
tk, err := tokenizer.LoadFromBytes(data)
if err != nil {
panic(err)
}
// Use the tokenizer
tokens := tk.Encode("hello world", false)
_ = tokens
}
func LoadFromFile ¶
LoadFromFile loads a tokenizer from a HuggingFace tokenizer.json file
func (*Tokenizer) EncodeWithOffsets ¶
EncodeWithOffsets returns tokens with their character offsets
type TokenizerJSON ¶
type TokenizerJSON struct {
Model struct {
Type string `json:"type"`
Vocab map[string]int `json:"vocab"`
Merges []string `json:"merges"`
ByteFallback bool `json:"byte_fallback,omitempty"`
} `json:"model"`
AddedTokens []struct {
ID int `json:"id"`
Content string `json:"content"`
Special bool `json:"special"`
} `json:"added_tokens"`
PreTokenizer struct {
Type string `json:"type"`
Pretokenizers []struct {
Type string `json:"type"`
Pattern struct {
String string `json:"String"`
} `json:"pattern,omitempty"`
} `json:"pretokenizers,omitempty"`
} `json:"pre_tokenizer"`
}
TokenizerJSON represents the HuggingFace tokenizer.json format
Click to show internal directories.
Click to hide internal directories.