Documentation
¶
Overview ¶
Package api defines the Tokenizer API. It's just a hack to break the cyclic dependency, and allow the users to import `tokenizers` and get the default implementations.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct {
ConfigFile string
TokenizerClass string `json:"tokenizer_class"`
ChatTemplate string `json:"chat_template"`
UseDefaultSystemPrompt bool `json:"use_default_system_prompt"`
ModelMaxLength float64 `json:"model_max_length"`
MaxLength float64 `json:"max_length"`
SpModelKwargs map[string]any `json:"sp_model_kwargs"`
ClsToken string `json:"cls_token"`
UnkToken string `json:"unk_token"`
SepToken string `json:"sep_token"`
MaskToken string `json:"mask_token"`
BosToken string `json:"bos_token"`
EosToken string `json:"eos_token"`
PadToken string `json:"pad_token"`
AddBosToken bool `json:"add_bos_token"`
AddEosToken bool `json:"add_eos_token"`
AddedTokensDecoder map[int]TokensDecoder `json:"added_tokens_decoder"`
AdditionalSpecialTokens []string `json:"additional_special_tokens"`
DoLowerCase bool `json:"do_lower_case"`
CleanUpTokenizationSpaces bool `json:"clean_up_tokenization_spaces"`
SpacesBetweenSpecialTokens bool `json:"spaces_between_special_tokens"`
TokenizeChineseChars bool `json:"tokenize_chinese_chars"`
StripAccents any `json:"strip_accents"`
NameOrPath string `json:"name_or_path"`
DoBasicTokenize bool `json:"do_basic_tokenize"`
NeverSplit any `json:"never_split"`
Stride int `json:"stride"`
TruncationSide string `json:"truncation_side"`
TruncationStrategy string `json:"truncation_strategy"`
}
Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.
The extra field ConfigFile holds the path to the file with the full config.
func ParseConfigContent ¶
ParseConfigContent parses the given json content (of a tokenizer_config.json file) into a Config structure.
func ParseConfigFile ¶
ParseConfigFile parses the given file (holding a tokenizer_config.json file) into a Config structure.
type SpecialToken ¶
type SpecialToken int
SpecialToken is an enum of commonly used special tokens.
const ( TokBeginningOfSentence SpecialToken = iota TokEndOfSentence TokUnknown TokPad TokMask TokClassification TokSpecialTokensCount )
type Tokenizer ¶
type Tokenizer interface {
Encode(text string) []int
Decode([]int) string
// SpecialTokenID returns ID for given special token if registered, or an error if not.
SpecialTokenID(token SpecialToken) (int, error)
}
Tokenizer interface allows one convert test to "tokens" (integer ids) and back.
It also allows mapping of special tokens: tokens with a common semantic (like padding) but that may map to different ids (int) for different tokenizers.