api

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 9, 2024 License: Apache-2.0 Imports: 3 Imported by: 2

Documentation

Overview

Package api defines the Tokenizer API. It's just a hack to break the cyclic dependency, and allow the users to import `tokenizers` and get the default implementations.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	ConfigFile     string
	TokenizerClass string `json:"tokenizer_class"`

	ChatTemplate           string `json:"chat_template"`
	UseDefaultSystemPrompt bool   `json:"use_default_system_prompt"`

	ModelMaxLength float64        `json:"model_max_length"`
	MaxLength      float64        `json:"max_length"`
	SpModelKwargs  map[string]any `json:"sp_model_kwargs"`

	ClsToken  string `json:"cls_token"`
	UnkToken  string `json:"unk_token"`
	SepToken  string `json:"sep_token"`
	MaskToken string `json:"mask_token"`
	BosToken  string `json:"bos_token"`
	EosToken  string `json:"eos_token"`
	PadToken  string `json:"pad_token"`

	AddBosToken             bool                  `json:"add_bos_token"`
	AddEosToken             bool                  `json:"add_eos_token"`
	AddedTokensDecoder      map[int]TokensDecoder `json:"added_tokens_decoder"`
	AdditionalSpecialTokens []string              `json:"additional_special_tokens"`

	DoLowerCase                bool `json:"do_lower_case"`
	CleanUpTokenizationSpaces  bool `json:"clean_up_tokenization_spaces"`
	SpacesBetweenSpecialTokens bool `json:"spaces_between_special_tokens"`

	TokenizeChineseChars bool   `json:"tokenize_chinese_chars"`
	StripAccents         any    `json:"strip_accents"`
	NameOrPath           string `json:"name_or_path"`
	DoBasicTokenize      bool   `json:"do_basic_tokenize"`
	NeverSplit           any    `json:"never_split"`

	Stride             int    `json:"stride"`
	TruncationSide     string `json:"truncation_side"`
	TruncationStrategy string `json:"truncation_strategy"`
}

Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.

The extra field ConfigFile holds the path to the file with the full config.

func ParseConfigContent

func ParseConfigContent(jsonContent []byte) (*Config, error)

ParseConfigContent parses the given json content (of a tokenizer_config.json file) into a Config structure.

func ParseConfigFile

func ParseConfigFile(filePath string) (*Config, error)

ParseConfigFile parses the given file (holding a tokenizer_config.json file) into a Config structure.

type SpecialToken

type SpecialToken int

SpecialToken is an enum of commonly used special tokens.

const (
	TokBeginningOfSentence SpecialToken = iota
	TokEndOfSentence
	TokUnknown
	TokPad
	TokMask
	TokClassification
	TokSpecialTokensCount
)

type Tokenizer

type Tokenizer interface {
	Encode(text string) []int
	Decode([]int) string

	// SpecialTokenID returns ID for given special token if registered, or an error if not.
	SpecialTokenID(token SpecialToken) (int, error)
}

Tokenizer interface allows one convert test to "tokens" (integer ids) and back.

It also allows mapping of special tokens: tokens with a common semantic (like padding) but that may map to different ids (int) for different tokenizers.

type TokensDecoder

type TokensDecoder struct {
	Content    string `json:"content"`
	Lstrip     bool   `json:"lstrip"`
	Normalized bool   `json:"normalized"`
	Rstrip     bool   `json:"rstrip"`
	SingleWord bool   `json:"single_word"`
	Special    bool   `json:"special"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL