api

package

v0.1.0 Latest Latest Go to latest Published: Nov 9, 2024 License: Apache-2.0 Imports: 3 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gomlx/go-huggingface

Links

Open Source Insights

Documentation ¶

Overview ¶

Package api defines the Tokenizer API. It's just a hack to break the cyclic dependency, and allow the users to import `tokenizers` and get the default implementations.

Index ¶

type Config
- func ParseConfigContent(jsonContent []byte) (*Config, error)
- func ParseConfigFile(filePath string) (*Config, error)
type SpecialToken
type Tokenizer
type TokensDecoder

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	ConfigFile     string
	TokenizerClass string `json:"tokenizer_class"`

	ChatTemplate           string `json:"chat_template"`
	UseDefaultSystemPrompt bool   `json:"use_default_system_prompt"`

	ModelMaxLength float64        `json:"model_max_length"`
	MaxLength      float64        `json:"max_length"`
	SpModelKwargs  map[string]any `json:"sp_model_kwargs"`

	ClsToken  string `json:"cls_token"`
	UnkToken  string `json:"unk_token"`
	SepToken  string `json:"sep_token"`
	MaskToken string `json:"mask_token"`
	BosToken  string `json:"bos_token"`
	EosToken  string `json:"eos_token"`
	PadToken  string `json:"pad_token"`

	AddBosToken             bool                  `json:"add_bos_token"`
	AddEosToken             bool                  `json:"add_eos_token"`
	AddedTokensDecoder      map[int]TokensDecoder `json:"added_tokens_decoder"`
	AdditionalSpecialTokens []string              `json:"additional_special_tokens"`

	DoLowerCase                bool `json:"do_lower_case"`
	CleanUpTokenizationSpaces  bool `json:"clean_up_tokenization_spaces"`
	SpacesBetweenSpecialTokens bool `json:"spaces_between_special_tokens"`

	TokenizeChineseChars bool   `json:"tokenize_chinese_chars"`
	StripAccents         any    `json:"strip_accents"`
	NameOrPath           string `json:"name_or_path"`
	DoBasicTokenize      bool   `json:"do_basic_tokenize"`
	NeverSplit           any    `json:"never_split"`

	Stride             int    `json:"stride"`
	TruncationSide     string `json:"truncation_side"`
	TruncationStrategy string `json:"truncation_strategy"`
}

Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.

The extra field ConfigFile holds the path to the file with the full config.

func ParseConfigContent ¶

func ParseConfigContent(jsonContent []byte) (*Config, error)

ParseConfigContent parses the given json content (of a tokenizer_config.json file) into a Config structure.

func ParseConfigFile ¶

func ParseConfigFile(filePath string) (*Config, error)

ParseConfigFile parses the given file (holding a tokenizer_config.json file) into a Config structure.

type SpecialToken ¶

type SpecialToken int

SpecialToken is an enum of commonly used special tokens.

const (
	TokBeginningOfSentence SpecialToken = iota
	TokEndOfSentence
	TokUnknown
	TokPad
	TokMask
	TokClassification
	TokSpecialTokensCount
)

type Tokenizer ¶

type Tokenizer interface {
	Encode(text string) []int
	Decode([]int) string

	// SpecialTokenID returns ID for given special token if registered, or an error if not.
	SpecialTokenID(token SpecialToken) (int, error)
}

Tokenizer interface allows one convert test to "tokens" (integer ids) and back.

It also allows mapping of special tokens: tokens with a common semantic (like padding) but that may map to different ids (int) for different tokenizers.

type TokensDecoder ¶

type TokensDecoder struct {
	Content    string `json:"content"`
	Lstrip     bool   `json:"lstrip"`
	Normalized bool   `json:"normalized"`
	Rstrip     bool   `json:"rstrip"`
	SingleWord bool   `json:"single_word"`
	Special    bool   `json:"special"`
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL