Documentation
¶
Overview ¶
Package tokenizers creates tokenizers from HuggingFace models.
Given a HuggingFace repository (see hub.New to create one), tokenizers will use its "tokenizer_config.json" and "tokenizer.json" to instantiate a Tokenizer.
Index ¶
Constants ¶
const ( TokBeginningOfSentence = api.TokBeginningOfSentence TokEndOfSentence = api.TokEndOfSentence TokUnknown = api.TokUnknown TokPad = api.TokPad TokMask = api.TokMask TokClassification = api.TokClassification TokSpecialTokensCount = api.TokSpecialTokensCount )
Variables ¶
This section is empty.
Functions ¶
func RegisterTokenizerClass ¶
func RegisterTokenizerClass(name string, constructor TokenizerConstructor)
RegisterTokenizerClass used by Tokenizer implementations.
Types ¶
type Config ¶
Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.
The extra field ConfigFile holds the path to the file with the full config.
type SpecialToken ¶
SpecialToken is an enum of commonly used special tokens.
type Tokenizer ¶
Tokenizer interface allows one convert test to "tokens" (integer ids) and back.
It also allows mapping of special tokens: tokens with a comman semantic (like padding) but that may map to different ids (int) for different tokenizers.
func New ¶
New creates a new tokenizer from the given HuggingFace repo (see hub.New).
Currently, it only supports "SentencePiece" encoders, and it attempts to download details from the repo files "tokenizer_config.json" and "tokenizer.json".
If it fails to load those files, or create a tokenizer, it returns an error.
Directories
¶
Path | Synopsis |
---|---|
Package api defines the Tokenizer API.
|
Package api defines the Tokenizer API. |
Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer.
|
Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer. |
private/protos
Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.
|
Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto. |