tokenizers

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 27, 2025 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package tokenizers creates tokenizers from HuggingFace models.

Given a HuggingFace repository (see hub.New to create one), tokenizers will use its "tokenizer_config.json" and "tokenizer.json" to instantiate a Tokenizer.

Index

Constants

View Source
const (
	TokBeginningOfSentence = api.TokBeginningOfSentence
	TokEndOfSentence       = api.TokEndOfSentence
	TokUnknown             = api.TokUnknown
	TokPad                 = api.TokPad
	TokMask                = api.TokMask
	TokClassification      = api.TokClassification
	TokSpecialTokensCount  = api.TokSpecialTokensCount
)

Variables

This section is empty.

Functions

func GetConfig

func GetConfig(repo *hub.Repo) (*api.Config, error)

GetConfig returns the parsed "tokenizer_config.json" Config object for the repo.

func RegisterTokenizerClass

func RegisterTokenizerClass(name string, constructor TokenizerConstructor)

RegisterTokenizerClass used by Tokenizer implementations.

Types

type Config

type Config = api.Config

Config struct to hold HuggingFace's tokenizer_config.json contents. There is no formal schema for this file, but these are some common fields that may be of use. Specific tokenizer classes are free to implement additional features as they see fit.

The extra field ConfigFile holds the path to the file with the full config.

type SpecialToken

type SpecialToken = api.Tokenizer

SpecialToken is an enum of commonly used special tokens.

type Tokenizer

type Tokenizer = api.Tokenizer

Tokenizer interface allows one convert test to "tokens" (integer ids) and back.

It also allows mapping of special tokens: tokens with a comman semantic (like padding) but that may map to different ids (int) for different tokenizers.

func New

func New(repo *hub.Repo) (Tokenizer, error)

New creates a new tokenizer from the given HuggingFace repo (see hub.New).

Currently, it only supports "SentencePiece" encoders, and it attempts to download details from the repo files "tokenizer_config.json" and "tokenizer.json".

If it fails to load those files, or create a tokenizer, it returns an error.

type TokenizerConstructor

type TokenizerConstructor func(config *api.Config, repo *hub.Repo) (api.Tokenizer, error)

TokenizerConstructor is used by Tokenizer implementations to provide implementations for different tokenizer classes.

Directories

Path Synopsis
Package api defines the Tokenizer API.
Package api defines the Tokenizer API.
Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer.
Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer.
private/protos
Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.
Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL