sentencepiece

package

v0.1.2 Latest Latest Go to latest Published: Apr 8, 2025 License: Apache-2.0 Imports: 4 Imported by: 1

Details

Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer.

This section is empty.

This section is empty.

func New(config *api.Config, repo *hub.Repo) (api.Tokenizer, error)

New creates a SentencePiece tokenizer based on the "tokenizer.model" file, which must be a SentencePiece Model proto (see protos.Model).

It implements a tokenizer.TokenizerConstructor function signature.

type Tokenizer struct {
	*esentencepiece.Processor
	Info *esentencepiece.ModelInfo
}

Tokenizer implements tokenizers.Tokenizer interface based on SentencePiece tokenizer by Google.

func (p *Tokenizer) Decode(ids []int) string

Decode returns the text from a sequence of ids. It implements sampler.Vocabulary.

func (p *Tokenizer) Encode(text string) []int

Encode returns the text encoded into a sequence of ids. It implements sampler.Vocabulary.

func (p *Tokenizer) SpecialTokenID(token api.SpecialToken) (int, error)

SpecialTokenID returns the token for the given symbol, or an error if not known.

Path	Synopsis
private
protos Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.	Package protos have the Proto Buffer code for the sentencepiece_model.proto file, downloaded from https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.