sentencepiece

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 9, 2024 License: Apache-2.0 Imports: 4 Imported by: 1

Documentation

Overview

Package sentencepiece implements a tokenizers.Tokenizer based on SentencePiece tokenizer.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func New

func New(config *api.Config, repo *hub.Repo) (api.Tokenizer, error)

New creates a tokenizer based on the tokenizer_config.json and tokenizer.json (vocabulary) files.

Types

type Tokenizer

type Tokenizer struct {
	*esentencepiece.Processor
	Info *esentencepiece.ModelInfo
}

Tokenizer implements tokenizers.Tokenizer interface based on SentencePiece tokenizer by Google.

func (*Tokenizer) Decode

func (p *Tokenizer) Decode(ids []int) string

Decode returns the text from a sequence of ids. It implements sampler.Vocabulary.

func (*Tokenizer) Encode

func (p *Tokenizer) Encode(text string) []int

Encode returns the text encoded into a sequence of ids. It implements sampler.Vocabulary.

func (*Tokenizer) SpecialTokenID

func (p *Tokenizer) SpecialTokenID(token api.SpecialToken) (int, error)

SpecialTokenID returns the token for the given symbol, or an error if not known.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL