pretokenizer

package
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 16, 2020 License: Apache-2.0 Imports: 4 Imported by: 27

Documentation

Index

Constants

This section is empty.

Variables

View Source
var BytesChar map[uint8]string = GenerateBytesChar()
View Source
var CharBytes map[string]uint8 = func() map[string]uint8 {
	var bc = GenerateBytesChar()
	var cb map[string]uint8 = make(map[string]uint8)
	for b, c := range bc {
		cb[c] = b
	}
	return cb
}()

Functions

func GenerateBytesChar

func GenerateBytesChar() map[uint8]string

BytesChar maps first 0-255 (byte) to first 0-255 `char` in unicode Ref. https://en.wikipedia.org/wiki/List_of_Unicode_characters Ref. https://rosettacode.org/wiki/UTF-8_encode_and_decode See example: https://play.golang.org/p/_1W0ni2uZWm

func ProcessOffsets

func ProcessOffsets(encoding *tokenizer.Encoding, addPrefixSpace bool) *tokenizer.Encoding

Types

type BertPreTokenizer

type BertPreTokenizer struct{}

func NewBertPreTokenizer

func NewBertPreTokenizer() *BertPreTokenizer

func (*BertPreTokenizer) PreTokenize

PreTokenize implements PreTokenizer interface for BertPreTokenizer

type ByteLevel

type ByteLevel struct {
	// whether to add a leading space to the first word.
	// It allows to treat the leading word just as any other words.
	AddPrefixSpace bool

	// Whether the post processing step should trim offsets
	// to avoid including whitespaces.
	TrimOffsets bool
}

ByteLevel provides all the neccessary steps to handle the BPE tokenization at byte-level. It takes care of all the required processing steps to transform a utf-8 string as needed before and after the BPE model does it job.

func NewByteLevel

func NewByteLevel() *ByteLevel

NewByteLevel returns a default ByteLevel with both AddPrefixSpace and TrimOffsets set true

func (*ByteLevel) AddedToken

func (bl *ByteLevel) AddedToken(isPair bool) int

func (*ByteLevel) Alphabet

func (bl *ByteLevel) Alphabet() map[string]struct{}

Alphabet returns set of first 256 unicode `char`

func (*ByteLevel) Decode

func (bl *ByteLevel) Decode(tokens []string) string

Decode converts any byte-level characters to their unicode couterpart before merging everything back into a single string

func (*ByteLevel) PreTokenize

func (bl *ByteLevel) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)

PreTokenizer, as a `PreTokenizer`, `ByteLevel` is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

func (*ByteLevel) Process

func (bl *ByteLevel) Process(encoding, pairEncoding *tokenizer.Encoding, addSpecialTokens bool) *tokenizer.Encoding

func (*ByteLevel) SetAddPrefixSpace

func (bl *ByteLevel) SetAddPrefixSpace(v bool)

SetAddPrefixSpace set `AddPrefixSpace` property

func (*ByteLevel) SetTrimOffsets

func (bl *ByteLevel) SetTrimOffsets(v bool)

SetTrimOffsets set `TrimOffsets` property

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL