pretokenizer

package

v0.1.3 Latest Latest Go to latest Published: Oct 16, 2020 License: Apache-2.0 Imports: 4 Imported by: 27

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sugarme/tokenizer

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
func GenerateBytesChar() map[uint8]string
func ProcessOffsets(encoding *tokenizer.Encoding, addPrefixSpace bool) *tokenizer.Encoding
type BertPreTokenizer
- func NewBertPreTokenizer() *BertPreTokenizer
- func (bt *BertPreTokenizer) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)
type ByteLevel
- func NewByteLevel() *ByteLevel

Constants ¶

This section is empty.

Variables ¶

View Source

var BytesChar map[uint8]string = GenerateBytesChar()

View Source

var CharBytes map[string]uint8 = func() map[string]uint8 {
	var bc = GenerateBytesChar()
	var cb map[string]uint8 = make(map[string]uint8)
	for b, c := range bc {
		cb[c] = b
	}
	return cb
}()

Functions ¶

func GenerateBytesChar ¶

func GenerateBytesChar() map[uint8]string

BytesChar maps first 0-255 (byte) to first 0-255 `char` in unicode Ref. https://en.wikipedia.org/wiki/List_of_Unicode_characters Ref. https://rosettacode.org/wiki/UTF-8_encode_and_decode See example: https://play.golang.org/p/_1W0ni2uZWm

func ProcessOffsets ¶

func ProcessOffsets(encoding *tokenizer.Encoding, addPrefixSpace bool) *tokenizer.Encoding

Types ¶

type BertPreTokenizer ¶

type BertPreTokenizer struct{}

func NewBertPreTokenizer ¶

func NewBertPreTokenizer() *BertPreTokenizer

func (*BertPreTokenizer) PreTokenize ¶

func (bt *BertPreTokenizer) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)

PreTokenize implements PreTokenizer interface for BertPreTokenizer

type ByteLevel ¶

type ByteLevel struct {
	// whether to add a leading space to the first word.
	// It allows to treat the leading word just as any other words.
	AddPrefixSpace bool

	// Whether the post processing step should trim offsets
	// to avoid including whitespaces.
	TrimOffsets bool
}

ByteLevel provides all the neccessary steps to handle the BPE tokenization at byte-level. It takes care of all the required processing steps to transform a utf-8 string as needed before and after the BPE model does it job.

func NewByteLevel ¶

func NewByteLevel() *ByteLevel

NewByteLevel returns a default ByteLevel with both AddPrefixSpace and TrimOffsets set true

func (*ByteLevel) AddedToken ¶

func (bl *ByteLevel) AddedToken(isPair bool) int

func (*ByteLevel) Alphabet ¶

func (bl *ByteLevel) Alphabet() map[string]struct{}

Alphabet returns set of first 256 unicode `char`

func (*ByteLevel) Decode ¶

func (bl *ByteLevel) Decode(tokens []string) string

Decode converts any byte-level characters to their unicode couterpart before merging everything back into a single string

func (*ByteLevel) PreTokenize ¶

func (bl *ByteLevel) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)

PreTokenizer, as a `PreTokenizer`, `ByteLevel` is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

func (*ByteLevel) Process ¶

func (bl *ByteLevel) Process(encoding, pairEncoding *tokenizer.Encoding, addSpecialTokens bool) *tokenizer.Encoding

func (*ByteLevel) SetAddPrefixSpace ¶

func (bl *ByteLevel) SetAddPrefixSpace(v bool)

SetAddPrefixSpace set `AddPrefixSpace` property

func (*ByteLevel) SetTrimOffsets ¶

func (bl *ByteLevel) SetTrimOffsets(v bool)

SetTrimOffsets set `TrimOffsets` property

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL