Documentation
¶
Index ¶
- Variables
- func GenerateBytesChar() map[uint8]string
- func ProcessOffsets(encoding *tokenizer.Encoding, addPrefixSpace bool) *tokenizer.Encoding
- type BertPreTokenizer
- type ByteLevel
- func (bl *ByteLevel) AddedToken(isPair bool) int
- func (bl *ByteLevel) Alphabet() map[string]struct{}
- func (bl *ByteLevel) Decode(tokens []string) string
- func (bl *ByteLevel) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)
- func (bl *ByteLevel) Process(encoding, pairEncoding *tokenizer.Encoding, addSpecialTokens bool) *tokenizer.Encoding
- func (bl *ByteLevel) SetAddPrefixSpace(v bool)
- func (bl *ByteLevel) SetTrimOffsets(v bool)
Constants ¶
This section is empty.
Variables ¶
var BytesChar map[uint8]string = GenerateBytesChar()
Functions ¶
func GenerateBytesChar ¶
BytesChar maps first 0-255 (byte) to first 0-255 `char` in unicode Ref. https://en.wikipedia.org/wiki/List_of_Unicode_characters Ref. https://rosettacode.org/wiki/UTF-8_encode_and_decode See example: https://play.golang.org/p/_1W0ni2uZWm
Types ¶
type BertPreTokenizer ¶
type BertPreTokenizer struct{}
func NewBertPreTokenizer ¶
func NewBertPreTokenizer() *BertPreTokenizer
func (*BertPreTokenizer) PreTokenize ¶
func (bt *BertPreTokenizer) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)
PreTokenize implements PreTokenizer interface for BertPreTokenizer
type ByteLevel ¶
type ByteLevel struct {
// whether to add a leading space to the first word.
// It allows to treat the leading word just as any other words.
AddPrefixSpace bool
// Whether the post processing step should trim offsets
// to avoid including whitespaces.
TrimOffsets bool
}
ByteLevel provides all the neccessary steps to handle the BPE tokenization at byte-level. It takes care of all the required processing steps to transform a utf-8 string as needed before and after the BPE model does it job.
func NewByteLevel ¶
func NewByteLevel() *ByteLevel
NewByteLevel returns a default ByteLevel with both AddPrefixSpace and TrimOffsets set true
func (*ByteLevel) AddedToken ¶
func (*ByteLevel) Decode ¶
Decode converts any byte-level characters to their unicode couterpart before merging everything back into a single string
func (*ByteLevel) PreTokenize ¶
func (bl *ByteLevel) PreTokenize(pretokenized *tokenizer.PreTokenizedString) (*tokenizer.PreTokenizedString, error)
PreTokenizer, as a `PreTokenizer`, `ByteLevel` is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.
func (*ByteLevel) SetAddPrefixSpace ¶
SetAddPrefixSpace set `AddPrefixSpace` property
func (*ByteLevel) SetTrimOffsets ¶
SetTrimOffsets set `TrimOffsets` property