pretrained

package
v0.1.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 19, 2020 License: Apache-2.0 Imports: 9 Imported by: 46

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BertBaseUncased

func BertBaseUncased() *tokenizer.Tokenizer

BertBaseUncase loads pretrained BERT tokenizer.

Special tokens: - unknown token: "[UNK]" - sep token: "[SEP]" - cls token: "[CLS]" - mask token: "[MASK]" Its normalizer configued with: clean text, lower-case, handle Chinese characters and strip accents.

Source: "https://cdn.huggingface.co/bert-base-uncased-vocab.txt"

func BertLargeCasedWholeWordMaskingSquad added in v0.1.11

func BertLargeCasedWholeWordMaskingSquad() *tokenizer.Tokenizer

BertLargeCasedWholeWordMaskingSquad loads pretrained BERT large case whole-word masking tokenizer finetuned on SQuAD dataset.

Source: https://cdn.huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt

func GPT2 added in v0.1.11

func GPT2(addPrefixSpace bool, trimOffsets bool) *tokenizer.Tokenizer

GPT2 loads GPT2 (small) tokenizer from vocab and merges files.

Params:

  • addPrefixSpace: set whether to add a leading space to the first word. It allows to treat the leading word just as any other words.
  • trimOffsets: set Whether the post processing step should trim offsets to avoid including whitespaces.

Special tokens: - cls-token: "<s>" - sep token: "</s>" - pad token: "<pad>" - space token: "Ġ"

Source: "https://cdn.huggingface.co/gpt2-merges.txt" "https://cdn.huggingface.co/gpt2-vocab.json"

func RobertaBase added in v0.1.11

func RobertaBase(addPrefixSpace, trimOffsets bool) *tokenizer.Tokenizer

RobertaBase loads pretrained RoBERTa tokenizer.

Params:

  • addPrefixSpace: set whether to add a leading space to the first word. It allows to treat the leading word just as any other words.
  • trimOffsets: set Whether the post processing step should trim offsets to avoid including whitespaces.

Special tokens: - cls-token: "<s>" - sep token: "</s>" - pad token: "<pad>" - space token: "Ġ"

Source: - vocab: "https://cdn.huggingface.co/roberta-base-vocab.json", - merges: "https://cdn.huggingface.co/roberta-base-merges.txt",

func RobertaBaseSquad2 added in v0.1.11

func RobertaBaseSquad2(addPrefixSpace, trimOffsets bool) *tokenizer.Tokenizer

RobertaBaseSquad2 loads pretrained RoBERTa fine-tuned SQuAD Question Answering tokenizer.

Params:

  • addPrefixSpace: set whether to add a leading space to the first word. It allows to treat the leading word just as any other words.
  • trimOffsets: set Whether the post processing step should trim offsets to avoid including whitespaces.

Special tokens: - cls-token: "<s>" - sep token: "</s>" - pad token: "<pad>" - space token: "Ġ"

Source: - vocab: "https://cdn.huggingface.co/deepset/roberta-base-squad2/vocab.json", - merges: "https://cdn.huggingface.co/deepset/roberta-base-squad2/merges.txt",

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL