squad

package

v0.1.3 Latest Latest Go to latest Published: Nov 9, 2020 License: Apache-2.0 Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/sugarme/transformer

Links

Open Source Insights

Documentation ¶

Index ¶

Variables
type Answer
type Example
- func LoadV2(datasetNameOpt ...string) []Example
- func NewExample(qasId string, question, context, answer string, startPositionChar int, ...) *Example
type Feature
type Paragraph
type QA
type Result
- func NewResult(uniqueId string, start, end *ts.Tensor, startStopIndex, endStopIndex int, ...) *Result
type Squad2
type SquadV2Data

Constants ¶

This section is empty.

Variables ¶

View Source

var MultiSepTokensTokenizers []string = []string{
	"roberta",
	"camembert",
	"bart",
}

Functions ¶

This section is empty.

Types ¶

type Answer ¶

type Answer struct {
	Text        string `json:"text"`
	AnswerStart int    `json:"answer_start"`
}

type Example ¶

type Example struct {
	QAsId         string   // example unique identification
	QuestionText  string   // question string
	ContextText   string   // context string
	AnswerText    string   // answer string
	Title         string   // Title of the example
	Answers       []Answer // Default = nil. Holds answers as well as their start positions
	IsImposible   bool     // Default = false. Set to true if the example has no possible answer.
	StartPosition int      // index of start rune ("character") of the answer
}

Example is a single training/test example for the Squad dataset, as loaded from disk.

func LoadV2 ¶

func LoadV2(datasetNameOpt ...string) []Example

Load loads SQUAD v2.0 data from file.

Param - datasetNameOpt: specify either "train" or "dev" dataset. Default="train"

Params: - qasId: unique Id of QAs in SQuAD dataset - context: paragraph text - answer: answer text - startPositionChar: first rune position of the answer string - title: title of the document (article) in SQuAD dataset - isImpossibleOpt: optional param set whether the example has no possible answer. Default=False

type Feature ¶

type Feature struct {
	QAsId             string // example unique identification
	InputIds          []int  // Indices of input sequence tokens in the vocabulary.
	AttentionMask     []int  // Mask to avoid performing attention on padding token indices.
	TokenTypeIds      []int  // Segment token indices to indicate first and second portions of the input.
	ClsIndex          int    // Index of the CLS token.
	PMask             []int  // Mask identifying tokens that can be answers versus tokens that cannot (1 not in the answer, 0 in the answer).
	ExampleIndex      int    // Index of the example.
	UniqueId          int    // The unique feature identifier.
	TokenIsMaxContext []bool // A bool slice identifying which tokens have their maximum context in this feature.
	// NOTE. If a token does not have their maximum context in this feature, it means that another feature has more
	// information related to that token and should be prioritized over this feature for that token.
	Tokens        []string // Slice of tokens corresponding to the input Ids.
	StartPosition int      // Index of the first answer token. Value=0 if there's no answer
	EndPosition   int      // Index of the last answer token. Value=0 if there's no answer
	IsImposible   bool     // whether feature has no possible answer. False means feature has no possible answer.
}

Feature are SINGLE squad example feature to be fed to the model.

This feature is model-specific and can be crafted using method `SquadExample.ConvertExampleToFeatures`.

func ConvertExampleToFeatures ¶

func ConvertExampleToFeatures(tk *tokenizer.Tokenizer, sepToken string, clsIndex int, example Example, answerStart, answerEnd int) []Feature

ConvertExampleToFeatures converts a single example to features.

Params: - tk: tokenizer to use - sepToken: separator token - clsIndex: index of the cls token - answerStart: start offset of answer on context sequence. - answerEnd: end offset of answer on context sequence.

func ConvertExamplesToFeatures ¶

func ConvertExamplesToFeatures(examples []Example, tk *tokenizer.Tokenizer, tkName string, maxSeqLen, docStride, maxQueryLen int, sepToken, padToken string, clsIndex int, isTraining bool, returnTensorDataset bool) ([]Feature, *ts.Tensor)

ConvertExamplesToFeatures converts a list of examples into a list of features that can be directly fed into a model. It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.

Params:

examples: Slice of Examples to convert
tk: corresponding Tokenizer to be used with the model
tkName: name of tokenizer (whether it uses multiple sep tokens: ["roberta", "bart", "camebert"]).
maxSeqLen: maximal length of the input sequence to feed into the model (count in number of tokens)
maxQueryLen: maximal length of the question (count in number of tokens)
docStride: the stride (step size) tokenizer will use to split the encoding overflowing if it occurs. E.g. total overflowing tokens=20, maxSeqLen=10, docStride=5, there will be 4 encodings of 10 tokens.
sepToken: sep token used in tokenizer (e.g. BERT tokenizer uses "[SEP]")
padToken: pad token used in tokenizer (e.g. BERT tokenizer uses "[PAD]")
clsIndex: index position of the cls token in encoded input after tokenizing (e.g. BERT tokenizer [CLS] index = 0)
isTraining: whether to config features for training (added Answer)
returnTensorDataset: whether to stack Feature fields to a tensor.

NOTE. returning tensor depends on input params: - returnTensor=false: return tensor.None - isTraining=false: return tensor of size [6, numOfFeatures, maxSeqLen]:

inputIds
attentionMasks
tokenTypeIds
featureIndexes
clsIndex (repeated values to make size=maxSeqLen)
pMasks

- isTraining=true: return tensor of size = [8, numOfFeatures, maxSeqLen]

inputIds
attentionMasks
tokenTypeIds
startPosition (repeated values to make size=maxSeqLen)
endPosition (repeated values to make size=maxSeqLen)
clsIndex (repeated values to make size=maxSeqLen)
pMasks
isImpossible (repeated values to make size=maxSeqLen)

func NewFeature ¶

func NewFeature(inputIds []int, attentionMask []int, tokenTypeIds []int, clsIndex int, pMask []int, exampleIndex int, uniqueId int, tokenIsMaxContext []bool, tokens []string, startPosition, endPosition int, isImposible bool, qasId string) *Feature

NewFeature creates new Squad Features.

type Paragraph ¶

type Paragraph struct {
	QAs     []QA   `json:"qas"`
	Context string `json:"context"`
}

type QA ¶

type QA struct {
	Question         string   `json:"question"`
	Id               string   `json:"id"`
	Answers          []Answer `json:"answers"`
	IsImposible      bool     `json:"is_impossible"`
	PlausibleAnswers []Answer `json:"plausible_answers"`
}

type Result ¶

type Result struct {
	UniqueId       string     // The unique identifier corresponding to that example.
	StartLogits    *ts.Tensor // The logits corresponding to the start of the answer.
	EndLogits      *ts.Tensor // The logits corresponding to the end of the answer.
	StartStopIndex int
	EndStopIndex   int
	ClsLogits      *ts.Tensor
}

Result constructs a Squad result that can be used to evaluate a model's output on the SQuAD dataset.

func NewResult ¶

func NewResult(uniqueId string, start, end *ts.Tensor, startStopIndex, endStopIndex int, clsLogits *ts.Tensor) *Result

type Squad2 ¶

type Squad2 struct {
	Version string        `json:"version"`
	Data    []SquadV2Data `json:"data"`
}

type SquadV2Data ¶

type SquadV2Data struct {
	Title      string      `json:"title"`
	Paragraphs []Paragraph `json:"paragraphs"`
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL