squad

package
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 9, 2020 License: Apache-2.0 Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var MultiSepTokensTokenizers []string = []string{
	"roberta",
	"camembert",
	"bart",
}

Functions

This section is empty.

Types

type Answer

type Answer struct {
	Text        string `json:"text"`
	AnswerStart int    `json:"answer_start"`
}

type Example

type Example struct {
	QAsId         string   // example unique identification
	QuestionText  string   // question string
	ContextText   string   // context string
	AnswerText    string   // answer string
	Title         string   // Title of the example
	Answers       []Answer // Default = nil. Holds answers as well as their start positions
	IsImposible   bool     // Default = false. Set to true if the example has no possible answer.
	StartPosition int      // index of start rune ("character") of the answer
}

Example is a single training/test example for the Squad dataset, as loaded from disk.

func LoadV2

func LoadV2(datasetNameOpt ...string) []Example

Load loads SQUAD v2.0 data from file.

Param - datasetNameOpt: specify either "train" or "dev" dataset. Default="train"

func NewExample

func NewExample(qasId string, question, context, answer string, startPositionChar int, title string, isImpossibleOpt ...bool) *Example

NewExample creates a Example.

Params: - qasId: unique Id of QAs in SQuAD dataset - context: paragraph text - answer: answer text - startPositionChar: first rune position of the answer string - title: title of the document (article) in SQuAD dataset - isImpossibleOpt: optional param set whether the example has no possible answer. Default=False

type Feature

type Feature struct {
	QAsId             string // example unique identification
	InputIds          []int  // Indices of input sequence tokens in the vocabulary.
	AttentionMask     []int  // Mask to avoid performing attention on padding token indices.
	TokenTypeIds      []int  // Segment token indices to indicate first and second portions of the input.
	ClsIndex          int    // Index of the CLS token.
	PMask             []int  // Mask identifying tokens that can be answers versus tokens that cannot (1 not in the answer, 0 in the answer).
	ExampleIndex      int    // Index of the example.
	UniqueId          int    // The unique feature identifier.
	TokenIsMaxContext []bool // A bool slice identifying which tokens have their maximum context in this feature.
	// NOTE. If a token does not have their maximum context in this feature, it means that another feature has more
	// information related to that token and should be prioritized over this feature for that token.
	Tokens        []string // Slice of tokens corresponding to the input Ids.
	StartPosition int      // Index of the first answer token. Value=0 if there's no answer
	EndPosition   int      // Index of the last answer token. Value=0 if there's no answer
	IsImposible   bool     // whether feature has no possible answer. False means feature has no possible answer.
}

Feature are SINGLE squad example feature to be fed to the model.

This feature is model-specific and can be crafted using method `SquadExample.ConvertExampleToFeatures`.

func ConvertExampleToFeatures

func ConvertExampleToFeatures(tk *tokenizer.Tokenizer, sepToken string, clsIndex int, example Example, answerStart, answerEnd int) []Feature

ConvertExampleToFeatures converts a single example to features.

Params: - tk: tokenizer to use - sepToken: separator token - clsIndex: index of the cls token - answerStart: start offset of answer on context sequence. - answerEnd: end offset of answer on context sequence.

func ConvertExamplesToFeatures

func ConvertExamplesToFeatures(examples []Example, tk *tokenizer.Tokenizer, tkName string, maxSeqLen, docStride, maxQueryLen int, sepToken, padToken string, clsIndex int, isTraining bool, returnTensorDataset bool) ([]Feature, *ts.Tensor)

ConvertExamplesToFeatures converts a list of examples into a list of features that can be directly fed into a model. It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.

Params:

  • examples: Slice of Examples to convert
  • tk: corresponding Tokenizer to be used with the model
  • tkName: name of tokenizer (whether it uses multiple sep tokens: ["roberta", "bart", "camebert"]).
  • maxSeqLen: maximal length of the input sequence to feed into the model (count in number of tokens)
  • maxQueryLen: maximal length of the question (count in number of tokens)
  • docStride: the stride (step size) tokenizer will use to split the encoding overflowing if it occurs. E.g. total overflowing tokens=20, maxSeqLen=10, docStride=5, there will be 4 encodings of 10 tokens.
  • sepToken: sep token used in tokenizer (e.g. BERT tokenizer uses "[SEP]")
  • padToken: pad token used in tokenizer (e.g. BERT tokenizer uses "[PAD]")
  • clsIndex: index position of the cls token in encoded input after tokenizing (e.g. BERT tokenizer [CLS] index = 0)
  • isTraining: whether to config features for training (added Answer)
  • returnTensorDataset: whether to stack Feature fields to a tensor.

NOTE. returning tensor depends on input params: - returnTensor=false: return tensor.None - isTraining=false: return tensor of size [6, numOfFeatures, maxSeqLen]:

  • inputIds
  • attentionMasks
  • tokenTypeIds
  • featureIndexes
  • clsIndex (repeated values to make size=maxSeqLen)
  • pMasks

- isTraining=true: return tensor of size = [8, numOfFeatures, maxSeqLen]

  • inputIds
  • attentionMasks
  • tokenTypeIds
  • startPosition (repeated values to make size=maxSeqLen)
  • endPosition (repeated values to make size=maxSeqLen)
  • clsIndex (repeated values to make size=maxSeqLen)
  • pMasks
  • isImpossible (repeated values to make size=maxSeqLen)

func NewFeature

func NewFeature(inputIds []int, attentionMask []int, tokenTypeIds []int, clsIndex int, pMask []int, exampleIndex int, uniqueId int, tokenIsMaxContext []bool, tokens []string, startPosition, endPosition int, isImposible bool, qasId string) *Feature

NewFeature creates new Squad Features.

type Paragraph

type Paragraph struct {
	QAs     []QA   `json:"qas"`
	Context string `json:"context"`
}

type QA

type QA struct {
	Question         string   `json:"question"`
	Id               string   `json:"id"`
	Answers          []Answer `json:"answers"`
	IsImposible      bool     `json:"is_impossible"`
	PlausibleAnswers []Answer `json:"plausible_answers"`
}

type Result

type Result struct {
	UniqueId       string     // The unique identifier corresponding to that example.
	StartLogits    *ts.Tensor // The logits corresponding to the start of the answer.
	EndLogits      *ts.Tensor // The logits corresponding to the end of the answer.
	StartStopIndex int
	EndStopIndex   int
	ClsLogits      *ts.Tensor
}

Result constructs a Squad result that can be used to evaluate a model's output on the SQuAD dataset.

func NewResult

func NewResult(uniqueId string, start, end *ts.Tensor, startStopIndex, endStopIndex int, clsLogits *ts.Tensor) *Result

type Squad2

type Squad2 struct {
	Version string        `json:"version"`
	Data    []SquadV2Data `json:"data"`
}

type SquadV2Data

type SquadV2Data struct {
	Title      string      `json:"title"`
	Paragraphs []Paragraph `json:"paragraphs"`
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL