Documentation
¶
Index ¶
- Variables
- type Answer
- type Example
- type Feature
- func ConvertExampleToFeatures(tk *tokenizer.Tokenizer, sepToken string, clsIndex int, example Example, ...) []Feature
- func ConvertExamplesToFeatures(examples []Example, tk *tokenizer.Tokenizer, tkName string, ...) ([]Feature, *ts.Tensor)
- func NewFeature(inputIds []int, attentionMask []int, tokenTypeIds []int, clsIndex int, ...) *Feature
- type Paragraph
- type QA
- type Result
- type Squad2
- type SquadV2Data
Constants ¶
This section is empty.
Variables ¶
var MultiSepTokensTokenizers []string = []string{
"roberta",
"camembert",
"bart",
}
Functions ¶
This section is empty.
Types ¶
type Example ¶
type Example struct {
QAsId string // example unique identification
QuestionText string // question string
ContextText string // context string
AnswerText string // answer string
Title string // Title of the example
Answers []Answer // Default = nil. Holds answers as well as their start positions
IsImposible bool // Default = false. Set to true if the example has no possible answer.
StartPosition int // index of start rune ("character") of the answer
}
Example is a single training/test example for the Squad dataset, as loaded from disk.
func LoadV2 ¶
Load loads SQUAD v2.0 data from file.
Param - datasetNameOpt: specify either "train" or "dev" dataset. Default="train"
func NewExample ¶
func NewExample(qasId string, question, context, answer string, startPositionChar int, title string, isImpossibleOpt ...bool) *Example
NewExample creates a Example.
Params: - qasId: unique Id of QAs in SQuAD dataset - context: paragraph text - answer: answer text - startPositionChar: first rune position of the answer string - title: title of the document (article) in SQuAD dataset - isImpossibleOpt: optional param set whether the example has no possible answer. Default=False
type Feature ¶
type Feature struct {
QAsId string // example unique identification
InputIds []int // Indices of input sequence tokens in the vocabulary.
AttentionMask []int // Mask to avoid performing attention on padding token indices.
TokenTypeIds []int // Segment token indices to indicate first and second portions of the input.
ClsIndex int // Index of the CLS token.
PMask []int // Mask identifying tokens that can be answers versus tokens that cannot (1 not in the answer, 0 in the answer).
ExampleIndex int // Index of the example.
UniqueId int // The unique feature identifier.
TokenIsMaxContext []bool // A bool slice identifying which tokens have their maximum context in this feature.
// NOTE. If a token does not have their maximum context in this feature, it means that another feature has more
// information related to that token and should be prioritized over this feature for that token.
Tokens []string // Slice of tokens corresponding to the input Ids.
StartPosition int // Index of the first answer token. Value=0 if there's no answer
EndPosition int // Index of the last answer token. Value=0 if there's no answer
IsImposible bool // whether feature has no possible answer. False means feature has no possible answer.
}
Feature are SINGLE squad example feature to be fed to the model.
This feature is model-specific and can be crafted using method `SquadExample.ConvertExampleToFeatures`.
func ConvertExampleToFeatures ¶
func ConvertExampleToFeatures(tk *tokenizer.Tokenizer, sepToken string, clsIndex int, example Example, answerStart, answerEnd int) []Feature
ConvertExampleToFeatures converts a single example to features.
Params: - tk: tokenizer to use - sepToken: separator token - clsIndex: index of the cls token - answerStart: start offset of answer on context sequence. - answerEnd: end offset of answer on context sequence.
func ConvertExamplesToFeatures ¶
func ConvertExamplesToFeatures(examples []Example, tk *tokenizer.Tokenizer, tkName string, maxSeqLen, docStride, maxQueryLen int, sepToken, padToken string, clsIndex int, isTraining bool, returnTensorDataset bool) ([]Feature, *ts.Tensor)
ConvertExamplesToFeatures converts a list of examples into a list of features that can be directly fed into a model. It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.
Params:
- examples: Slice of Examples to convert
- tk: corresponding Tokenizer to be used with the model
- tkName: name of tokenizer (whether it uses multiple sep tokens: ["roberta", "bart", "camebert"]).
- maxSeqLen: maximal length of the input sequence to feed into the model (count in number of tokens)
- maxQueryLen: maximal length of the question (count in number of tokens)
- docStride: the stride (step size) tokenizer will use to split the encoding overflowing if it occurs. E.g. total overflowing tokens=20, maxSeqLen=10, docStride=5, there will be 4 encodings of 10 tokens.
- sepToken: sep token used in tokenizer (e.g. BERT tokenizer uses "[SEP]")
- padToken: pad token used in tokenizer (e.g. BERT tokenizer uses "[PAD]")
- clsIndex: index position of the cls token in encoded input after tokenizing (e.g. BERT tokenizer [CLS] index = 0)
- isTraining: whether to config features for training (added Answer)
- returnTensorDataset: whether to stack Feature fields to a tensor.
NOTE. returning tensor depends on input params: - returnTensor=false: return tensor.None - isTraining=false: return tensor of size [6, numOfFeatures, maxSeqLen]:
- inputIds
- attentionMasks
- tokenTypeIds
- featureIndexes
- clsIndex (repeated values to make size=maxSeqLen)
- pMasks
- isTraining=true: return tensor of size = [8, numOfFeatures, maxSeqLen]
- inputIds
- attentionMasks
- tokenTypeIds
- startPosition (repeated values to make size=maxSeqLen)
- endPosition (repeated values to make size=maxSeqLen)
- clsIndex (repeated values to make size=maxSeqLen)
- pMasks
- isImpossible (repeated values to make size=maxSeqLen)
func NewFeature ¶
func NewFeature(inputIds []int, attentionMask []int, tokenTypeIds []int, clsIndex int, pMask []int, exampleIndex int, uniqueId int, tokenIsMaxContext []bool, tokens []string, startPosition, endPosition int, isImposible bool, qasId string) *Feature
NewFeature creates new Squad Features.
type Result ¶
type Result struct {
UniqueId string // The unique identifier corresponding to that example.
StartLogits *ts.Tensor // The logits corresponding to the start of the answer.
EndLogits *ts.Tensor // The logits corresponding to the end of the answer.
StartStopIndex int
EndStopIndex int
ClsLogits *ts.Tensor
}
Result constructs a Squad result that can be used to evaluate a model's output on the SQuAD dataset.
type Squad2 ¶
type Squad2 struct {
Version string `json:"version"`
Data []SquadV2Data `json:"data"`
}