Documentation
¶
Overview ¶
Package line is used to define lexicon line formats for parsing input and printing output.
Interfaces:
* Format - simple line format definition (field names and indices)
* Parser - a more complex parser, containing a Format definition, but also adds the possibility to write specific code for parsing that cannot be handeled by the Format specs alone (multi-value fields, etc).
THE WIKISPEECH FILE FORMAT ¶
The Wikispeech lexicon file format is defined in ws.go. Lexicon files are tab separated text files (UTF-8 encoded), and should contain the fields listed below. Empty fields are allowed in most positions.
Any lexicon files you want to import into the lexicon database must be in this file format.
Orth The word's orthography
Pos The part of speech tag
Morph Morphological features (gender, number, etc)
WordParts Compound parts, if any, separated by a plus sign (+)
Lemma The word's lemma form
Paradigm The name of the paradigm used for inflections
Lang The word's language label
Trans1 The first transcription (default for TTS)
Translang1 The language of the Trans1
Trans2 Alternative transcription
Translang2 The language of the Trans2
Trans3 Alternative transcription
Translang3 The language of the Trans3
Trans4 Alternative transcription
Translang4 The language of the Trans4
StatusName Status of the lexicon entry
StatusSource Source of the status
Preferred Takes values 1/0, and is used to defined which reading for a specific
orthography should be the standard one (in case of homographs)
Tag A tag (string) that can be used to disambiguate between homographs if needed (default: empty)
Comments On or more comments containing a label (category), a comment (text), and a source (user or other source).
Comments are defined in the following format (separated by §§§):
[label: comment text] (source) §§§ [anotherlabel: another comment] (anothersource_or_user)
Sample line:
finalspelet NN SIN|DEF|NOM|NEU final+spelet finalspel s7n-övriga ex träd sv-se f I . "" n A: l . % s p e: . l e t sv-se imported nst false dummytag [assign_to: john] (jane) §§§ [nolabel: typo] (hanna)
Index ¶
- func MapTranscriptions(m mapper.Mapper, e *lex.Entry) error
- type Braxen
- func (brax Braxen) Entry2String(e lex.Entry) (string, error)
- func (brax Braxen) Format() Format
- func (brax Braxen) Header() string
- func (brax Braxen) Parse(line string) (map[Field]string, error)
- func (brax Braxen) ParseToEntry(line string) (lex.Entry, string, error)
- func (brax Braxen) String(fields map[Field]string) (string, error)
- type Field
- type FileWriter
- type Format
- type FormatTest
- type NST
- func (nst NST) Entry2String(e lex.Entry) (string, error)
- func (nst NST) Format() Format
- func (nst NST) Header() string
- func (nst NST) Parse(line string) (map[Field]string, error)
- func (nst NST) ParseToEntry(line string) (lex.Entry, string, error)
- func (nst NST) String(fields map[Field]string) (string, error)
- type Parser
- type WS
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Braxen ¶ added in v0.4.1
type Braxen struct {
// contains filtered or unexported fields
}
Braxen contains the line format used for Braxen lexicon data. Struct for package private usage. To create a new Braxen instance, use NewBraxen.
func (Braxen) Entry2String ¶ added in v0.4.1
Entry2String is used to generate an output line from a lex.Entry (calls underlying Format.String)
func (Braxen) Format ¶ added in v0.4.1
Format is the line.Format instance used for line parsing inside of this parser
func (Braxen) Parse ¶ added in v0.4.1
Parse is used for parsing input lines (calls underlying Format.Parse)
func (Braxen) ParseToEntry ¶ added in v0.4.1
ParseToEntry is used for parsing input lines (calls underlying Format.Parse). Orthography will be lower cased, but 2nd return argument is the input orthography with its original case
type Field ¶
type Field int
Field is a simple const for line field definition types
const ( // Orth orthography Orth Field = iota // Pos part-of-speech (noun, verb, NN, VB, etc) Pos // Morph morphological tags (case, gender, tense, etc) Morph // WordParts decompounded orthography field (for compounds) WordParts // Lang the word's language Lang // Trans1 the primary transcription Trans1 // Translang1 the language of the primary transcription Translang1 // Trans2 transcription variant Trans2 // Translang2 language for Trans2 Translang2 // Trans3 transcription variant Trans3 // Translang3 language for Trans3 Translang3 // Trans4 transcription variant Trans4 // Translang4 language for Trans4 Translang4 // Trans5 transcription variant Trans5 // Translang5 language for Trans5 Translang5 // Trans6 transcription variant Trans6 // Translang6 language for Trans6 Translang6 // Lemma the lemma form. Ttypically orthographic lemmma + some kind of (disambiguation) identifier, eg., wind_01. Lemma // Paradigm rule reference (id) for generating inflected forms from lemma Paradigm // StatusName refers to a status category of the entry, such as 'ok', 'skip' or similar StatusName // StatusSource refers to the source of a status (user id, reference data id, etc) StatusSource // Preferred field to use label certain entries preferred over other ones with the same orthography; 1 = preferred, 0 = not preferred; Schema triggers only one preferred per orthographic word Preferred // Tag is an optional disambiguation tag Tag // Comments is an optional field Comments )
type FileWriter ¶
FileWriter is used for writing entries to file (using an io.Writer)
func (FileWriter) Size ¶
func (w FileWriter) Size() int
Size returns the size of the FileWriter content
type Format ¶
Format is used to define a lexicon's line. This a struct for package private usage. To create a new Format instance, use NewFormat.
func NewFormat ¶
func NewFormat(name string, fieldSep string, fields map[Field]int, nFields int, tests []FormatTest) (Format, error)
NewFormat is a public constructor for Format with built-in error checks and tests
type FormatTest ¶
FormatTest defines a test to run upon initialization of Format (using NewFormat)
type NST ¶
type NST struct {
// contains filtered or unexported fields
}
NST contains the line format used for NST lexicon data. Struct for package private usage. To create a new NST instance, use NewNST.
func (NST) Entry2String ¶
Entry2String is used to generate an output line from a lex.Entry (calls underlying Format.String)
func (NST) ParseToEntry ¶
ParseToEntry is used for parsing input lines (calls underlying Format.Parse). Orthography will be lower cased, but 2nd return argument is the input orthography with its original case
type Parser ¶
type Parser interface {
// Format is the line.Format instance used for line parsing inside of this parser
Format() Format
// Parse is used for parsing input lines
Parse(string) (map[Field]string, error)
// String is used to generate an output line from a set of fields
String(map[Field]string) (string, error)
// Entry2String is used to generate an output line from an input entry
Entry2String(e lex.Entry) (string, error)
}
Parser is used to define a lexicon's line parser. To implement your own parser, make sure to implement functions Parse(string) and String(map[Field]string)
type WS ¶
type WS struct {
// contains filtered or unexported fields
}
WS implements the line.Parser interface
func (WS) Entry2String ¶
Entry2String is used to generate an output line from a lex.Entry (calls underlying Format.String)
func (WS) ParseToEntry ¶
ParseToEntry is used for parsing input lines (calls underlying Format.Parse)