line

package
v0.4.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 6, 2026 License: GPL-3.0 Imports: 7 Imported by: 7

README

File formats

Wikispeech lexicon file format

Description of the Wikispeech lexicon file format

NST lexicon file format

This format is used for converting NST lexicon files to the Wikispeech lexicon file format

Documentation

Overview

Package line is used to define lexicon line formats for parsing input and printing output.

Interfaces:

* Format - simple line format definition (field names and indices)

* Parser - a more complex parser, containing a Format definition, but also adds the possibility to write specific code for parsing that cannot be handeled by the Format specs alone (multi-value fields, etc).

THE WIKISPEECH FILE FORMAT

The Wikispeech lexicon file format is defined in ws.go. Lexicon files are tab separated text files (UTF-8 encoded), and should contain the fields listed below. Empty fields are allowed in most positions.

Any lexicon files you want to import into the lexicon database must be in this file format.

Orth           The word's orthography
Pos            The part of speech tag
Morph          Morphological features (gender, number, etc)
WordParts      Compound parts, if any, separated by a plus sign (+)
Lemma          The word's lemma form
Paradigm       The name of the paradigm used for inflections
Lang           The word's language label
Trans1         The first transcription (default for TTS)
Translang1     The language of the Trans1
Trans2         Alternative transcription
Translang2     The language of the Trans2
Trans3         Alternative transcription
Translang3     The language of the Trans3
Trans4         Alternative transcription
Translang4     The language of the Trans4
StatusName     Status of the lexicon entry
StatusSource   Source of the status
Preferred      Takes values 1/0, and is used to defined which reading for a specific
               orthography should be the standard one (in case of homographs)
Tag            A tag (string) that can be used to disambiguate between homographs if needed (default: empty)
Comments       On or more comments containing a label (category), a comment (text), and a source (user or other source).
               Comments are defined in the following format (separated by §§§):
                 [label: comment text] (source) §§§ [anotherlabel: another comment] (anothersource_or_user)

Sample line:

finalspelet	NN	SIN|DEF|NOM|NEU	final+spelet	finalspel	s7n-övriga ex träd	sv-se	f I . "" n A: l . % s p e: . l e t	sv-se							imported	nst	false	dummytag	[assign_to: john] (jane) §§§
[nolabel: typo] (hanna)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func MapTranscriptions added in v0.4.1

func MapTranscriptions(m mapper.Mapper, e *lex.Entry) error

MapTranscriptions maps the input entry's transcriptions (in-place)

Types

type Braxen added in v0.4.1

type Braxen struct {
	// contains filtered or unexported fields
}

Braxen contains the line format used for Braxen lexicon data. Struct for package private usage. To create a new Braxen instance, use NewBraxen.

func NewBraxen added in v0.4.1

func NewBraxen() (Braxen, error)

NewBraxen is used to create an instance of the Braxen parser

func (Braxen) Entry2String added in v0.4.1

func (brax Braxen) Entry2String(e lex.Entry) (string, error)

Entry2String is used to generate an output line from a lex.Entry (calls underlying Format.String)

func (Braxen) Format added in v0.4.1

func (brax Braxen) Format() Format

Format is the line.Format instance used for line parsing inside of this parser

func (Braxen) Header added in v0.4.1

func (brax Braxen) Header() string

func (Braxen) Parse added in v0.4.1

func (brax Braxen) Parse(line string) (map[Field]string, error)

Parse is used for parsing input lines (calls underlying Format.Parse)

func (Braxen) ParseToEntry added in v0.4.1

func (brax Braxen) ParseToEntry(line string) (lex.Entry, string, error)

ParseToEntry is used for parsing input lines (calls underlying Format.Parse). Orthography will be lower cased, but 2nd return argument is the input orthography with its original case

func (Braxen) String added in v0.4.1

func (brax Braxen) String(fields map[Field]string) (string, error)

String is used to generate an output line from a set of fields (calls underlying Format.String)

type Field

type Field int

Field is a simple const for line field definition types

const (
	// Orth orthography
	Orth Field = iota

	// Pos part-of-speech (noun, verb, NN, VB, etc)
	Pos

	// Morph morphological tags (case, gender, tense, etc)
	Morph

	// WordParts decompounded orthography field (for compounds)
	WordParts

	// Lang the word's language
	Lang

	// Trans1 the primary transcription
	Trans1

	// Translang1 the language of the primary transcription
	Translang1

	// Trans2 transcription variant
	Trans2

	// Translang2 language for Trans2
	Translang2

	// Trans3 transcription variant
	Trans3

	// Translang3 language for Trans3
	Translang3

	// Trans4 transcription variant
	Trans4

	// Translang4 language for Trans4
	Translang4

	// Trans5 transcription variant
	Trans5

	// Translang5 language for Trans5
	Translang5

	// Trans6 transcription variant
	Trans6

	// Translang6 language for Trans6
	Translang6

	// Lemma the lemma form. Ttypically orthographic lemmma + some kind of (disambiguation) identifier, eg., wind_01.
	Lemma

	// Paradigm rule reference (id) for generating inflected forms from lemma
	Paradigm

	// StatusName refers to a status category of the entry, such as 'ok', 'skip' or similar
	StatusName

	// StatusSource refers to the source of a status (user id, reference data id, etc)
	StatusSource

	// Preferred field to use label certain entries preferred over other ones with the same orthography; 1 = preferred, 0 = not preferred; Schema triggers only one preferred per orthographic word
	Preferred

	// Tag is an optional disambiguation tag
	Tag

	// Comments is an optional field
	Comments
)

func (Field) String

func (i Field) String() string

type FileWriter

type FileWriter struct {
	Parser Parser
	Writer io.Writer
	// contains filtered or unexported fields
}

FileWriter is used for writing entries to file (using an io.Writer)

func (FileWriter) Size

func (w FileWriter) Size() int

Size returns the size of the FileWriter content

func (FileWriter) Write

func (w FileWriter) Write(e lex.Entry) error

Write is used to write one lex.Entry at a time to a file (using an io.Writer)

type Format

type Format struct {
	Name     string
	FieldSep string
	Fields   map[Field]int
	NFields  int
}

Format is used to define a lexicon's line. This a struct for package private usage. To create a new Format instance, use NewFormat.

func NewFormat

func NewFormat(name string, fieldSep string, fields map[Field]int, nFields int, tests []FormatTest) (Format, error)

NewFormat is a public constructor for Format with built-in error checks and tests

func (Format) Equals

func (f Format) Equals(other Format) bool

Equals compares two line.Format instances

func (Format) Header added in v0.4.1

func (f Format) Header() string

func (Format) Parse

func (f Format) Parse(line string) (map[Field]string, error)

Parse is used for parsing input lines

func (Format) String

func (f Format) String(fields map[Field]string) (string, error)

String is used to generate an output line from a set of fields

type FormatTest

type FormatTest struct {
	InputLine  string
	Fields     map[Field]string
	OutputLine string
}

FormatTest defines a test to run upon initialization of Format (using NewFormat)

type NST

type NST struct {
	// contains filtered or unexported fields
}

NST contains the line format used for NST lexicon data. Struct for package private usage. To create a new NST instance, use NewNST.

func NewNST

func NewNST() (NST, error)

NewNST is used to create an instance of the NST parser

func (NST) Entry2String

func (nst NST) Entry2String(e lex.Entry) (string, error)

Entry2String is used to generate an output line from a lex.Entry (calls underlying Format.String)

func (NST) Format

func (nst NST) Format() Format

Format is the line.Format instance used for line parsing inside of this parser

func (NST) Header added in v0.4.1

func (nst NST) Header() string

func (NST) Parse

func (nst NST) Parse(line string) (map[Field]string, error)

Parse is used for parsing input lines (calls underlying Format.Parse)

func (NST) ParseToEntry

func (nst NST) ParseToEntry(line string) (lex.Entry, string, error)

ParseToEntry is used for parsing input lines (calls underlying Format.Parse). Orthography will be lower cased, but 2nd return argument is the input orthography with its original case

func (NST) String

func (nst NST) String(fields map[Field]string) (string, error)

String is used to generate an output line from a set of fields (calls underlying Format.String)

type Parser

type Parser interface {

	// Format is the line.Format instance used for line parsing inside of this parser
	Format() Format

	// Parse is used for parsing input lines
	Parse(string) (map[Field]string, error)

	// String is used to generate an output line from a set of fields
	String(map[Field]string) (string, error)

	// Entry2String is used to generate an output line from an input entry
	Entry2String(e lex.Entry) (string, error)
}

Parser is used to define a lexicon's line parser. To implement your own parser, make sure to implement functions Parse(string) and String(map[Field]string)

type WS

type WS struct {
	// contains filtered or unexported fields
}

WS implements the line.Parser interface

func NewWS

func NewWS() (WS, error)

NewWS is used to create a new instance of the WS parser

func (WS) Entry2String

func (ws WS) Entry2String(e lex.Entry) (string, error)

Entry2String is used to generate an output line from a lex.Entry (calls underlying Format.String)

func (WS) Format

func (ws WS) Format() Format

Format is the line.Format instance used for line parsing inside of this parser

func (WS) Header added in v0.4.1

func (ws WS) Header() string

func (WS) Parse

func (ws WS) Parse(line string) (map[Field]string, error)

Parse is used for parsing input lines (calls underlying Format.Parse)

func (WS) ParseToEntry

func (ws WS) ParseToEntry(line string) (lex.Entry, error)

ParseToEntry is used for parsing input lines (calls underlying Format.Parse)

func (WS) String

func (ws WS) String(fields map[Field]string) (string, error)

String is used to generate an output line from a set of fields (calls underlying Format.String)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL