token

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 15, 2021 License: MIT Imports: 3 Imported by: 5

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CalculateProperties added in v0.1.1

func CalculateProperties(raw, cleaned []rune, p *Properties)

CalculateProperties takes raw and cleaned values of a token and computes properties of these values, saving them into Properties object.

Types

type Properties

type Properties struct {
	// HasStartParens token starts with '('.
	HasStartParens bool

	// HasEndParens token end with '('.
	HasEndParens bool

	// HasStartSqParens token starts with '['.
	HasStartSqParens bool

	// HasEndSqParens token ends with ']'.
	HasEndSqParens bool

	// HasEndDot token ends with '.'
	HasEndDot bool

	// HasEndComma token ends with ','
	HasEndComma bool

	// HasDigits token includes at least one '0-9'.
	HasDigits bool

	// HasLetters token includes at least one character for which
	// unicode.IsLetter(ch) is true.
	HasLetters bool

	// HasDash token includes '-'
	HasDash bool

	// HasSpecialChars internal part of a token includes non-letters, non-digits.
	HasSpecialChars bool

	// IsCapitalized is true if the furst letter of a token is capitalized.
	// The first letter does not have to be the first character.
	IsCapitalized bool

	// IsNumber internal part of a token has only numbers.
	IsNumber bool

	// IsWord internal part of a token includes only letters.
	IsWord bool
}

Properties is a fixed set of general properties determined during the the text traversal.

type TokenJSON

type TokenJSON struct {
	Line    int    `json:"lineNumber"`
	Raw     string `json:"raw"`
	Cleaned string `json:"cleaned"`
	Start   int    `json:"start"`
	End     int    `json:"end"`
}

TokenJSON provides a presentation view for a Token.

type TokenNER added in v0.1.1

type TokenNER interface {
	// Raw is a verbatim presentation of a token as it appears in a text.
	Raw() []rune

	// Start is the index of the first rune of a token. The first rune
	// does not have to be alpha-numeric.
	Start() int

	// End is the index of the last rune of a token. The last rune does not
	// have to be alpha-numeric.
	End() int

	// Line line number in the text
	Line() int

	// SetLine sets the line number
	SetLine(int)

	// Cleaned is a presentation of a token after normalization.
	Cleaned() string

	// SetCleaned substitues existing cleaned text with a new one.
	SetCleaned(string)

	// Properties is a fixed set of general properties that we determine during
	// the text traversal.
	Properties() *Properties

	// SetProperties substitutes existing properties with new ones.
	SetProperties(*Properties)

	// ProcessRaw computes a clean version of a name as well as properties
	// of the token.
	ProcessRaw()

	// ToJSON converts TokenNER object into JSON represenation
	ToJSON() ([]byte, error)
}

TokenNER represents a word separated by spaces in a text. Words split by new lines are concatenated.

func Tokenize

func Tokenize(text []rune, wrapToken func(TokenNER) TokenNER) []TokenNER

Tokenize creates a slice containing tokens for every word in the document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL