token

package

v0.1.1 Latest Latest Go to latest Published: Apr 15, 2021 License: MIT Imports: 3 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gnames/gner

Links

Open Source Insights

Documentation ¶

Index ¶

func CalculateProperties(raw, cleaned []rune, p *Properties)
type Properties
type TokenJSON
type TokenNER
- func Tokenize(text []rune, wrapToken func(TokenNER) TokenNER) []TokenNER

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CalculateProperties ¶ added in v0.1.1

func CalculateProperties(raw, cleaned []rune, p *Properties)

CalculateProperties takes raw and cleaned values of a token and computes properties of these values, saving them into Properties object.

Types ¶

type Properties ¶

type Properties struct {
	// HasStartParens token starts with '('.
	HasStartParens bool

	// HasEndParens token end with '('.
	HasEndParens bool

	// HasStartSqParens token starts with '['.
	HasStartSqParens bool

	// HasEndSqParens token ends with ']'.
	HasEndSqParens bool

	// HasEndDot token ends with '.'
	HasEndDot bool

	// HasEndComma token ends with ','
	HasEndComma bool

	// HasDigits token includes at least one '0-9'.
	HasDigits bool

	// HasLetters token includes at least one character for which
	// unicode.IsLetter(ch) is true.
	HasLetters bool

	// HasDash token includes '-'
	HasDash bool

	// HasSpecialChars internal part of a token includes non-letters, non-digits.
	HasSpecialChars bool

	// IsCapitalized is true if the furst letter of a token is capitalized.
	// The first letter does not have to be the first character.
	IsCapitalized bool

	// IsNumber internal part of a token has only numbers.
	IsNumber bool

	// IsWord internal part of a token includes only letters.
	IsWord bool
}

Properties is a fixed set of general properties determined during the the text traversal.

type TokenJSON ¶

type TokenJSON struct {
	Line    int    `json:"lineNumber"`
	Raw     string `json:"raw"`
	Cleaned string `json:"cleaned"`
	Start   int    `json:"start"`
	End     int    `json:"end"`
}

TokenJSON provides a presentation view for a Token.

type TokenNER ¶ added in v0.1.1

type TokenNER interface {
	// Raw is a verbatim presentation of a token as it appears in a text.
	Raw() []rune

	// Start is the index of the first rune of a token. The first rune
	// does not have to be alpha-numeric.
	Start() int

	// End is the index of the last rune of a token. The last rune does not
	// have to be alpha-numeric.
	End() int

	// Line line number in the text
	Line() int

	// SetLine sets the line number
	SetLine(int)

	// Cleaned is a presentation of a token after normalization.
	Cleaned() string

	// SetCleaned substitues existing cleaned text with a new one.
	SetCleaned(string)

	// Properties is a fixed set of general properties that we determine during
	// the text traversal.
	Properties() *Properties

	// SetProperties substitutes existing properties with new ones.
	SetProperties(*Properties)

	// ProcessRaw computes a clean version of a name as well as properties
	// of the token.
	ProcessRaw()

	// ToJSON converts TokenNER object into JSON represenation
	ToJSON() ([]byte, error)
}

TokenNER represents a word separated by spaces in a text. Words split by new lines are concatenated.

func Tokenize ¶

func Tokenize(text []rune, wrapToken func(TokenNER) TokenNER) []TokenNER

Tokenize creates a slice containing tokens for every word in the document.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL