Documentation
¶
Overview ¶
Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.
The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.
The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.
Index ¶
- type Grapheme
- type GraphemeIterator
- type Line
- type LineIterator
- type Segmenter
- func (sg *Segmenter) GraphemeIterator() *GraphemeIterator
- func (seg *Segmenter) Init(paragraph []rune)
- func (seg *Segmenter) InitWithBytes(paragraph []byte) (err error)
- func (seg *Segmenter) InitWithString(paragraph string) (err error)
- func (sg *Segmenter) LineIterator() *LineIterator
- func (sg *Segmenter) WordIterator() *WordIterator
- type Word
- type WordIterator
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Grapheme ¶
type Grapheme struct {
// Text is a subslice of the original input slice, containing the delimited grapheme
Text []rune
// Offset is the start of the grapheme in the input rune slice
Offset int
// OffsetInBytes is the start of the grapheme in the input, in UTF-8 bytes
OffsetInBytes int
// LengthInBytes is the length of the grapheme in the input, in UTF-8 bytes
LengthInBytes int
}
Grapheme is the content of a grapheme delimited by the segmenter.
type GraphemeIterator ¶
type GraphemeIterator struct {
// contains filtered or unexported fields
}
GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.
func (*GraphemeIterator) Grapheme ¶
func (gr *GraphemeIterator) Grapheme() Grapheme
Grapheme returns the current `Grapheme`
func (*GraphemeIterator) Next ¶
func (gr *GraphemeIterator) Next() bool
Next returns true if there is still a grapheme to process, and advances the iterator; or return false.
type Line ¶
type Line struct {
// Text is a subslice of the original input slice, containing the delimited line
Text []rune
// Offset is the start of the line in the input rune slice
Offset int
// OffsetInBytes is the start of the line in the input, in UTF-8 bytes
OffsetInBytes int
// LengthInBytes is the length of the line in the input, in UTF-8 bytes
LengthInBytes int
// IsMandatoryBreak is true if breaking (at the end of the line)
// is mandatory
IsMandatoryBreak bool
}
Line is the content of a line delimited by the segmenter.
type LineIterator ¶
type LineIterator struct {
// contains filtered or unexported fields
}
LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.
func (*LineIterator) Next ¶
func (li *LineIterator) Next() bool
Next returns true if there is still a line to process, and advances the iterator; or return false.
type Segmenter ¶
type Segmenter struct {
// contains filtered or unexported fields
}
Segmenter is the entry point of the package.
Usage :
var seg Segmenter
seg.Init(...)
iter := seg.LineIterator()
for iter.Next() {
... // do something with iter.Line()
}
func (*Segmenter) GraphemeIterator ¶
func (sg *Segmenter) GraphemeIterator() *GraphemeIterator
GraphemeIterator returns an iterator over the graphemes delimited in [Init].
func (*Segmenter) Init ¶
Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.
If paragraph includes an invalid rune like out of range, some outputs like [Line.OffsetInBytes] and [Line.LengthInBytes] are undefined.
func (*Segmenter) InitWithBytes ¶
InitWithBytes resets the segmenter storage with the given byte slice input, and computes the attributes required to segment the text.
InitWithBytes returns an error if paragraph includes an invalid UTF-8 sequence.
InitWithBytes is more efficient than [Init] if the input is a byte slice. No allocation for the text is made if its internal buffer capacity is already large enough.
func (*Segmenter) InitWithString ¶
InitWithString resets the segmenter storage with the given string input, and computes the attributes required to segment the text.
InitWithString returns an error if paragraph includes an invalid UTF-8 sequence.
InitWithString is more efficient than [Init] if the input is a string. No allocation for the text is made if its internal buffer capacity is already large enough.
func (*Segmenter) LineIterator ¶
func (sg *Segmenter) LineIterator() *LineIterator
LineIterator returns an iterator on the lines delimited in [Init].
func (*Segmenter) WordIterator ¶
func (sg *Segmenter) WordIterator() *WordIterator
WordIterator returns an iterator over the word delimited in [Init].
type Word ¶
type Word struct {
// Text is a subslice of the original input slice, containing the delimited word
Text []rune
// Offset is the start of the word in the input rune slice
Offset int
// OffsetInBytes is the start of the word in the input, in UTF-8 bytes
OffsetInBytes int
// LengthInBytes is the length of the word in the input, in UTF-8 bytes
LengthInBytes int
}
Word is the content of a word delimited by the segmenter.
More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.
See also https://unicode.org/reports/tr29/#Word_Boundary_Rules, http://unicode.org/reports/tr44/#Alphabetic and http://unicode.org/reports/tr44/#General_Category_Values
type WordIterator ¶
type WordIterator struct {
// contains filtered or unexported fields
}
func (*WordIterator) Next ¶
func (gr *WordIterator) Next() bool
Next returns true if there is still a word to process, and advances the iterator; or return false.