segmenter

package

v1.0.1 Latest Latest Go to latest Published: Apr 26, 2026 License: BSD-3-Clause, Unlicense Imports: 3 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/nanorele/typesetting

Links

Open Source Insights

Documentation ¶

Overview ¶

Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.

The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.

The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.

Index ¶

type Grapheme
type GraphemeIterator
- func (gr *GraphemeIterator) Grapheme() Grapheme
- func (gr *GraphemeIterator) Next() bool
type Line
type LineIterator
- func (li *LineIterator) Line() Line
- func (li *LineIterator) Next() bool
type Segmenter
type Word
type WordIterator
- func (gr *WordIterator) Next() bool
- func (gr *WordIterator) Word() Word

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Grapheme ¶

type Grapheme struct {
	// Text is a subslice of the original input slice, containing the delimited grapheme
	Text []rune
	// Offset is the start of the grapheme in the input rune slice
	Offset int
	// OffsetInBytes is the start of the grapheme in the input, in UTF-8 bytes
	OffsetInBytes int
	// LengthInBytes is the length of the grapheme in the input, in UTF-8 bytes
	LengthInBytes int
}

Grapheme is the content of a grapheme delimited by the segmenter.

type GraphemeIterator ¶

type GraphemeIterator struct {
	// contains filtered or unexported fields
}

GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.

func (*GraphemeIterator) Grapheme ¶

func (gr *GraphemeIterator) Grapheme() Grapheme

Grapheme returns the current `Grapheme`

func (*GraphemeIterator) Next ¶

func (gr *GraphemeIterator) Next() bool

Next returns true if there is still a grapheme to process, and advances the iterator; or return false.

type Line ¶

type Line struct {
	// Text is a subslice of the original input slice, containing the delimited line
	Text []rune
	// Offset is the start of the line in the input rune slice
	Offset int
	// OffsetInBytes is the start of the line in the input, in UTF-8 bytes
	OffsetInBytes int
	// LengthInBytes is the length of the line in the input, in UTF-8 bytes
	LengthInBytes int
	// IsMandatoryBreak is true if breaking (at the end of the line)
	// is mandatory
	IsMandatoryBreak bool
}

Line is the content of a line delimited by the segmenter.

type LineIterator ¶

type LineIterator struct {
	// contains filtered or unexported fields
}

LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.

func (*LineIterator) Line ¶

func (li *LineIterator) Line() Line

Line returns the current `Line`

func (*LineIterator) Next ¶

func (li *LineIterator) Next() bool

Next returns true if there is still a line to process, and advances the iterator; or return false.

type Segmenter ¶

type Segmenter struct {
	// contains filtered or unexported fields
}

Segmenter is the entry point of the package.

Usage :

var seg Segmenter
seg.Init(...)
iter := seg.LineIterator()
for iter.Next() {
  ... // do something with iter.Line()
}

func (*Segmenter) GraphemeIterator ¶

func (sg *Segmenter) GraphemeIterator() *GraphemeIterator

GraphemeIterator returns an iterator over the graphemes delimited in [Init].

func (*Segmenter) Init ¶

func (seg *Segmenter) Init(paragraph []rune)

Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.

If paragraph includes an invalid rune like out of range, some outputs like [Line.OffsetInBytes] and [Line.LengthInBytes] are undefined.

func (*Segmenter) InitWithBytes ¶

func (seg *Segmenter) InitWithBytes(paragraph []byte) (err error)

InitWithBytes resets the segmenter storage with the given byte slice input, and computes the attributes required to segment the text.

InitWithBytes returns an error if paragraph includes an invalid UTF-8 sequence.

InitWithBytes is more efficient than [Init] if the input is a byte slice. No allocation for the text is made if its internal buffer capacity is already large enough.

func (*Segmenter) InitWithString ¶

func (seg *Segmenter) InitWithString(paragraph string) (err error)

InitWithString resets the segmenter storage with the given string input, and computes the attributes required to segment the text.

InitWithString returns an error if paragraph includes an invalid UTF-8 sequence.

InitWithString is more efficient than [Init] if the input is a string. No allocation for the text is made if its internal buffer capacity is already large enough.

func (*Segmenter) LineIterator ¶

func (sg *Segmenter) LineIterator() *LineIterator

LineIterator returns an iterator on the lines delimited in [Init].

func (*Segmenter) WordIterator ¶

func (sg *Segmenter) WordIterator() *WordIterator

WordIterator returns an iterator over the word delimited in [Init].

type Word ¶

type Word struct {
	// Text is a subslice of the original input slice, containing the delimited word
	Text []rune
	// Offset is the start of the word in the input rune slice
	Offset int
	// OffsetInBytes is the start of the word in the input, in UTF-8 bytes
	OffsetInBytes int
	// LengthInBytes is the length of the word in the input, in UTF-8 bytes
	LengthInBytes int
}

Word is the content of a word delimited by the segmenter.

More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.

type WordIterator ¶

type WordIterator struct {
	// contains filtered or unexported fields
}

func (*WordIterator) Next ¶

func (gr *WordIterator) Next() bool

Next returns true if there is still a word to process, and advances the iterator; or return false.

func (*WordIterator) Word ¶

func (gr *WordIterator) Word() Word

Word returns the current `Word`

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL