segmenter

package
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 26, 2026 License: BSD-3-Clause, Unlicense Imports: 3 Imported by: 0

Documentation

Overview

Package segmenter implements Unicode rules used to segment a paragraph of text according to several criteria. In particular, it provides a way of delimiting line break opportunities.

The API of the package follows the very nice iterator pattern proposed in github.com/npillmayer/uax, but use a somewhat simpler internal implementation, inspired by Pango.

The reference documentation is at https://unicode.org/reports/tr14 and https://unicode.org/reports/tr29.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Grapheme

type Grapheme struct {
	// Text is a subslice of the original input slice, containing the delimited grapheme
	Text []rune
	// Offset is the start of the grapheme in the input rune slice
	Offset int
	// OffsetInBytes is the start of the grapheme in the input, in UTF-8 bytes
	OffsetInBytes int
	// LengthInBytes is the length of the grapheme in the input, in UTF-8 bytes
	LengthInBytes int
}

Grapheme is the content of a grapheme delimited by the segmenter.

type GraphemeIterator

type GraphemeIterator struct {
	// contains filtered or unexported fields
}

GraphemeIterator provides a convenient way of iterating over the graphemes delimited by a `Segmenter`.

func (*GraphemeIterator) Grapheme

func (gr *GraphemeIterator) Grapheme() Grapheme

Grapheme returns the current `Grapheme`

func (*GraphemeIterator) Next

func (gr *GraphemeIterator) Next() bool

Next returns true if there is still a grapheme to process, and advances the iterator; or return false.

type Line

type Line struct {
	// Text is a subslice of the original input slice, containing the delimited line
	Text []rune
	// Offset is the start of the line in the input rune slice
	Offset int
	// OffsetInBytes is the start of the line in the input, in UTF-8 bytes
	OffsetInBytes int
	// LengthInBytes is the length of the line in the input, in UTF-8 bytes
	LengthInBytes int
	// IsMandatoryBreak is true if breaking (at the end of the line)
	// is mandatory
	IsMandatoryBreak bool
}

Line is the content of a line delimited by the segmenter.

type LineIterator

type LineIterator struct {
	// contains filtered or unexported fields
}

LineIterator provides a convenient way of iterating over the lines delimited by a `Segmenter`.

func (*LineIterator) Line

func (li *LineIterator) Line() Line

Line returns the current `Line`

func (*LineIterator) Next

func (li *LineIterator) Next() bool

Next returns true if there is still a line to process, and advances the iterator; or return false.

type Segmenter

type Segmenter struct {
	// contains filtered or unexported fields
}

Segmenter is the entry point of the package.

Usage :

var seg Segmenter
seg.Init(...)
iter := seg.LineIterator()
for iter.Next() {
  ... // do something with iter.Line()
}

func (*Segmenter) GraphemeIterator

func (sg *Segmenter) GraphemeIterator() *GraphemeIterator

GraphemeIterator returns an iterator over the graphemes delimited in [Init].

func (*Segmenter) Init

func (seg *Segmenter) Init(paragraph []rune)

Init resets the segmenter storage with the given input, and computes the attributes required to segment the text.

If paragraph includes an invalid rune like out of range, some outputs like [Line.OffsetInBytes] and [Line.LengthInBytes] are undefined.

func (*Segmenter) InitWithBytes

func (seg *Segmenter) InitWithBytes(paragraph []byte) (err error)

InitWithBytes resets the segmenter storage with the given byte slice input, and computes the attributes required to segment the text.

InitWithBytes returns an error if paragraph includes an invalid UTF-8 sequence.

InitWithBytes is more efficient than [Init] if the input is a byte slice. No allocation for the text is made if its internal buffer capacity is already large enough.

func (*Segmenter) InitWithString

func (seg *Segmenter) InitWithString(paragraph string) (err error)

InitWithString resets the segmenter storage with the given string input, and computes the attributes required to segment the text.

InitWithString returns an error if paragraph includes an invalid UTF-8 sequence.

InitWithString is more efficient than [Init] if the input is a string. No allocation for the text is made if its internal buffer capacity is already large enough.

func (*Segmenter) LineIterator

func (sg *Segmenter) LineIterator() *LineIterator

LineIterator returns an iterator on the lines delimited in [Init].

func (*Segmenter) WordIterator

func (sg *Segmenter) WordIterator() *WordIterator

WordIterator returns an iterator over the word delimited in [Init].

type Word

type Word struct {
	// Text is a subslice of the original input slice, containing the delimited word
	Text []rune
	// Offset is the start of the word in the input rune slice
	Offset int
	// OffsetInBytes is the start of the word in the input, in UTF-8 bytes
	OffsetInBytes int
	// LengthInBytes is the length of the word in the input, in UTF-8 bytes
	LengthInBytes int
}

Word is the content of a word delimited by the segmenter.

More precisely, a word is formed by runes with the [Alphabetic] property, or with a General_Category of Number, delimited by the Word Boundary Unicode Property.

See also https://unicode.org/reports/tr29/#Word_Boundary_Rules, http://unicode.org/reports/tr44/#Alphabetic and http://unicode.org/reports/tr44/#General_Category_Values

type WordIterator

type WordIterator struct {
	// contains filtered or unexported fields
}

func (*WordIterator) Next

func (gr *WordIterator) Next() bool

Next returns true if there is still a word to process, and advances the iterator; or return false.

func (*WordIterator) Word

func (gr *WordIterator) Word() Word

Word returns the current `Word`

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL