blingfire

package
v0.0.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 28, 2026 License: MIT Imports: 3 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func TextToSentences

func TextToSentences(text string) []string

TextToSentences splits text into sentences using a robust regex.

func TextToWords

func TextToWords(text string) []string

TextToWords splits text into words.

Types

type Offset

type Offset struct {
	Start int
	End   int
}

Offset represents a start and end position of a token within the original string.

func TextToSentencesWithOffsets

func TextToSentencesWithOffsets(text string) ([]string, []Offset)

TextToSentencesWithOffsets splits text into sentences and returns their offsets.

func TextToWordsWithOffsets

func TextToWordsWithOffsets(text string) ([]string, []Offset)

TextToWordsWithOffsets splits text into words and returns their offsets.

type SentenceTokenizer

type SentenceTokenizer struct {
	Language         string
	MinSentenceLen   int
	StreamContextLen int
}

func NewSentenceTokenizer

func NewSentenceTokenizer(language string, minSentenceLen, streamContextLen int) *SentenceTokenizer

func (*SentenceTokenizer) Stream

func (t *SentenceTokenizer) Stream(language string) tokenize.SentenceStream

func (*SentenceTokenizer) Tokenize

func (t *SentenceTokenizer) Tokenize(text string, language string) []string

type WordTokenizer

type WordTokenizer struct {
	Language string
}

func NewWordTokenizer

func NewWordTokenizer(language string) *WordTokenizer

func (*WordTokenizer) Stream

func (t *WordTokenizer) Stream(language string) tokenize.WordStream

func (*WordTokenizer) Tokenize

func (t *WordTokenizer) Tokenize(text string, language string) []string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL