tokenizer

package
v1.33.13 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 22, 2026 License: BSD-3-Clause Imports: 14 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	UseGse          = false // Load Japanese dictionary and prepare tokenizer
	UseGseCh        = false // Load Chinese dictionary and prepare tokenizer
	KagomeKrEnabled = false // Load Korean dictionary and prepare tokenizer
	KagomeJaEnabled = false // Load Japanese dictionary and prepare tokenizer
	// The Tokenizer Libraries can consume a lot of memory, so we limit the number of parallel tokenizers
	ApacTokenizerThrottle = chan struct{}(nil) // Throttle for tokenizers

)

Optional tokenizers can be enabled with an environment variable like: 'ENABLE_TOKENIZER_XXX', e.g. 'ENABLE_TOKENIZER_GSE', 'ENABLE_TOKENIZER_KAGOME_KR', 'ENABLE_TOKENIZER_KAGOME_JA'

Functions

func Tokenize

func Tokenize(tokenization string, in string) []string

func TokenizeAndCountDuplicates

func TokenizeAndCountDuplicates(tokenization string, in string) ([]string, []int)

func TokenizeWithWildcards

func TokenizeWithWildcards(tokenization string, in string) []string

Types

type KagomeTokenizers

type KagomeTokenizers struct {
	Korean   *kagomeTokenizer.Tokenizer
	Japanese *kagomeTokenizer.Tokenizer
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL