tokenizer

package
v1.35.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 22, 2025 License: BSD-3-Clause Imports: 17 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	UseGse   = false // Load Japanese dictionary and prepare tokenizer
	UseGseCh = false // Load Chinese dictionary and prepare tokenizer
	// The Tokenizer Libraries can consume a lot of memory, so we limit the number of parallel tokenizers
	ApacTokenizerThrottle = chan struct{}(nil) // Throttle for tokenizers

)

Optional tokenizers can be enabled with an environment variable like: 'ENABLE_TOKENIZER_XXX', e.g. 'ENABLE_TOKENIZER_GSE', 'ENABLE_TOKENIZER_KAGOME_KR', 'ENABLE_TOKENIZER_KAGOME_JA'

Functions

func AddCustomDict added in v1.34.1

func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error

func InitOptionalTokenizers added in v1.34.1

func InitOptionalTokenizers()

func NewUserDictFromModel added in v1.34.1

func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)

func Tokenize

func Tokenize(tokenization string, in string) []string

func TokenizeAndCountDuplicatesForClass added in v1.34.1

func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)

func TokenizeForClass added in v1.34.1

func TokenizeForClass(tokenization string, in string, class string) []string

func TokenizeWithWildcardsForClass added in v1.34.1

func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string

Types

type KagomeTokenizers

type KagomeTokenizers struct {
	Korean   *kagomeTokenizer.Tokenizer
	Japanese *kagomeTokenizer.Tokenizer
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL