tokenizer

package

v1.35.2 Latest Latest Go to latest Published: Dec 22, 2025 License: BSD-3-Clause Imports: 17 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/weaviate/weaviate

Links

Documentation ¶

Index ¶

Variables
func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error
func InitOptionalTokenizers()
func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)
func Tokenize(tokenization string, in string) []string
func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)
func TokenizeForClass(tokenization string, in string, class string) []string
func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string
type KagomeTokenizers

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	UseGse   = false // Load Japanese dictionary and prepare tokenizer
	UseGseCh = false // Load Chinese dictionary and prepare tokenizer
	// The Tokenizer Libraries can consume a lot of memory, so we limit the number of parallel tokenizers
	ApacTokenizerThrottle = chan struct{}(nil) // Throttle for tokenizers

)

View Source

var Tokenizations []string = []string{
	models.PropertyTokenizationWord,
	models.PropertyTokenizationLowercase,
	models.PropertyTokenizationWhitespace,
	models.PropertyTokenizationField,
	models.PropertyTokenizationTrigram,
}

Optional tokenizers can be enabled with an environment variable like: 'ENABLE_TOKENIZER_XXX', e.g. 'ENABLE_TOKENIZER_GSE', 'ENABLE_TOKENIZER_KAGOME_KR', 'ENABLE_TOKENIZER_KAGOME_JA'

Functions ¶

func AddCustomDict ¶ added in v1.34.1

func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error

func InitOptionalTokenizers ¶ added in v1.34.1

func InitOptionalTokenizers()

func NewUserDictFromModel ¶ added in v1.34.1

func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)

func Tokenize ¶

func Tokenize(tokenization string, in string) []string

func TokenizeAndCountDuplicatesForClass ¶ added in v1.34.1

func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)

func TokenizeForClass ¶ added in v1.34.1

func TokenizeForClass(tokenization string, in string, class string) []string

func TokenizeWithWildcardsForClass ¶ added in v1.34.1

func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string

Types ¶

type KagomeTokenizers ¶

type KagomeTokenizers struct {
	Korean   *kagomeTokenizer.Tokenizer
	Japanese *kagomeTokenizer.Tokenizer
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL