tokenizer

package

v1.37.0 Latest Latest Go to latest Published: Apr 15, 2026 License: BSD-3-Clause Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/weaviate/weaviate

Links

Documentation ¶

Overview ¶

Package tokenizer provides text tokenization and accent folding for Weaviate's inverted index.

Accent folding ¶

FoldASCII removes diacritical marks from Latin characters while preserving the base letters. It uses a three-phase approach:

Table-driven replacement for characters that Unicode NFD normalization does not decompose (ø→o, æ→ae, ß→ss, ð→d, þ→th, ł→l, đ→d, ħ→h, ŧ→t, etc.).
NFD decomposition + stripping of combining marks (category Mn). Only Mn marks are stripped so that vowel signs in other scripts are not affected.
NFC recomposition for clean storage.

Characters that have table-driven replacements but no NFD decomposition:

Character              | Replacement | Language / script
-----------------------|-------------|------------------------------------------
ł (U+0142) — L-stroke | l           | Polish
Ł (U+0141)            | L           | Polish
ø (U+00F8) — O-stroke | o           | Danish, Norwegian
Ø (U+00D8)            | O           | Danish, Norwegian
æ (U+00E6)            | ae          | Danish, Norwegian, Icelandic, Old English
Æ (U+00C6)            | AE          |
œ (U+0153)            | oe          | French
Œ (U+0152)            | OE          |
ß (U+00DF) — Eszett   | ss          | German
ẞ (U+1E9E)            | SS          | German (capital)
ð (U+00F0) — Eth      | d           | Icelandic, Old English
Ð (U+00D0)            | D           |
þ (U+00FE) — Thorn    | th          | Icelandic, Old English
Þ (U+00DE)            | Th          |
đ (U+0111) — D-stroke | d           | Croatian, Vietnamese
Đ (U+0110)            | D           |
ħ (U+0127) — H-stroke | h           | Maltese
Ħ (U+0126)            | H           |
ŧ (U+0167) — T-stroke | t           | Northern Sami
Ŧ (U+0166)            | T           |
ı (U+0131) — dotless i| i           | Turkish

Additional entries cover hooked, tailed, and other modified Latin letters that NFD does not decompose (ɓ→b, ƈ→c, ɗ→d, etc.).

The fold table is intentionally limited to Latin-script characters. CJK, Cyrillic, Arabic, Devanagari, and other scripts are passed through unchanged. Within the Latin block, characters that already have a clean NFD decomposition (e.g. é → e + combining acute) are handled entirely by the remaining stroked letters, ligatures, special letters, and hooked/tailed letters.

Index ¶

Variables
func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error
func AnalyzeAndCountDuplicates(text string, tokenization string, className string, prepared *PreparedAnalyzer, ...) (terms []string, boosts []int)
func FoldASCII(s string, ignore *IgnoreSet) string
func FoldASCIISlice(terms []string, ignore *IgnoreSet) []string
func InitOptionalTokenizers()
func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)
func Tokenize(tokenization string, in string) []string
func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)
func TokenizeForClass(tokenization string, in string, class string) []string
func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string
type AnalyzeResult
- func Analyze(text string, tokenization string, className string, prepared *PreparedAnalyzer, ...) AnalyzeResult
type IgnoreSet
- func BuildIgnoreSet(chars []string) *IgnoreSet
type KagomeTokenizers
type PreparedAnalyzer
- func NewPreparedAnalyzer(cfg *models.TextAnalyzerConfig) *PreparedAnalyzer
type StopwordDetector

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	UseGse   = false // Load Japanese dictionary and prepare tokenizer
	UseGseCh = false // Load Chinese dictionary and prepare tokenizer
	// The Tokenizer Libraries can consume a lot of memory, so we limit the number of parallel tokenizers
	ApacTokenizerThrottle = chan struct{}(nil) // Throttle for tokenizers

)

View Source

var Tokenizations []string = []string{
	models.PropertyTokenizationWord,
	models.PropertyTokenizationLowercase,
	models.PropertyTokenizationWhitespace,
	models.PropertyTokenizationField,
	models.PropertyTokenizationTrigram,
}

Optional tokenizers can be enabled with an environment variable like: 'ENABLE_TOKENIZER_XXX', e.g. 'ENABLE_TOKENIZER_GSE', 'ENABLE_TOKENIZER_KAGOME_KR', 'ENABLE_TOKENIZER_KAGOME_JA'

Functions ¶

func AddCustomDict ¶ added in v1.34.1

func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error

func AnalyzeAndCountDuplicates ¶ added in v1.37.0

func AnalyzeAndCountDuplicates(
	text string,
	tokenization string,
	className string,
	prepared *PreparedAnalyzer,
	stopwords StopwordDetector,
) (terms []string, boosts []int)

AnalyzeAndCountDuplicates is like Analyze but also deduplicates tokens and returns per-token counts (boost factors). Used by BM25 scoring.

func FoldASCII ¶ added in v1.37.0

func FoldASCII(s string, ignore *IgnoreSet) string

FoldASCII removes diacritical marks from Latin characters.

Phase 1: table-driven replacement for characters NFD doesn't decompose (ł→l, ø→o, æ→ae, ß→ss, ð→d, þ→th, etc.).

Phase 2: NFD decompose + strip combining marks (Mn category). When ignore is non-nil, base characters whose NFD suffix matches a precomputed pattern are preserved along with their combining marks.

Phase 3: NFC recompose to clean up any remaining sequences.

If ignore is non-nil, characters present in the set are preserved without folding.

func FoldASCIISlice ¶ added in v1.37.0

func FoldASCIISlice(terms []string, ignore *IgnoreSet) []string

FoldASCIISlice applies accent folding to each element of a string slice in-place.

func InitOptionalTokenizers ¶ added in v1.34.1

func InitOptionalTokenizers()

func NewUserDictFromModel ¶ added in v1.34.1

func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)

func Tokenize ¶

func Tokenize(tokenization string, in string) []string

func TokenizeAndCountDuplicatesForClass ¶ added in v1.34.1

func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)

func TokenizeForClass ¶ added in v1.34.1

func TokenizeForClass(tokenization string, in string, class string) []string

func TokenizeWithWildcardsForClass ¶ added in v1.34.1

func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string

Types ¶

type AnalyzeResult ¶ added in v1.37.0

type AnalyzeResult struct {
	Indexed []string
	Query   []string
}

AnalyzeResult holds the output of Analyze: the indexed tokens and the query tokens (indexed minus stopwords).

func Analyze ¶ added in v1.37.0

func Analyze(
	text string,
	tokenization string,
	className string,
	prepared *PreparedAnalyzer,
	stopwords StopwordDetector,
) AnalyzeResult

Analyze runs the full text-analysis pipeline: ASCII-fold → tokenize → stopword removal (query only).

The PreparedAnalyzer may be nil (no folding). Create one via NewPreparedAnalyzer to reuse across multiple calls with the same config.

type IgnoreSet ¶ added in v1.37.0

type IgnoreSet struct {
	// contains filtered or unexported fields
}

IgnoreSet holds the precomputed data for the asciiFoldIgnore feature. Create one via BuildIgnoreSet.

func BuildIgnoreSet ¶ added in v1.37.0

func BuildIgnoreSet(chars []string) *IgnoreSet

BuildIgnoreSet converts a slice of strings (each typically a single character) into an IgnoreSet for use with FoldASCII. It pre-decomposes the ignored characters into base + combining-mark suffixes so that Phase 2 can match by simple string comparison instead of calling norm.NFC.String().

type KagomeTokenizers ¶

type KagomeTokenizers struct {
	Korean   *kagomeTokenizer.Tokenizer
	Japanese *kagomeTokenizer.Tokenizer
}

type PreparedAnalyzer ¶ added in v1.37.0

type PreparedAnalyzer struct {
	// contains filtered or unexported fields
}

PreparedAnalyzer caches the ignore set built from a TextAnalyzerConfig so that repeated Analyze calls (e.g. over a text array) don't rebuild it each time. Create one via NewPreparedAnalyzer; when the schema is updated a new PreparedAnalyzer should be created.

func NewPreparedAnalyzer ¶ added in v1.37.0

func NewPreparedAnalyzer(cfg *models.TextAnalyzerConfig) *PreparedAnalyzer

NewPreparedAnalyzer pre-builds the ignore set from the given config. Returns nil when no folding is configured, which Analyze handles as a no-op.

type StopwordDetector ¶ added in v1.37.0

type StopwordDetector interface {
	IsStopword(word string) bool
}

StopwordDetector is satisfied by stopwords.Detector and test fakes.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL