Documentation
¶
Overview ¶
Package tokenizer provides text tokenization and accent folding for Weaviate's inverted index.
Accent folding ¶
FoldASCII removes diacritical marks from Latin characters while preserving the base letters. It uses a three-phase approach:
- Table-driven replacement for characters that Unicode NFD normalization does not decompose (ø→o, æ→ae, ß→ss, ð→d, þ→th, ł→l, đ→d, ħ→h, ŧ→t, etc.).
- NFD decomposition + stripping of combining marks (category Mn). Only Mn marks are stripped so that vowel signs in other scripts are not affected.
- NFC recomposition for clean storage.
Characters that have table-driven replacements but no NFD decomposition:
Character | Replacement | Language / script -----------------------|-------------|------------------------------------------ ł (U+0142) — L-stroke | l | Polish Ł (U+0141) | L | Polish ø (U+00F8) — O-stroke | o | Danish, Norwegian Ø (U+00D8) | O | Danish, Norwegian æ (U+00E6) | ae | Danish, Norwegian, Icelandic, Old English Æ (U+00C6) | AE | œ (U+0153) | oe | French Œ (U+0152) | OE | ß (U+00DF) — Eszett | ss | German ẞ (U+1E9E) | SS | German (capital) ð (U+00F0) — Eth | d | Icelandic, Old English Ð (U+00D0) | D | þ (U+00FE) — Thorn | th | Icelandic, Old English Þ (U+00DE) | Th | đ (U+0111) — D-stroke | d | Croatian, Vietnamese Đ (U+0110) | D | ħ (U+0127) — H-stroke | h | Maltese Ħ (U+0126) | H | ŧ (U+0167) — T-stroke | t | Northern Sami Ŧ (U+0166) | T | ı (U+0131) — dotless i| i | Turkish
Additional entries cover hooked, tailed, and other modified Latin letters that NFD does not decompose (ɓ→b, ƈ→c, ɗ→d, etc.).
The fold table is intentionally limited to Latin-script characters. CJK, Cyrillic, Arabic, Devanagari, and other scripts are passed through unchanged. Within the Latin block, characters that already have a clean NFD decomposition (e.g. é → e + combining acute) are handled entirely by the remaining stroked letters, ligatures, special letters, and hooked/tailed letters.
Index ¶
- Variables
- func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error
- func AnalyzeAndCountDuplicates(text string, tokenization string, className string, prepared *PreparedAnalyzer, ...) (terms []string, boosts []int)
- func FoldASCII(s string, ignore *IgnoreSet) string
- func FoldASCIISlice(terms []string, ignore *IgnoreSet) []string
- func InitOptionalTokenizers()
- func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)
- func Tokenize(tokenization string, in string) []string
- func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)
- func TokenizeForClass(tokenization string, in string, class string) []string
- func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string
- type AnalyzeResult
- type IgnoreSet
- type KagomeTokenizers
- type PreparedAnalyzer
- type StopwordDetector
Constants ¶
This section is empty.
Variables ¶
var ( UseGse = false // Load Japanese dictionary and prepare tokenizer UseGseCh = false // Load Chinese dictionary and prepare tokenizer // The Tokenizer Libraries can consume a lot of memory, so we limit the number of parallel tokenizers ApacTokenizerThrottle = chan struct{}(nil) // Throttle for tokenizers )
var Tokenizations []string = []string{ models.PropertyTokenizationWord, models.PropertyTokenizationLowercase, models.PropertyTokenizationWhitespace, models.PropertyTokenizationField, models.PropertyTokenizationTrigram, }
Optional tokenizers can be enabled with an environment variable like: 'ENABLE_TOKENIZER_XXX', e.g. 'ENABLE_TOKENIZER_GSE', 'ENABLE_TOKENIZER_KAGOME_KR', 'ENABLE_TOKENIZER_KAGOME_JA'
Functions ¶
func AddCustomDict ¶ added in v1.34.1
func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error
func AnalyzeAndCountDuplicates ¶ added in v1.37.0
func AnalyzeAndCountDuplicates( text string, tokenization string, className string, prepared *PreparedAnalyzer, stopwords StopwordDetector, ) (terms []string, boosts []int)
AnalyzeAndCountDuplicates is like Analyze but also deduplicates tokens and returns per-token counts (boost factors). Used by BM25 scoring.
func FoldASCII ¶ added in v1.37.0
FoldASCII removes diacritical marks from Latin characters.
Phase 1: table-driven replacement for characters NFD doesn't decompose (ł→l, ø→o, æ→ae, ß→ss, ð→d, þ→th, etc.).
Phase 2: NFD decompose + strip combining marks (Mn category). When ignore is non-nil, base characters whose NFD suffix matches a precomputed pattern are preserved along with their combining marks.
Phase 3: NFC recompose to clean up any remaining sequences.
If ignore is non-nil, characters present in the set are preserved without folding.
func FoldASCIISlice ¶ added in v1.37.0
FoldASCIISlice applies accent folding to each element of a string slice in-place.
func InitOptionalTokenizers ¶ added in v1.34.1
func InitOptionalTokenizers()
func NewUserDictFromModel ¶ added in v1.34.1
func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)
func TokenizeAndCountDuplicatesForClass ¶ added in v1.34.1
func TokenizeForClass ¶ added in v1.34.1
Types ¶
type AnalyzeResult ¶ added in v1.37.0
AnalyzeResult holds the output of Analyze: the indexed tokens and the query tokens (indexed minus stopwords).
func Analyze ¶ added in v1.37.0
func Analyze( text string, tokenization string, className string, prepared *PreparedAnalyzer, stopwords StopwordDetector, ) AnalyzeResult
Analyze runs the full text-analysis pipeline: ASCII-fold → tokenize → stopword removal (query only).
The PreparedAnalyzer may be nil (no folding). Create one via NewPreparedAnalyzer to reuse across multiple calls with the same config.
type IgnoreSet ¶ added in v1.37.0
type IgnoreSet struct {
// contains filtered or unexported fields
}
IgnoreSet holds the precomputed data for the asciiFoldIgnore feature. Create one via BuildIgnoreSet.
func BuildIgnoreSet ¶ added in v1.37.0
BuildIgnoreSet converts a slice of strings (each typically a single character) into an IgnoreSet for use with FoldASCII. It pre-decomposes the ignored characters into base + combining-mark suffixes so that Phase 2 can match by simple string comparison instead of calling norm.NFC.String().
type KagomeTokenizers ¶
type KagomeTokenizers struct {
Korean *kagomeTokenizer.Tokenizer
Japanese *kagomeTokenizer.Tokenizer
}
type PreparedAnalyzer ¶ added in v1.37.0
type PreparedAnalyzer struct {
// contains filtered or unexported fields
}
PreparedAnalyzer caches the ignore set built from a TextAnalyzerConfig so that repeated Analyze calls (e.g. over a text array) don't rebuild it each time. Create one via NewPreparedAnalyzer; when the schema is updated a new PreparedAnalyzer should be created.
func NewPreparedAnalyzer ¶ added in v1.37.0
func NewPreparedAnalyzer(cfg *models.TextAnalyzerConfig) *PreparedAnalyzer
NewPreparedAnalyzer pre-builds the ignore set from the given config. Returns nil when no folding is configured, which Analyze handles as a no-op.
type StopwordDetector ¶ added in v1.37.0
StopwordDetector is satisfied by stopwords.Detector and test fakes.