tokenizer

package
v1.37.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 15, 2026 License: BSD-3-Clause Imports: 20 Imported by: 0

Documentation

Overview

Package tokenizer provides text tokenization and accent folding for Weaviate's inverted index.

Accent folding

FoldASCII removes diacritical marks from Latin characters while preserving the base letters. It uses a three-phase approach:

  1. Table-driven replacement for characters that Unicode NFD normalization does not decompose (ø→o, æ→ae, ß→ss, ð→d, þ→th, ł→l, đ→d, ħ→h, ŧ→t, etc.).
  2. NFD decomposition + stripping of combining marks (category Mn). Only Mn marks are stripped so that vowel signs in other scripts are not affected.
  3. NFC recomposition for clean storage.

Characters that have table-driven replacements but no NFD decomposition:

Character              | Replacement | Language / script
-----------------------|-------------|------------------------------------------
ł (U+0142) — L-stroke | l           | Polish
Ł (U+0141)            | L           | Polish
ø (U+00F8) — O-stroke | o           | Danish, Norwegian
Ø (U+00D8)            | O           | Danish, Norwegian
æ (U+00E6)            | ae          | Danish, Norwegian, Icelandic, Old English
Æ (U+00C6)            | AE          |
œ (U+0153)            | oe          | French
Œ (U+0152)            | OE          |
ß (U+00DF) — Eszett   | ss          | German
ẞ (U+1E9E)            | SS          | German (capital)
ð (U+00F0) — Eth      | d           | Icelandic, Old English
Ð (U+00D0)            | D           |
þ (U+00FE) — Thorn    | th          | Icelandic, Old English
Þ (U+00DE)            | Th          |
đ (U+0111) — D-stroke | d           | Croatian, Vietnamese
Đ (U+0110)            | D           |
ħ (U+0127) — H-stroke | h           | Maltese
Ħ (U+0126)            | H           |
ŧ (U+0167) — T-stroke | t           | Northern Sami
Ŧ (U+0166)            | T           |
ı (U+0131) — dotless i| i           | Turkish

Additional entries cover hooked, tailed, and other modified Latin letters that NFD does not decompose (ɓ→b, ƈ→c, ɗ→d, etc.).

The fold table is intentionally limited to Latin-script characters. CJK, Cyrillic, Arabic, Devanagari, and other scripts are passed through unchanged. Within the Latin block, characters that already have a clean NFD decomposition (e.g. é → e + combining acute) are handled entirely by the remaining stroked letters, ligatures, special letters, and hooked/tailed letters.

Index

Constants

This section is empty.

Variables

View Source
var (
	UseGse   = false // Load Japanese dictionary and prepare tokenizer
	UseGseCh = false // Load Chinese dictionary and prepare tokenizer
	// The Tokenizer Libraries can consume a lot of memory, so we limit the number of parallel tokenizers
	ApacTokenizerThrottle = chan struct{}(nil) // Throttle for tokenizers

)

Optional tokenizers can be enabled with an environment variable like: 'ENABLE_TOKENIZER_XXX', e.g. 'ENABLE_TOKENIZER_GSE', 'ENABLE_TOKENIZER_KAGOME_KR', 'ENABLE_TOKENIZER_KAGOME_JA'

Functions

func AddCustomDict added in v1.34.1

func AddCustomDict(className string, configs []*models.TokenizerUserDictConfig) error

func AnalyzeAndCountDuplicates added in v1.37.0

func AnalyzeAndCountDuplicates(
	text string,
	tokenization string,
	className string,
	prepared *PreparedAnalyzer,
	stopwords StopwordDetector,
) (terms []string, boosts []int)

AnalyzeAndCountDuplicates is like Analyze but also deduplicates tokens and returns per-token counts (boost factors). Used by BM25 scoring.

func FoldASCII added in v1.37.0

func FoldASCII(s string, ignore *IgnoreSet) string

FoldASCII removes diacritical marks from Latin characters.

Phase 1: table-driven replacement for characters NFD doesn't decompose (ł→l, ø→o, æ→ae, ß→ss, ð→d, þ→th, etc.).

Phase 2: NFD decompose + strip combining marks (Mn category). When ignore is non-nil, base characters whose NFD suffix matches a precomputed pattern are preserved along with their combining marks.

Phase 3: NFC recompose to clean up any remaining sequences.

If ignore is non-nil, characters present in the set are preserved without folding.

func FoldASCIISlice added in v1.37.0

func FoldASCIISlice(terms []string, ignore *IgnoreSet) []string

FoldASCIISlice applies accent folding to each element of a string slice in-place.

func InitOptionalTokenizers added in v1.34.1

func InitOptionalTokenizers()

func NewUserDictFromModel added in v1.34.1

func NewUserDictFromModel(config *models.TokenizerUserDictConfig) (*dict.UserDict, error)

func Tokenize

func Tokenize(tokenization string, in string) []string

func TokenizeAndCountDuplicatesForClass added in v1.34.1

func TokenizeAndCountDuplicatesForClass(tokenization string, in string, class string) ([]string, []int)

func TokenizeForClass added in v1.34.1

func TokenizeForClass(tokenization string, in string, class string) []string

func TokenizeWithWildcardsForClass added in v1.34.1

func TokenizeWithWildcardsForClass(tokenization string, in string, class string) []string

Types

type AnalyzeResult added in v1.37.0

type AnalyzeResult struct {
	Indexed []string
	Query   []string
}

AnalyzeResult holds the output of Analyze: the indexed tokens and the query tokens (indexed minus stopwords).

func Analyze added in v1.37.0

func Analyze(
	text string,
	tokenization string,
	className string,
	prepared *PreparedAnalyzer,
	stopwords StopwordDetector,
) AnalyzeResult

Analyze runs the full text-analysis pipeline: ASCII-fold → tokenize → stopword removal (query only).

The PreparedAnalyzer may be nil (no folding). Create one via NewPreparedAnalyzer to reuse across multiple calls with the same config.

type IgnoreSet added in v1.37.0

type IgnoreSet struct {
	// contains filtered or unexported fields
}

IgnoreSet holds the precomputed data for the asciiFoldIgnore feature. Create one via BuildIgnoreSet.

func BuildIgnoreSet added in v1.37.0

func BuildIgnoreSet(chars []string) *IgnoreSet

BuildIgnoreSet converts a slice of strings (each typically a single character) into an IgnoreSet for use with FoldASCII. It pre-decomposes the ignored characters into base + combining-mark suffixes so that Phase 2 can match by simple string comparison instead of calling norm.NFC.String().

type KagomeTokenizers

type KagomeTokenizers struct {
	Korean   *kagomeTokenizer.Tokenizer
	Japanese *kagomeTokenizer.Tokenizer
}

type PreparedAnalyzer added in v1.37.0

type PreparedAnalyzer struct {
	// contains filtered or unexported fields
}

PreparedAnalyzer caches the ignore set built from a TextAnalyzerConfig so that repeated Analyze calls (e.g. over a text array) don't rebuild it each time. Create one via NewPreparedAnalyzer; when the schema is updated a new PreparedAnalyzer should be created.

func NewPreparedAnalyzer added in v1.37.0

func NewPreparedAnalyzer(cfg *models.TextAnalyzerConfig) *PreparedAnalyzer

NewPreparedAnalyzer pre-builds the ignore set from the given config. Returns nil when no folding is configured, which Analyze handles as a no-op.

type StopwordDetector added in v1.37.0

type StopwordDetector interface {
	IsStopword(word string) bool
}

StopwordDetector is satisfied by stopwords.Detector and test fakes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL