tokenizer

package

v0.8.2 Latest Latest Go to latest Published: May 24, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

gitcode.com/dscli/dscli

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer is the cornerstone of Chinese full-text search in dscli.

Without it, Chinese queries against FTS5 would fail completely — FTS5's default tokenizer splits on whitespace, but Chinese has no spaces between words. "全文搜索" becomes a single undividable token, making substring search ("全文" or "搜索") impossible.

Tokenization (Tokenize) ¶

Uses gse (github.com/go-ego/gse) in search-engine mode (CutSearch), which produces both compound words AND their sub-words:

"Go单元测试"  →  "go 单元 测试 单元测试"
"轻量级数据库" →  "轻量 量级 轻量级 数据 据库 数据库"

This dual-output strategy ensures high recall: a search for "单元" matches "Go单元测试", and "轻量级数据库" matches even if the user types "轻量" alone.

Query Sanitization (SanitizeFTS) ¶

FTS5 MATCH queries must mirror the indexing tokenization. SanitizeFTS tokenizes the user query with the same gse pipeline, filters stopwords, then wraps each token in double-quotes:

"全文搜索"  →  `"全文" "搜索"`
"fix auth"  →  `"fix" "auth"`

Stopword Filtering ¶

Three embedded stopword lists (cn_stopwords, hit_stopwords, scu_stopwords) filter high-frequency function words (的/了/也/吧/吗/呢/是/一个) that carry no semantic meaning. A CJK-character gate (containsCJK) auto-rejects non-Chinese entries (hello, go, the, ———, 123), preventing English content words from being mistakenly filtered.

The popular baidu_stopwords list was deliberately excluded: 39% of its 1,396 entries are English words (hello, go, the, unit, test…) — real content words that would destroy search recall if filtered.

Design ¶

Lazy init: gse dictionary (~150k words) and stopword maps are loaded once on first use (sync.Once), not at package import.
Zero config: dictionaries and stopwords are embedded via go:embed, requiring no external files or environment setup.
Thread-safe: all shared state is behind sync.Once.

Package tokenizer provides Chinese+English word segmentation for FTS5 full-text search indexing and querying.

It uses gse (github.com/go-ego/gse) search-engine mode (CutSearch) which produces both compound words and their sub-words for better recall. Stopwords (high-frequency, low-meaning words like 的/了/也) are filtered out.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func IsStop ¶

func IsStop(word string) bool

IsStop reports whether word is a stopword — a high-frequency, low-meaning word that should be excluded from full-text search indexing and queries.

It uses 3 popular Chinese stopword lists (embedded):

cn_stopwords.txt (Chinese function words, highest quality)
hit_stopwords.txt (HIT Information Retrieval Lab)
scu_stopwords.txt (Sichuan University, phrase-heavy)

Only entries containing CJK characters are kept; pure ASCII/symbol entries are skipped to avoid filtering English content words. This is why baidu_stopwords.txt was rejected: 39% of its entries are English words like "hello", "go", "the" — real content words that must not be filtered.

func SanitizeFTS ¶

func SanitizeFTS(query string) string

SanitizeFTS prepares a user query for FTS5 MATCH.

It tokenizes the query using gse search-engine mode (same as indexing), filters stopwords, then wraps each token in double-quotes. This mirrors how content is indexed so both sides share the same tokenization.

"全文搜索"       → `"全文" "搜索"`
"Go单元测试"     → `"go" "单元" "测试" "单元测试"`
"fix auth bug"  → `"fix" "auth" "bug"`

func Tokenize ¶

func Tokenize(s string) string

Tokenize returns space-joined Chinese+English words for FTS5 indexing/search.

It uses gse search-engine mode (CutSearch) for Chinese word segmentation, which produces both compound words and their sub-words for better recall. Whitespace, empty tokens, and stopwords are filtered.

Examples:

"Go单元测试"        → "go 单元 测试 单元测试"
"SQLite轻量级数据库"  → "sqlite 轻量 量级 轻量级 数据 据库 数据库"

Types ¶

This section is empty.

Source Files ¶

doc.go
stopwords.go
tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL