tokenizer

package
v0.8.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 24, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Documentation

Overview

Package tokenizer is the cornerstone of Chinese full-text search in dscli.

Without it, Chinese queries against FTS5 would fail completely — FTS5's default tokenizer splits on whitespace, but Chinese has no spaces between words. "全文搜索" becomes a single undividable token, making substring search ("全文" or "搜索") impossible.

Tokenization (Tokenize)

Uses gse (github.com/go-ego/gse) in search-engine mode (CutSearch), which produces both compound words AND their sub-words:

"Go单元测试"  →  "go 单元 测试 单元测试"
"轻量级数据库" →  "轻量 量级 轻量级 数据 据库 数据库"

This dual-output strategy ensures high recall: a search for "单元" matches "Go单元测试", and "轻量级数据库" matches even if the user types "轻量" alone.

Query Sanitization (SanitizeFTS)

FTS5 MATCH queries must mirror the indexing tokenization. SanitizeFTS tokenizes the user query with the same gse pipeline, filters stopwords, then wraps each token in double-quotes:

"全文搜索"  →  `"全文" "搜索"`
"fix auth"  →  `"fix" "auth"`

Stopword Filtering

Three embedded stopword lists (cn_stopwords, hit_stopwords, scu_stopwords) filter high-frequency function words (的/了/也/吧/吗/呢/是/一个) that carry no semantic meaning. A CJK-character gate (containsCJK) auto-rejects non-Chinese entries (hello, go, the, ———, 123), preventing English content words from being mistakenly filtered.

The popular baidu_stopwords list was deliberately excluded: 39% of its 1,396 entries are English words (hello, go, the, unit, test…) — real content words that would destroy search recall if filtered.

Design

  • Lazy init: gse dictionary (~150k words) and stopword maps are loaded once on first use (sync.Once), not at package import.
  • Zero config: dictionaries and stopwords are embedded via go:embed, requiring no external files or environment setup.
  • Thread-safe: all shared state is behind sync.Once.

Package tokenizer provides Chinese+English word segmentation for FTS5 full-text search indexing and querying.

It uses gse (github.com/go-ego/gse) search-engine mode (CutSearch) which produces both compound words and their sub-words for better recall. Stopwords (high-frequency, low-meaning words like 的/了/也) are filtered out.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func IsStop

func IsStop(word string) bool

IsStop reports whether word is a stopword — a high-frequency, low-meaning word that should be excluded from full-text search indexing and queries.

It uses 3 popular Chinese stopword lists (embedded):

  • cn_stopwords.txt (Chinese function words, highest quality)
  • hit_stopwords.txt (HIT Information Retrieval Lab)
  • scu_stopwords.txt (Sichuan University, phrase-heavy)

Only entries containing CJK characters are kept; pure ASCII/symbol entries are skipped to avoid filtering English content words. This is why baidu_stopwords.txt was rejected: 39% of its entries are English words like "hello", "go", "the" — real content words that must not be filtered.

func SanitizeFTS

func SanitizeFTS(query string) string

SanitizeFTS prepares a user query for FTS5 MATCH.

It tokenizes the query using gse search-engine mode (same as indexing), filters stopwords, then wraps each token in double-quotes. This mirrors how content is indexed so both sides share the same tokenization.

"全文搜索"       → `"全文" "搜索"`
"Go单元测试"     → `"go" "单元" "测试" "单元测试"`
"fix auth bug"  → `"fix" "auth" "bug"`

func Tokenize

func Tokenize(s string) string

Tokenize returns space-joined Chinese+English words for FTS5 indexing/search.

It uses gse search-engine mode (CutSearch) for Chinese word segmentation, which produces both compound words and their sub-words for better recall. Whitespace, empty tokens, and stopwords are filtered.

Examples:

"Go单元测试"        → "go 单元 测试 单元测试"
"SQLite轻量级数据库"  → "sqlite 轻量 量级 轻量级 数据 据库 数据库"

Types

This section is empty.

Source Files

  • doc.go
  • stopwords.go
  • tokenizer.go

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL