microfts2

package module

v0.2.0 Latest Latest Go to latest Published: Apr 12, 2026 License: MIT Imports: 22 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zot/microfts2

Links

Open Source Insights

README ¶

microfts2

A dynamic trigram index backed by LMDB, written in Go. Usable as a CLI tool or as a library.

microfts2 indexes files into raw byte trigrams (24-bit, 16M possible) organized by chunks, then maintains an inverted index for fast substring search. The index is maintained incrementally — every add/remove updates it immediately.

Install

go install microfts/cmd/microfts@latest

Or clone and build:

git clone https://github.com/zot/microfts2.git
cd microfts2
go build ./cmd/microfts

Quick Start

# Create a database
microfts init -db myindex

# Register a chunking strategy (any command that outputs range\tcontent lines)
microfts strategy add -db myindex -name lines -cmd "microfts chunk-lines"

# Add files
microfts add -db myindex -strategy lines src/*.go

# Search
microfts search -db myindex "func Open"
# output: src/db.go:198-260

# Check for stale files
microfts stale -db myindex

# Refresh stale files and search in one step
microfts search -r -db myindex "func Open"

How It Works

Every byte is its own value — no character set mapping. Three consecutive bytes form a 24-bit trigram. UTF-8 multibyte characters produce cross-boundary byte trigrams; character-internal trigrams are skipped.

Search computes the query's trigrams, optionally filters them via a caller-supplied TrigramFilter, intersects posting lists from the inverted index, and returns matching file/chunk locations.

Two-tree design: Content and index live in separate LMDB subdatabases. The content DB stores trigram frequency counts (sparse C records), file metadata, and settings. The index DB stores the inverted trigram-to-chunk mapping, maintained incrementally on every add/remove.

Dynamic trigram filtering: Query trigram selection is handled at search time via TrigramFilter functions. Stock filters include FilterAll (use all trigrams), FilterByRatio (skip high-frequency trigrams), and FilterBestN (keep N most selective). Callers can supply custom filters.

Staleness detection: Each indexed file records its modification time and SHA-256 hash. Checking mod time first avoids hashing unchanged files. The -r flag refreshes stale files before any command.

CLI Reference

All commands require -db <path>. Optional: -content-db, -index-db for custom subdatabase names.

Command	Description
`init`	Create a new database. `-case-insensitive`, `-aliases`
`add`	Add files. `-strategy <name>`
`search`	Search for text. `-regex`, `-score coverage\|density`, `-verify`
`delete`	Remove files from the database
`reindex`	Re-chunk files with a different strategy. `-strategy <name>`
`score`	Score named files against a query. `-score coverage\|density`
`stale`	List stale and missing files
`strategy add\|remove\|list`	Manage chunking strategies
`chunk-lines`	Built-in chunker: one chunk per line
`chunk-lines-overlap`	Built-in chunker: overlapping line windows. `-lines`, `-overlap`
`chunk-words-overlap`	Built-in chunker: overlapping word windows. `-words`, `-overlap`, `-pattern`

Global flag: -r refreshes all stale files before running the command. Usable standalone (microfts -r -db path) or combined (microfts search -r -db path query).

Library API

import "microfts2"

// Lifecycle
db, err := microfts.Create(path, microfts.Options{})
db, err := microfts.Open(path, microfts.Options{})
db.Close()

// Content
fileid, err := db.AddFile(filepath, strategyName)
db.RemoveFile(filepath)
fileid, err := db.Reindex(filepath, strategyName)

// Search
results, err := db.Search("query", microfts.WithTrigramFilter(microfts.FilterAll))
results, err := db.SearchRegex("pattern")

// Scoring
chunks, err := db.ScoreFile("query", filepath, microfts.CoverageScore)

// Staleness
status, err := db.CheckFile(filepath)    // FileStatus{Path, Status, FileID, Strategy}
statuses, err := db.StaleFiles()         // all indexed files
refreshed, err := db.RefreshStale("")    // reindex stale files

// Strategies
db.AddStrategy(name, command)
db.AddStrategyFunc(name, fn)
db.RemoveStrategy(name)

Chunking Strategies

A chunking strategy is either an external command or a Go function that takes a file and produces chunks. Each chunk has an opaque range label and text content to index.

External chunker example using awk:

microfts strategy add -db myindex -name awk-lines \
  -cmd "awk 'BEGIN{pos=0} {pos+=length(\$0)+1; print pos}'"

License

MIT

Documentation ¶

Index ¶

Constants
Variables
func DecodeFilename(keys [][]byte) string
func DecodeTrigram(v uint32) string
func FinalKey(filename string) []byte
func LineChunkFunc(_ string, content []byte, yield func(Chunk) bool) error
func MarkdownChunkFunc(_ string, content []byte, yield func(Chunk) bool) error
func PairGet(pairs []Pair, key string) ([]byte, bool)
func ScoreOverlap(queryTrigrams []uint32, chunkCounts map[uint32]int, _ int) float64
func TrigramValue(a, b, c byte) uint32
func UnmarshalTValue(data []byte) ([]uint64, error)
func UnmarshalWValue(data []byte) ([]uint64, error)
func ValidateAliases(aliases map[byte]byte) error
type AppendOption
- func WithAppendChunkCallback(fn ChunkCallback) AppendOption
- func WithBaseLine(n int) AppendOption
- func WithContentHash(hash string) AppendOption
- func WithFileLength(n int64) AppendOption
- func WithModTime(t int64) AppendOption
type Bitset
- func (b *Bitset) Bytes() []byte
- func (b *Bitset) Count() int
- func (b *Bitset) ForEach(fn func(uint32))
- func (b *Bitset) FromBytes(data []byte)
- func (b *Bitset) Set(trigram uint32)
- func (b *Bitset) Test(trigram uint32) bool
type BracketGroup
type BracketLang
- func LangByName(name string) (BracketLang, bool)
type CRecord
- func UnmarshalCValue(data []byte) (CRecord, error)
- func (c *CRecord) DB() *DB
- func (c *CRecord) FileRecord(fileid uint64) (FRecord, error)
- func (c *CRecord) MarshalValue() []byte
- func (c *CRecord) Txn() *lmdb.Txn
type Chunk
type ChunkCache
- func (cc *ChunkCache) ChunkText(fpath, rangeLabel string) ([]byte, bool)
- func (cc *ChunkCache) GetChunks(fpath, targetRange string, before, after int) ([]ChunkResult, error)
type ChunkCallback
type ChunkFilter
type ChunkFunc
- func RunChunkerFunc(cmd string) ChunkFunc
type ChunkResult
type Chunker
- func BracketChunker(lang BracketLang) Chunker
- func IndentChunker(lang BracketLang, tabWidth int) Chunker
type DB
- func Create(path string, opts Options) (*DB, error)
- func Open(path string, opts Options) (*DB, error)
- func (db *DB) AddChunker(name string, c Chunker) error
- func (db *DB) AddFile(fpath, strategy string, opts ...IndexOption) (uint64, error)
- func (db *DB) AddFileWithContent(fpath, strategy string, opts ...IndexOption) (uint64, []byte, error)
- func (db *DB) AddStrategy(name, cmd string) error
- func (db *DB) AddStrategyFunc(name string, fn ChunkFunc) error
- func (db *DB) AddTmpFile(path, strategy string, content []byte, opts ...IndexOption) (uint64, error)
- func (db *DB) AppendChunks(fileid uint64, content []byte, strategy string, opts ...AppendOption) error
- func (db *DB) AppendTmpFile(path, strategy string, content []byte, opts ...AppendOption) (uint64, error)
- func (db *DB) BM25Func(queryTrigrams []uint32) (ScoreFunc, error)
- func (db *DB) CheckFile(fpath string) (FileStatus, error)
- func (db *DB) Close() error
- func (db *DB) Copy() *DB
- func (db *DB) Env() *lmdb.Env
- func (db *DB) FileIDPaths() (map[uint64]string, error)
- func (db *DB) FileInfoByID(fileid uint64) (FRecord, error)
- func (db *DB) GetChunks(fpath, targetRange string, before, after int) ([]ChunkResult, error)
- func (db *DB) HasTmp() bool
- func (db *DB) InvalidateCaches()
- func (db *DB) NewChunkCache() *ChunkCache
- func (db *DB) NewSearchCache() func()
- func (db *DB) QueryTrigramCounts(query string) ([]TrigramCount, error)
- func (db *DB) RecordCounts() (map[byte]RecordStats, error)
- func (db *DB) RefreshStale(strategy string, opts ...IndexOption) ([]FileStatus, error)
- func (db *DB) Reindex(fpath, strategy string, opts ...IndexOption) (uint64, error)
- func (db *DB) ReindexWithContent(fpath, strategy string, opts ...IndexOption) (uint64, []byte, error)
- func (db *DB) RemoveFile(fpath string) error
- func (db *DB) RemoveStrategy(name string) error
- func (db *DB) RemoveTmpFile(path string) error
- func (db *DB) ScoreFile(query, fpath string, fn ScoreFunc, opts ...SearchOption) ([]ScoredChunk, error)
- func (db *DB) Search(query string, opts ...SearchOption) (*SearchResults, error)
- func (db *DB) SearchFuzzy(query string, k int, opts ...SearchOption) (*SearchResults, error)
- func (db *DB) SearchMulti(query string, strategies map[string]ScoreFunc, k int, opts ...SearchOption) ([]MultiSearchResult, error)
- func (db *DB) SearchRegex(pattern string, opts ...SearchOption) (*SearchResults, error)
- func (db *DB) Settings() Settings
- func (db *DB) StaleFiles() ([]FileStatus, error)
- func (db *DB) TmpContent(path string) (*bytes.Reader, error)
- func (db *DB) TmpFileIDs() map[uint64]struct{}
- func (db *DB) UpdateTmpFile(path, strategy string, content []byte, opts ...IndexOption) error
- func (db *DB) Version() (string, error)
type FRecord
- func UnmarshalFHeader(data []byte) (FRecord, error)
- func UnmarshalFValue(data []byte) (FRecord, error)
- func (f *FRecord) MarshalValue() []byte
type FileChunkEntry
type FileStatus
type FuncChunker
- func (fc FuncChunker) ChunkText(path string, content []byte, rangeLabel string) ([]byte, bool)
- func (fc FuncChunker) Chunks(path string, content []byte, yield func(Chunk) bool) error
type HRecord
type IndexOption
- func WithChunkCallback(fn ChunkCallback) IndexOption
type IndexStatus
type KeyPair
- func EncodeFilename(filename string) []KeyPair
type MultiSearchResult
type Options
type Pair
- func CopyPairs(src []Pair) []Pair
type RecordStats
type ScoreFunc
- func ScoreBM25(idf map[uint32]float64, avgTokenCount float64) ScoreFunc
type ScoredChunk
type SearchOption
- func WithAfter(t time.Time) SearchOption
- func WithBefore(t time.Time) SearchOption
- func WithChunkCache(cc *ChunkCache) SearchOption
- func WithChunkFilter(fn ChunkFilter) SearchOption
- func WithCoverage() SearchOption
- func WithDensity() SearchOption
- func WithExcept(ids map[uint64]struct{}) SearchOption
- func WithExceptRegex(patterns ...string) SearchOption
- func WithLoose() SearchOption
- func WithNoTmp() SearchOption
- func WithOnly(ids map[uint64]struct{}) SearchOption
- func WithOverlap() SearchOption
- func WithProximityRerank(topN int) SearchOption
- func WithRegexFilter(patterns ...string) SearchOption
- func WithScoring(fn ScoreFunc) SearchOption
- func WithTrigramFilter(fn TrigramFilter) SearchOption
- func WithVerify() SearchOption
type SearchResult
type SearchResults
type Settings
type StringDelim
type TRecord
- func (t *TRecord) MarshalValue() []byte
type TokenEntry
type TrigramCount
- func FilterAll(trigrams []TrigramCount, _ int) []TrigramCount
type TrigramEntry
type TrigramFilter
- func FilterBestN(n int) TrigramFilter
- func FilterByRatio(maxRatio float64) TrigramFilter
type Trigrams
- func NewTrigrams(caseInsensitive bool, aliases map[byte]byte) *Trigrams
- func (t *Trigrams) EncodeTrigram(s string) (uint32, bool)
- func (t *Trigrams) ExtractTrigrams(data []byte) []uint32
- func (t *Trigrams) TrigramCounts(data []byte) map[uint32]int
type TxnHolder
type WRecord
- func (w *WRecord) MarshalValue() []byte

Constants ¶

View Source

const BitsetSize = 2097152 // 2^21 bytes = 2^24 bits

Variables ¶

View Source

var ErrAlreadyIndexed = errors.New("file already indexed")

ErrAlreadyIndexed is returned when AddFile is called for a path that already has F records in the database. Use Reindex or AppendChunks instead. R215

View Source

var ErrNoChunks = errors.New("chunker produced no chunks")

ErrNoChunks is returned when a chunker produces zero chunks for a file.

View Source

var LangC = BracketLang{
	LineComments:  []string{"//"},
	BlockComments: [][2]string{{"/*", "*/"}},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'", Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangC is the bracket language config for C/C++.

View Source

var LangGo = BracketLang{
	LineComments:  []string{"//"},
	BlockComments: [][2]string{{"/*", "*/"}},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "`", Close: "`"},
		{Open: "'", Close: "'", Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangGo is the bracket language config for Go.

View Source

var LangJS = BracketLang{
	LineComments:  []string{"//"},
	BlockComments: [][2]string{{"/*", "*/"}},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'", Escape: `\`},
		{Open: "`", Close: "`", Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangJS is the bracket language config for JavaScript.

View Source

var LangJava = LangC

LangJava is the bracket language config for Java.

View Source

var LangLisp = BracketLang{
	LineComments: []string{";"},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangLisp is the bracket language config for Lisp/Scheme/Clojure.

View Source

var LangNginx = BracketLang{
	LineComments: []string{"#"},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'"},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
	},
}

LangNginx is the bracket language config for nginx.

View Source

var LangPascal = BracketLang{
	BlockComments: [][2]string{{"{", "}"}, {"(*", "*)"}},
	StringDelims: []StringDelim{
		{Open: "'", Close: "'"},
	},
	Brackets: []BracketGroup{
		{
			Open:       []string{"begin", "record", "class"},
			Separators: []string{},
			Close:      []string{"end"},
		},
		{
			Open:       []string{"if"},
			Separators: []string{"then", "else"},
			Close:      []string{"end"},
		},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangPascal is the bracket language config for Pascal.

View Source

var LangShell = BracketLang{
	LineComments: []string{"#"},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'"},
	},
	Brackets: []BracketGroup{
		{
			Open:       []string{"if"},
			Separators: []string{"then", "elif", "else"},
			Close:      []string{"fi"},
		},
		{
			Open:       []string{"while", "for"},
			Separators: []string{"do"},
			Close:      []string{"done"},
		},
		{
			Open:  []string{"case"},
			Close: []string{"esac"},
		},
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
	},
}

LangShell is the bracket language config for Bourne shell / bash.

Functions ¶

func DecodeFilename ¶

func DecodeFilename(keys [][]byte) string

DecodeFilename reconstructs a filename from chained N record keys. Keys must be in order (part 0, 1, ..., 255).

func DecodeTrigram ¶

func DecodeTrigram(v uint32) string

DecodeTrigram converts a 24-bit trigram value back to a 3-byte string. Bytes that are 0 (whitespace-encoded) are shown as spaces.

func FinalKey ¶

func FinalKey(filename string) []byte

FinalKey returns the final N record key for direct fileid lookup.

func LineChunkFunc ¶

func LineChunkFunc(_ string, content []byte, yield func(Chunk) bool) error

LineChunkFunc is a built-in ChunkFunc that yields one chunk per line. Range is "N-N" (1-based line number).

func MarkdownChunkFunc ¶

func MarkdownChunkFunc(_ string, content []byte, yield func(Chunk) bool) error

MarkdownChunkFunc splits markdown content into paragraph-based chunks. Heading lines start new chunks; a heading and its following paragraph (up to the next blank line or heading) form one chunk. Blank lines are boundaries only and are not included in any chunk's content. Fenced code blocks (``` or ~~~) suppress blank-line splitting — all lines from opening fence through matching close belong to the current chunk. // R465, R466, R467, R468

func PairGet ¶

func PairGet(pairs []Pair, key string) ([]byte, bool)

PairGet returns the value for the first Pair matching key, or nil if not found.

func ScoreOverlap ¶

func ScoreOverlap(queryTrigrams []uint32, chunkCounts map[uint32]int, _ int) float64

ScoreOverlap: count of matching query trigrams, no normalization (OR semantics).

func TrigramValue ¶

func TrigramValue(a, b, c byte) uint32

TrigramValue computes the 24-bit trigram from three byte values.

func UnmarshalTValue ¶

func UnmarshalTValue(data []byte) ([]uint64, error)

UnmarshalTValue decodes a TRecord value. Trigram must be set separately (from key).

func UnmarshalWValue ¶

func UnmarshalWValue(data []byte) ([]uint64, error)

UnmarshalWValue decodes a WRecord value. Same format as TRecord.

func ValidateAliases ¶

func ValidateAliases(aliases map[byte]byte) error

ValidateAliases returns an error if any alias source or target byte is non-ASCII (≥ 0x80). Aliasing UTF-8 continuation or leading bytes would corrupt multibyte characters and break character-internal trigram skipping.

Types ¶

type AppendOption ¶

type AppendOption func(*appendConfig)

AppendOption configures AppendChunks behavior. R158

func WithAppendChunkCallback ¶

func WithAppendChunkCallback(fn ChunkCallback) AppendOption

WithAppendChunkCallback supplies a chunk callback for append methods. CRC: crc-DB.md | R471

func WithBaseLine ¶

func WithBaseLine(n int) AppendOption

WithBaseLine sets the 1-based line number offset for line-based chunker ranges. When non-zero, "start-end" ranges are adjusted by adding this offset. R162

func WithContentHash ¶

func WithContentHash(hash string) AppendOption

WithContentHash sets the full-file SHA-256 hash (caller pre-computed). R159

func WithFileLength ¶

func WithFileLength(n int64) AppendOption

WithFileLength sets the full file size after append. R161

func WithModTime ¶

func WithModTime(t int64) AppendOption

WithModTime sets the file modification time (Unix nanoseconds). R160

type Bitset ¶

type Bitset [BitsetSize]byte

Bitset is a fixed-size bitset for 16,777,216 trigrams (2^24).

func (*Bitset) Bytes ¶

func (b *Bitset) Bytes() []byte

Bytes returns the bitset as a byte slice for storage.

func (*Bitset) Count ¶

func (b *Bitset) Count() int

Count returns the number of set bits.

func (*Bitset) ForEach ¶

func (b *Bitset) ForEach(fn func(uint32))

ForEach calls fn for each set bit in ascending order.

func (*Bitset) FromBytes ¶

func (b *Bitset) FromBytes(data []byte)

FromBytes loads the bitset from stored bytes.

func (*Bitset) Set ¶

func (b *Bitset) Set(trigram uint32)

Set sets the bit for the given trigram.

func (*Bitset) Test ¶

func (b *Bitset) Test(trigram uint32) bool

Test returns whether the bit for the given trigram is set.

type BracketGroup ¶

type BracketGroup struct {
	Open       []string // openers: e.g. ["{"], ["if","while","for"]
	Separators []string // optional: e.g. ["else","elif","then"]
	Close      []string // closers: e.g. ["}"], ["end","done","fi"]
}

BracketGroup defines one set of matching brackets. R309 Separators are mid-group markers (e.g. "else" between "if"/"end").

type BracketLang ¶

type BracketLang struct {
	LineComments  []string       // e.g. "//", "#", "--"
	BlockComments [][2]string    // e.g. {{"/*", "*/"}, {"<!--", "-->"}}
	StringDelims  []StringDelim  // e.g. {`"`, `"`, `\`}
	Brackets      []BracketGroup // open/separator/close sets
}

BracketLang defines the lexical rules for one language. R307

func LangByName ¶

func LangByName(name string) (BracketLang, bool)

LangByName returns a BracketLang config by name, or false if not found.

type CRecord ¶

type CRecord struct {
	ChunkID  uint64
	Hash     [32]byte
	Trigrams []TrigramEntry
	Tokens   []TokenEntry
	Attrs    []Pair
	FileIDs  []uint64
	// contains filtered or unexported fields
}

CRecord is the per-chunk record. Self-describing: everything needed for search, scoring, filtering, and removal. Carries unexported db/txn — the chunk is tied to the transaction that read it.

func UnmarshalCValue ¶

func UnmarshalCValue(data []byte) (CRecord, error)

UnmarshalCValue decodes a CRecord value. ChunkID must be set separately (from key). v2 format: hash + trigrams + tokens + attrs + fileids

func (*CRecord) DB ¶

func (c *CRecord) DB() *DB

DB returns the database this record belongs to.

func (*CRecord) FileRecord ¶

func (c *CRecord) FileRecord(fileid uint64) (FRecord, error)

FileRecord navigates to an F record within the same transaction.

func (*CRecord) MarshalValue ¶

func (c *CRecord) MarshalValue() []byte

MarshalValue encodes the CRecord value (everything except the key prefix and chunkid). v2 format: hash + trigrams + tokens + attrs + fileids

func (*CRecord) Txn ¶

func (c *CRecord) Txn() *lmdb.Txn

Txn returns the transaction this record was read from. Implements TxnHolder.

type Chunk ¶

type Chunk struct {
	Range   []byte
	Content []byte
	Attrs   []Pair
}

Chunk is a single chunk yielded by a Chunker. Range is an opaque label (e.g. "1-10" for lines); Content is the chunk text. Range and Content may be reused between yields — caller must copy if retaining. Attrs is optional per-chunk metadata (nil means no attrs).

type ChunkCache ¶

type ChunkCache struct {
	// contains filtered or unexported fields
}

ChunkCache is a per-query cache for file content and chunked data. Avoids redundant file reads and re-chunking when processing search results.

func (*ChunkCache) ChunkText ¶

func (cc *ChunkCache) ChunkText(fpath, rangeLabel string) ([]byte, bool)

ChunkText returns a single chunk's content by range label. Uses lazy chunking — stops as soon as the target is found.

func (*ChunkCache) GetChunks ¶

func (cc *ChunkCache) GetChunks(fpath, targetRange string, before, after int) ([]ChunkResult, error)

GetChunks retrieves the target chunk and up to before/after positional neighbors. Same contract as DB.GetChunks but cached.

type ChunkCallback ¶

type ChunkCallback func(chunkText string)

ChunkCallback receives clean chunk text during indexing. Called once per chunk, in chunk order. The string is a copy, safe to retain. CRC: crc-DB.md | R469

type ChunkFilter ¶

type ChunkFilter func(chunk CRecord) bool

ChunkFilter receives a CRecord during candidate evaluation. Return true to keep the chunk, false to reject it. The CRecord carries transaction context — use Txn() and DB() for lookups.

type ChunkFunc ¶

type ChunkFunc func(path string, content []byte, yield func(Chunk) bool) error

ChunkFunc is a generator that yields chunks for a file. Convenience type — wrap with FuncChunker to get a full Chunker.

func RunChunkerFunc ¶

func RunChunkerFunc(cmd string) ChunkFunc

RunChunkerFunc returns a ChunkFunc that executes an external command. The command receives the filepath as an argument and outputs one chunk per line on stdout as "range\tcontent".

type ChunkResult ¶

type ChunkResult struct {
	Path    string `json:"path"`
	Range   string `json:"range"`
	Content string `json:"content"`
	Index   int    `json:"index"` // 0-based position in the file's chunk list
	Attrs   []Pair `json:"attrs,omitempty"`
}

ChunkResult holds a single chunk with its content and position. R201

type Chunker ¶

type Chunker interface {
	Chunks(path string, content []byte, yield func(Chunk) bool) error
	ChunkText(path string, content []byte, rangeLabel string) ([]byte, bool)
}

Chunker is the interface for chunking strategies. Chunks produces chunks for indexing; ChunkText retrieves a single chunk's content.

func BracketChunker ¶

func BracketChunker(lang BracketLang) Chunker

BracketChunker returns a Chunker for the given language config. R320

func IndentChunker ¶

func IndentChunker(lang BracketLang, tabWidth int) Chunker

IndentChunker returns a Chunker for indentation-scoped languages. R333 tabWidth controls how tabs count for column calculation (0 = one column per tab).

type DB ¶

type DB struct {
	// contains filtered or unexported fields
}

func Create ¶

func Create(path string, opts Options) (*DB, error)

Seq: seq-init.md

func Open ¶

func Open(path string, opts Options) (*DB, error)

func (*DB) AddChunker ¶

func (db *DB) AddChunker(name string, c Chunker) error

CRC: crc-DB.md | R293

func (*DB) AddFile ¶

func (db *DB) AddFile(fpath, strategy string, opts ...IndexOption) (uint64, error)

Seq: seq-add.md | R477

func (*DB) AddFileWithContent ¶

func (db *DB) AddFileWithContent(fpath, strategy string, opts ...IndexOption) (uint64, []byte, error)

CRC: crc-DB.md | R120, R478

func (*DB) AddStrategy ¶

func (db *DB) AddStrategy(name, cmd string) error

func (*DB) AddStrategyFunc ¶

func (db *DB) AddStrategyFunc(name string, fn ChunkFunc) error

CRC: crc-DB.md | R294

func (*DB) AddTmpFile ¶

func (db *DB) AddTmpFile(path, strategy string, content []byte, opts ...IndexOption) (uint64, error)

CRC: crc-DB.md | Seq: seq-tmp-add.md | R358, R359, R360 AddTmpFile indexes a tmp:// document in the in-memory overlay. CRC: crc-DB.md | R480

func (*DB) AppendChunks ¶

func (db *DB) AppendChunks(fileid uint64, content []byte, strategy string, opts ...AppendOption) error

AppendChunks adds chunks to an existing file without full reindex. content is only the appended bytes, not the full file. CRC: crc-DB.md | Seq: seq-append.md R150, R151, R152, R153, R154, R155, R156, R157, R163, R164, R165, R166, R167, R168

func (*DB) AppendTmpFile ¶

func (db *DB) AppendTmpFile(path, strategy string, content []byte, opts ...AppendOption) (uint64, error)

CRC: crc-DB.md | R428-R442, R483 AppendTmpFile appends content to an existing tmp:// document, creating it if it doesn't exist. New chunks are indexed from the appended content without touching existing chunks.

func (*DB) BM25Func ¶

func (db *DB) BM25Func(queryTrigrams []uint32) (ScoreFunc, error)

CRC: crc-DB.md | R274, R277, R278 BM25Func reads T records for per-trigram document frequencies and I record counters for corpus statistics, then returns a BM25 ScoreFunc closure.

func (*DB) CheckFile ¶

func (db *DB) CheckFile(fpath string) (FileStatus, error)

CheckFile checks whether an indexed file is fresh, stale, or missing on disk.

func (*DB) Close ¶

func (db *DB) Close() error

func (*DB) Copy ¶

func (db *DB) Copy() *DB

CRC: crc-DB.md | R459, R460, R461, R462 Copy returns a shallow copy of the DB sharing the LMDB env, overlay, and chunker registry. Caches are nil — the copy lazy-loads from committed LMDB state. Intended for short-lived write transactions in a separate goroutine.

func (*DB) Env ¶

func (db *DB) Env() *lmdb.Env

Env returns the underlying LMDB environment for sharing with other libraries.

func (*DB) FileIDPaths ¶

func (db *DB) FileIDPaths() (map[uint64]string, error)

CRC: crc-DB.md | R448, R449, R450, R454

func (*DB) FileInfoByID ¶

func (db *DB) FileInfoByID(fileid uint64) (FRecord, error)

FileInfoByID resolves a fileid to its FRecord.

func (*DB) GetChunks ¶

func (db *DB) GetChunks(fpath, targetRange string, before, after int) ([]ChunkResult, error)

Seq: seq-chunks.md | R197, R198, R199, R200, R201, R202, R203 GetChunks retrieves the target chunk (identified by range label) and up to before/after positional neighbors. Returns chunks in positional order.

func (*DB) HasTmp ¶

func (db *DB) HasTmp() bool

CRC: crc-DB.md | R377 HasTmp reports whether any tmp:// documents exist in the overlay.

func (*DB) InvalidateCaches ¶

func (db *DB) InvalidateCaches()

CRC: crc-DB.md | R463, R464 InvalidateCaches nils the path and FRecord caches, forcing lazy reload on next access. Does not reset overlayOnce.

func (*DB) NewChunkCache ¶

func (db *DB) NewChunkCache() *ChunkCache

NewChunkCache creates a per-query chunk cache.

func (*DB) NewSearchCache ¶

func (db *DB) NewSearchCache() func()

CRC: crc-DB.md | R456

func (*DB) QueryTrigramCounts ¶

func (db *DB) QueryTrigramCounts(query string) ([]TrigramCount, error)

QueryTrigramCounts extracts trigrams from a query string and returns their corpus document frequencies. For diagnostic/inspection use.

func (*DB) RecordCounts ¶

func (db *DB) RecordCounts() (map[byte]RecordStats, error)

CRC: crc-DB.md | R443, R444, R445

func (*DB) RefreshStale ¶

func (db *DB) RefreshStale(strategy string, opts ...IndexOption) ([]FileStatus, error)

RefreshStale reindexes all stale files. If strategy is empty, each file's existing strategy is used. Returns the list of stale/missing files. CRC: crc-DB.md | R479

func (*DB) Reindex ¶

func (db *DB) Reindex(fpath, strategy string, opts ...IndexOption) (uint64, error)

func (*DB) ReindexWithContent ¶

func (db *DB) ReindexWithContent(fpath, strategy string, opts ...IndexOption) (uint64, []byte, error)

CRC: crc-DB.md | R121

func (*DB) RemoveFile ¶

func (db *DB) RemoveFile(fpath string) error

func (*DB) RemoveStrategy ¶

func (db *DB) RemoveStrategy(name string) error

func (*DB) RemoveTmpFile ¶

func (db *DB) RemoveTmpFile(path string) error

CRC: crc-DB.md | Seq: seq-tmp-add.md | R364, R365 RemoveTmpFile removes a tmp:// document from the overlay.

func (*DB) ScoreFile ¶

func (db *DB) ScoreFile(query, fpath string, fn ScoreFunc, opts ...SearchOption) ([]ScoredChunk, error)

Seq: seq-score.md | R178, R179, R180 ScoreFile returns per-chunk scores for a single file using the given scoring function.

func (*DB) Search ¶

func (db *DB) Search(query string, opts ...SearchOption) (*SearchResults, error)

Seq: seq-search.md | R178, R179, R180, R181, R182

func (*DB) SearchFuzzy ¶

func (db *DB) SearchFuzzy(query string, k int, opts ...SearchOption) (*SearchResults, error)

CRC: crc-DB.md | Seq: seq-fuzzy-trigram.md | R418, R419, R420, R421, R422, R423, R425, R427 SearchFuzzy performs fast typo-tolerant search using two phases: Phase 1: trigram OR-union tally from T record posting lists (select top-k) Phase 2: C record re-score with ScoreCoverage for the top-k winners

func (*DB) SearchMulti ¶

func (db *DB) SearchMulti(query string, strategies map[string]ScoreFunc, k int, opts ...SearchOption) ([]MultiSearchResult, error)

CRC: crc-DB.md | Seq: seq-search-multi.md | R283, R284, R285, R287, R288, R289, R290

func (*DB) SearchRegex ¶

func (db *DB) SearchRegex(pattern string, opts ...SearchOption) (*SearchResults, error)

Seq: seq-search.md SearchRegex searches using a regex pattern against the full trigram index.

func (*DB) Settings ¶

func (db *DB) Settings() Settings

Settings returns the current database settings.

func (*DB) StaleFiles ¶

func (db *DB) StaleFiles() ([]FileStatus, error)

StaleFiles returns the status of every indexed file.

func (*DB) TmpContent ¶

func (db *DB) TmpContent(path string) (*bytes.Reader, error)

CRC: crc-DB.md | R378 TmpContent returns a reader over the raw stored content of a tmp:// document.

func (*DB) TmpFileIDs ¶

func (db *DB) TmpFileIDs() map[uint64]struct{}

CRC: crc-DB.md | R369 TmpFileIDs returns the set of all current tmp:// fileids.

func (*DB) UpdateTmpFile ¶

func (db *DB) UpdateTmpFile(path, strategy string, content []byte, opts ...IndexOption) error

CRC: crc-DB.md | Seq: seq-tmp-add.md | R361, R362, R363, R481 UpdateTmpFile replaces the content of an existing tmp:// document.

func (*DB) Version ¶

func (db *DB) Version() (string, error)

CRC: crc-DB.md

type FRecord ¶

type FRecord struct {
	FileID      uint64
	ModTime     int64
	ContentHash [32]byte
	FileLength  int64
	Strategy    string
	Names       []string
	Chunks      []FileChunkEntry
	Tokens      []TokenEntry
}

FRecord is the per-file record. Metadata, ordered chunks, file-level token bag.

func UnmarshalFHeader ¶

func UnmarshalFHeader(data []byte) (FRecord, error)

R451, R452: UnmarshalFHeader decodes only the header fields of an F record value: ModTime, ContentHash, FileLength, Strategy, and Names. Skips Chunks and Tokens.

func UnmarshalFValue ¶

func UnmarshalFValue(data []byte) (FRecord, error)

UnmarshalFValue decodes an FRecord value. FileID must be set separately (from key).

func (*FRecord) MarshalValue ¶

func (f *FRecord) MarshalValue() []byte

MarshalValue encodes the FRecord value (everything except the key prefix and fileid).

type FileChunkEntry ¶

type FileChunkEntry struct {
	ChunkID  uint64
	Location string
}

FileChunkEntry pairs a chunkid with its location label (opaque range string from chunker).

type FileStatus ¶

type FileStatus struct {
	Path     string
	Status   string // "fresh", "stale", "missing"
	FileID   uint64
	Strategy string
}

FileStatus is returned by CheckFile and StaleFiles.

type FuncChunker ¶

type FuncChunker struct {
	Fn ChunkFunc
}

FuncChunker wraps a bare ChunkFunc into a Chunker. ChunkText re-runs the function and returns the first chunk matching the range label.

func (FuncChunker) ChunkText ¶

func (fc FuncChunker) ChunkText(path string, content []byte, rangeLabel string) ([]byte, bool)

func (FuncChunker) Chunks ¶

func (fc FuncChunker) Chunks(path string, content []byte, yield func(Chunk) bool) error

type HRecord ¶

type HRecord struct {
	Hash    [32]byte
	ChunkID uint64
}

HRecord maps content hash to chunkid.

type IndexOption ¶

type IndexOption func(*indexConfig)

IndexOption configures indexing methods (AddFile, AddFileWithContent, RefreshStale, AddTmpFile, UpdateTmpFile). CRC: crc-DB.md | R472

func WithChunkCallback ¶

func WithChunkCallback(fn ChunkCallback) IndexOption

WithChunkCallback supplies a chunk callback for indexing methods. CRC: crc-DB.md | R470

type IndexStatus ¶

type IndexStatus struct {
	Built bool
}

IndexStatus reports the state of the index.

type KeyPair ¶

type KeyPair struct {
	Key   []byte
	Value []byte // nil for non-final parts; caller sets fileid on final part
}

KeyPair is an N record key/value pair for filename key chains.

func EncodeFilename ¶

func EncodeFilename(filename string) []KeyPair

EncodeFilename returns N record key/value pairs for a filename. Short filenames (≤509 bytes) produce a single final key. Longer filenames are split across chained keys.

type MultiSearchResult ¶

type MultiSearchResult struct {
	Strategy string
	Results  []SearchResult
}

CRC: crc-DB.md | R286 MultiSearchResult holds one strategy's results from SearchMulti.

type Options ¶

type Options struct {
	CaseInsensitive bool
	Aliases         map[byte]byte // maps input bytes to replacement bytes before trigram extraction
	DBName          string        // subdatabase name, default "fts"
	MaxDBs          int           // LMDB max named databases, default 2
	MapSize         int64         // bytes, default 1GB
}

Options configures database creation and opening.

type Pair ¶

type Pair struct {
	Key   []byte
	Value []byte
}

Pair is an opaque key-value pair for per-chunk metadata. Allows duplicate keys. Mirrors the DB wire format.

func CopyPairs ¶

func CopyPairs(src []Pair) []Pair

CopyPairs deep-copies a slice of Pair.

type RecordStats ¶

type RecordStats struct {
	Count      int64
	KeyBytes   int64
	ValueBytes int64
}

CRC: crc-DB.md | R445

type ScoreFunc ¶

type ScoreFunc func(queryTrigrams []uint32, chunkCounts map[uint32]int, chunkTokenCount int) float64

ScoreFunc computes a relevance score for a chunk. queryTrigrams: active query trigrams. chunkCounts: trigram -> occurrence count in the chunk. chunkTokenCount: number of tokens (words) in the chunk.

var ScoreCoverage ScoreFunc = scoreCoverage

ScoreCoverage is the coverage scoring function: fraction of active query trigrams present in chunk.

var ScoreDensityFunc ScoreFunc = scoreDensity

ScoreDensityFunc is the density scoring function for direct use with ScoreFile.

func ScoreBM25 ¶

func ScoreBM25(idf map[uint32]float64, avgTokenCount float64) ScoreFunc

CRC: crc-DB.md | R272, R273 ScoreBM25 returns a ScoreFunc closure implementing BM25 ranking. idf maps trigram codes to their inverse document frequency. avgTokenCount is the average chunk token count across the corpus.

type ScoredChunk ¶

type ScoredChunk struct {
	Range string
	Score float64
}

ScoredChunk is a per-chunk trigram match score from ScoreFile.

type SearchOption ¶

type SearchOption func(*searchConfig)

SearchOption configures search behavior.

func WithAfter ¶

func WithAfter(t time.Time) SearchOption

WithAfter keeps chunks with timestamp >= t. Checks "timestamp" attr first (parsed as Unix nanos); falls back to file mod time from F record. CRC: crc-DB.md | R258

func WithBefore ¶

func WithBefore(t time.Time) SearchOption

WithBefore keeps chunks with timestamp < t. Same fallback as WithAfter. CRC: crc-DB.md | R259

func WithChunkCache ¶

func WithChunkCache(cc *ChunkCache) SearchOption

WithChunkCache threads an external ChunkCache through post-filters (verify, regex, except-regex). When present, post-filters use the cache instead of re-reading files. R486

func WithChunkFilter ¶

func WithChunkFilter(fn ChunkFilter) SearchOption

WithChunkFilter adds a chunk filter. Multiple calls accumulate (AND semantics).

func WithCoverage ¶

func WithCoverage() SearchOption

WithCoverage uses coverage scoring (default): matching / total active query trigrams.

func WithDensity ¶

func WithDensity() SearchOption

WithDensity uses token-density scoring for long queries.

func WithExcept ¶

func WithExcept(ids map[uint64]struct{}) SearchOption

WithExcept excludes chunks from the given file IDs.

func WithExceptRegex ¶

func WithExceptRegex(patterns ...string) SearchOption

WithExceptRegex adds subtract post-filters: any match rejects the chunk. Multiple calls accumulate patterns. R184, R185

func WithLoose ¶

func WithLoose() SearchOption

CRC: crc-DB.md | Seq: seq-fuzzy-search.md | R336 WithLoose enables OR semantics at the term level: a chunk matches if it contains any query term's trigrams. Default scoring: terms matched / total terms.

func WithNoTmp ¶

func WithNoTmp() SearchOption

CRC: crc-DB.md | R376 WithNoTmp excludes tmp:// overlay documents from search results.

func WithOnly ¶

func WithOnly(ids map[uint64]struct{}) SearchOption

WithOnly restricts search to chunks from the given file IDs.

func WithOverlap ¶

func WithOverlap() SearchOption

CRC: crc-DB.md | R271 WithOverlap uses overlap scoring: matching trigram count, no normalization.

func WithProximityRerank ¶

func WithProximityRerank(topN int) SearchOption

CRC: crc-DB.md | R279 WithProximityRerank reranks the top-N results by query term proximity in chunk text.

func WithRegexFilter ¶

func WithRegexFilter(patterns ...string) SearchOption

WithRegexFilter adds AND post-filters: every pattern must match chunk content. Multiple calls accumulate patterns. R183, R185

func WithScoring ¶

func WithScoring(fn ScoreFunc) SearchOption

WithScoring uses a custom scoring function.

func WithTrigramFilter ¶

func WithTrigramFilter(fn TrigramFilter) SearchOption

WithTrigramFilter supplies a caller-defined trigram selection function.

func WithVerify ¶

func WithVerify() SearchOption

WithVerify enables post-filter verification: after trigram intersection, read chunk text from disk and verify each query term appears as a case-insensitive substring. Eliminates trigram false positives. R124, R125

type SearchResult ¶

type SearchResult struct {
	Path  string
	Range string
	Score float64
	// contains filtered or unexported fields
}

SearchResult is a single match from Search. R99, R490, R491

type SearchResults ¶

type SearchResults struct {
	Results []SearchResult
	Status  IndexStatus
}

SearchResults wraps search matches with index health status.

type Settings ¶

type Settings struct {
	CaseInsensitive    bool
	Aliases            map[byte]byte     // byte→byte alias mapping
	ChunkingStrategies map[string]string // name→cmd (empty cmd = func strategy)
}

Settings holds the in-memory representation of I records.

type StringDelim ¶

type StringDelim struct {
	Open   string // opening delimiter
	Close  string // closing delimiter (same as Open for symmetric quotes)
	Escape string // escape character (empty = no escaping)
}

StringDelim defines a string delimiter and its escape character. R308

type TRecord ¶

type TRecord struct {
	Trigram  uint32
	ChunkIDs []uint64
}

TRecord is the trigram inverted index entry.

func (*TRecord) MarshalValue ¶

func (t *TRecord) MarshalValue() []byte

MarshalValue encodes the TRecord value (packed chunkid list).

type TokenEntry ¶

type TokenEntry struct {
	Token string
	Count int
}

TokenEntry pairs a token string with its occurrence count.

type TrigramCount ¶

type TrigramCount struct {
	Trigram uint32
	Count   int
}

TrigramCount pairs a trigram code with its corpus document frequency.

func FilterAll ¶

func FilterAll(trigrams []TrigramCount, _ int) []TrigramCount

FilterAll uses every query trigram. No filtering.

type TrigramEntry ¶

type TrigramEntry struct {
	Trigram uint32
	Count   int
}

TrigramEntry pairs a trigram code with its per-chunk occurrence count.

type TrigramFilter ¶

type TrigramFilter func(trigrams []TrigramCount, totalChunks int) []TrigramCount

TrigramFilter decides which trigrams to use for a given query. It receives the query's trigrams with their corpus-wide document frequencies, and the total number of indexed chunks. It returns the subset to search with.

func FilterBestN ¶

func FilterBestN(n int) TrigramFilter

FilterBestN returns a TrigramFilter that keeps the N trigrams with the lowest document frequency.

func FilterByRatio ¶

func FilterByRatio(maxRatio float64) TrigramFilter

FilterByRatio returns a TrigramFilter that skips trigrams appearing in more than maxRatio of total chunks. E.g., 0.50 skips trigrams in >50% of chunks.

type Trigrams ¶

type Trigrams struct {
	// contains filtered or unexported fields
}

Trigrams extracts raw byte trigrams from text. Every byte is its own value — no character set mapping. Whitespace bytes are boundaries; runs collapse. Case insensitivity via bytes.ToLower(). Byte aliases applied before extraction.

func NewTrigrams ¶

func NewTrigrams(caseInsensitive bool, aliases map[byte]byte) *Trigrams

NewTrigrams creates a trigram extractor.

func (*Trigrams) EncodeTrigram ¶

func (t *Trigrams) EncodeTrigram(s string) (uint32, bool)

EncodeTrigram converts a 3-byte string to a 24-bit trigram using the same encoding as ExtractTrigrams: case folding, aliases, whitespace→0. Returns 0, false if the trigram cannot appear in the index (e.g. all whitespace, or consecutive whitespace which encode() collapses away).

func (*Trigrams) ExtractTrigrams ¶

func (t *Trigrams) ExtractTrigrams(data []byte) []uint32

ExtractTrigrams extracts all trigrams from data. Character-internal trigrams (windows entirely within a multibyte UTF-8 char) are skipped.

func (*Trigrams) TrigramCounts ¶

func (t *Trigrams) TrigramCounts(data []byte) map[uint32]int

TrigramCounts extracts trigrams with occurrence counts. Character-internal trigrams (windows entirely within a multibyte UTF-8 char) are skipped.

type TxnHolder ¶

type TxnHolder interface {
	Txn() *lmdb.Txn
}

TxnHolder is anything that carries an LMDB transaction. CRecord implements it; txnWrap wraps raw transactions from View/Update blocks.

type WRecord ¶

type WRecord struct {
	TokenHash uint32
	ChunkIDs  []uint64
}

WRecord is the token inverted index entry.

func (*WRecord) MarshalValue ¶

func (w *WRecord) MarshalValue() []byte

MarshalValue encodes the WRecord value (packed chunkid list, same as TRecord).

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
microfts command

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL