microfts2

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 12, 2026 License: MIT Imports: 22 Imported by: 0

README

microfts2

A dynamic trigram index backed by LMDB, written in Go. Usable as a CLI tool or as a library.

microfts2 indexes files into raw byte trigrams (24-bit, 16M possible) organized by chunks, then maintains an inverted index for fast substring search. The index is maintained incrementally — every add/remove updates it immediately.

Install

go install microfts/cmd/microfts@latest

Or clone and build:

git clone https://github.com/zot/microfts2.git
cd microfts2
go build ./cmd/microfts

Quick Start

# Create a database
microfts init -db myindex

# Register a chunking strategy (any command that outputs range\tcontent lines)
microfts strategy add -db myindex -name lines -cmd "microfts chunk-lines"

# Add files
microfts add -db myindex -strategy lines src/*.go

# Search
microfts search -db myindex "func Open"
# output: src/db.go:198-260

# Check for stale files
microfts stale -db myindex

# Refresh stale files and search in one step
microfts search -r -db myindex "func Open"

How It Works

Every byte is its own value — no character set mapping. Three consecutive bytes form a 24-bit trigram. UTF-8 multibyte characters produce cross-boundary byte trigrams; character-internal trigrams are skipped.

Search computes the query's trigrams, optionally filters them via a caller-supplied TrigramFilter, intersects posting lists from the inverted index, and returns matching file/chunk locations.

Two-tree design: Content and index live in separate LMDB subdatabases. The content DB stores trigram frequency counts (sparse C records), file metadata, and settings. The index DB stores the inverted trigram-to-chunk mapping, maintained incrementally on every add/remove.

Dynamic trigram filtering: Query trigram selection is handled at search time via TrigramFilter functions. Stock filters include FilterAll (use all trigrams), FilterByRatio (skip high-frequency trigrams), and FilterBestN (keep N most selective). Callers can supply custom filters.

Staleness detection: Each indexed file records its modification time and SHA-256 hash. Checking mod time first avoids hashing unchanged files. The -r flag refreshes stale files before any command.

CLI Reference

All commands require -db <path>. Optional: -content-db, -index-db for custom subdatabase names.

Command Description
init Create a new database. -case-insensitive, -aliases
add Add files. -strategy <name>
search Search for text. -regex, -score coverage|density, -verify
delete Remove files from the database
reindex Re-chunk files with a different strategy. -strategy <name>
score Score named files against a query. -score coverage|density
stale List stale and missing files
strategy add|remove|list Manage chunking strategies
chunk-lines Built-in chunker: one chunk per line
chunk-lines-overlap Built-in chunker: overlapping line windows. -lines, -overlap
chunk-words-overlap Built-in chunker: overlapping word windows. -words, -overlap, -pattern

Global flag: -r refreshes all stale files before running the command. Usable standalone (microfts -r -db path) or combined (microfts search -r -db path query).

Library API

import "microfts2"

// Lifecycle
db, err := microfts.Create(path, microfts.Options{})
db, err := microfts.Open(path, microfts.Options{})
db.Close()

// Content
fileid, err := db.AddFile(filepath, strategyName)
db.RemoveFile(filepath)
fileid, err := db.Reindex(filepath, strategyName)

// Search
results, err := db.Search("query", microfts.WithTrigramFilter(microfts.FilterAll))
results, err := db.SearchRegex("pattern")

// Scoring
chunks, err := db.ScoreFile("query", filepath, microfts.CoverageScore)

// Staleness
status, err := db.CheckFile(filepath)    // FileStatus{Path, Status, FileID, Strategy}
statuses, err := db.StaleFiles()         // all indexed files
refreshed, err := db.RefreshStale("")    // reindex stale files

// Strategies
db.AddStrategy(name, command)
db.AddStrategyFunc(name, fn)
db.RemoveStrategy(name)

Chunking Strategies

A chunking strategy is either an external command or a Go function that takes a file and produces chunks. Each chunk has an opaque range label and text content to index.

External chunker example using awk:

microfts strategy add -db myindex -name awk-lines \
  -cmd "awk 'BEGIN{pos=0} {pos+=length(\$0)+1; print pos}'"

License

MIT

Documentation

Index

Constants

View Source
const BitsetSize = 2097152 // 2^21 bytes = 2^24 bits

Variables

View Source
var ErrAlreadyIndexed = errors.New("file already indexed")

ErrAlreadyIndexed is returned when AddFile is called for a path that already has F records in the database. Use Reindex or AppendChunks instead. R215

View Source
var ErrNoChunks = errors.New("chunker produced no chunks")

ErrNoChunks is returned when a chunker produces zero chunks for a file.

View Source
var LangC = BracketLang{
	LineComments:  []string{"//"},
	BlockComments: [][2]string{{"/*", "*/"}},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'", Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangC is the bracket language config for C/C++.

View Source
var LangGo = BracketLang{
	LineComments:  []string{"//"},
	BlockComments: [][2]string{{"/*", "*/"}},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "`", Close: "`"},
		{Open: "'", Close: "'", Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangGo is the bracket language config for Go.

View Source
var LangJS = BracketLang{
	LineComments:  []string{"//"},
	BlockComments: [][2]string{{"/*", "*/"}},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'", Escape: `\`},
		{Open: "`", Close: "`", Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangJS is the bracket language config for JavaScript.

View Source
var LangJava = LangC

LangJava is the bracket language config for Java.

View Source
var LangLisp = BracketLang{
	LineComments: []string{";"},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
	},
	Brackets: []BracketGroup{
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangLisp is the bracket language config for Lisp/Scheme/Clojure.

View Source
var LangNginx = BracketLang{
	LineComments: []string{"#"},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'"},
	},
	Brackets: []BracketGroup{
		{Open: []string{"{"}, Close: []string{"}"}},
	},
}

LangNginx is the bracket language config for nginx.

View Source
var LangPascal = BracketLang{
	BlockComments: [][2]string{{"{", "}"}, {"(*", "*)"}},
	StringDelims: []StringDelim{
		{Open: "'", Close: "'"},
	},
	Brackets: []BracketGroup{
		{
			Open:       []string{"begin", "record", "class"},
			Separators: []string{},
			Close:      []string{"end"},
		},
		{
			Open:       []string{"if"},
			Separators: []string{"then", "else"},
			Close:      []string{"end"},
		},
		{Open: []string{"("}, Close: []string{")"}},
		{Open: []string{"["}, Close: []string{"]"}},
	},
}

LangPascal is the bracket language config for Pascal.

View Source
var LangShell = BracketLang{
	LineComments: []string{"#"},
	StringDelims: []StringDelim{
		{Open: `"`, Close: `"`, Escape: `\`},
		{Open: "'", Close: "'"},
	},
	Brackets: []BracketGroup{
		{
			Open:       []string{"if"},
			Separators: []string{"then", "elif", "else"},
			Close:      []string{"fi"},
		},
		{
			Open:       []string{"while", "for"},
			Separators: []string{"do"},
			Close:      []string{"done"},
		},
		{
			Open:  []string{"case"},
			Close: []string{"esac"},
		},
		{Open: []string{"{"}, Close: []string{"}"}},
		{Open: []string{"("}, Close: []string{")"}},
	},
}

LangShell is the bracket language config for Bourne shell / bash.

Functions

func DecodeFilename

func DecodeFilename(keys [][]byte) string

DecodeFilename reconstructs a filename from chained N record keys. Keys must be in order (part 0, 1, ..., 255).

func DecodeTrigram

func DecodeTrigram(v uint32) string

DecodeTrigram converts a 24-bit trigram value back to a 3-byte string. Bytes that are 0 (whitespace-encoded) are shown as spaces.

func FinalKey

func FinalKey(filename string) []byte

FinalKey returns the final N record key for direct fileid lookup.

func LineChunkFunc

func LineChunkFunc(_ string, content []byte, yield func(Chunk) bool) error

LineChunkFunc is a built-in ChunkFunc that yields one chunk per line. Range is "N-N" (1-based line number).

func MarkdownChunkFunc

func MarkdownChunkFunc(_ string, content []byte, yield func(Chunk) bool) error

MarkdownChunkFunc splits markdown content into paragraph-based chunks. Heading lines start new chunks; a heading and its following paragraph (up to the next blank line or heading) form one chunk. Blank lines are boundaries only and are not included in any chunk's content. Fenced code blocks (``` or ~~~) suppress blank-line splitting — all lines from opening fence through matching close belong to the current chunk. // R465, R466, R467, R468

func PairGet

func PairGet(pairs []Pair, key string) ([]byte, bool)

PairGet returns the value for the first Pair matching key, or nil if not found.

func ScoreOverlap

func ScoreOverlap(queryTrigrams []uint32, chunkCounts map[uint32]int, _ int) float64

ScoreOverlap: count of matching query trigrams, no normalization (OR semantics).

func TrigramValue

func TrigramValue(a, b, c byte) uint32

TrigramValue computes the 24-bit trigram from three byte values.

func UnmarshalTValue

func UnmarshalTValue(data []byte) ([]uint64, error)

UnmarshalTValue decodes a TRecord value. Trigram must be set separately (from key).

func UnmarshalWValue

func UnmarshalWValue(data []byte) ([]uint64, error)

UnmarshalWValue decodes a WRecord value. Same format as TRecord.

func ValidateAliases

func ValidateAliases(aliases map[byte]byte) error

ValidateAliases returns an error if any alias source or target byte is non-ASCII (≥ 0x80). Aliasing UTF-8 continuation or leading bytes would corrupt multibyte characters and break character-internal trigram skipping.

Types

type AppendOption

type AppendOption func(*appendConfig)

AppendOption configures AppendChunks behavior. R158

func WithAppendChunkCallback

func WithAppendChunkCallback(fn ChunkCallback) AppendOption

WithAppendChunkCallback supplies a chunk callback for append methods. CRC: crc-DB.md | R471

func WithBaseLine

func WithBaseLine(n int) AppendOption

WithBaseLine sets the 1-based line number offset for line-based chunker ranges. When non-zero, "start-end" ranges are adjusted by adding this offset. R162

func WithContentHash

func WithContentHash(hash string) AppendOption

WithContentHash sets the full-file SHA-256 hash (caller pre-computed). R159

func WithFileLength

func WithFileLength(n int64) AppendOption

WithFileLength sets the full file size after append. R161

func WithModTime

func WithModTime(t int64) AppendOption

WithModTime sets the file modification time (Unix nanoseconds). R160

type Bitset

type Bitset [BitsetSize]byte

Bitset is a fixed-size bitset for 16,777,216 trigrams (2^24).

func (*Bitset) Bytes

func (b *Bitset) Bytes() []byte

Bytes returns the bitset as a byte slice for storage.

func (*Bitset) Count

func (b *Bitset) Count() int

Count returns the number of set bits.

func (*Bitset) ForEach

func (b *Bitset) ForEach(fn func(uint32))

ForEach calls fn for each set bit in ascending order.

func (*Bitset) FromBytes

func (b *Bitset) FromBytes(data []byte)

FromBytes loads the bitset from stored bytes.

func (*Bitset) Set

func (b *Bitset) Set(trigram uint32)

Set sets the bit for the given trigram.

func (*Bitset) Test

func (b *Bitset) Test(trigram uint32) bool

Test returns whether the bit for the given trigram is set.

type BracketGroup

type BracketGroup struct {
	Open       []string // openers: e.g. ["{"], ["if","while","for"]
	Separators []string // optional: e.g. ["else","elif","then"]
	Close      []string // closers: e.g. ["}"], ["end","done","fi"]
}

BracketGroup defines one set of matching brackets. R309 Separators are mid-group markers (e.g. "else" between "if"/"end").

type BracketLang

type BracketLang struct {
	LineComments  []string       // e.g. "//", "#", "--"
	BlockComments [][2]string    // e.g. {{"/*", "*/"}, {"<!--", "-->"}}
	StringDelims  []StringDelim  // e.g. {`"`, `"`, `\`}
	Brackets      []BracketGroup // open/separator/close sets
}

BracketLang defines the lexical rules for one language. R307

func LangByName

func LangByName(name string) (BracketLang, bool)

LangByName returns a BracketLang config by name, or false if not found.

type CRecord

type CRecord struct {
	ChunkID  uint64
	Hash     [32]byte
	Trigrams []TrigramEntry
	Tokens   []TokenEntry
	Attrs    []Pair
	FileIDs  []uint64
	// contains filtered or unexported fields
}

CRecord is the per-chunk record. Self-describing: everything needed for search, scoring, filtering, and removal. Carries unexported db/txn — the chunk is tied to the transaction that read it.

func UnmarshalCValue

func UnmarshalCValue(data []byte) (CRecord, error)

UnmarshalCValue decodes a CRecord value. ChunkID must be set separately (from key). v2 format: hash + trigrams + tokens + attrs + fileids

func (*CRecord) DB

func (c *CRecord) DB() *DB

DB returns the database this record belongs to.

func (*CRecord) FileRecord

func (c *CRecord) FileRecord(fileid uint64) (FRecord, error)

FileRecord navigates to an F record within the same transaction.

func (*CRecord) MarshalValue

func (c *CRecord) MarshalValue() []byte

MarshalValue encodes the CRecord value (everything except the key prefix and chunkid). v2 format: hash + trigrams + tokens + attrs + fileids

func (*CRecord) Txn

func (c *CRecord) Txn() *lmdb.Txn

Txn returns the transaction this record was read from. Implements TxnHolder.

type Chunk

type Chunk struct {
	Range   []byte
	Content []byte
	Attrs   []Pair
}

Chunk is a single chunk yielded by a Chunker. Range is an opaque label (e.g. "1-10" for lines); Content is the chunk text. Range and Content may be reused between yields — caller must copy if retaining. Attrs is optional per-chunk metadata (nil means no attrs).

type ChunkCache

type ChunkCache struct {
	// contains filtered or unexported fields
}

ChunkCache is a per-query cache for file content and chunked data. Avoids redundant file reads and re-chunking when processing search results.

func (*ChunkCache) ChunkText

func (cc *ChunkCache) ChunkText(fpath, rangeLabel string) ([]byte, bool)

ChunkText returns a single chunk's content by range label. Uses lazy chunking — stops as soon as the target is found.

func (*ChunkCache) GetChunks

func (cc *ChunkCache) GetChunks(fpath, targetRange string, before, after int) ([]ChunkResult, error)

GetChunks retrieves the target chunk and up to before/after positional neighbors. Same contract as DB.GetChunks but cached.

type ChunkCallback

type ChunkCallback func(chunkText string)

ChunkCallback receives clean chunk text during indexing. Called once per chunk, in chunk order. The string is a copy, safe to retain. CRC: crc-DB.md | R469

type ChunkFilter

type ChunkFilter func(chunk CRecord) bool

ChunkFilter receives a CRecord during candidate evaluation. Return true to keep the chunk, false to reject it. The CRecord carries transaction context — use Txn() and DB() for lookups.

type ChunkFunc

type ChunkFunc func(path string, content []byte, yield func(Chunk) bool) error

ChunkFunc is a generator that yields chunks for a file. Convenience type — wrap with FuncChunker to get a full Chunker.

func RunChunkerFunc

func RunChunkerFunc(cmd string) ChunkFunc

RunChunkerFunc returns a ChunkFunc that executes an external command. The command receives the filepath as an argument and outputs one chunk per line on stdout as "range\tcontent".

type ChunkResult

type ChunkResult struct {
	Path    string `json:"path"`
	Range   string `json:"range"`
	Content string `json:"content"`
	Index   int    `json:"index"` // 0-based position in the file's chunk list
	Attrs   []Pair `json:"attrs,omitempty"`
}

ChunkResult holds a single chunk with its content and position. R201

type Chunker

type Chunker interface {
	Chunks(path string, content []byte, yield func(Chunk) bool) error
	ChunkText(path string, content []byte, rangeLabel string) ([]byte, bool)
}

Chunker is the interface for chunking strategies. Chunks produces chunks for indexing; ChunkText retrieves a single chunk's content.

func BracketChunker

func BracketChunker(lang BracketLang) Chunker

BracketChunker returns a Chunker for the given language config. R320

func IndentChunker

func IndentChunker(lang BracketLang, tabWidth int) Chunker

IndentChunker returns a Chunker for indentation-scoped languages. R333 tabWidth controls how tabs count for column calculation (0 = one column per tab).

type DB

type DB struct {
	// contains filtered or unexported fields
}

func Create

func Create(path string, opts Options) (*DB, error)

Seq: seq-init.md

func Open

func Open(path string, opts Options) (*DB, error)

func (*DB) AddChunker

func (db *DB) AddChunker(name string, c Chunker) error

CRC: crc-DB.md | R293

func (*DB) AddFile

func (db *DB) AddFile(fpath, strategy string, opts ...IndexOption) (uint64, error)

Seq: seq-add.md | R477

func (*DB) AddFileWithContent

func (db *DB) AddFileWithContent(fpath, strategy string, opts ...IndexOption) (uint64, []byte, error)

CRC: crc-DB.md | R120, R478

func (*DB) AddStrategy

func (db *DB) AddStrategy(name, cmd string) error

func (*DB) AddStrategyFunc

func (db *DB) AddStrategyFunc(name string, fn ChunkFunc) error

CRC: crc-DB.md | R294

func (*DB) AddTmpFile

func (db *DB) AddTmpFile(path, strategy string, content []byte, opts ...IndexOption) (uint64, error)

CRC: crc-DB.md | Seq: seq-tmp-add.md | R358, R359, R360 AddTmpFile indexes a tmp:// document in the in-memory overlay. CRC: crc-DB.md | R480

func (*DB) AppendChunks

func (db *DB) AppendChunks(fileid uint64, content []byte, strategy string, opts ...AppendOption) error

AppendChunks adds chunks to an existing file without full reindex. content is only the appended bytes, not the full file. CRC: crc-DB.md | Seq: seq-append.md R150, R151, R152, R153, R154, R155, R156, R157, R163, R164, R165, R166, R167, R168

func (*DB) AppendTmpFile

func (db *DB) AppendTmpFile(path, strategy string, content []byte, opts ...AppendOption) (uint64, error)

CRC: crc-DB.md | R428-R442, R483 AppendTmpFile appends content to an existing tmp:// document, creating it if it doesn't exist. New chunks are indexed from the appended content without touching existing chunks.

func (*DB) BM25Func

func (db *DB) BM25Func(queryTrigrams []uint32) (ScoreFunc, error)

CRC: crc-DB.md | R274, R277, R278 BM25Func reads T records for per-trigram document frequencies and I record counters for corpus statistics, then returns a BM25 ScoreFunc closure.

func (*DB) CheckFile

func (db *DB) CheckFile(fpath string) (FileStatus, error)

CheckFile checks whether an indexed file is fresh, stale, or missing on disk.

func (*DB) Close

func (db *DB) Close() error

func (*DB) Copy

func (db *DB) Copy() *DB

CRC: crc-DB.md | R459, R460, R461, R462 Copy returns a shallow copy of the DB sharing the LMDB env, overlay, and chunker registry. Caches are nil — the copy lazy-loads from committed LMDB state. Intended for short-lived write transactions in a separate goroutine.

func (*DB) Env

func (db *DB) Env() *lmdb.Env

Env returns the underlying LMDB environment for sharing with other libraries.

func (*DB) FileIDPaths

func (db *DB) FileIDPaths() (map[uint64]string, error)

CRC: crc-DB.md | R448, R449, R450, R454

func (*DB) FileInfoByID

func (db *DB) FileInfoByID(fileid uint64) (FRecord, error)

FileInfoByID resolves a fileid to its FRecord.

func (*DB) GetChunks

func (db *DB) GetChunks(fpath, targetRange string, before, after int) ([]ChunkResult, error)

Seq: seq-chunks.md | R197, R198, R199, R200, R201, R202, R203 GetChunks retrieves the target chunk (identified by range label) and up to before/after positional neighbors. Returns chunks in positional order.

func (*DB) HasTmp

func (db *DB) HasTmp() bool

CRC: crc-DB.md | R377 HasTmp reports whether any tmp:// documents exist in the overlay.

func (*DB) InvalidateCaches

func (db *DB) InvalidateCaches()

CRC: crc-DB.md | R463, R464 InvalidateCaches nils the path and FRecord caches, forcing lazy reload on next access. Does not reset overlayOnce.

func (*DB) NewChunkCache

func (db *DB) NewChunkCache() *ChunkCache

NewChunkCache creates a per-query chunk cache.

func (*DB) NewSearchCache

func (db *DB) NewSearchCache() func()

CRC: crc-DB.md | R456

func (*DB) QueryTrigramCounts

func (db *DB) QueryTrigramCounts(query string) ([]TrigramCount, error)

QueryTrigramCounts extracts trigrams from a query string and returns their corpus document frequencies. For diagnostic/inspection use.

func (*DB) RecordCounts

func (db *DB) RecordCounts() (map[byte]RecordStats, error)

CRC: crc-DB.md | R443, R444, R445

func (*DB) RefreshStale

func (db *DB) RefreshStale(strategy string, opts ...IndexOption) ([]FileStatus, error)

RefreshStale reindexes all stale files. If strategy is empty, each file's existing strategy is used. Returns the list of stale/missing files. CRC: crc-DB.md | R479

func (*DB) Reindex

func (db *DB) Reindex(fpath, strategy string, opts ...IndexOption) (uint64, error)

func (*DB) ReindexWithContent

func (db *DB) ReindexWithContent(fpath, strategy string, opts ...IndexOption) (uint64, []byte, error)

CRC: crc-DB.md | R121

func (*DB) RemoveFile

func (db *DB) RemoveFile(fpath string) error

func (*DB) RemoveStrategy

func (db *DB) RemoveStrategy(name string) error

func (*DB) RemoveTmpFile

func (db *DB) RemoveTmpFile(path string) error

CRC: crc-DB.md | Seq: seq-tmp-add.md | R364, R365 RemoveTmpFile removes a tmp:// document from the overlay.

func (*DB) ScoreFile

func (db *DB) ScoreFile(query, fpath string, fn ScoreFunc, opts ...SearchOption) ([]ScoredChunk, error)

Seq: seq-score.md | R178, R179, R180 ScoreFile returns per-chunk scores for a single file using the given scoring function.

func (*DB) Search

func (db *DB) Search(query string, opts ...SearchOption) (*SearchResults, error)

Seq: seq-search.md | R178, R179, R180, R181, R182

func (*DB) SearchFuzzy

func (db *DB) SearchFuzzy(query string, k int, opts ...SearchOption) (*SearchResults, error)

CRC: crc-DB.md | Seq: seq-fuzzy-trigram.md | R418, R419, R420, R421, R422, R423, R425, R427 SearchFuzzy performs fast typo-tolerant search using two phases: Phase 1: trigram OR-union tally from T record posting lists (select top-k) Phase 2: C record re-score with ScoreCoverage for the top-k winners

func (*DB) SearchMulti

func (db *DB) SearchMulti(query string, strategies map[string]ScoreFunc, k int, opts ...SearchOption) ([]MultiSearchResult, error)

CRC: crc-DB.md | Seq: seq-search-multi.md | R283, R284, R285, R287, R288, R289, R290

func (*DB) SearchRegex

func (db *DB) SearchRegex(pattern string, opts ...SearchOption) (*SearchResults, error)

Seq: seq-search.md SearchRegex searches using a regex pattern against the full trigram index.

func (*DB) Settings

func (db *DB) Settings() Settings

Settings returns the current database settings.

func (*DB) StaleFiles

func (db *DB) StaleFiles() ([]FileStatus, error)

StaleFiles returns the status of every indexed file.

func (*DB) TmpContent

func (db *DB) TmpContent(path string) (*bytes.Reader, error)

CRC: crc-DB.md | R378 TmpContent returns a reader over the raw stored content of a tmp:// document.

func (*DB) TmpFileIDs

func (db *DB) TmpFileIDs() map[uint64]struct{}

CRC: crc-DB.md | R369 TmpFileIDs returns the set of all current tmp:// fileids.

func (*DB) UpdateTmpFile

func (db *DB) UpdateTmpFile(path, strategy string, content []byte, opts ...IndexOption) error

CRC: crc-DB.md | Seq: seq-tmp-add.md | R361, R362, R363, R481 UpdateTmpFile replaces the content of an existing tmp:// document.

func (*DB) Version

func (db *DB) Version() (string, error)

CRC: crc-DB.md

type FRecord

type FRecord struct {
	FileID      uint64
	ModTime     int64
	ContentHash [32]byte
	FileLength  int64
	Strategy    string
	Names       []string
	Chunks      []FileChunkEntry
	Tokens      []TokenEntry
}

FRecord is the per-file record. Metadata, ordered chunks, file-level token bag.

func UnmarshalFHeader

func UnmarshalFHeader(data []byte) (FRecord, error)

R451, R452: UnmarshalFHeader decodes only the header fields of an F record value: ModTime, ContentHash, FileLength, Strategy, and Names. Skips Chunks and Tokens.

func UnmarshalFValue

func UnmarshalFValue(data []byte) (FRecord, error)

UnmarshalFValue decodes an FRecord value. FileID must be set separately (from key).

func (*FRecord) MarshalValue

func (f *FRecord) MarshalValue() []byte

MarshalValue encodes the FRecord value (everything except the key prefix and fileid).

type FileChunkEntry

type FileChunkEntry struct {
	ChunkID  uint64
	Location string
}

FileChunkEntry pairs a chunkid with its location label (opaque range string from chunker).

type FileStatus

type FileStatus struct {
	Path     string
	Status   string // "fresh", "stale", "missing"
	FileID   uint64
	Strategy string
}

FileStatus is returned by CheckFile and StaleFiles.

type FuncChunker

type FuncChunker struct {
	Fn ChunkFunc
}

FuncChunker wraps a bare ChunkFunc into a Chunker. ChunkText re-runs the function and returns the first chunk matching the range label.

func (FuncChunker) ChunkText

func (fc FuncChunker) ChunkText(path string, content []byte, rangeLabel string) ([]byte, bool)

func (FuncChunker) Chunks

func (fc FuncChunker) Chunks(path string, content []byte, yield func(Chunk) bool) error

type HRecord

type HRecord struct {
	Hash    [32]byte
	ChunkID uint64
}

HRecord maps content hash to chunkid.

type IndexOption

type IndexOption func(*indexConfig)

IndexOption configures indexing methods (AddFile, AddFileWithContent, RefreshStale, AddTmpFile, UpdateTmpFile). CRC: crc-DB.md | R472

func WithChunkCallback

func WithChunkCallback(fn ChunkCallback) IndexOption

WithChunkCallback supplies a chunk callback for indexing methods. CRC: crc-DB.md | R470

type IndexStatus

type IndexStatus struct {
	Built bool
}

IndexStatus reports the state of the index.

type KeyPair

type KeyPair struct {
	Key   []byte
	Value []byte // nil for non-final parts; caller sets fileid on final part
}

KeyPair is an N record key/value pair for filename key chains.

func EncodeFilename

func EncodeFilename(filename string) []KeyPair

EncodeFilename returns N record key/value pairs for a filename. Short filenames (≤509 bytes) produce a single final key. Longer filenames are split across chained keys.

type MultiSearchResult

type MultiSearchResult struct {
	Strategy string
	Results  []SearchResult
}

CRC: crc-DB.md | R286 MultiSearchResult holds one strategy's results from SearchMulti.

type Options

type Options struct {
	CaseInsensitive bool
	Aliases         map[byte]byte // maps input bytes to replacement bytes before trigram extraction
	DBName          string        // subdatabase name, default "fts"
	MaxDBs          int           // LMDB max named databases, default 2
	MapSize         int64         // bytes, default 1GB
}

Options configures database creation and opening.

type Pair

type Pair struct {
	Key   []byte
	Value []byte
}

Pair is an opaque key-value pair for per-chunk metadata. Allows duplicate keys. Mirrors the DB wire format.

func CopyPairs

func CopyPairs(src []Pair) []Pair

CopyPairs deep-copies a slice of Pair.

type RecordStats

type RecordStats struct {
	Count      int64
	KeyBytes   int64
	ValueBytes int64
}

CRC: crc-DB.md | R445

type ScoreFunc

type ScoreFunc func(queryTrigrams []uint32, chunkCounts map[uint32]int, chunkTokenCount int) float64

ScoreFunc computes a relevance score for a chunk. queryTrigrams: active query trigrams. chunkCounts: trigram -> occurrence count in the chunk. chunkTokenCount: number of tokens (words) in the chunk.

var ScoreCoverage ScoreFunc = scoreCoverage

ScoreCoverage is the coverage scoring function: fraction of active query trigrams present in chunk.

var ScoreDensityFunc ScoreFunc = scoreDensity

ScoreDensityFunc is the density scoring function for direct use with ScoreFile.

func ScoreBM25

func ScoreBM25(idf map[uint32]float64, avgTokenCount float64) ScoreFunc

CRC: crc-DB.md | R272, R273 ScoreBM25 returns a ScoreFunc closure implementing BM25 ranking. idf maps trigram codes to their inverse document frequency. avgTokenCount is the average chunk token count across the corpus.

type ScoredChunk

type ScoredChunk struct {
	Range string
	Score float64
}

ScoredChunk is a per-chunk trigram match score from ScoreFile.

type SearchOption

type SearchOption func(*searchConfig)

SearchOption configures search behavior.

func WithAfter

func WithAfter(t time.Time) SearchOption

WithAfter keeps chunks with timestamp >= t. Checks "timestamp" attr first (parsed as Unix nanos); falls back to file mod time from F record. CRC: crc-DB.md | R258

func WithBefore

func WithBefore(t time.Time) SearchOption

WithBefore keeps chunks with timestamp < t. Same fallback as WithAfter. CRC: crc-DB.md | R259

func WithChunkCache

func WithChunkCache(cc *ChunkCache) SearchOption

WithChunkCache threads an external ChunkCache through post-filters (verify, regex, except-regex). When present, post-filters use the cache instead of re-reading files. R486

func WithChunkFilter

func WithChunkFilter(fn ChunkFilter) SearchOption

WithChunkFilter adds a chunk filter. Multiple calls accumulate (AND semantics).

func WithCoverage

func WithCoverage() SearchOption

WithCoverage uses coverage scoring (default): matching / total active query trigrams.

func WithDensity

func WithDensity() SearchOption

WithDensity uses token-density scoring for long queries.

func WithExcept

func WithExcept(ids map[uint64]struct{}) SearchOption

WithExcept excludes chunks from the given file IDs.

func WithExceptRegex

func WithExceptRegex(patterns ...string) SearchOption

WithExceptRegex adds subtract post-filters: any match rejects the chunk. Multiple calls accumulate patterns. R184, R185

func WithLoose

func WithLoose() SearchOption

CRC: crc-DB.md | Seq: seq-fuzzy-search.md | R336 WithLoose enables OR semantics at the term level: a chunk matches if it contains any query term's trigrams. Default scoring: terms matched / total terms.

func WithNoTmp

func WithNoTmp() SearchOption

CRC: crc-DB.md | R376 WithNoTmp excludes tmp:// overlay documents from search results.

func WithOnly

func WithOnly(ids map[uint64]struct{}) SearchOption

WithOnly restricts search to chunks from the given file IDs.

func WithOverlap

func WithOverlap() SearchOption

CRC: crc-DB.md | R271 WithOverlap uses overlap scoring: matching trigram count, no normalization.

func WithProximityRerank

func WithProximityRerank(topN int) SearchOption

CRC: crc-DB.md | R279 WithProximityRerank reranks the top-N results by query term proximity in chunk text.

func WithRegexFilter

func WithRegexFilter(patterns ...string) SearchOption

WithRegexFilter adds AND post-filters: every pattern must match chunk content. Multiple calls accumulate patterns. R183, R185

func WithScoring

func WithScoring(fn ScoreFunc) SearchOption

WithScoring uses a custom scoring function.

func WithTrigramFilter

func WithTrigramFilter(fn TrigramFilter) SearchOption

WithTrigramFilter supplies a caller-defined trigram selection function.

func WithVerify

func WithVerify() SearchOption

WithVerify enables post-filter verification: after trigram intersection, read chunk text from disk and verify each query term appears as a case-insensitive substring. Eliminates trigram false positives. R124, R125

type SearchResult

type SearchResult struct {
	Path  string
	Range string
	Score float64
	// contains filtered or unexported fields
}

SearchResult is a single match from Search. R99, R490, R491

type SearchResults

type SearchResults struct {
	Results []SearchResult
	Status  IndexStatus
}

SearchResults wraps search matches with index health status.

type Settings

type Settings struct {
	CaseInsensitive    bool
	Aliases            map[byte]byte     // byte→byte alias mapping
	ChunkingStrategies map[string]string // name→cmd (empty cmd = func strategy)
}

Settings holds the in-memory representation of I records.

type StringDelim

type StringDelim struct {
	Open   string // opening delimiter
	Close  string // closing delimiter (same as Open for symmetric quotes)
	Escape string // escape character (empty = no escaping)
}

StringDelim defines a string delimiter and its escape character. R308

type TRecord

type TRecord struct {
	Trigram  uint32
	ChunkIDs []uint64
}

TRecord is the trigram inverted index entry.

func (*TRecord) MarshalValue

func (t *TRecord) MarshalValue() []byte

MarshalValue encodes the TRecord value (packed chunkid list).

type TokenEntry

type TokenEntry struct {
	Token string
	Count int
}

TokenEntry pairs a token string with its occurrence count.

type TrigramCount

type TrigramCount struct {
	Trigram uint32
	Count   int
}

TrigramCount pairs a trigram code with its corpus document frequency.

func FilterAll

func FilterAll(trigrams []TrigramCount, _ int) []TrigramCount

FilterAll uses every query trigram. No filtering.

type TrigramEntry

type TrigramEntry struct {
	Trigram uint32
	Count   int
}

TrigramEntry pairs a trigram code with its per-chunk occurrence count.

type TrigramFilter

type TrigramFilter func(trigrams []TrigramCount, totalChunks int) []TrigramCount

TrigramFilter decides which trigrams to use for a given query. It receives the query's trigrams with their corpus-wide document frequencies, and the total number of indexed chunks. It returns the subset to search with.

func FilterBestN

func FilterBestN(n int) TrigramFilter

FilterBestN returns a TrigramFilter that keeps the N trigrams with the lowest document frequency.

func FilterByRatio

func FilterByRatio(maxRatio float64) TrigramFilter

FilterByRatio returns a TrigramFilter that skips trigrams appearing in more than maxRatio of total chunks. E.g., 0.50 skips trigrams in >50% of chunks.

type Trigrams

type Trigrams struct {
	// contains filtered or unexported fields
}

Trigrams extracts raw byte trigrams from text. Every byte is its own value — no character set mapping. Whitespace bytes are boundaries; runs collapse. Case insensitivity via bytes.ToLower(). Byte aliases applied before extraction.

func NewTrigrams

func NewTrigrams(caseInsensitive bool, aliases map[byte]byte) *Trigrams

NewTrigrams creates a trigram extractor.

func (*Trigrams) EncodeTrigram

func (t *Trigrams) EncodeTrigram(s string) (uint32, bool)

EncodeTrigram converts a 3-byte string to a 24-bit trigram using the same encoding as ExtractTrigrams: case folding, aliases, whitespace→0. Returns 0, false if the trigram cannot appear in the index (e.g. all whitespace, or consecutive whitespace which encode() collapses away).

func (*Trigrams) ExtractTrigrams

func (t *Trigrams) ExtractTrigrams(data []byte) []uint32

ExtractTrigrams extracts all trigrams from data. Character-internal trigrams (windows entirely within a multibyte UTF-8 char) are skipped.

func (*Trigrams) TrigramCounts

func (t *Trigrams) TrigramCounts(data []byte) map[uint32]int

TrigramCounts extracts trigrams with occurrence counts. Character-internal trigrams (windows entirely within a multibyte UTF-8 char) are skipped.

type TxnHolder

type TxnHolder interface {
	Txn() *lmdb.Txn
}

TxnHolder is anything that carries an LMDB transaction. CRecord implements it; txnWrap wraps raw transactions from View/Update blocks.

type WRecord

type WRecord struct {
	TokenHash uint32
	ChunkIDs  []uint64
}

WRecord is the token inverted index entry.

func (*WRecord) MarshalValue

func (w *WRecord) MarshalValue() []byte

MarshalValue encodes the WRecord value (packed chunkid list, same as TRecord).

Directories

Path Synopsis
cmd
microfts command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL