search

package
v3.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: MIT Imports: 14 Imported by: 0

README

Go In-Memory Search Engine

This project is a robust, extensible, in-memory search engine written in Go. It supports a rich text-based query language that combines boolean keyword search with structured metadata filtering. The engine is designed to be self-contained with minimal dependencies, featuring a "self-healing" parser that gracefully handles common user syntax errors.

Features

  • Rich Query Syntax:
    • Boolean Logic: AND, OR, NOT operators. AND is the default operator between terms (e.g., cat dog is the same as cat AND dog).
    • Operator Precedence: NOT (tightest) > AND > OR (loosest). a OR b AND c parses as a OR (b AND c). a OR b NOT path:vendor parses as a OR (b AND NOT path:vendor). Use parentheses to override: (a OR b) NOT path:vendor.
    • Grouping: Use parentheses () for controlling order of operations.
    • Phrase Search: Exact phrases using double quotes (e.g., "lazy fox").
    • Regex Search: Pattern matching using /.../ syntax (e.g., /[cb]at/).
  • Structured Metadata Filtering:
    • Filter on metadata fields using operators: =, !=, >=, <=.
    • Multi-value filtering with commas: lang=go,python.
    • Example: complexity>=5 AND lang=go.
  • Semantic Aliases:
    • Define user-friendly aliases that are transformed into concrete queries.
    • Example: complexity=high is automatically rewritten to complexity>=8.
  • Robust "Self-Healing" Parser:
    • Automatically corrects common syntax errors and informs the user.
    • Handles dangling operators (cat AND).
    • Handles mismatched parentheses ((cat OR dog or cat)).
  • Optimized Execution:
    • A simplified query planner reorders clauses to execute the most restrictive filters first, ensuring better performance.

How to Use

This is a library package. Import it and use the SearchEngine API:

import "github.com/boyter/cs/v3/pkg/search"

// Create documents to search over.
docs := []*search.Document{
    {
        Path:       "/src/main.go",
        Filename:   "main.go",
        Language:   "Go",
        Extension:  "go",
        Content:    []byte("package main\nfunc main() {}"),
        Complexity: 2,
    },
}

// Create a search engine and run a query.
engine := search.NewSearchEngine(docs)
result, err := engine.Search("main AND lang=go", false) // false = case-insensitive
if err != nil {
    log.Fatal(err)
}

for _, doc := range result.Documents {
    fmt.Println(doc.Path)
}
// result.Notices contains any parser warnings/corrections.
// result.TermsToHighlight contains terms for highlighting in output.

Architecture and Flow

The search engine processes a query through a multi-stage pipeline. This decoupled architecture makes the system easier to understand, maintain, and extend.

Architectural Diagram
+---------------+      +-------------+      +-------------------+      +-------------+      +------------------+      +------------------+      +------------------+
|  Query String |----->|    Lexer    |----->|      Parser       |----->| Transformer |----->|     Planner     |----->|     Executor     |----->|    Extractor     |
| "cat AND > 5" |      | (Tokens)    |      | (Initial AST)     |      | (Final AST) |      | (Optimized AST) |      | (Filtering Data) |      | (Highlight Terms)|
+---------------+      +-------------+      +-------------------+      +-------------+      +------------------+      +------------------+      +------------------+
                                                                                                                                                       |
                                                                                                                                                       V
                                                                                                                                              +--------------------+
                                                                                                                                              |   Search Results   |
                                                                                                                                              +--------------------+
Breakdown of Stages
  1. Lexer (Tokenizer) - lexer.go

    • Responsibility: Scans the raw query string and breaks it down into a sequence of "tokens" (e.g., KEYWORD, AND, OPERATOR, NUMBER, COMMA). It has no understanding of grammar; it only identifies the pieces.
  2. Parser - parser.go

    • Responsibility: Takes the stream of tokens from the Lexer and builds an Abstract Syntax Tree (AST) based on a defined grammar. The AST is a tree structure that represents the logical meaning of the query.
    • Key Feature: This is where the self-healing logic resides. If the parser encounters a syntax error (like a missing parenthesis), it attempts to correct it and adds a Notice to the search result.
  3. Transformer - transformer.go

    • Responsibility: Walks the initial AST and applies semantic transformations. Its job is to translate user-friendly aliases into concrete, executable filter logic.
    • Example: It finds a FilterNode for complexity=high and replaces it with a new FilterNode for complexity>=8.
  4. Planner - planner.go

    • Responsibility: Performs a simplified query optimization step. It analyzes the AST, specifically looking for AND clauses, and reorders the nodes to be more efficient.
    • Logic: It reorders clauses so that low-cost operations (like metadata filters) are executed before high-cost operations (like regex searches). This dramatically reduces the amount of data that needs to be processed in later stages.
  5. Executor - executor.go

    • Responsibility: This is the engine's workhorse. It walks the final, optimized AST and executes the search logic against the in-memory slice of Document structs.
    • Core Search Logic: The evaluate function contains the switch statement that handles each node type (e.g., AndNode, KeywordNode, FilterNode). The actual text matching (strings.Contains, regexp.Match) happens here.
  6. Extractor - extractor.go

    • Responsibility: Walks the AST to extract positive search terms for highlighting in results. It skips terms inside NOT subtrees so that negated terms are not highlighted.

Built-in Filters

The engine ships with the following metadata filters, registered in executor.go:

Filter Name Aliases Type Operators Multi-value Description
complexity Numeric = != >= <= No Matches Document.Complexity
lang / language Each is an alias String = != Yes Matches Document.Language (case-insensitive)
file / filename Each is an alias String = != Yes Matches Document.Filename (case-insensitive substring, or glob when *, ?, [ present)
path / filepath Each is an alias String = != No Matches Document.Path (case-insensitive substring, or glob when *, ?, [ present)
ext / extension Each is an alias String = != Yes Matches Document.Extension (case-insensitive)

Semantic alias: complexity=high is rewritten to complexity>=8.

Glob pattern support: The file and path filters support glob patterns when the value contains *, ?, or [ characters. Without these characters, the existing substring matching is used for backward compatibility.

Pattern Meaning
* Matches any sequence of non-separator characters
? Matches any single non-separator character
[abc] Matches any character in the set

Examples:

  • file:*.go — files ending in .go
  • file:*_test.go — Go test files only
  • file:file?.pyfile1.py, file2.py, etc.
  • path:*/search/* — files in any search directory
  • path:src/main/* — files directly under src/main/
  • NOT path:vendor/*/* — exclude vendor directory

For path: globs, a sliding-window approach matches pattern segments against contiguous segments of the file path. ** (recursive glob) is not supported; use substring matching without glob characters for broad directory matching (e.g., path:vendor).

Multi-value syntax: Filters that support multi-value accept comma-separated lists (e.g., lang=go,python, ext=ts,tsx). This works like an IN clause — the document matches if its value is any of the listed values.

How to Extend the Engine

The engine was designed to be easily extended. Here are guides for the most common modifications.

Scenario 1: Adding a New Metadata Filter (e.g., path prefix)

Let's add the ability to filter on a path prefix, so users can write pathprefix="/src".

Step 1: Implement the Filter Handler

In executor.go, create a new function that matches the FilterHandler signature. Each handler receives a single document and returns whether it matches.

// in executor.go

// handlePathPrefixFilter filters documents by their path prefix.
func handlePathPrefixFilter(op string, val interface{}, doc *Document) bool {
	prefix, ok := val.(string)
	if !ok {
		return false
	}
	prefix = strings.ToLower(prefix)

	switch op {
	case "=":
		return strings.HasPrefix(strings.ToLower(doc.Path), prefix)
	case "!=":
		return !strings.HasPrefix(strings.ToLower(doc.Path), prefix)
	}
	return false
}

Step 2: Register the New Handler

In executor.go, inside the registerFilterHandlers function, map the field name to your new handler function.

// in executor.go

func (se *SearchEngine) registerFilterHandlers() {
	se.filterHandlers = make(map[string]FilterHandler)
	se.filterHandlers["complexity"] = handleComplexityFilter
	se.filterHandlers["lang"] = handleLanguageFilter
	se.filterHandlers["language"] = handleLanguageFilter
	se.filterHandlers["file"] = handleFilenameFilter
	se.filterHandlers["filename"] = handleFilenameFilter
	se.filterHandlers["ext"] = handleExtensionFilter
	se.filterHandlers["extension"] = handleExtensionFilter
	se.filterHandlers["pathprefix"] = handlePathPrefixFilter // <-- ADD THIS LINE
}

That's it! The parser is already generic enough to understand pathprefix="/src". Now the executor knows how to handle it. Add tests in search_test.go to verify.

Scenario 2: Adding a New Semantic Alias (e.g., complexity=low)

Currently only complexity=high is implemented (rewritten to complexity>=8). Let's add complexity=low.

Step 1: Implement the Transformation Logic

In transformer.go, find the transformFilterNode function and add a new case to the if block.

// in transformer.go

func (t *Transformer) transformFilterNode(node *FilterNode) Node {
	if node.Field == "complexity" && node.Operator == "=" {
		if val, ok := node.Value.(string); ok {
			valLower := strings.ToLower(val)

			if valLower == "high" {
				// Existing code: 'high' is defined as 8 or more
				newNode := &FilterNode{
					Field:    "complexity",
					Operator: ">=",
					Value:    8,
				}
				notice := fmt.Sprintf("Notice: '%s=%s' was interpreted as 'complexity >= 8'.", node.Field, val)
				t.notices = append(t.notices, notice)
				return newNode
			}

			// v-- ADD THIS LOGIC --v
			if valLower == "low" {
				newNode := &FilterNode{
					Field:    "complexity",
					Operator: "<=",
					Value:    3, // 'low' is defined as 3 or less
				}
				notice := fmt.Sprintf("Notice: '%s=%s' was interpreted as 'complexity <= 3'.", node.Field, val)
				t.notices = append(t.notices, notice)
				return newNode
			}
			// ^-- END OF NEW LOGIC --^
		}
	}
	return node
}
Scenario 3: Fuzzy Matching

Fuzzy matching is fully implemented using Levenshtein distance. The FuzzyNode AST type supports distance-based matching with ~1 (edit distance 1) and ~2 (edit distance 2) syntax. The executor slides a window over file content and finds near-matches within the specified edit distance.

API Reference

Types
// Document represents a single file to be searched.
type Document struct {
    Path       string
    Filename   string
    Language   string
    Extension  string
    Content    []byte
    Complexity int64
}

// SearchResult holds the outcome of a search operation.
type SearchResult struct {
    Documents        []*Document   // Matching documents
    Notices          []string      // Parser warnings and transformation notices
    TermsToHighlight []string      // Positive search terms for highlighting
}

// FilterHandler is a function that checks whether a single document matches a filter.
type FilterHandler func(op string, val interface{}, doc *Document) bool
Functions
// NewSearchEngine creates a new engine with built-in filters registered.
func NewSearchEngine(docs []*Document) *SearchEngine

// Search runs a query against the engine's documents.
// caseSensitive controls whether keyword/phrase matching is case-sensitive.
func (se *SearchEngine) Search(query string, caseSensitive bool) (*SearchResult, error)

// EvaluateFile evaluates a parsed AST against a single file's content.
// Returns whether the file matches and a map of term → match locations.
func EvaluateFile(node Node, content []byte, filename string, location string, caseSensitive bool) (bool, map[string][][]int)

// ExtractTerms traverses the AST and returns terms for highlighting,
// excluding terms inside NOT subtrees.
func ExtractTerms(node Node) []string

Project Structure

.
├── ast.go              # Defines the Abstract Syntax Tree node types (And, Or, Not, Keyword, Phrase, Regex, Filter, Fuzzy)
├── document.go         # Defines Document and SearchResult structs
├── executor.go         # The main engine; executes the AST against data, contains filter handlers
├── executor_test.go    # Unit tests for executor
├── extractor.go        # Walks AST to extract positive search terms for highlighting
├── extractor_test.go   # Unit tests for extractor
├── lexer.go            # The tokenizer; turns strings into tokens
├── parser.go           # The parser; turns tokens into an AST, handles self-healing
├── planner.go          # The query optimizer; reorders AND clauses by cost
├── README.md           # This file
├── search_fuzz_test.go # Fuzz tests for the search engine
├── search_test.go      # Unit tests for all components
└── transformer.go      # Transforms the AST for semantic aliases

How Multi-Value Filtering Works

The engine supports multi-value IN-style filtering with comma syntax (e.g., lang=go,python). This feature spans three pipeline stages:

Lexer

The lexer recognizes , as a COMMA token (see lexer.go). This allows comma-separated values to be tokenized individually.

Parser

In parser.go, the parseFilterExpression function collects values in a loop. After consuming the operator, it reads values separated by COMMA tokens:

  • If only one value is found, FilterNode.Value is stored directly (e.g., "go" or 5).
  • If multiple values are found, FilterNode.Value is stored as a []interface{} slice (e.g., ["go", "python"]).
Executor

Filter handlers use a type switch on the val parameter to handle both cases:

func handleLanguageFilter(op string, val interface{}, doc *Document) bool {
    isEquality := (op == "=")

    switch v := val.(type) {
    case string: // Single value: lang=go
        lang := strings.ToLower(v)
        return (strings.ToLower(doc.Language) == lang) == isEquality
    case []interface{}: // Multiple values: lang=go,python
        valueSet := make(map[string]bool)
        for _, item := range v {
            if strItem, ok := item.(string); ok {
                valueSet[strings.ToLower(strItem)] = true
            }
        }
        _, exists := valueSet[strings.ToLower(doc.Language)]
        return exists == isEquality
    }
    return false
}

With =, the document matches if its value is any of the listed values. With !=, the document matches if its value is none of the listed values.

Documentation

Index

Constants

This section is empty.

Variables

View Source
var ErrInvalidQuery = errors.New("invalid or unparseable query")

ErrInvalidQuery is returned when a query is syntactically incorrect and cannot be healed.

Functions

func CountAllTerms added in v3.1.0

func CountAllTerms(node Node) int

CountAllTerms counts all unique leaf-node values in the AST, including terms inside NOT and filter values. This is used for query complexity validation, unlike ExtractTerms which only returns positive terms for highlighting.

func EvaluateFile

func EvaluateFile(node Node, content []byte, filename string, location string, caseSensitive bool) (bool, map[string][][]int)

EvaluateFile evaluates a parsed AST against a single file's content. It returns whether the file matches and a map of term → match locations. File/extension filters are evaluated against the filename; other filters (lang, complexity) pass through as true since metadata is not available.

func ExtractTerms

func ExtractTerms(node Node) []string

ExtractTerms is the public entry point that traverses the AST and returns a slice of strings that should be highlighted.

func PostEvalMetadataFilters

func PostEvalMetadataFilters(node Node, lang string, complexity int64) bool

PostEvalMetadataFilters checks metadata-dependent filters (lang, complexity) that could not be evaluated during per-file AST evaluation because the metadata was not yet available. Returns false if any metadata filter rejects the file.

Types

type AndNode

type AndNode struct {
	Left  Node
	Right Node
}

AndNode represents a logical AND operation between two other nodes.

func (*AndNode) String

func (n *AndNode) String() string

type Document

type Document struct {
	Path       string
	Filename   string
	Language   string
	Extension  string
	Content    []byte
	Complexity int64
}

Document represents a single file to be searched.

type FilterHandler

type FilterHandler func(op string, val interface{}, doc *Document) bool

FilterHandler is a function that checks whether a single document matches a filter.

type FilterNode

type FilterNode struct {
	Field    string
	Operator string
	Value    interface{} // Can be a string, number, etc.
}

FilterNode represents a generic metadata filter.

func (*FilterNode) String

func (n *FilterNode) String() string

type FuzzyNode

type FuzzyNode struct {
	Value    string
	Distance int
}

FuzzyNode represents a fuzzy search term with a maximum edit distance.

func (*FuzzyNode) String

func (n *FuzzyNode) String() string

type KeywordNode

type KeywordNode struct {
	Value string
}

KeywordNode represents a simple keyword search term.

func (*KeywordNode) String

func (n *KeywordNode) String() string

type Lexer

type Lexer struct {
	// contains filtered or unexported fields
}

Lexer scans the input string and produces tokens.

func NewLexer

func NewLexer(r io.Reader) *Lexer

NewLexer creates a new Lexer.

type Node

type Node interface {
	// String is used for debugging and visualizing the AST.
	String() string
}

Node is the interface that all nodes in the AST must implement.

func PlanAST

func PlanAST(node Node) Node

PlanAST optimizes the AST for execution.

type NotNode

type NotNode struct {
	Expr Node
}

NotNode represents a logical NOT operation on a single node.

func (*NotNode) String

func (n *NotNode) String() string

type OrNode

type OrNode struct {
	Left  Node
	Right Node
}

OrNode represents a logical OR operation between two other nodes.

func (*OrNode) String

func (n *OrNode) String() string

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser creates an AST from a stream of tokens.

func NewParser

func NewParser(l *Lexer) *Parser

NewParser creates a new Parser.

func (*Parser) ParseQuery

func (p *Parser) ParseQuery() (Node, []string)

ParseQuery is the entry point for parsing.

type PhraseNode

type PhraseNode struct {
	Value string
}

PhraseNode represents a quoted phrase search term.

func (*PhraseNode) String

func (n *PhraseNode) String() string

type RegexNode

type RegexNode struct {
	Pattern string
	// contains filtered or unexported fields
}

RegexNode represents a regular expression search term.

func (*RegexNode) Regexp added in v3.1.0

func (n *RegexNode) Regexp() *regexp.Regexp

Regexp returns the compiled regexp, compiling once on first call. Returns nil if the pattern is invalid.

func (*RegexNode) String

func (n *RegexNode) String() string

type SearchEngine

type SearchEngine struct {
	// contains filtered or unexported fields
}

SearchEngine holds the data and configuration for searching.

func NewSearchEngine

func NewSearchEngine(docs []*Document) *SearchEngine

NewSearchEngine creates a new engine and initializes it.

func (*SearchEngine) Search

func (se *SearchEngine) Search(query string, caseSensitive bool) (*SearchResult, error)

Search is the main public method to run a query.

type SearchResult

type SearchResult struct {
	Documents        []*Document
	Notices          []string
	TermsToHighlight []string
}

SearchResult holds the outcome of a search operation. It includes the matching documents and any informational notices generated during the query parsing and transformation phases.

type Token

type Token struct {
	Type    TokenType
	Literal string
}

Token represents a single token from the query string.

type TokenType

type TokenType int

TokenType represents the type of a token.

const (
	ILLEGAL TokenType = iota
	EOF
	WS // Whitespace

	PHRASE       // "hello world"
	REGEX        // /[cb]at/
	OPERATOR     // >=, <=, =, !=
	LPAREN       // (
	RPAREN       // )
	AND          // AND, and
	OR           // OR, or
	NOT          // NOT, not
	IDENTIFIER   // complexity, author, #define
	NUMBER       // 5, 10
	STRING_ALIAS // high, low
	COMMA
	FUZZY // term~1, term~2
)

type Transformer

type Transformer struct {
	// contains filtered or unexported fields
}

Transformer walks the AST and applies modifications.

func (*Transformer) TransformAST

func (t *Transformer) TransformAST(node Node) (Node, []string)

TransformAST is the entry point for the transformation process.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL