blaze

package module
v0.0.0-...-98727d6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 14, 2025 License: MIT Imports: 17 Imported by: 0

README ΒΆ

Blaze

Built for hackers, not hyperscalers.
A tiny, hackable full-text search engine you can actually fit in your head. Features inverted indexing, boolean queries, phrase search, proximity queries, and BM25 rankingβ€”powered by a flexible query engine, roaring bitmaps, and skip lists.

[!NOTE] This focuses on keyword-based full-text search. For semantic search with embeddings, see Comet (docs).

Table of Contents

Overview

Blaze is a Go engine that provides fast, full-text search capabilities through an inverted index implementation. It's designed for applications that need to search through text documents efficiently without relying on external search engines.

[!TIP] Blaze handles keyword-based search (BM25, phrase matching, boolean queries). If you need vector embeddings or hybrid retrieval, Comet (docs) implements HNSW, IVF, and quantization-based indexes with metadata filtering. It's a hybrid vector store written from scratch in Go, purpose built for hackers, not hyperscalers.

Key Highlights:

  • Inverted Index: Maps terms to document positions for instant lookups
  • Skip Lists: Probabilistic data structure providing O(log n) operations
  • Query Builder: Type-safe, fluent API for boolean queries with roaring bitmaps
  • Advanced Search: Phrase search, BM25 ranking, proximity ranking, and boolean queries
  • BM25 Algorithm: Industry-standard relevance scoring with IDF and length normalization
  • Text Analysis: Tokenization, stemming, stopword filtering, and case normalization
  • Thread-Safe: Concurrent indexing with mutex protection
  • Serialization: Efficient binary format for persistence

Features

Search Capabilities
  • Term Search: Find documents containing specific terms
  • Phrase Search: Exact multi-word matching ("quick brown fox")
  • Boolean Queries: Type-safe AND, OR, NOT operations with query builder
  • BM25 Ranking: Industry-standard relevance scoring (used by Elasticsearch, Solr)
  • Proximity Ranking: Score results by term proximity
  • Position Tracking: Track exact word positions within documents
  • Roaring Bitmaps: Compressed bitmap operations for fast boolean queries
Text Processing
  • Tokenization: Unicode-aware text splitting
  • Stemming: Snowball (Porter2) stemmer for English
  • Stopword Filtering: Remove common words (the, a, is, etc.)
  • Case Normalization: Case-insensitive search
  • Configurable Pipeline: Customize analysis behavior
Data Structures
  • Skip Lists: O(log n) search, insert, and delete operations
  • Inverted Index: Efficient term-to-position mapping
  • Binary Serialization: Compact storage format

Not for Everyone

[!CAUTION] Blaze is an educational implementation. For production use, see Bleve - a mature, battle-tested full-text search library.

Blaze focuses on keyword-based full-text search with inverted indexes. If you need semantic search with vector embeddings, Comet (docs) implements various vector indexes (Flat, HNSW, IVF, PQ, IVFPQ) along with hybrid retrieval combining BM25 and vector similarity.

Blaze is purpose-built to be hackableβ€”small enough to understand completely. If you've ever wondered how inverted indexes are structured, how BM25 scoring works, or how boolean queries execute, Blaze provides a readable implementation to learn from.

Installation

go get github.com/wizenheimer/blaze

Quick Start

package main

import (
    "fmt"
    "github.com/wizenheimer/blaze"
)

func main() {
    // Create a new inverted index
    idx := blaze.NewInvertedIndex()

    // Index some documents
    idx.Index(1, "The quick brown fox jumps over the lazy dog")
    idx.Index(2, "A quick brown dog runs fast")
    idx.Index(3, "The lazy cat sleeps all day")

    // Search for documents containing "quick" and "brown"
    matches := idx.RankProximity("quick brown", 10)

    // Print results
    for _, match := range matches {
        fmt.Printf("Document %d (score: %.2f)\n",
            int(match.Offsets[0].DocumentID),
            match.Score)
    }
}

Output:

Document 2 (score: 1.00)
Document 1 (score: 0.50)

Core Concepts

Inverted Index

An inverted index is like the index at the back of a book. Instead of scanning every document to find a word, the index tells you exactly where each word appears.

Example:

Given these documents:

Doc 1: "the quick brown fox"
        Pos:0    1     2     3

Doc 2: "the lazy dog"
        Pos:0   1    2

Doc 3: "quick brown dogs"
        Pos:0    1     2

The inverted index looks like:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Token  β”‚         Posting List               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ "quick" β”‚ β†’ [Doc1:Pos1] β†’ [Doc3:Pos0]        β”‚
β”‚ "brown" β”‚ β†’ [Doc1:Pos2] β†’ [Doc3:Pos1]        β”‚
β”‚ "fox"   β”‚ β†’ [Doc1:Pos3]                      β”‚
β”‚ "lazy"  β”‚ β†’ [Doc2:Pos1]                      β”‚
β”‚ "dog"   β”‚ β†’ [Doc2:Pos2]                      β”‚
β”‚ "dogs"  β”‚ β†’ [Doc3:Pos2]                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Visual Representation:

                    Inverted Index
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚ Map      β”‚
                    β”‚ [string] β”‚
                    β”‚ SkipList β”‚
                    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                         β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                β”‚                β”‚
        β–Ό                β–Ό                β–Ό
   "quick"          "brown"           "fox"
   SkipList         SkipList         SkipList
   β”Œβ”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”
   β”‚ HEAD β”‚        β”‚ HEAD β”‚         β”‚ HEAD β”‚
   β””β”€β”€β”¬β”€β”€β”€β”˜        β””β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”¬β”€β”€β”€β”˜
      β”‚               β”‚                 β”‚
      β–Ό               β–Ό                 β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”
   β”‚Doc1:1β”‚        β”‚Doc1:2β”‚         β”‚Doc1:3β”‚
   β””β”€β”€β”¬β”€β”€β”€β”˜        β””β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”˜
      β”‚               β”‚
      β–Ό               β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”
   β”‚Doc3:0β”‚        β”‚Doc3:1β”‚
   β””β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”˜

Benefits:

  • Instant term lookups (no document scanning)
  • Phrase search via position checking
  • Proximity ranking by measuring distances
  • Efficient boolean queries (AND, OR, NOT)
Skip Lists

A skip list is a probabilistic data structure that maintains sorted data with O(log n) average time complexity for search, insertion, and deletion.

Visual Representation:

Skip List with Multiple Levels (Express Lanes)
═══════════════════════════════════════════════════════════════

Level 3: HEAD ────────────────────────────────────────────────────────────> [30] ────────> NULL
              ↓                                                                ↓
Level 2: HEAD ─────────────────────────────> [15] ────────────────────────> [30] ────────> NULL
              ↓                                ↓                               ↓
Level 1: HEAD ─────────────> [10] ─────────> [15] ────────> [20] ─────────> [30] ────────> NULL
              ↓                ↓                ↓              ↓                ↓
Level 0: HEAD ──> [5] ──> [10] ──> [15] ──> [20] ──> [25] ──> [30] ──> [35] ──> NULL
         (ALL NODES AT LEVEL 0)

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Node  β”‚  Each node has a "tower" of forward pointers
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€
         β”‚ Key   β”‚  Example: Node [15]
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€
         β”‚ Lvl 3 β”‚ ──> [30]      (skip far ahead)
         β”‚ Lvl 2 β”‚ ──> [30]      (skip ahead)
         β”‚ Lvl 1 β”‚ ──> [20]      (skip a little)
         β”‚ Lvl 0 β”‚ ──> [20]      (next node)
         β””β”€β”€β”€β”€β”€β”€β”€β”˜

How Heights are Assigned (Probabilistic):

Coin Flip Algorithm:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Height  β”‚ Probability β”‚ Visual      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚    1    β”‚    50%      β”‚ β–“β–“β–“β–“β–“       β”‚
β”‚    2    β”‚    25%      β”‚ β–“β–“β–“         β”‚
β”‚    3    β”‚   12.5%     β”‚ β–“β–“          β”‚
β”‚    4    β”‚   6.25%     β”‚ β–“           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For 1000 nodes, expected distribution:
Level 0: ~1000 nodes (all)    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Level 1: ~500 nodes           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Level 2: ~250 nodes           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Level 3: ~125 nodes           β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Level 4: ~62 nodes            β–ˆβ–ˆ

Search Algorithm (finding 20):

Step-by-Step Search for Key = 20:

Level 3: [HEAD] ───────────────────────────────> [30]        (30 > 20, drop down)
           ↓
Level 2: [HEAD] ──────────────> [15] ─────────> [30]        (15 < 20, advance)
                                   ↓
Level 2:                         [15] ─────────> [30]        (30 > 20, drop down)
                                   ↓
Level 1:                         [15] ──> [20]               (20 = 20, FOUND!)
                                          ^^^^

Journey Recorded:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Level 3   β”‚ HEAD            β”‚  Predecessor at each level
β”‚ Level 2   β”‚ [15]            β”‚  Used for insertions/deletions
β”‚ Level 1   β”‚ [15]            β”‚
β”‚ Level 0   β”‚ [15]            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. Start at HEAD, Level 3
  2. Level 3: Move to 30? No (30 > 20), drop to Level 2
  3. Level 2: Move to 15? Yes (15 < 20), advance to 15
  4. Level 2: Move to 30? No (30 > 20), drop to Level 1
  5. Level 1: Move to 20? Yes! Found it!

Time Complexity: O(log n) on average

Why Skip Lists?

  • O(log n) operations without complex balancing
  • Simpler than AVL or Red-Black trees
  • Better cache locality than trees
  • Easier to make lock-free for concurrency
  • Used in Redis, LevelDB, and other databases
Text Analysis Pipeline

Blaze transforms raw text into searchable tokens through a multi-stage pipeline:

Pipeline Stages:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Text Analysis Pipeline                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  1. Tokenization                       β”‚
         β”‚  Split on non-alphanumeric chars       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  2. Lowercasing                        β”‚
         β”‚  Normalize case ("Quick" β†’ "quick")    β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  3. Stopword Filtering                 β”‚
         β”‚  Remove common words (the, a, is)      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  4. Length Filtering                   β”‚
         β”‚  Remove tokens < 2 chars               β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  5. Stemming (Snowball/Porter2)        β”‚
         β”‚  Reduce to root ("running" β†’ "run")    β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
                    Final Tokens

Example Transformation:

Input:  "The Quick Brown Fox Jumps!"
        β”‚
        β”œβ”€ Step 1: Tokenization
        β”‚  └─> ["The", "Quick", "Brown", "Fox", "Jumps"]
        β”‚
        β”œβ”€ Step 2: Lowercasing
        β”‚  └─> ["the", "quick", "brown", "fox", "jumps"]
        β”‚
        β”œβ”€ Step 3: Stopword Filtering (remove "the")
        β”‚  └─> ["quick", "brown", "fox", "jumps"]
        β”‚
        β”œβ”€ Step 4: Length Filtering (all pass >= 2 chars)
        β”‚  └─> ["quick", "brown", "fox", "jumps"]
        β”‚
        └─ Step 5: Stemming ("jumps" β†’ "jump")
           └─> ["quick", "brown", "fox", "jump"]

Configuration:

// Use default configuration
tokens := blaze.Analyze("The quick brown fox")

// Custom configuration
config := blaze.AnalyzerConfig{
    MinTokenLength:  3,      // Only keep tokens >= 3 chars
    EnableStemming:  false,  // Disable stemming
    EnableStopwords: true,   // Keep stopword filtering
}
tokens := blaze.AnalyzeWithConfig("The quick brown fox", config)
Search Operations

Find all occurrences of a single term:

idx := blaze.NewInvertedIndex()
idx.Index(1, "the quick brown fox")
idx.Index(2, "quick brown dogs")

// Find first occurrence of "quick"
pos, err := idx.First("quick")
if err == nil {
    fmt.Printf("Found at Doc %d, Pos %d\n",
        int(pos.DocumentID), int(pos.Offset))
}

// Find next occurrence
nextPos, _ := idx.Next("quick", pos)

Find exact sequences of words:

// Find documents containing "quick brown fox" as a phrase
matches := idx.FindAllPhrases("quick brown fox", blaze.BOFDocument)

for _, match := range matches {
    start, end := match[0], match[1]
    fmt.Printf("Found in Doc %d from Pos %d to %d\n",
        int(start.DocumentID), int(start.Offset), int(end.Offset))
}

Algorithm:

Searching for phrase: "brown fox"

Document: "the quick brown dog jumped over the brown fox"
Positions: 0     1     2    3     4      5    6     7    8

Phase 1: Find END (last word "fox")
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Find "brown" β†’ Doc:Pos2                                 β”‚
β”‚ Find "fox" after Pos2 β†’ Doc:Pos8  ← END position       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 2: Walk BACKWARDS from END to find START
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ From Pos9, find previous "brown" β†’ Doc:Pos7  ← START   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 3: Validate
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Start: Pos7, End: Pos8                                  β”‚
β”‚ Distance: 8 - 7 = 1                                     β”‚
β”‚ Expected: 2 words - 1 = 1  MATCH!                      β”‚
β”‚                                                          β”‚
β”‚      "brown"  "fox"                                     β”‚
β”‚        β–²       β–²                                        β”‚
β”‚       Pos7    Pos8    (consecutive positions)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. Find END: Locate the last word of the phrase
  2. Walk BACKWARDS: Find previous occurrences of earlier words
  3. Validate: Check if positions are consecutive
  4. Recurse: Continue searching for more matches

Find documents containing all terms (not necessarily consecutive):

// Find documents with both "quick" and "fox"
cover := idx.NextCover([]string{"quick", "fox"}, blaze.BOFDocument)
start, end := cover[0], cover[1]

// Calculate proximity score
distance := end.Offset - start.Offset
score := 1.0 / distance  // Closer terms = higher score

Cover Algorithm:

Searching for: ["quick", "fox"] (any order, not necessarily consecutive)

Document: "the quick brown dog jumped over the lazy fox"
Positions: 0     1     2    3     4      5    6    7    8

Phase 1: Find COVER END (furthest term)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Find "quick" after BOF β†’ Doc:Pos1                           β”‚
β”‚ Find "fox" after BOF β†’ Doc:Pos8  ← FURTHEST (cover end)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 2: Find COVER START (earliest term before end)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Find "quick" before Pos9 β†’ Doc:Pos1  ← EARLIEST (cover start)β”‚
β”‚ Find "fox" before Pos9 β†’ Doc:Pos8                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 3: Validate & Return
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cover: [Pos1, Pos8]                                          β”‚
β”‚ Same document? Yes                                           β”‚
β”‚ All terms present? Yes                                       β”‚
β”‚                                                               β”‚
β”‚ "quick" ... ... ... ... ... ... ... "fox"                    β”‚
β”‚    β–²                                   β–²                     β”‚
β”‚   Pos1                                Pos8                   β”‚
β”‚   └────────── Cover Range β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                                                               β”‚
β”‚ Proximity Score: 1 / (8 - 1 + 1) = 1/8 = 0.125             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  1. Find FURTHEST occurrence of any term (cover end)
  2. Find EARLIEST occurrence of each term before end (cover start)
  3. Validate all terms are in the same document
  4. Return [start, end] positions
4. BM25 Ranking

BM25 (Best Matching 25) is a probabilistic ranking function used by search engines to estimate the relevance of documents to a given search query. It's the industry standard used by Elasticsearch, Solr, and Lucene.

// Search and rank using BM25
results := idx.RankBM25("machine learning", 10)

for _, match := range results {
    fmt.Printf("Doc %d: Score %.2f\n",
        match.DocID,
        match.Score)
}

What BM25 Considers:

+------------------+-------------------------------------------------------+
| Factor           | Description                                           |
+------------------+-------------------------------------------------------+
| Term Frequency   | How often does the term appear?                       |
|                  | More occurrences = higher relevance                   |
+------------------+-------------------------------------------------------+
| TF Saturation    | Diminishing returns                                   |
|                  | 3->10 occurrences matters less than 0->3             |
+------------------+-------------------------------------------------------+
| Document Length  | Normalize by document size                            |
|                  | Prevents long docs from dominating results            |
+------------------+-------------------------------------------------------+
| Term Rarity      | Rare terms are more important than common ones        |
|                  | "quantum" > "the" in importance                       |
+------------------+-------------------------------------------------------+

Complete BM25 Formula:

                    IDF(q_i) Γ— TF(q_i, D) Γ— (k1 + 1)
BM25(D, Q) = SUM  ─────────────────────────────────────────
             q_i  TF(q_i, D) + k1 Γ— (1 - b + b Γ— |D|/avgdl)
            in Q

Where:
    D       = Document being scored
    Q       = Query (set of terms q_1, q_2, ..., q_n)
    q_i     = Individual query term

Component Breakdown:

+-------------------+-----------------------------------------------------+
|    Component      |                   Definition                        |
+-------------------+-----------------------------------------------------+
| IDF(q_i)          | Inverse Document Frequency                          |
|                   |                                                     |
|                   |          N - df(q_i) + 0.5                          |
|                   | log( ─────────────────────── + 1 )                  |
|                   |            df(q_i) + 0.5                            |
|                   |                                                     |
|                   | N  = Total documents in corpus                      |
|                   | df = Documents containing term q_i                  |
|                   |                                                     |
|                   | Effect: Rare terms get higher weights              |
+-------------------+-----------------------------------------------------+
| TF(q_i, D)        | Term Frequency                                      |
|                   | = Number of times q_i appears in document D         |
|                   |                                                     |
|                   | Effect: More occurrences = higher relevance         |
+-------------------+-----------------------------------------------------+
| k1                | Term Frequency Saturation Parameter                 |
|                   | = 1.5 (default)                                     |
|                   | Range: [1.2, 2.0]                                   |
|                   |                                                     |
|                   | Effect: Controls diminishing returns                |
|                   |         Higher k1 = less saturation                 |
+-------------------+-----------------------------------------------------+
| b                 | Length Normalization Parameter                      |
|                   | = 0.75 (default)                                    |
|                   | Range: [0, 1]                                       |
|                   |                                                     |
|                   | Effect: Controls length penalty                     |
|                   |         b=1  = full normalization                   |
|                   |         b=0  = no normalization                     |
+-------------------+-----------------------------------------------------+
| |D|               | Document Length                                     |
|                   | = Number of terms in document D                     |
+-------------------+-----------------------------------------------------+
| avgdl             | Average Document Length                             |
|                   | = Total terms / Total documents                     |
+-------------------+-----------------------------------------------------+

Visual Example - Term Frequency Saturation:

Score Contribution (with k1=1.5, b=0.75)
    ^
    |                            /---------------  (saturation)
    |                          /
 3  |                       /
    |                     /
 2  |                  /
    |               /
 1  |            /
    |         /
 0  |______/
    +---+---+---+---+---+---+---+---+---+---+---> Term Frequency
    0   1   2   3   4   5   6   7   8   9   10

Key Insight: Going from 0->3 occurrences adds more to the score
             than going from 7->10 occurrences (diminishing returns)

Visual Example - Document Length Normalization:

Scenario: Same term frequency, different document lengths

Document A: 100 words, "learning" appears 3 times
Document B: 1000 words, "learning" appears 3 times

Raw TF:  Both have TF = 3
Density: Doc A = 3/100  = 3.0%    <- Higher density
         Doc B = 3/1000 = 0.3%    <- Lower density

BM25 adjusts: Doc A gets HIGHER score (term is more prominent)
              Doc B gets LOWER score (term is less prominent)

Length Penalty Formula:

    Penalty = k1 Γ— (1 - b + b Γ— docLen/avgDocLen)

    If docLen > avgDocLen: Penalty increases (score decreases)
    If docLen < avgDocLen: Penalty decreases (score increases)

Step-by-Step Scoring Example:

SETUP:
------
Query:  "machine learning"
Corpus: 1000 documents, average length 150 words
Target: Document 1 (200 words)
        - "machine" appears 3 times (df=100 docs have "machine")
        - "learning" appears 2 times (df=50 docs have "learning")

Parameters: k1=1.5, b=0.75


STEP 1: Calculate IDF for each term
----------------------------------------

IDF(machine):
    N = 1000, df = 100

    IDF = log((1000 - 100 + 0.5) / (100 + 0.5) + 1)
        = log(900.5 / 100.5 + 1)
        = log(8.96 + 1)
        = log(9.96)
        β‰ˆ 2.30

IDF(learning):
    N = 1000, df = 50

    IDF = log((1000 - 50 + 0.5) / (50 + 0.5) + 1)
        = log(950.5 / 50.5 + 1)
        = log(18.82 + 1)
        = log(19.82)
        β‰ˆ 2.99

    Note: "learning" is rarer (df=50) than "machine" (df=100)
          so it gets a higher IDF weight


STEP 2: Calculate normalized TF for "machine"
----------------------------------------------

TF = 3 (appears 3 times)
docLen = 200
avgdl = 150

Numerator   = TF Γ— (k1 + 1)
            = 3 Γ— (1.5 + 1)
            = 3 Γ— 2.5
            = 7.5

Denominator = TF + k1 Γ— (1 - b + b Γ— (docLen / avgdl))
            = 3 + 1.5 Γ— (1 - 0.75 + 0.75 Γ— (200/150))
            = 3 + 1.5 Γ— (0.25 + 0.75 Γ— 1.333)
            = 3 + 1.5 Γ— (0.25 + 1.0)
            = 3 + 1.5 Γ— 1.25
            = 3 + 1.875
            = 4.875

Normalized TF = 7.5 / 4.875 β‰ˆ 1.54

Contribution = IDF Γ— Normalized TF
             = 2.30 Γ— 1.54
             β‰ˆ 3.54


STEP 3: Calculate normalized TF for "learning"
-----------------------------------------------

TF = 2 (appears 2 times)
docLen = 200
avgdl = 150

Numerator   = 2 Γ— 2.5 = 5.0

Denominator = 2 + 1.5 Γ— (1 - 0.75 + 0.75 Γ— (200/150))
            = 2 + 1.875
            = 3.875

Normalized TF = 5.0 / 3.875 β‰ˆ 1.29

Contribution = IDF Γ— Normalized TF
             = 2.99 Γ— 1.29
             β‰ˆ 3.86


STEP 4: Calculate final BM25 score
-----------------------------------

BM25(Document 1, "machine learning") = 3.54 + 3.86 = 7.40

                    +----------+----------+
                    | Term     | Score    |
                    +----------+----------+
                    | machine  | 3.54     |
                    | learning | 3.86     |
                    +----------+----------+
                    | TOTAL    | 7.40     |
                    +----------+----------+

Why BM25 Works:

+------------------------+------------------------------------------------+
| Advantage              | Explanation                                    |
+------------------------+------------------------------------------------+
| Industry Standard      | Used by Elasticsearch, Solr, Lucene           |
|                        | Battle-tested in production systems            |
+------------------------+------------------------------------------------+
| Probabilistic          | Based on probability ranking principle         |
|                        | Solid theoretical foundation                   |
+------------------------+------------------------------------------------+
| Term Rarity (IDF)      | Rare terms contribute more to score            |
|                        | "quantum" > "the" in importance                |
+------------------------+------------------------------------------------+
| Saturation             | Diminishing returns for repeated terms         |
|                        | 0->3 occurrences: HIGH impact                  |
|                        | 7->10 occurrences: LOW impact                  |
+------------------------+------------------------------------------------+
| Length Normalization   | Prevents long documents from dominating        |
|                        | Adjusts for document size bias                 |
+------------------------+------------------------------------------------+
| Tunable                | Adjust k1 and b for domain-specific needs     |
|                        | Customize behavior without changing algorithm  |
+------------------------+------------------------------------------------+

Comparison with Simple TF-IDF:

Simple TF-IDF:
    Score = TF Γ— IDF
    Problem: Linear relationship with TF
             10 occurrences = 10x score of 1 occurrence

    TF-IDF Score
        ^
        |                                        /
     10 |                                      /
        |                                    /
      5 |                                  /
        |                                /
      0 |______________________________/
        +---+---+---+---+---+---+---+---+---> Term Frequency
        0   2   4   6   8   10  12  14  16

BM25:
    Score = IDF Γ— (TF Γ— (k1 + 1)) / (TF + k1 Γ— length_norm)
    Benefit: Sublinear relationship with TF
             Saturation prevents spam

    BM25 Score
        ^
        |                    /----------------  (plateau)
      4 |                  /
        |                /
      2 |             /
        |          /
      0 |________/
        +---+---+---+---+---+---+---+---+---> Term Frequency
        0   2   4   6   8   10  12  14  16

    Key: BM25 saturates, preventing keyword stuffing exploits
5. Proximity Ranking

Score and rank documents by term proximity:

// Search and rank results
matches := idx.RankProximity("machine learning", 10)

for _, match := range matches {
    fmt.Printf("Doc %d: Score %.2f\n",
        int(match.Offsets[0].DocumentID),
        match.Score)
}

Scoring Formula:

For each cover in a document:
    score += 1 / (coverEnd - coverStart + 1)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Proximity Scoring Examples                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚ Doc 1: "machine learning is machine learning"                  β”‚
β”‚         Pos:0      1      2  3       4                          β”‚
β”‚                                                                 β”‚
β”‚   Cover 1: [Pos 0-1]  β†’ score += 1/(1-0+1) = 1/2 = 0.500      β”‚
β”‚   Cover 2: [Pos 3-4]  β†’ score += 1/(4-3+1) = 1/2 = 0.500      β”‚
β”‚                         ─────────────────────────────           β”‚
β”‚   Total Score: 1.000                                            β”‚
β”‚                                                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚ Doc 2: "learning about machine and learning"                   β”‚
β”‚         Pos:0       1     2       3   4                         β”‚
β”‚                                                                 β”‚
β”‚   Cover 1: [Pos 0-2]  β†’ score += 1/(2-0+1) = 1/3 = 0.333      β”‚
β”‚   Cover 2: [Pos 2-4]  β†’ score += 1/(4-2+1) = 1/3 = 0.333      β”‚
β”‚                         ─────────────────────────────           β”‚
β”‚   Total Score: 0.666                                            β”‚
β”‚                                                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚ Doc 3: "machine ... ... ... ... learning"                      β”‚
β”‚         Pos:0    1   2   3   4   5                              β”‚
β”‚                                                                 β”‚
β”‚   Cover 1: [Pos 0-5]  β†’ score += 1/(5-0+1) = 1/6 = 0.167      β”‚
β”‚                         ─────────────────────────────           β”‚
β”‚   Total Score: 0.167                                            β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Ranking: Doc 1 (1.000) > Doc 2 (0.666) > Doc 3 (0.167)
          β–²               β–²               β–²
      Terms closest   Terms medium   Terms far apart

Why This Works:

  • Smaller distances β†’ larger scores (inverse relationship)
  • Multiple occurrences β†’ higher scores (additive)
  • Documents with terms close together rank higher

Query Builder API

The Query Builder provides a type-safe, fluent API for constructing complex boolean queries with roaring bitmaps. No string parsing, no syntax errors - just clean, composable code.

Why Use Builder Pattern

String Parsing Approach:

// Error-prone, runtime failures
results, err := index.ExecuteQuery("(machine AND learning) OR python")
if err != nil {
    // Handle parsing errors
}

Builder Pattern Approach:

// Type-safe, compile-time checks, IDE autocomplete!
results := blaze.NewQueryBuilder(index).
    Group(func(q *blaze.QueryBuilder) {
        q.Term("machine").And().Term("learning")
    }).
    Or().
    Term("python").
    Execute()
Query Builder Quick Start
Single Term Query
// Find all documents containing "machine"
results := blaze.NewQueryBuilder(idx).
    Term("machine").
    Execute()

fmt.Printf("Found %d documents\n", results.GetCardinality())
AND Query
// Find documents with BOTH "machine" AND "learning"
results := blaze.NewQueryBuilder(idx).
    Term("machine").
    And().
    Term("learning").
    Execute()
OR Query
// Find documents with "python" OR "javascript"
results := blaze.NewQueryBuilder(idx).
    Term("python").
    Or().
    Term("javascript").
    Execute()
NOT Query
// Find documents with "python" but NOT "snake"
results := blaze.NewQueryBuilder(idx).
    Term("python").
    And().Not().
    Term("snake").
    Execute()
Complex Grouped Query
// (machine OR deep) AND learning
results := blaze.NewQueryBuilder(idx).
    Group(func(q *blaze.QueryBuilder) {
        q.Term("machine").Or().Term("deep")
    }).
    And().
    Term("learning").
    Execute()
Phrase Query
// Find exact phrase "machine learning"
results := blaze.NewQueryBuilder(idx).
    Phrase("machine learning").
    Execute()
BM25 Ranked Results
// Get top 10 results ranked by relevance
matches := blaze.NewQueryBuilder(idx).
    Term("machine").
    And().
    Term("learning").
    ExecuteWithBM25(10)

for _, match := range matches {
    fmt.Printf("Doc %d: score=%.2f\n", match.DocID, match.Score)
}
Query Builder Core Methods
NewQueryBuilder(index *InvertedIndex) *QueryBuilder

Creates a new query builder instance.

qb := blaze.NewQueryBuilder(idx)
Term(term string) *QueryBuilder

Adds a single term to the query. Uses roaring bitmaps for O(1) document lookup.

qb.Term("machine")
Phrase(phrase string) *QueryBuilder

Adds an exact phrase match. Combines bitmap efficiency with skip list position checking.

qb.Phrase("machine learning")
And() *QueryBuilder

Combines results with intersection (both must match). Uses bitmap AND operation.

qb.Term("machine").And().Term("learning")
Or() *QueryBuilder

Combines results with union (either can match). Uses bitmap OR operation.

qb.Term("cat").Or().Term("dog")
Not() *QueryBuilder

Negates the next term (exclude from results). Uses bitmap difference operation.

qb.Term("python").And().Not().Term("snake")
Group(fn func(*QueryBuilder)) *QueryBuilder

Creates a sub-query with its own scope for precedence control.

qb.Group(func(q *blaze.QueryBuilder) {
    q.Term("machine").Or().Term("deep")
}).And().Term("learning")
Execute() *roaring.Bitmap

Executes the query and returns a bitmap of matching document IDs.

results := qb.Execute()
docCount := results.GetCardinality()
ExecuteWithBM25(maxResults int) []Match

Executes the query with BM25 ranking and returns top results.

matches := qb.ExecuteWithBM25(10)  // Top 10 results
Boolean Operations

The Query Builder provides convenient shorthand functions for common boolean operations:

AllOf(index *InvertedIndex, terms ...string) *roaring.Bitmap

Shorthand for documents containing ALL terms (AND operation).

// Find documents with "machine" AND "learning" AND "python"
results := blaze.AllOf(idx, "machine", "learning", "python")

// Equivalent to:
results := blaze.NewQueryBuilder(idx).
    Term("machine").And().Term("learning").And().Term("python").
    Execute()
AnyOf(index *InvertedIndex, terms ...string) *roaring.Bitmap

Shorthand for documents containing ANY term (OR operation).

// Find documents with "cat" OR "dog" OR "bird"
results := blaze.AnyOf(idx, "cat", "dog", "bird")

// Equivalent to:
results := blaze.NewQueryBuilder(idx).
    Term("cat").Or().Term("dog").Or().Term("bird").
    Execute()
TermExcluding(index *InvertedIndex, include string, exclude string) *roaring.Bitmap

Shorthand for term with exclusion (AND NOT operation).

// Find documents with "python" but NOT "snake"
results := blaze.TermExcluding(idx, "python", "snake")

// Equivalent to:
results := blaze.NewQueryBuilder(idx).
    Term("python").And().Not().Term("snake").
    Execute()
Query Patterns
Pattern 1: Broad to Narrow

Start with a broad category, then filter down with specific criteria.

// Find programming content about Python or JavaScript, excluding beginner material
results := blaze.NewQueryBuilder(idx).
    Term("programming").
    And().
    Group(func(q *blaze.QueryBuilder) {
        q.Term("python").Or().Term("javascript")
    }).
    And().Not().
    Term("beginner").
    ExecuteWithBM25(10)
Pattern 2: Multi-Criteria Matching

Match documents that satisfy multiple independent criteria.

// Find documents about (machine learning OR deep learning) AND (python OR tensorflow)
results := blaze.NewQueryBuilder(idx).
    Group(func(q *blaze.QueryBuilder) {
        q.Phrase("machine learning").Or().Phrase("deep learning")
    }).
    And().
    Group(func(q *blaze.QueryBuilder) {
        q.Term("python").Or().Term("tensorflow")
    }).
    ExecuteWithBM25(20)
Pattern 3: Exclusion Filtering

Find relevant content while filtering out noise or unwanted categories.

// Find "apple" content but exclude fruit/food related content
results := blaze.NewQueryBuilder(idx).
    Term("apple").
    And().Not().
    Group(func(q *blaze.QueryBuilder) {
        q.Term("fruit").Or().Term("food").Or().Term("cooking")
    }).
    Execute()  // Finds "Apple Inc." not the fruit

Search within specific categories or tags.

func SearchWithCategory(idx *blaze.InvertedIndex, query string, categories []string) []blaze.Match {
    qb := blaze.NewQueryBuilder(idx)

    // Add main query
    qb.Term(query)

    // Add category filter if provided
    if len(categories) > 0 {
        qb.And().Group(func(q *blaze.QueryBuilder) {
            q.Term(categories[0])
            for i := 1; i < len(categories); i++ {
                q.Or().Term(categories[i])
            }
        })
    }

    return qb.ExecuteWithBM25(20)
}
Query Builder Performance

The Query Builder leverages roaring bitmaps for exceptional performance on boolean operations.

Benchmarks (Apple M2)
BenchmarkQueryBuilder_Simple-8       440,616 ops/sec    2,511 ns/op    896 B/op    39 allocs/op
BenchmarkQueryBuilder_Complex-8      222,024 ops/sec    5,333 ns/op  2,240 B/op    98 allocs/op
BenchmarkQueryBuilder_WithBM25-8     411,124 ops/sec    2,955 ns/op  1,416 B/op    46 allocs/op
Performance Benefits
Operation Complexity Why It's Fast
AND O(1) per chunk Roaring bitmap intersection
OR O(1) per chunk Roaring bitmap union
NOT O(1) per chunk Roaring bitmap difference
Term Lookup O(1) Direct hash map access
Compression Benefits

For a term appearing in 500,000 documents:

  • Skip list positions: ~24 MB (500k nodes Γ— 48 bytes)
  • Roaring bitmap: ~60 KB (400x compression!)
Query Builder Best Practices
1. Use Groups for Complex Logic
// Good: Clear precedence with groups
qb.Group(func(q *blaze.QueryBuilder) {
    q.Term("a").Or().Term("b")
}).And().Term("c")

// Bad: Ambiguous without groups
qb.Term("a").Or().Term("b").And().Term("c")  // Is this (a OR b) AND c or a OR (b AND c)?
2. Leverage Convenience Functions for Simple Cases
// Good: Clean and readable
results := blaze.AllOf(idx, "python", "django", "web")

// Bad: Verbose for simple case
results := blaze.NewQueryBuilder(idx).
    Term("python").And().Term("django").And().Term("web").
    Execute()
3. Use BM25 for User-Facing Searches
// Good: Ranked results for users
matches := qb.ExecuteWithBM25(10)

// Bad: Unranked - harder for users to find relevant docs
bitmap := qb.Execute()
4. Combine Phrases and Terms Strategically
// Good: Exact phrase + related term
qb.Phrase("machine learning").And().Term("python")

// Bad: Overly restrictive
qb.Phrase("machine learning python")  // Requires exact phrase
5. Build Queries Programmatically
func BuildDynamicQuery(idx *blaze.InvertedIndex, required []string, optional []string, excluded []string) *roaring.Bitmap {
    qb := blaze.NewQueryBuilder(idx)

    // Add required terms (AND)
    if len(required) > 0 {
        qb.Term(required[0])
        for i := 1; i < len(required); i++ {
            qb.And().Term(required[i])
        }
    }

    // Add optional terms (OR)
    if len(optional) > 0 {
        if len(required) > 0 {
            qb.And()
        }
        qb.Group(func(q *blaze.QueryBuilder) {
            q.Term(optional[0])
            for i := 1; i < len(optional); i++ {
                q.Or().Term(optional[i])
            }
        })
    }

    // Exclude terms (NOT)
    for _, term := range excluded {
        qb.And().Not().Term(term)
    }

    return qb.Execute()
}
Real-World Query Builder Examples
Example 1: E-commerce Search with Filters
func SearchProducts(idx *blaze.InvertedIndex, searchTerm string, category string, excludeOutOfStock bool) []blaze.Match {
    qb := blaze.NewQueryBuilder(idx).Term(searchTerm)

    // Add category filter
    if category != "" {
        qb.And().Term(category)
    }

    // Exclude out of stock items
    if excludeOutOfStock {
        qb.And().Not().Term("outofstock")
    }

    return qb.ExecuteWithBM25(20)
}
func SearchInCategories(idx *blaze.InvertedIndex, query string, categories []string) []blaze.Match {
    qb := blaze.NewQueryBuilder(idx).Term(query)

    if len(categories) > 0 {
        qb.And().Group(func(q *blaze.QueryBuilder) {
            q.Term(categories[0])
            for i := 1; i < len(categories); i++ {
                q.Or().Term(categories[i])
            }
        })
    }

    return qb.ExecuteWithBM25(50)
}
Example 3: Content Filtering with Blocklist
func FilterContent(idx *blaze.InvertedIndex, searchTerm string, blocklist []string) *roaring.Bitmap {
    qb := blaze.NewQueryBuilder(idx).Term(searchTerm)

    for _, blocked := range blocklist {
        qb.And().Not().Term(blocked)
    }

    return qb.Execute()
}
Example 4: Advanced Search with Multiple Phrases
func AdvancedSearch(idx *blaze.InvertedIndex, phrases []string, requiredTerms []string) []blaze.Match {
    qb := blaze.NewQueryBuilder(idx)

    // Match any of the phrases (OR)
    qb.Group(func(q *blaze.QueryBuilder) {
        q.Phrase(phrases[0])
        for i := 1; i < len(phrases); i++ {
            q.Or().Phrase(phrases[i])
        }
    })

    // AND with required terms
    for _, term := range requiredTerms {
        qb.And().Term(term)
    }

    return qb.ExecuteWithBM25(10)
}

// Usage:
results := AdvancedSearch(idx,
    []string{"machine learning", "deep learning"},
    []string{"python", "tensorflow"})
Example 5: HTTP API Integration
func SearchHandler(w http.ResponseWriter, r *http.Request) {
    query := r.URL.Query().Get("q")
    category := r.URL.Query().Get("category")
    exclude := r.URL.Query().Get("exclude")

    qb := blaze.NewQueryBuilder(index).Term(query)

    if category != "" {
        qb.And().Term(category)
    }

    if exclude != "" {
        qb.And().Not().Term(exclude)
    }

    results := qb.ExecuteWithBM25(20)
    json.NewEncoder(w).Encode(results)
}
func SemanticSearch(idx *blaze.InvertedIndex, concept string, relatedTerms []string) []blaze.Match {
    qb := blaze.NewQueryBuilder(idx)

    // Main concept OR any related terms
    qb.Term(concept)
    for _, related := range relatedTerms {
        qb.Or().Term(related)
    }

    return qb.ExecuteWithBM25(50)
}

// Usage:
results := SemanticSearch(idx, "automobile",
    []string{"car", "vehicle", "transportation", "automotive"})

API Reference

InvertedIndex
NewInvertedIndex
func NewInvertedIndex() *InvertedIndex

Creates a new empty inverted index.

Example:

idx := blaze.NewInvertedIndex()
Index
func (idx *InvertedIndex) Index(docID int, document string)

Adds a document to the inverted index. Thread-safe.

Parameters:

  • docID: Unique document identifier
  • document: Text content to index

Example:

idx.Index(1, "The quick brown fox jumps over the lazy dog")
idx.Index(2, "A fast brown dog")

What Happens:

  1. Text is analyzed (tokenized, stemmed, etc.)
  2. Each token is recorded with its position
  3. Positions are stored in skip lists for fast lookup
First
func (idx *InvertedIndex) First(token string) (Position, error)

Returns the first occurrence of a token in the index.

Example:

pos, err := idx.First("quick")
if err != nil {
    // Token not found
}
fmt.Printf("Doc %d, Pos %d\n", int(pos.DocumentID), int(pos.Offset))

Returns:

  • Position: Location of first occurrence
  • error: ErrNoPostingList if token doesn't exist
Last
func (idx *InvertedIndex) Last(token string) (Position, error)

Returns the last occurrence of a token in the index.

Example:

pos, err := idx.Last("quick")
Next
func (idx *InvertedIndex) Next(token string, currentPos Position) (Position, error)

Finds the next occurrence of a token after the given position.

Example:

// Iterate through all occurrences
pos := blaze.BOFDocument
for {
    pos, err = idx.Next("quick", pos)
    if pos.IsEnd() || err != nil {
        break
    }
    fmt.Printf("Found at Doc %d, Pos %d\n",
        int(pos.DocumentID), int(pos.Offset))
}
Previous
func (idx *InvertedIndex) Previous(token string, currentPos Position) (Position, error)

Finds the previous occurrence of a token before the given position.

NextPhrase
func (idx *InvertedIndex) NextPhrase(query string, startPos Position) []Position

Finds the next occurrence of a phrase (exact word sequence).

Parameters:

  • query: Space-separated phrase (e.g., "quick brown fox")
  • startPos: Position to start searching from

Returns:

  • []Position: Array with two elements [phraseStart, phraseEnd]
  • Returns [EOFDocument, EOFDocument] if no match found

Example:

matches := idx.NextPhrase("quick brown fox", blaze.BOFDocument)
if !matches[0].IsEnd() {
    fmt.Printf("Phrase found in Doc %d from Pos %d to %d\n",
        int(matches[0].DocumentID),
        int(matches[0].Offset),
        int(matches[1].Offset))
}
FindAllPhrases
func (idx *InvertedIndex) FindAllPhrases(query string, startPos Position) [][]Position

Finds all occurrences of a phrase in the entire index.

Example:

allMatches := idx.FindAllPhrases("brown fox", blaze.BOFDocument)
for _, match := range allMatches {
    fmt.Printf("Doc %d: Pos %d-%d\n",
        int(match[0].DocumentID),
        int(match[0].Offset),
        int(match[1].Offset))
}
NextCover
func (idx *InvertedIndex) NextCover(tokens []string, startPos Position) []Position

Finds the next "cover" - a range containing all given tokens.

Parameters:

  • tokens: Array of search terms
  • startPos: Position to start searching from

Returns:

  • []Position: Array with [coverStart, coverEnd]

Example:

cover := idx.NextCover([]string{"quick", "fox", "brown"}, blaze.BOFDocument)
fmt.Printf("Cover: Doc %d, Pos %d-%d\n",
    int(cover[0].DocumentID),
    int(cover[0].Offset),
    int(cover[1].Offset))
RankBM25
func (idx *InvertedIndex) RankBM25(query string, maxResults int) []Match

Performs BM25 ranking of search results. This is the recommended search function for most use cases.

Parameters:

  • query: Search query (e.g., "machine learning")
  • maxResults: Maximum number of results to return

Returns:

  • []Match: Sorted array of matches with BM25 scores

Example:

results := idx.RankBM25("machine learning", 10)
for i, match := range results {
    fmt.Printf("%d. Doc %d (score: %.2f)\n",
        i+1,
        match.DocID,
        match.Score)
}

Match Structure:

type Match struct {
    DocID   int        // Document identifier
    Offsets []Position // Where terms appear in the document
    Score   float64    // BM25 relevance score
}
RankProximity
func (idx *InvertedIndex) RankProximity(query string, maxResults int) []Match

Performs proximity-based ranking of search results. Alternative to BM25, ranks by term proximity.

Parameters:

  • query: Search query (e.g., "machine learning")
  • maxResults: Maximum number of results to return

Returns:

  • []Match: Sorted array of matches with proximity scores

Example:

results := idx.RankProximity("quick brown", 5)
for i, match := range results {
    fmt.Printf("%d. Doc %d (score: %.2f)\n",
        i+1,
        int(match.Offsets[0].DocumentID),
        match.Score)
}

BM25 vs Proximity Ranking:

Feature BM25 Proximity
Term Rarity Yes (IDF) No (all terms equal)
Length Normalization Yes (built-in) No
Term Frequency Yes (with saturation) No
Term Distance No Yes (main factor)
Use Case General search Finding close co-occurrences
Industry Standard Yes (Elasticsearch, Solr) No (custom algorithm)
Encode
func (idx *InvertedIndex) Encode() ([]byte, error)

Serializes the inverted index to binary format.

Example:

data, err := idx.Encode()
if err != nil {
    log.Fatal(err)
}

// Save to file
err = os.WriteFile("index.bin", data, 0644)
Decode
func (idx *InvertedIndex) Decode(data []byte) error

Deserializes binary data back into an inverted index.

Example:

data, err := os.ReadFile("index.bin")
if err != nil {
    log.Fatal(err)
}

idx := blaze.NewInvertedIndex()
err = idx.Decode(data)
Text Analysis
Analyze
func Analyze(text string) []string

Transforms raw text into searchable tokens using the default pipeline.

Example:

tokens := blaze.Analyze("The Quick Brown Fox Jumps!")
// Returns: ["quick", "brown", "fox", "jump"]
AnalyzeWithConfig
func AnalyzeWithConfig(text string, config AnalyzerConfig) []string

Transforms text using a custom configuration.

Example:

config := blaze.AnalyzerConfig{
    MinTokenLength:  3,
    EnableStemming:  false,
    EnableStopwords: true,
}
tokens := blaze.AnalyzeWithConfig("The quick brown fox", config)
Position
Position Methods
func (p *Position) GetDocumentID() int
func (p *Position) GetOffset() int
func (p *Position) IsBeginning() bool
func (p *Position) IsEnd() bool
func (p *Position) IsBefore(other Position) bool
func (p *Position) IsAfter(other Position) bool
func (p *Position) Equals(other Position) bool

Example:

pos1 := blaze.Position{DocumentID: 1, Offset: 5}
pos2 := blaze.Position{DocumentID: 1, Offset: 10}

if pos1.IsBefore(pos2) {
    fmt.Println("pos1 comes before pos2")
}
Skip List
NewSkipList
func NewSkipList() *SkipList

Creates a new empty skip list.

Insert
func (sl *SkipList) Insert(key Position)

Adds or updates a position in the skip list. Average O(log n).

Find
func (sl *SkipList) Find(key Position) (Position, error)

Searches for an exact position. Average O(log n).

Delete
func (sl *SkipList) Delete(key Position) bool

Removes a position from the skip list. Average O(log n).

FindLessThan
func (sl *SkipList) FindLessThan(key Position) (Position, error)

Finds the largest position less than the given position.

FindGreaterThan
func (sl *SkipList) FindGreaterThan(key Position) (Position, error)

Finds the smallest position greater than the given position.

Examples

Example 1: Basic Document Search with BM25
package main

import (
    "fmt"
    "github.com/wizenheimer/blaze"
)

func main() {
    // Create index
    idx := blaze.NewInvertedIndex()

    // Index documents
    idx.Index(1, "Go is a programming language designed at Google")
    idx.Index(2, "Python is a high-level programming language")
    idx.Index(3, "Go is fast and efficient for system programming")

    // Search for "programming language" using BM25
    results := idx.RankBM25("programming language", 10)

    fmt.Println("Search results for 'programming language':")
    for i, match := range results {
        fmt.Printf("%d. Document %d (score: %.3f)\n", i+1, match.DocID, match.Score)
    }
}

Output:

Search results for 'programming language':
1. Document 1 (score: 4.521)
2. Document 2 (score: 4.521)
3. Document 3 (score: 2.156)

Note: BM25 scores are absolute values (not normalized to 0-1), reflecting relevance based on term frequency, document length, and term rarity.

package main

import (
    "fmt"
    "github.com/wizenheimer/blaze"
)

func main() {
    idx := blaze.NewInvertedIndex()

    idx.Index(1, "the quick brown fox jumps over the lazy dog")
    idx.Index(2, "a quick brown dog runs fast")
    idx.Index(3, "the lazy brown fox sleeps")

    // Find exact phrase "brown fox"
    matches := idx.FindAllPhrases("brown fox", blaze.BOFDocument)

    fmt.Println("Documents containing 'brown fox' as a phrase:")
    for _, match := range matches {
        docID := int(match[0].DocumentID)
        start := int(match[0].Offset)
        end := int(match[1].Offset)
        fmt.Printf("Document %d: positions %d-%d\n", docID, start, end)
    }
}

Output:

Documents containing 'brown fox' as a phrase:
Document 1: positions 1-2
Document 3: positions 2-3
Example 3: Iterating Through Positions
package main

import (
    "fmt"
    "github.com/wizenheimer/blaze"
)

func main() {
    idx := blaze.NewInvertedIndex()

    idx.Index(1, "quick test quick test quick")
    idx.Index(2, "another quick test here")

    // Find all occurrences of "quick"
    fmt.Println("All occurrences of 'quick':")

    pos := blaze.BOFDocument
    for {
        pos, err := idx.Next("quick", pos)
        if err != nil || pos.IsEnd() {
            break
        }
        fmt.Printf("  Doc %d, Pos %d\n",
            int(pos.DocumentID),
            int(pos.Offset))
    }
}

Output:

All occurrences of 'quick':
  Doc 1, Pos 0
  Doc 1, Pos 2
  Doc 1, Pos 4
  Doc 2, Pos 1
Example 4: Persistence with Serialization
package main

import (
    "fmt"
    "os"
    "github.com/wizenheimer/blaze"
)

func main() {
    // Build and save index
    idx := blaze.NewInvertedIndex()
    idx.Index(1, "machine learning algorithms")
    idx.Index(2, "deep learning neural networks")
    idx.Index(3, "natural language processing")

    // Serialize to binary
    data, err := idx.Encode()
    if err != nil {
        panic(err)
    }

    // Save to file
    err = os.WriteFile("search_index.bin", data, 0644)
    if err != nil {
        panic(err)
    }
    fmt.Println("Index saved to search_index.bin")

    // Load index from file
    loadedData, err := os.ReadFile("search_index.bin")
    if err != nil {
        panic(err)
    }

    loadedIdx := blaze.NewInvertedIndex()
    err = loadedIdx.Decode(loadedData)
    if err != nil {
        panic(err)
    }

    // Use loaded index
    results := loadedIdx.RankProximity("learning", 5)
    fmt.Printf("Found %d documents\n", len(results))
}
Example 5: Custom Analyzer Configuration
package main

import (
    "fmt"
    "github.com/wizenheimer/blaze"
)

func main() {
    // Create custom analyzer config (no stemming, longer min length)
    config := blaze.AnalyzerConfig{
        MinTokenLength:  3,      // Minimum 3 characters
        EnableStemming:  false,  // Keep original word forms
        EnableStopwords: true,   // Still remove stopwords
    }

    text := "The running dogs are running fast"

    // Compare default vs custom analysis
    defaultTokens := blaze.Analyze(text)
    customTokens := blaze.AnalyzeWithConfig(text, config)

    fmt.Println("Default tokens:", defaultTokens)
    fmt.Println("Custom tokens:", customTokens)
}

Output:

Default tokens: [run dog run fast]
Custom tokens: [running dogs running fast]
Example 6: Comparing BM25 and Proximity Ranking
package main

import (
    "fmt"
    "github.com/wizenheimer/blaze"
)

func main() {
    idx := blaze.NewInvertedIndex()

    // Index documents
    idx.Index(1, "machine learning algorithms")
    idx.Index(2, "machine learning machine learning")  // High term frequency
    idx.Index(3, "machine and algorithms and learning") // Terms far apart

    query := "machine learning"

    // BM25 Ranking
    fmt.Println("BM25 Rankings:")
    bm25Results := idx.RankBM25(query, 10)
    for i, match := range bm25Results {
        fmt.Printf("%d. Doc %d (score: %.3f)\n", i+1, match.DocID, match.Score)
    }

    // Proximity Ranking
    fmt.Println("\nProximity Rankings:")
    proxResults := idx.RankProximity(query, 10)
    for i, match := range proxResults {
        docID := int(match.Offsets[0].DocumentID)
        fmt.Printf("%d. Doc %d (score: %.3f)\n", i+1, docID, match.Score)
    }
}

Output:

BM25 Rankings:
1. Doc 2 (score: 5.234)  ← High term frequency
2. Doc 1 (score: 3.156)
3. Doc 3 (score: 2.891)

Proximity Rankings:
1. Doc 1 (score: 1.000)  ← Terms adjacent
2. Doc 2 (score: 1.000)
3. Doc 3 (score: 0.200)  ← Terms far apart

Key Differences:

  • BM25 favors Doc 2 (repeated terms = high relevance)
  • Proximity favors Doc 1 and Doc 2 equally (both have adjacent terms)
  • Doc 3 ranks low in both (terms spread out)
Example 7: Building a Simple Search Engine
package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"
    "github.com/wizenheimer/blaze"
)

func main() {
    // Create index
    idx := blaze.NewInvertedIndex()

    // Index some documents
    docs := map[int]string{
        1: "Go is an open source programming language that makes it easy to build simple, reliable, and efficient software",
        2: "Python is a programming language that lets you work quickly and integrate systems more effectively",
        3: "JavaScript is a programming language that conforms to the ECMAScript specification",
        4: "Rust is a multi-paradigm programming language focused on performance and safety",
        5: "Java is a class-based, object-oriented programming language designed for portability",
    }

    for id, doc := range docs {
        idx.Index(id, doc)
    }

    // Interactive search
    scanner := bufio.NewScanner(os.Stdin)

    for {
        fmt.Print("\nSearch query (or 'quit' to exit): ")
        if !scanner.Scan() {
            break
        }

        query := strings.TrimSpace(scanner.Text())
        if query == "quit" {
            break
        }

        if query == "" {
            continue
        }

        // Perform search using BM25
        results := idx.RankBM25(query, 5)

        if len(results) == 0 {
            fmt.Println("No results found")
            continue
        }

        // Display results
        fmt.Printf("\nFound %d result(s):\n", len(results))
        for i, match := range results {
            fmt.Printf("\n%d. Document %d (Score: %.3f)\n", i+1, match.DocID, match.Score)
            fmt.Printf("   %s\n", docs[match.DocID])
        }
    }
}

Performance Characteristics

Time Complexity
Operation Average Worst Case Notes
Index (per document) O(n Γ— log m) O(n Γ— m) n = tokens, m = total positions
Term lookup O(log m) O(m) m = positions for term
Phrase search O(k Γ— log m) O(k Γ— m) k = phrase length
BM25 ranking O(t Γ— d) O(t Γ— d) t = query terms, d = candidates
Proximity ranking O(t Γ— m) O(t Γ— m) t = query terms
Skip list insert O(log n) O(n) n = elements in list
Skip list search O(log n) O(n) Probabilistically rare
Space Complexity
Component Space Notes
Inverted index O(n) n = total unique positions
Skip list nodes O(n Γ— log n) Average 2 pointers per node
Analyzer O(1) In-place processing
Serialized index O(n) Compact binary format
Benchmarks

Performance on Apple M2 (8 cores), Go 1.24:

BenchmarkIndex-8                     50000    35421 ns/op    18234 B/op    245 allocs/op
BenchmarkTermSearch-8              300000     4123 ns/op      128 B/op      3 allocs/op
BenchmarkPhraseSearch-8            100000    12456 ns/op      512 B/op     12 allocs/op
BenchmarkRankBM25-8                  60000    24567 ns/op     1856 B/op     38 allocs/op
BenchmarkProximityRanking-8         50000    28934 ns/op     2048 B/op     45 allocs/op
BenchmarkCalculateIDF-8           5000000      234 ns/op       16 B/op      1 allocs/op
BenchmarkCalculateBM25Score-8     2000000      567 ns/op       64 B/op      2 allocs/op
BenchmarkSkipListInsert-8         3000000      413 ns/op      255 B/op      6 allocs/op
BenchmarkSkipListSearch-8         5000000      203 ns/op       23 B/op      1 allocs/op
BenchmarkAnalyze-8                1000000     1234 ns/op      512 B/op      8 allocs/op
BenchmarkEncode-8                   10000   156789 ns/op    65536 B/op    234 allocs/op
BenchmarkDecode-8                   15000   123456 ns/op    49152 B/op    189 allocs/op
Scalability

Index Size vs Performance:

Documents Terms Index Time Search Time Memory
1K 10K 50ms 0.5ms 2 MB
10K 100K 500ms 1ms 20 MB
100K 1M 5s 2ms 200 MB
1M 10M 50s 5ms 2 GB

Notes:

  • Search time remains relatively constant due to O(log n) operations
  • Memory scales linearly with unique positions
  • Serialization reduces storage by ~40% compared to in-memory size

Configuration

BM25 Parameters

Customize BM25 ranking behavior:

type BM25Parameters struct {
    K1 float64 // Term frequency saturation (default: 1.5)
    B  float64 // Length normalization (default: 0.75)
}

Tuning BM25:

idx := blaze.NewInvertedIndex()

// Adjust BM25 parameters before indexing
idx.BM25Params.K1 = 2.0  // Higher = less saturation (more weight to TF)
idx.BM25Params.B = 0.5   // Lower = less length penalty

// Now index and search
idx.Index(1, "document content")
results := idx.RankBM25("query", 10)

Parameter Effects:

Parameter Range Effect When to Adjust
K1 1.2 - 2.0 Controls TF saturation Higher for domains where term frequency matters more
B 0 - 1 Controls length penalty Lower for domains with naturally longer docs

Examples:

// Academic papers (long documents, repeated terms important)
idx.BM25Params.K1 = 2.0
idx.BM25Params.B = 0.5

// Short messages (length less important)
idx.BM25Params.K1 = 1.2
idx.BM25Params.B = 0.3

// Default (works well for most cases)
idx.BM25Params.K1 = 1.5
idx.BM25Params.B = 0.75

BM25 Statistics:

During indexing, Blaze automatically tracks:

type DocumentStats struct {
    DocID     int            // Document identifier
    Length    int            // Number of terms
    TermFreqs map[string]int // Term frequencies
}

// Corpus-level statistics
idx.TotalDocs  // Total documents indexed
idx.TotalTerms // Total terms across all documents
idx.DocStats   // Per-document statistics

These statistics are:

  • Automatically computed during indexing
  • Serialized with the index
  • Used for BM25 score calculation
Analyzer Configuration

Customize the text analysis pipeline:

type AnalyzerConfig struct {
    MinTokenLength  int  // Minimum token length (default: 2)
    EnableStemming  bool // Apply stemming (default: true)
    EnableStopwords bool // Remove stopwords (default: true)
}

Configuration Examples:

// Exact matching (no stemming, keep all words)
exactConfig := blaze.AnalyzerConfig{
    MinTokenLength:  1,
    EnableStemming:  false,
    EnableStopwords: false,
}

// Fuzzy matching (aggressive stemming)
fuzzyConfig := blaze.AnalyzerConfig{
    MinTokenLength:  2,
    EnableStemming:  true,
    EnableStopwords: true,
}

// Code search (no stemming, no stopwords, longer tokens)
codeConfig := blaze.AnalyzerConfig{
    MinTokenLength:  3,
    EnableStemming:  false,
    EnableStopwords: false,
}
Tuning Recommendations

MinTokenLength:

  • 1: Very permissive, large index
  • 2: Balanced (default), filters single chars
  • 3: Strict, smaller index, misses short words

EnableStemming:

  • true: Better recall, finds related words ("run" matches "running")
  • false: Exact matching, preserves original word forms

EnableStopwords:

  • true: Smaller index, faster search, standard behavior
  • false: Complete indexing, useful for phrase search
Skip List Parameters
const MaxHeight = 32  // Maximum tower height

Tower Height Probability:

  • Height 1: 50%
  • Height 2: 25%
  • Height 3: 12.5%
  • Height 4: 6.25%

This geometric distribution ensures O(log n) average performance.

Use Cases

[!NOTE] > Blaze handles keyword-based full-text search (BM25, phrases, boolean queries).
Comet (docs) handles semantic search with embeddings (vector similarity, hybrid retrieval).
Both are educational implementations optimized for readability.

1. Document Search Systems

Build a search engine for documents:

type Document struct {
    ID      int
    Title   string
    Content string
}

func IndexDocuments(docs []Document) *blaze.InvertedIndex {
    idx := blaze.NewInvertedIndex()

    for _, doc := range docs {
        // Combine title and content
        text := doc.Title + " " + doc.Content
        idx.Index(doc.ID, text)
    }

    return idx
}

func SearchDocuments(idx *blaze.InvertedIndex, query string) []int {
    // Use BM25 for general relevance ranking (recommended)
    matches := idx.RankBM25(query, 20)

    docIDs := make([]int, len(matches))
    for i, match := range matches {
        docIDs[i] = match.DocID
    }

    return docIDs
}

// Alternative: Use proximity ranking to find documents with close term matches
func SearchDocumentsByProximity(idx *blaze.InvertedIndex, query string) []int {
    matches := idx.RankProximity(query, 20)

    docIDs := make([]int, len(matches))
    for i, match := range matches {
        docIDs[i] = int(match.Offsets[0].DocumentID)
    }

    return docIDs
}
2. Log Analysis

Search through log files:

func IndexLogs(logFile string) (*blaze.InvertedIndex, error) {
    idx := blaze.NewInvertedIndex()

    file, err := os.Open(logFile)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    scanner := bufio.NewScanner(file)
    lineNumber := 1

    for scanner.Scan() {
        idx.Index(lineNumber, scanner.Text())
        lineNumber++
    }

    return idx, scanner.Err()
}

// Find all ERROR log lines using BM25 (considers frequency and rarity)
errorLogs := idx.RankBM25("ERROR", 100)

// Alternative: Use proximity for finding error patterns
// e.g., "connection timeout" appearing close together
patternMatches := idx.RankProximity("connection timeout", 50)

Search through source code:

func IndexCodebase(rootDir string) (*blaze.InvertedIndex, error) {
    idx := blaze.NewInvertedIndex()
    fileID := 1

    // Custom config for code (no stemming, keep all words)
    config := blaze.AnalyzerConfig{
        MinTokenLength:  2,
        EnableStemming:  false,
        EnableStopwords: false,
    }

    err := filepath.Walk(rootDir, func(path string, info os.FileInfo, err error) error {
        if err != nil || info.IsDir() {
            return err
        }

        // Only index Go files
        if !strings.HasSuffix(path, ".go") {
            return nil
        }

        content, err := os.ReadFile(path)
        if err != nil {
            return err
        }

        // Use custom analyzer
        tokens := blaze.AnalyzeWithConfig(string(content), config)
        // ... index tokens ...

        fileID++
        return nil
    })

    return idx, err
}

// BM25: Find files with frequent mentions of a function/variable
bm25Results := idx.RankBM25("http.Handler", 20)

// Proximity: Find exact API patterns (e.g., function calls with parameters)
// Better for finding "http.HandleFunc" as a specific pattern
proximityResults := idx.RankProximity("http HandleFunc", 20)

Search product catalog:

type Product struct {
    ID          int
    Name        string
    Description string
    Category    string
    Tags        []string
}

func IndexProducts(products []Product) *blaze.InvertedIndex {
    idx := blaze.NewInvertedIndex()

    for _, product := range products {
        // Combine all searchable fields
        searchText := fmt.Sprintf("%s %s %s %s",
            product.Name,
            product.Description,
            product.Category,
            strings.Join(product.Tags, " "))

        idx.Index(product.ID, searchText)
    }

    return idx
}

// BM25: Best for general product search (considers all factors)
results := idx.RankBM25("wireless headphones", 10)

// Proximity: Good for finding exact product name matches
// (e.g., "Sony WH-1000XM4" as an exact phrase proximity)
exactMatches := idx.RankProximity("wireless headphones", 10)

Index and search email messages:

type Email struct {
    ID      int
    From    string
    Subject string
    Body    string
}

func IndexEmails(emails []Email) *blaze.InvertedIndex {
    idx := blaze.NewInvertedIndex()

    for _, email := range emails {
        searchText := fmt.Sprintf("%s %s %s",
            email.From,
            email.Subject,
            email.Body)

        idx.Index(email.ID, searchText)
    }

    return idx
}

// BM25: Find emails where terms appear frequently (general search)
matches := idx.RankBM25("project deadline", 50)

// Proximity: Find emails where "project" and "deadline" appear close together
// (more precise, better for finding specific mentions)
closeMatches := idx.RankProximity("project deadline", 50)

Testing

Running Tests
# Run all tests
make test

# Run tests with coverage
make test-coverage

# Run benchmarks
make bench

# Run all checks (format, vet, lint, test)
make check
Test Coverage

The library has comprehensive test coverage:

$ make test-coverage
Running tests...
ok      github.com/wizenheimer/blaze    2.456s  coverage: 98.5% of statements
Generating coverage report...
Coverage report: coverage.html

Coverage by Component:

  • Inverted Index: 100%
  • Skip Lists: 100%
  • Text Analysis: 100%
  • Search Operations: 100%
  • BM25 Ranking: 100%
  • Serialization: 100%
Writing Tests

Example test:

func TestSearchFunctionality(t *testing.T) {
    idx := blaze.NewInvertedIndex()

    // Index test documents
    idx.Index(1, "the quick brown fox")
    idx.Index(2, "the lazy brown dog")

    // Test phrase search
    matches := idx.FindAllPhrases("brown fox", blaze.BOFDocument)

    if len(matches) != 1 {
        t.Errorf("Expected 1 match, got %d", len(matches))
    }

    if int(matches[0][0].DocumentID) != 1 {
        t.Errorf("Expected document 1, got %d", int(matches[0][0].DocumentID))
    }
}

Architecture

Component Overview
blaze/
β”œβ”€β”€ index.go          # Inverted index implementation with hybrid storage
β”œβ”€β”€ query.go          # Query builder with roaring bitmaps
β”œβ”€β”€ skiplist.go       # Skip list data structure for positions
β”œβ”€β”€ search.go         # Search algorithms (phrase, proximity, BM25)
β”œβ”€β”€ analyzer.go       # Text analysis pipeline
β”œβ”€β”€ serialization.go  # Binary encoding/decoding (skip lists + bitmaps)
β”œβ”€β”€ *_test.go         # Comprehensive test suite
β”œβ”€β”€ Makefile          # Development commands
└── public/           # Documentation website
    └── index.html
Query Processor Architecture

The query processor uses a hybrid storage approach combining roaring bitmaps for document-level operations and skip lists for position-level operations.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      QUERY PROCESSOR ARCHITECTURE                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                              User Query
                          "machine AND learning"
                                  β”‚
                                  β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    Text Analyzer            β”‚
                    β”‚  (tokenize, stem, etc.)     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                    ["machine", "learning"]
                                  β”‚
                                  β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     Query Builder           β”‚
                    β”‚  (constructs query tree)    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                    Query Tree: AND(machine, learning)
                                  β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚                     β”‚                     β”‚
            β–Ό                     β–Ό                     β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Bitmap Ops   β”‚    β”‚  Skip List    β”‚    β”‚  BM25 Scorer  β”‚
    β”‚  (fast AND/OR)β”‚    β”‚  (positions)  β”‚    β”‚  (ranking)    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                     β”‚                     β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β–Ό
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚    Results    β”‚
                          β”‚  (ranked docs)β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Hybrid Storage Architecture

Blaze uses a sophisticated hybrid storage model:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        HYBRID STORAGE MODEL                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For each term "machine":

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DOCUMENT LEVEL (Roaring Bitmap)                                        β”‚
β”‚  ────────────────────────────────────────────────────────────────────── β”‚
β”‚                                                                           β”‚
β”‚  DocBitmaps["machine"] = {1, 2, 4, 5, 100, 500, 1000, ...}             β”‚
β”‚                                                                           β”‚
β”‚  Compressed representation of ALL documents containing "machine"         β”‚
β”‚  Use: Fast boolean operations (AND, OR, NOT)                            β”‚
β”‚  Size: ~60 KB for 500k documents (400x compression!)                    β”‚
β”‚                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
                                  β”‚ Links to
                                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  POSITION LEVEL (Skip List)                                             β”‚
β”‚  ────────────────────────────────────────────────────────────────────── β”‚
β”‚                                                                           β”‚
β”‚  PostingsList["machine"] = SkipList:                                    β”‚
β”‚                                                                           β”‚
β”‚    Level 2: [Doc1:Pos5] ────────────────────────> [Doc100:Pos12]       β”‚
β”‚                 β”‚                                       β”‚                β”‚
β”‚    Level 1: [Doc1:Pos5] ──> [Doc2:Pos3] ───────────> [Doc100:Pos12]   β”‚
β”‚                 β”‚              β”‚                         β”‚               β”‚
β”‚    Level 0: [Doc1:Pos5] -> [Doc2:Pos3] -> [Doc4:Pos1] -> [Doc5:Pos7]  β”‚
β”‚             -> [Doc100:Pos12] -> [Doc500:Pos2] -> ...                  β”‚
β”‚                                                                           β”‚
β”‚  Detailed position information for EVERY occurrence                      β”‚
β”‚  Use: Phrase search, proximity ranking, snippets                        β”‚
β”‚  Size: ~24 MB for 500k positions                                        β”‚
β”‚                                                                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

WHY HYBRID?
───────────
1. Bitmaps: Lightning-fast document filtering (AND, OR, NOT in microseconds)
2. Skip Lists: Precise position tracking for phrases and proximity
3. Best of both worlds: Speed + Precision
Query Execution Flow

Here's how a complex query executes step-by-step:

QUERY: (machine OR deep) AND learning AND NOT neural

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 1: BITMAP PHASE (Fast Document Filtering)                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Term Lookups (O(1) hash map):
    DocBitmaps["machine"] = {1, 2, 4, 5, 7, 8, 9, 10}
    DocBitmaps["deep"]    = {2, 3, 5, 6, 8, 9}
    DocBitmaps["learning"]= {1, 2, 4, 5, 6, 7, 8, 9, 10}
    DocBitmaps["neural"]  = {3, 6, 8, 9}

Boolean Operations (O(1) per chunk):
    Step 1: machine OR deep
            {1, 2, 4, 5, 7, 8, 9, 10} βˆͺ {2, 3, 5, 6, 8, 9}
          = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

    Step 2: (machine OR deep) AND learning
            {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} ∩ {1, 2, 4, 5, 6, 7, 8, 9, 10}
          = {1, 2, 4, 5, 6, 7, 8, 9, 10}

    Step 3: Result AND NOT neural
            {1, 2, 4, 5, 6, 7, 8, 9, 10} \ {3, 6, 8, 9}
          = {1, 2, 4, 5, 7, 10}  ← CANDIDATE DOCUMENTS

    Time: ~10 microseconds for 1M documents!

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 2: POSITION PHASE (Optional - for phrases/proximity)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

IF phrase search needed:
    For each candidate doc {1, 2, 4, 5, 7, 10}:
        Use skip lists to verify exact positions
        Check consecutive positions for phrases
        Extract position data for snippets

    Time: O(log n) per position lookup

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 3: RANKING PHASE (BM25 Scoring)                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For each candidate document:
    1. Calculate IDF (Inverse Document Frequency):
       - Uses bitmap cardinality for instant document counts
       - IDF("machine") = log((N - df + 0.5) / (df + 0.5))
       - df = DocBitmaps["machine"].GetCardinality()

    2. Calculate TF (Term Frequency):
       - Retrieves from pre-computed DocStats
       - TF("machine", Doc1) = termFreqs["machine"]

    3. Apply BM25 formula:
       - Combines IDF, TF, and length normalization
       - Score = IDF Γ— (TF Γ— (k1 + 1)) / (TF + k1 Γ— length_norm)

    4. Sum scores for all query terms

Results sorted by score:
    Doc 5: 8.45
    Doc 2: 7.23
    Doc 1: 6.91
    ...

    Time: O(candidates Γ— terms)
Data Structure Memory Layout
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    INVERTED INDEX STRUCTURE                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

InvertedIndex {
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  DocBitmaps: map[string]*roaring.Bitmap                         β”‚
    β”‚  ───────────────────────────────────────────────────────────────│
    β”‚  "machine"  β†’ [Compressed Bitmap: 512 bytes]                    β”‚
    β”‚  "learning" β†’ [Compressed Bitmap: 448 bytes]                    β”‚
    β”‚  "deep"     β†’ [Compressed Bitmap: 256 bytes]                    β”‚
    β”‚  ...                                                             β”‚
    β”‚                                                                  β”‚
    β”‚  Memory: ~100 bytes per term (compressed)                       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β”‚ Parallel Storage
                              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  PostingsList: map[string]SkipList                              β”‚
    β”‚  ───────────────────────────────────────────────────────────────│
    β”‚  "machine"  β†’ SkipList with 10,000 position nodes               β”‚
    β”‚  "learning" β†’ SkipList with 8,000 position nodes                β”‚
    β”‚  "deep"     β†’ SkipList with 5,000 position nodes                β”‚
    β”‚  ...                                                             β”‚
    β”‚                                                                  β”‚
    β”‚  Memory: ~48 bytes per position (node overhead)                 β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β”‚ Statistics
                              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  DocStats: map[int]DocumentStats                                β”‚
    β”‚  ───────────────────────────────────────────────────────────────│
    β”‚  Doc1 β†’ {Length: 150, TermFreqs: {"machine": 3, ...}}          β”‚
    β”‚  Doc2 β†’ {Length: 200, TermFreqs: {"learning": 5, ...}}         β”‚
    β”‚  ...                                                             β”‚
    β”‚                                                                  β”‚
    β”‚  Memory: ~16 bytes per term per document                        β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β”‚ Metadata
                              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Global Statistics                                               β”‚
    β”‚  ───────────────────────────────────────────────────────────────│
    β”‚  TotalDocs:   1,000,000                                         β”‚
    β”‚  TotalTerms:  150,000,000                                       β”‚
    β”‚  AvgDocLen:   150.0                                             β”‚
    β”‚  BM25Params:  {K1: 1.5, B: 0.75}                               β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Mutex for thread safety (sync.RWMutex)
}

MEMORY BREAKDOWN (for 1M documents, 10M unique positions):
────────────────────────────────────────────────────────────
DocBitmaps:     ~10 MB  (compressed bitmaps)
PostingsList:   ~480 MB (skip list nodes)
DocStats:       ~500 MB (per-doc statistics)
Overhead:       ~10 MB  (maps, pointers, etc.)
────────────────────────────────────────────────────────────
TOTAL:          ~1 GB
Roaring Bitmap Internals
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ROARING BITMAP STRUCTURE                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Document IDs: {1, 2, 3, 100, 101, 102, 500000, 500001, 999999}

Traditional Bitmap (naive):
    [1,1,1,0,0...0,1,1,1,0...0,1,1,0...0,1]
    Size: 1,000,000 bits = 125 KB (wasteful for sparse data)

Roaring Bitmap (smart):

    Split into 65,536 chunks (high 16 bits = chunk ID):

    Chunk 0 (docs 0-65535):      [1,2,3,100,101,102]
    Chunk 7 (docs 458752-524287): [500000, 500001]
    Chunk 15 (docs 983040-1048575): [999999]

    Storage per chunk (adaptive):
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ If cardinality < 4096:                             β”‚
    β”‚   β†’ Use Array Container                            β”‚
    β”‚   β†’ Store sorted uint16 values directly            β”‚
    β”‚   β†’ Size: 2 bytes Γ— cardinality                    β”‚
    β”‚                                                     β”‚
    β”‚ If cardinality > 4096:                             β”‚
    β”‚   β†’ Use Bitmap Container                           β”‚
    β”‚   β†’ Store 65536-bit bitmap (8 KB)                 β”‚
    β”‚   β†’ Size: 8 KB fixed                               β”‚
    β”‚                                                     β”‚
    β”‚ If cardinality = 65536 (all docs):                β”‚
    β”‚   β†’ Use Run Container                              β”‚
    β”‚   β†’ Store: [0-65535]                               β”‚
    β”‚   β†’ Size: 4 bytes                                  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Total Size: ~60 bytes (vs 125 KB!)

    Operations:

    AND: Container-by-container intersection
         Skip non-matching chunks (O(1))
         Intersect matching chunks (O(min(n,m)))

    OR:  Container-by-container union
         Merge all chunks (O(n+m))

    NOT: Complement within document space
         Flip all bits in each chunk
Query Builder Execution Model
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   QUERY BUILDER EXECUTION MODEL                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Query: NewQueryBuilder(idx).
         Group(func(q) { q.Term("machine").Or().Term("deep") }).
         And().
         Term("learning").
         Execute()

INTERNAL REPRESENTATION:
────────────────────────

QueryBuilder {
    stack: []*roaring.Bitmap       // Operand stack
    ops:   []QueryOp               // Operator stack
    terms: []string                // Track for BM25
}

EXECUTION TRACE:
────────────────

Step 1: Group(func(q) { ... })
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Create sub-builder                    β”‚
    β”‚ Execute sub-query                     β”‚
    β”‚ Push result bitmap to parent stack    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Sub-query execution:
      1.1: Term("machine")
           β†’ Lookup: DocBitmaps["machine"]
           β†’ Push: {1,2,4,5,7,8,9,10}

      1.2: Or()
           β†’ Push operator: OR

      1.3: Term("deep")
           β†’ Lookup: DocBitmaps["deep"]
           β†’ Push: {2,3,5,6,8,9}

      1.4: Apply OR
           β†’ Pop: {2,3,5,6,8,9}
           β†’ Pop: {1,2,4,5,7,8,9,10}
           β†’ Union: {1,2,3,4,5,6,7,8,9,10}
           β†’ Push result

    Result: {1,2,3,4,5,6,7,8,9,10}

Step 2: And()
    β†’ Push operator: AND

Step 3: Term("learning")
    β†’ Lookup: DocBitmaps["learning"]
    β†’ Push: {1,2,4,5,6,7,8,9,10}

Step 4: Execute()
    β†’ Pop: {1,2,4,5,6,7,8,9,10}
    β†’ Pop: {1,2,3,4,5,6,7,8,9,10}
    β†’ Intersect: {1,2,4,5,6,7,8,9,10}
    β†’ Return final bitmap

OPERATION COSTS:
────────────────
Bitmap Lookup:    O(1)          ~100 ns
Bitmap Union:     O(n+m)        ~1 Β΅s for 10k docs
Bitmap Intersect: O(min(n,m))   ~800 ns for 10k docs
Bitmap Difference: O(n)         ~900 ns for 10k docs

Total Query Time: ~10 Β΅s for typical query!
Data Flow
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Complete Data Flow                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                              User Input
                       "The Quick Brown Fox!"
                                β”‚
                                β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚      Text Analysis Pipeline               β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
            β”‚  β”‚ 1. Tokenization                     β”‚  β”‚
            β”‚  β”‚    ["The", "Quick", "Brown", "Fox"] β”‚  β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
            β”‚               β–Ό                            β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
            β”‚  β”‚ 2. Lowercasing                      β”‚  β”‚
            β”‚  β”‚    ["the", "quick", "brown", "fox"] β”‚  β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
            β”‚               β–Ό                            β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
            β”‚  β”‚ 3. Stopword Filtering               β”‚  β”‚
            β”‚  β”‚    ["quick", "brown", "fox"]        β”‚  β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
            β”‚               β–Ό                            β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
            β”‚  β”‚ 4. Length Filtering                 β”‚  β”‚
            β”‚  β”‚    ["quick", "brown", "fox"]        β”‚  β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
            β”‚               β–Ό                            β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
            β”‚  β”‚ 5. Stemming                         β”‚  β”‚
            β”‚  β”‚    ["quick", "brown", "fox"]        β”‚  β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
                    ["quick", "brown", "fox"]
                            β”‚
                            β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚       Inverted Index (Indexing)           β”‚
            β”‚                                            β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
            β”‚  β”‚ "quick" β”‚ β†’ SkipList             β”‚     β”‚
            β”‚  β”‚         β”‚    └─> [Doc1:Pos0]     β”‚     β”‚
            β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€     β”‚
            β”‚  β”‚ "brown" β”‚ β†’ SkipList             β”‚     β”‚
            β”‚  β”‚         β”‚    └─> [Doc1:Pos1]     β”‚     β”‚
            β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€     β”‚
            β”‚  β”‚ "fox"   β”‚ β†’ SkipList             β”‚     β”‚
            β”‚  β”‚         β”‚    └─> [Doc1:Pos2]     β”‚     β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚        Search Operations          β”‚
          β–Ό                                   β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Term    β”‚                      β”‚  Phrase    β”‚
    β”‚  Search  β”‚                      β”‚  Search    β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                      β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚   Proximity   β”‚
            β”‚   Ranking     β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  Ranked Results       β”‚
            β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
            β”‚  β”‚ Doc 1: Score 1.0β”‚  β”‚
            β”‚  β”‚ Doc 2: Score 0.5β”‚  β”‚
            β”‚  β”‚ Doc 3: Score 0.3β”‚  β”‚
            β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Key Design Decisions

1. Skip Lists over Balanced Trees

Rationale:

  • Simpler implementation (no rotation logic)
  • Better cache locality
  • Easier to make concurrent
  • Comparable performance (O(log n))
  • Used in production systems (Redis, LevelDB)

2. Position-Based Indexing

Instead of just tracking document IDs, Blaze tracks exact word positions:

Traditional Index (Document IDs only):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "quick" β”‚ [Doc1, Doc3]     β”‚  Cannot do phrase search
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Cannot rank by proximity

Position-Based Index (Document + Offset):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "quick" β”‚ [Doc1:Pos1, Doc3:Pos0]             β”‚  Enables phrase search
β”‚ "brown" β”‚ [Doc1:Pos2, Doc3:Pos1]             β”‚  Enables proximity ranking
β”‚ "fox"   β”‚ [Doc1:Pos3]                        β”‚  Enables snippet generation
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Enables precise results

Can verify: "quick brown" is a phrase in Doc1 (Pos1β†’Pos2)
            but NOT in Doc3 (Pos0 and Pos1 are not "quick brown")

Benefits:

  • Enables phrase search (check consecutive positions)
  • Enables proximity ranking (measure distances)
  • Enables snippet generation (extract relevant parts)
  • More precise search results

Trade-offs:

  • Larger index size (~2-3x more data)
  • More complex algorithms (but still O(log n))

3. Binary Serialization

Custom binary format instead of JSON:

Advantages:

  • 60% smaller file size
  • 3x faster parsing
  • Preserves skip list structure
  • Suitable for large indexes

4. Configurable Text Analysis

Pluggable analyzer configuration:

Benefits:

  • Adapt to different use cases
  • Trade-off precision vs recall
  • Support multiple languages (future)
  • Domain-specific customization

Best Practices

1. Choose Appropriate Document IDs

Use stable, unique identifiers:

// Good: Use database primary keys
idx.Index(dbRecord.ID, dbRecord.Content)

// Bad: Use array indices (changes when reordering)
for i, doc := range docs {
    idx.Index(i, doc.Content)  // Don't do this
}
2. Batch Indexing for Large Datasets
func IndexLargeDataset(docs []Document) *blaze.InvertedIndex {
    idx := blaze.NewInvertedIndex()

    // Process in batches
    batchSize := 1000
    for i := 0; i < len(docs); i += batchSize {
        end := min(i+batchSize, len(docs))
        batch := docs[i:end]

        for _, doc := range batch {
            idx.Index(doc.ID, doc.Content)
        }

        // Optional: periodic serialization for checkpoints
        if i%10000 == 0 {
            data, _ := idx.Encode()
            os.WriteFile(fmt.Sprintf("checkpoint_%d.bin", i), data, 0644)
        }
    }

    return idx
}
3. Use Appropriate Analyzer Config

Match configuration to your use case:

// Natural language text (books, articles)
naturalLanguageConfig := blaze.AnalyzerConfig{
    MinTokenLength:  2,
    EnableStemming:  true,   // Find related words
    EnableStopwords: true,   // Remove common words
}

// Technical documentation (code, APIs)
technicalConfig := blaze.AnalyzerConfig{
    MinTokenLength:  2,
    EnableStemming:  false,  // Keep exact terms
    EnableStopwords: false,  // Keep all words
}

// Product names (e-commerce)
productConfig := blaze.AnalyzerConfig{
    MinTokenLength:  1,      // Include single chars (e.g., "X")
    EnableStemming:  false,  // Exact product names
    EnableStopwords: false,  // Keep all words
}
4. Persist Index for Large Datasets

Don't rebuild the index every time:

const indexFile = "search_index.bin"

func LoadOrBuildIndex(docs []Document) (*blaze.InvertedIndex, error) {
    // Try to load existing index
    if data, err := os.ReadFile(indexFile); err == nil {
        idx := blaze.NewInvertedIndex()
        if err := idx.Decode(data); err == nil {
            return idx, nil
        }
    }

    // Build new index
    idx := blaze.NewInvertedIndex()
    for _, doc := range docs {
        idx.Index(doc.ID, doc.Content)
    }

    // Save for next time
    if data, err := idx.Encode(); err == nil {
        os.WriteFile(indexFile, data, 0644)
    }

    return idx, nil
}
5. Handle Concurrent Access

The Index method is thread-safe, but for read-heavy workloads:

type SearchService struct {
    idx *blaze.InvertedIndex
    mu  sync.RWMutex
}

func (s *SearchService) Index(docID int, content string) {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.idx.Index(docID, content)
}

func (s *SearchService) Search(query string) []blaze.Match {
    s.mu.RLock()
    defer s.mu.RUnlock()
    return s.idx.RankProximity(query, 20)
}
6. Monitor Index Size

Track index growth:

func (idx *InvertedIndex) Stats() map[string]interface{} {
    stats := make(map[string]interface{})

    stats["unique_terms"] = len(idx.PostingsList)

    totalPositions := 0
    for _, skipList := range idx.PostingsList {
        // Count positions in this skip list
        iter := skipList.Iterator()
        for iter.HasNext() {
            iter.Next()
            totalPositions++
        }
    }

    stats["total_positions"] = totalPositions
    stats["avg_positions_per_term"] = float64(totalPositions) / float64(len(idx.PostingsList))

    return stats
}
7. Choose the Right Ranking Algorithm

Use BM25 when:

  • You need industry-standard relevance ranking
  • Term frequency matters (documents with more occurrences rank higher)
  • You want automatic length normalization
  • Rare terms should be weighted more heavily
  • Recommended for most use cases

Use Proximity when:

  • You want to find terms close together
  • Term distance is more important than frequency
  • You're searching for specific co-occurrences
  • You need snippet generation (using position data)

Practical Examples:

// E-commerce: General product search
// BM25 considers term frequency and rarity
bm25Results := idx.RankBM25("wireless bluetooth headphones", 20)
// Returns products with any/all terms, ranked by relevance

// E-commerce: Exact product name
// Proximity finds terms appearing together
proxResults := idx.RankProximity("Sony WH-1000XM4", 20)
// Returns products where terms appear close together

// Document search: Research papers
// BM25 for broad topic search
papers := idx.RankBM25("neural networks deep learning", 50)

// Document search: Finding specific phrase mentions
// Proximity for finding "neural networks" as a concept
mentions := idx.RankProximity("neural networks", 50)

// Best practice: Use both for different purposes!
generalResults := idx.RankBM25(query, 100)    // Cast wide net
preciseResults := idx.RankProximity(query, 20) // Refine results
8. Limit Result Set Size

Always specify a reasonable max results:

// Good: Limit results
results := idx.RankBM25("search query", 100)

// Bad: Could return millions of results
results := idx.RankBM25("search query", math.MaxInt32)
9. Pre-process Queries

Normalize queries before searching:

func NormalizeQuery(query string) string {
    // Remove extra whitespace
    query = strings.TrimSpace(query)
    query = strings.Join(strings.Fields(query), " ")

    // Convert to lowercase
    query = strings.ToLower(query)

    // Remove special characters (optional)
    query = regexp.MustCompile(`[^\w\s]`).ReplaceAllString(query, "")

    return query
}

// Use normalized query
normalizedQuery := NormalizeQuery(userInput)
results := idx.RankBM25(normalizedQuery, 20)
10. Monitor BM25 Statistics

Track corpus statistics for insights:

// After indexing
fmt.Printf("Total documents: %d\n", idx.TotalDocs)
fmt.Printf("Total terms: %d\n", idx.TotalTerms)
fmt.Printf("Average doc length: %.2f\n",
    float64(idx.TotalTerms) / float64(idx.TotalDocs))

// Per-document analysis
for docID, stats := range idx.DocStats {
    fmt.Printf("Doc %d: %d terms\n", docID, stats.Length)

    // Find most frequent terms
    for term, freq := range stats.TermFreqs {
        if freq > 5 {
            fmt.Printf("  %s: %d occurrences\n", term, freq)
        }
    }
}

Contributing

Contributions are welcome! Please follow these guidelines:

Development Setup
# Clone repository
git clone https://github.com/wizenheimer/blaze.git
cd blaze

# Install dependencies
make deps

# Run tests
make test

# Run linter
make lint
Code Style
  • Follow Go conventions (gofmt, golint)
  • Write comprehensive comments
  • Include examples in documentation
  • Add tests for new features
  • Keep functions focused and small
Commit Messages

Use descriptive commit messages:

Good:
- "feat: Add proximity ranking algorithm"
- "feat: Handle empty query in RankProximity"
- "fix: Reduce allocations in skip list insert"

Bad:
- "Update code"
- "Fix bug uwu"
- "WIP"
Pull Request Process
  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests
  5. Run make check to verify
  6. Commit your changes
  7. Push to your fork
  8. Open a Pull Request

A companion project implementing vector search indexes in Go. While Blaze handles keyword search via inverted indexes, Comet implements semantic search using vector embeddings:

  • Vector Indexes: Flat, HNSW (graph-based), IVF (clustering), PQ (quantization), IVFPQ (hybrid)
  • Hybrid Search: Combines vector similarity, BM25 text search, and metadata filtering
  • Fusion Strategies: Reciprocal Rank Fusion, autocut, threshold-based results, multi-KNN
  • Metadata Filtering: Roaring Bitmaps and Bit-Sliced Indexes for fast pre-filtering
  • Advanced Features: Quantization, reranking, soft deletes, index rebuilds, multi-query operations

Use Case Comparison:

Blaze Comet
Full-text search Vector + Hybrid search
BM25 ranking BM25 + semantic similarity
Keyword matching Semantic understanding
Inverted index Multiple vector indexes
Best for text documents Best for embeddings + text

A vector DB you can read, not just run - perfect for learning search internals from the inside out.

License

MIT License

Acknowledgments

  • Skip Lists: Original paper by William Pugh (1990)
  • Snowball Stemmer: Martin Porter's stemming algorithm
  • Inspiration: Elasticsearch, Lucene, Mettis, Redis, LevelDB

Documentation ΒΆ

Index ΒΆ

Constants ΒΆ

View Source
const MaxHeight = 32 // Maximum tower height (supports billions of elements)

Variables ΒΆ

View Source
var (
	ErrNoPostingList = errors.New("no posting list exists for token")
	ErrNoNextElement = errors.New("no next element found")
	ErrNoPrevElement = errors.New("no previous element found")
)

═══════════════════════════════════════════════════════════════════════════════ ERROR DEFINITIONS ═══════════════════════════════════════════════════════════════════════════════ We define errors as package-level variables so they can be compared with == This is a Go best practice for error handling.

View Source
var (
	EOF = math.MaxInt // End Of File: maximum integer value (larger than any real position)
	BOF = math.MinInt // Beginning Of File: minimum integer value (smaller than any real position)
)

═══════════════════════════════════════════════════════════════════════════════ SENTINEL VALUES ═══════════════════════════════════════════════════════════════════════════════ We use MaxInt and MinInt as boundary markers

WHY USE MAX/MIN INT? -------------------- - Makes comparisons cleaner (no special cases for "empty") - Always guarantees: BOF < any_position < EOF - Simplifies edge cases in search algorithms

Example: Searching from the "beginning"

Without sentinels: Need to check "is this the first call?"
With sentinels: Just use BOF as the starting position!
View Source
var (
	ErrKeyNotFound    = errors.New("key not found")
	ErrNoElementFound = errors.New("no element found")
)
View Source
var (
	BOFDocument = Position{DocumentID: BOF, Offset: BOF} // Before all documents
	EOFDocument = Position{DocumentID: EOF, Offset: EOF} // After all documents
)

Sentinel positions for convenience

Functions ΒΆ

func AllOf ΒΆ

func AllOf(index *InvertedIndex, terms ...string) *roaring.Bitmap

AllOf finds documents containing ALL of the given terms (AND)

EXAMPLE: --------

results := AllOf(index, "machine", "learning", "python")
// Same as: Term("machine").And().Term("learning").And().Term("python")

func Analyze ΒΆ

func Analyze(text string) []string

Analyze transforms raw text into searchable tokens using the default pipeline

This is the main entry point for text analysis. It applies all filters in sequence: 1. Tokenization 2. Lowercasing 3. Stopword filtering 4. Length filtering 5. Stemming

Example:

tokens := Analyze("The quick brown fox jumps over the lazy dog")
// Returns: ["quick", "brown", "fox", "jump", "lazi", "dog"]

func AnalyzeWithConfig ΒΆ

func AnalyzeWithConfig(text string, config AnalyzerConfig) []string

AnalyzeWithConfig transforms text using a custom configuration

This allows fine-grained control over the analysis pipeline.

Example:

config := AnalyzerConfig{MinTokenLength: 3, EnableStemming: false}
tokens := AnalyzeWithConfig("The quick brown fox", config)

func AnyOf ΒΆ

func AnyOf(index *InvertedIndex, terms ...string) *roaring.Bitmap

AnyOf finds documents containing ANY of the given terms (OR)

EXAMPLE: --------

results := AnyOf(index, "cat", "dog", "bird")
// Same as: Term("cat").Or().Term("dog").Or().Term("bird")

func TermExcluding ΒΆ

func TermExcluding(index *InvertedIndex, include, exclude string) *roaring.Bitmap

TermExcluding finds documents with a term but excluding another

EXAMPLE: --------

results := TermExcluding(index, "python", "snake")
// Same as: Term("python").And().Not().Term("snake")

Types ΒΆ

type AnalyzerConfig ΒΆ

type AnalyzerConfig struct {
	MinTokenLength  int  // Minimum token length to keep (default: 2)
	EnableStemming  bool // Whether to apply stemming (default: true)
	EnableStopwords bool // Whether to remove stopwords (default: true)
}

AnalyzerConfig holds configuration options for text analysis

This allows customization of the analysis pipeline without modifying code. Future enhancements could add language support, custom stopwords, etc.

func DefaultConfig ΒΆ

func DefaultConfig() AnalyzerConfig

DefaultConfig returns the standard analyzer configuration

type BM25Parameters ΒΆ

type BM25Parameters struct {
	K1 float64 // Term frequency saturation (typical: 1.2-2.0)
	B  float64 // Length normalization (typical: 0.75)
}

BM25Parameters holds the tuning parameters for BM25 algorithm

func DefaultBM25Parameters ΒΆ

func DefaultBM25Parameters() BM25Parameters

DefaultBM25Parameters returns the standard BM25 parameters

type DocumentStats ΒΆ

type DocumentStats struct {
	DocID     int            // Document identifier
	Length    int            // Number of terms in the document
	TermFreqs map[string]int // How many times each term appears
}

DocumentStats stores statistics about a single document

type InvertedIndex ΒΆ

type InvertedIndex struct {

	// DOCUMENT-LEVEL STORAGE (for fast document lookups and boolean queries)
	DocBitmaps map[string]*roaring.Bitmap // Term β†’ Bitmap of document IDs

	// POSITION-LEVEL STORAGE (for phrase search, proximity)
	PostingsList map[string]SkipList // Term β†’ Positions

	// ===============================
	// BM25 INDEXING DATA STRUCTURES
	// ===============================
	DocStats   map[int]DocumentStats // DocID β†’ statistics
	TotalDocs  int                   // Total number of indexed documents
	TotalTerms int64                 // Total number of terms across all docs
	BM25Params BM25Parameters        // BM25 tuning parameters
	// contains filtered or unexported fields
}

═══════════════════════════════════════════════════════════════════════════════ CORE DATA STRUCTURE: InvertedIndex with HYBRID STORAGE ═══════════════════════════════════════════════════════════════════════════════ The InvertedIndex uses a hybrid approach for maximum efficiency:

Architecture:

InvertedIndex
β”œβ”€β”€ DocBitmaps: map[string]*roaring.Bitmap  (DOCUMENT-LEVEL)
β”‚   β”œβ”€β”€ "quick" β†’ Bitmap of document IDs [1, 3, 5, ...]
β”‚   β”œβ”€β”€ "brown" β†’ Bitmap of document IDs [1, 2, 7, ...]
β”‚   └── "fox"   β†’ Bitmap of document IDs [3, 5, ...]
β”œβ”€β”€ PostingsList: map[string]SkipList       (POSITION-LEVEL)
β”‚   β”œβ”€β”€ "quick" β†’ SkipList of exact positions
β”‚   β”œβ”€β”€ "brown" β†’ SkipList of exact positions
β”‚   └── "fox"   β†’ SkipList of exact positions
└── mu: mutex for thread safety

Why Hybrid Storage?

  • Roaring Bitmaps: Lightning-fast for document-level operations (AND, OR, NOT) 10-100x memory compression, O(1) boolean operations
  • Skip Lists: Essential for position-based queries (phrases, proximity)

This gives us the best of both worlds! ═══════════════════════════════════════════════════════════════════════════════

func NewInvertedIndex ΒΆ

func NewInvertedIndex() *InvertedIndex

NewInvertedIndex creates a new empty inverted index with hybrid storage and BM25 support

func (*InvertedIndex) Decode ΒΆ

func (idx *InvertedIndex) Decode(data []byte) error

Decode deserializes binary data back into an inverted index

PROCESS: -------- 1. Create a decoder to track our position in the byte array 2. Repeatedly decode terms until we reach the end 3. Reconstruct the PostingsList map

EXAMPLE: -------- Input: [5]['quick'][16][1,1,3,0][4][2][2][0]... Output: PostingsList["quick"] = SkipList{...} Decode deserializes binary data back into an inverted index with HYBRID STORAGE and BM25 stats

func (*InvertedIndex) Encode ΒΆ

func (idx *InvertedIndex) Encode() ([]byte, error)

Encode serializes the inverted index to binary format

COMPLETE EXAMPLE: ----------------- Index contains:

"quick" β†’ SkipList with nodes at [Doc1:Pos1, Doc3:Pos0]
"brown" β†’ SkipList with nodes at [Doc1:Pos2]

Encoded format:

[5]['q','u','i','c','k']  ← Term name
[16][1,1,3,0]              ← Node positions (2 positions Γ— 8 bytes each)
[4][2][2][0]               ← Tower structure (node1β†’node2, node2β†’nil)
[5]['b','r','o','w','n']  ← Next term
[8][1,2]                   ← Node position
[2][0]                     ← Tower structure (only one node, no next)

The encoder object keeps track of our position in the output buffer. Encode serializes the inverted index with HYBRID STORAGE including BM25 statistics

BINARY FORMAT: -------------- [Header]

  • TotalDocs: uint32
  • TotalTerms: uint64
  • BM25.K1: float64
  • BM25.B: float64
  • NumDocStats: uint32

[Document Statistics] (for each document)

  • DocID: uint32
  • Length: uint32
  • NumTerms: uint32
  • For each term:
  • TermLength: uint32
  • Term: bytes
  • Frequency: uint32

[Roaring Bitmaps] (NEW - for fast document lookups)

  • NumBitmaps: uint32
  • For each term:
  • TermLength: uint32
  • Term: bytes
  • BitmapLength: uint32
  • Bitmap: bytes (roaring's native serialization)

[Posting Lists] (position data for phrase search)

  • For each term...

func (*InvertedIndex) FindAllPhrases ΒΆ

func (idx *InvertedIndex) FindAllPhrases(query string, startPos Position) [][]Position

FindAllPhrases finds ALL occurrences of a phrase in the entire index

ALGORITHM: ---------- This is just a loop that repeatedly calls NextPhrase until we reach EOF.

Example: Finding all occurrences of "brown fox"

Iteration 1:
  - Search from BOF β†’ Found at Doc2:Pos[3-4]
  - Add to results
  - Continue from Doc2:Pos3

Iteration 2:
  - Search from Doc2:Pos3 β†’ Found at Doc5:Pos[1-2]
  - Add to results
  - Continue from Doc5:Pos1

Iteration 3:
  - Search from Doc5:Pos1 β†’ Returns EOF
  - Stop searching

Result: [[Doc2:Pos3-4], [Doc5:Pos1-2]]

func (*InvertedIndex) First ΒΆ

func (idx *InvertedIndex) First(token string) (Position, error)

First returns the first occurrence of a token in the index

EXAMPLE: -------- Given: "quick" appears at [Doc1:Pos1, Doc3:Pos0, Doc5:Pos2] First("quick") returns Doc3:Pos0 (the earliest occurrence)

Use case: Start searching for a token from the beginning

func (*InvertedIndex) Index ΒΆ

func (idx *InvertedIndex) Index(docID int, document string)

Index adds a document to the inverted index

STEP-BY-STEP EXAMPLE: ---------------------- Input: docID=1, document="The quick brown fox"

Step 1: Tokenization

analyzer.Analyze() converts to: ["quick", "brown", "fox"]
(Note: "The" is removed as a stop word, and words are lowercased)

Step 2: For each token, record its position

Token "quick" at position 0 in document 1
Token "brown" at position 1 in document 1
Token "fox"   at position 2 in document 1

Step 3: Update the index

PostingsList["quick"] ← add Position{DocID:1, Offset:0}
PostingsList["brown"] ← add Position{DocID:1, Offset:1}
PostingsList["fox"]   ← add Position{DocID:1, Offset:2}

Why record positions and not just document IDs? - Positions let us do phrase search ("brown fox" requires consecutive positions) - Positions let us rank by proximity (closer words = more relevant)

Thread Safety Note: - We lock the entire indexing operation to prevent race conditions - If we didn't lock, two goroutines could corrupt the data structure ═══════════════════════════════════════════════════════════════════════════════ BM25 INDEXING ═══════════════════════════════════════════════════════════════════════════════ Index also enriches the index with BM25 statistics

WHAT'S DIFFERENT WITH BM25: --------------------------- In addition to building the inverted index, we now track: 1. Document length (number of terms) 2. Term frequencies per document (how many times each term appears) 3. Total number of documents (for IDF calculation) 4. Total number of terms (for average document length)

This metadata enables BM25 scoring later during search.

func (*InvertedIndex) Last ΒΆ

func (idx *InvertedIndex) Last(token string) (Position, error)

Last returns the last occurrence of a token in the index

EXAMPLE: -------- Given: "quick" appears at [Doc1:Pos1, Doc3:Pos0, Doc5:Pos2] Last("quick") returns Doc5:Pos2 (the latest occurrence)

Use case: Search backwards from the end

func (*InvertedIndex) Next ΒΆ

func (idx *InvertedIndex) Next(token string, currentPos Position) (Position, error)

Next finds the next occurrence of a token after the given position

EXAMPLE: -------- Given: "brown" appears at [Doc1:Pos2, Doc3:Pos1, Doc3:Pos5, Doc5:Pos0] Next("brown", Doc3:Pos1) returns Doc3:Pos5 Next("brown", Doc3:Pos5) returns Doc5:Pos0 Next("brown", Doc5:Pos0) returns EOF (no more occurrences)

Special cases: - If currentPos is BOF (beginning of file), return First - If currentPos is already EOF (end of file), stay at EOF

Use case: Iterate through all occurrences of a word

func (*InvertedIndex) NextCover ΒΆ

func (idx *InvertedIndex) NextCover(tokens []string, startPos Position) []Position

NextCover finds the next "cover" - a range containing all given tokens

ALGORITHM WALKTHROUGH: ---------------------- Query: ["quick", "fox", "brown"] StartPos: Beginning of file

PHASE 1: Find the cover END (furthest position)

  • Find "quick" after startPos β†’ maybe Doc2:Pos1
  • Find "fox" after startPos β†’ maybe Doc2:Pos8 ← furthest
  • Find "brown" after startPos β†’ maybe Doc2:Pos2
  • coverEnd = Doc2:Pos8

PHASE 2: Find the cover START (earliest position before end)

  • From Doc2:Pos9, find previous "quick" β†’ Doc2:Pos1 ← earliest
  • From Doc2:Pos9, find previous "fox" β†’ Doc2:Pos8
  • From Doc2:Pos9, find previous "brown" β†’ Doc2:Pos2
  • coverStart = Doc2:Pos1

PHASE 3: Validate the cover

  • Same document? Yes (all in Doc2) βœ“
  • Return [Doc2:Pos1, Doc2:Pos8]

If not same document, recurse from coverStart to find the next cover.

Why this algorithm? - Greedy approach: We find the furthest occurrence first - Efficient: Uses index jumps instead of scanning - Minimal covers: Always finds the smallest valid range

func (*InvertedIndex) NextPhrase ΒΆ

func (idx *InvertedIndex) NextPhrase(query string, startPos Position) []Position

NextPhrase finds the next occurrence of a phrase (sequence of words) in the index

ALGORITHM WALKTHROUGH: ---------------------- Query: "quick brown fox" StartPos: Beginning of file

Step 1: Find the END of a potential phrase

  • Find "quick" after startPos β†’ maybe Doc2:Pos3
  • Find "brown" after that β†’ maybe Doc2:Pos4
  • Find "fox" after that β†’ maybe Doc2:Pos5
  • endPos = Doc2:Pos5

Step 2: Walk BACKWARDS to find the START

  • From endPos, find previous "brown" β†’ Doc2:Pos4
  • From there, find previous "quick" β†’ Doc2:Pos3
  • phraseStart = Doc2:Pos3

Step 3: Validate it's a real phrase

  • Same document? Yes (both Doc2)
  • Consecutive positions? Yes (3, 4, 5)
  • Distance = 5 - 3 = 2 (which equals 3 words - 1) βœ“

Step 4: If not valid, recurse from phraseStart

  • This handles cases where words appear multiple times

Why this algorithm? - It's efficient: We use the index to jump between occurrences - It handles multiple occurrences: Recursion keeps searching - It validates correctness: We check for consecutive positions

func (*InvertedIndex) Previous ΒΆ

func (idx *InvertedIndex) Previous(token string, currentPos Position) (Position, error)

Previous finds the previous occurrence of a token before the given position

EXAMPLE: -------- Given: "brown" appears at [Doc1:Pos2, Doc3:Pos1, Doc3:Pos5, Doc5:Pos0] Previous("brown", Doc5:Pos0) returns Doc3:Pos5 Previous("brown", Doc3:Pos5) returns Doc3:Pos1 Previous("brown", Doc1:Pos2) returns BOF (no earlier occurrences)

Use case: Search backwards through occurrences

func (*InvertedIndex) RankBM25 ΒΆ

func (idx *InvertedIndex) RankBM25(query string, maxResults int) []Match

RankBM25 performs BM25 ranking of search results

ALGORITHM: ---------- 1. Tokenize query 2. Find all documents containing at least one query term 3. Calculate BM25 score for each document 4. Sort by score (descending) 5. Return top K results

EXAMPLE: -------- Query: "machine learning algorithms"

Step 1: Tokenize β†’ ["machine", "learning", "algorithms"]

Step 2: Find candidate documents:

"machine" appears in: [Doc1, Doc3, Doc5, Doc7]
"learning" appears in: [Doc1, Doc2, Doc5]
"algorithms" appears in: [Doc2, Doc5, Doc8]
Candidates: [Doc1, Doc2, Doc3, Doc5, Doc7, Doc8]

Step 3: Calculate BM25 scores:

Doc1: 12.5 (has "machine" and "learning")
Doc2: 8.3 (has "learning" and "algorithms")
Doc3: 3.2 (only has "machine")
Doc5: 15.7 (has all three terms!)
Doc7: 2.1 (only has "machine")
Doc8: 4.5 (only has "algorithms")

Step 4: Sort: [Doc5, Doc1, Doc2, Doc8, Doc3, Doc7]

Step 5: Return top 3: [Doc5, Doc1, Doc2]

func (*InvertedIndex) RankProximity ΒΆ

func (idx *InvertedIndex) RankProximity(query string, maxResults int) []Match

RankProximity performs proximity-based ranking of search results

THIS IS THE MAIN SEARCH FUNCTION!

COMPLETE EXAMPLE: ----------------- Query: "machine learning" MaxResults: 10

Step 1: Tokenize query

"machine learning" β†’ ["machine", "learning"]

Step 2: Find all covers (ranges containing both words)

Doc1: Cover[0-1], Cover[5-6]    β†’ score = 0.5 + 0.5 = 1.0
Doc2: Cover[0-5]                β†’ score = 0.167
Doc3: Cover[2-3], Cover[10-11]  β†’ score = 0.5 + 0.5 = 1.0
Doc4: Cover[1-1]                β†’ Wait, both words at same position? Impossible!
                                   (This means one word appears twice)

Step 3: Return top 10 results

Result: [Doc1, Doc3, Doc2] (sorted by score, limited to 10)

ALGORITHM WALKTHROUGH: ---------------------- We iterate through ALL covers in the index, accumulating scores per document.

Iteration 1: Find first cover β†’ Doc1:Pos[0-1]

  • New document Doc1 detected
  • Calculate score: 1/(1-0+1) = 0.5
  • Current document: Doc1, current score: 0.5

Iteration 2: Find next cover β†’ Doc1:Pos[5-6]

  • Still in Doc1 (not a new document)
  • Add to score: 0.5 + 1/(6-5+1) = 1.0
  • Current document: Doc1, current score: 1.0

Iteration 3: Find next cover β†’ Doc2:Pos[0-5]

  • New document Doc2 detected!
  • Save previous: Match{Doc1, score=1.0}
  • Start new: Doc2, score = 1/(5-0+1) = 0.167

... continue until EOF ...

Final step: Return top K results

type Iterator ΒΆ

type Iterator struct {
	// contains filtered or unexported fields
}

Iterator provides sequential access to skip list elements

IMPLEMENTATION NOTE: -------------------- We only traverse level 0 (the bottom level) which contains all elements in sorted order. Higher levels are just shortcuts for searching.

func (*Iterator) HasNext ΒΆ

func (it *Iterator) HasNext() bool

HasNext checks if there are more elements to iterate

LOGIC: ------ There are more elements if: - current is not nil (we haven't fallen off the end), AND - current.Tower[0] is not nil (there's a next element)

Example states: - HasNext() == true: current -> [next] -> ... - HasNext() == false: current -> nil (at the last element)

func (*Iterator) Next ΒΆ

func (it *Iterator) Next() Position

Next advances to and returns the next position

PROCESS: -------- 1. Move to the next node 2. If we've reached the end, return EOF 3. Otherwise, return the current position

IMPORTANT: ---------- Always check HasNext() before calling Next() to avoid returning EOF unexpectedly!

EXAMPLE USAGE: -------------- iter := skipList.Iterator()

for iter.HasNext() {
    pos := iter.Next()
    fmt.Printf("Doc %d, Pos %d\n", pos.GetDocumentID(), pos.GetOffset())
}

type Match ΒΆ

type Match struct {
	DocID   int        // Document identifier
	Offsets []Position // Where the match was found [start, end]
	Score   float64    // How relevant is this match?
}

Match represents a search result with its positions and relevance score

STRUCTURE: ---------- Offsets: The [start, end] positions of a cover in a document Score: The relevance score (higher = more relevant)

Example Match:

Offsets: [Doc3:Pos1, Doc3:Pos5]  ← This document matches from Pos1 to Pos5
Score: 2.7                         ← Relevance score

func (*Match) GetKey ΒΆ

func (m *Match) GetKey() (string, error)

GetKey generates a unique identifier for the match

type Node ΒΆ

type Node struct {
	Key   Position         // The position stored in this node
	Tower [MaxHeight]*Node // Array of forward pointers (one per level)
}

═══════════════════════════════════════════════════════════════════════════════ NODE: A Skip List Node ═══════════════════════════════════════════════════════════════════════════════ Each node stores: 1. A Key (Position): The data we're storing 2. A Tower: Array of pointers to next nodes at each level

TOWER VISUALIZATION: -------------------- For a node with height 3:

Tower[2] -----> (points to a node far ahead)
Tower[1] -----> (points to a node ahead)
Tower[0] -----> (points to the very next node)

The higher the level, the further ahead we skip! ═══════════════════════════════════════════════════════════════════════════════

type Position ΒΆ

type Position struct {
	DocumentID int // Which document?
	Offset     int // Which word in the document? (0-indexed)
}

═══════════════════════════════════════════════════════════════════════════════ POSITION: A Location in a Document ═══════════════════════════════════════════════════════════════════════════════ Position identifies a specific word in a specific document

EXAMPLE: -------- Document 5: "The quick brown fox jumps" Position{DocumentID: 5, Offset: 2} refers to "brown"

WHY USE INT? ------------ - Represents actual integer document IDs and offsets - Supports sentinel values (BOF = MinInt, EOF = MaxInt) - No casting needed - simpler and more efficient

ORDERING: --------- Positions are ordered first by DocumentID, then by Offset:

Doc1:Pos5 < Doc1:Pos10 < Doc2:Pos0 < Doc2:Pos3

═══════════════════════════════════════════════════════════════════════════════

func (*Position) Equals ΒΆ

func (p *Position) Equals(other Position) bool

Equals checks if two positions are identical

Example:

Doc1:Pos5 == Doc1:Pos5 β†’ true
Doc1:Pos5 == Doc1:Pos6 β†’ false

func (*Position) GetDocumentID ΒΆ

func (p *Position) GetDocumentID() int

GetDocumentID returns the document ID (Convenience method for consistent API)

func (*Position) GetOffset ΒΆ

func (p *Position) GetOffset() int

GetOffset returns the offset (Convenience method for consistent API)

func (*Position) IsAfter ΒΆ

func (p *Position) IsAfter(other Position) bool

IsAfter checks if this position comes after another position

This is the opposite of IsBefore (with equality handled separately)

func (*Position) IsBefore ΒΆ

func (p *Position) IsBefore(other Position) bool

IsBefore checks if this position comes before another position

ORDERING RULES: --------------- Position A < Position B if:

  1. A.DocumentID < B.DocumentID, OR
  2. Same document AND A.Offset < B.Offset

EXAMPLES: --------- Doc1:Pos5 < Doc1:Pos10 β†’ true (same doc, 5 < 10) Doc1:Pos5 < Doc2:Pos0 β†’ true (doc 1 < doc 2) Doc2:Pos0 < Doc1:Pos5 β†’ false (doc 2 > doc 1)

func (*Position) IsBeginning ΒΆ

func (p *Position) IsBeginning() bool

IsBeginning checks if this is the BOF sentinel

Example usage:

if pos.IsBeginning() {
    // We're at the start, no previous element exists
}

func (*Position) IsEnd ΒΆ

func (p *Position) IsEnd() bool

IsEnd checks if this is the EOF sentinel

Example usage:

if pos.IsEnd() {
    // We've reached the end, stop searching
}

type QueryBuilder ΒΆ

type QueryBuilder struct {
	// contains filtered or unexported fields
}

QueryBuilder provides a fluent interface for building boolean queries

func NewQueryBuilder ΒΆ

func NewQueryBuilder(index *InvertedIndex) *QueryBuilder

NewQueryBuilder creates a new query builder

EXAMPLE: --------

qb := NewQueryBuilder(index)
results := qb.Term("machine").And().Term("learning").Execute()

func (*QueryBuilder) And ΒΆ

func (qb *QueryBuilder) And() *QueryBuilder

And adds an AND operation

EXAMPLE: --------

qb.Term("machine").And().Term("learning")
// Returns docs with BOTH "machine" AND "learning"

PERFORMANCE: ------------ Roaring bitmap intersection: O(1) for compressed chunks

func (*QueryBuilder) Execute ΒΆ

func (qb *QueryBuilder) Execute() *roaring.Bitmap

Execute runs the query and returns matching document IDs as a bitmap

ALGORITHM: ---------- 1. Process all terms and operations in order 2. Apply AND/OR operations using roaring bitmap operations 3. Return final bitmap of matching documents

EXAMPLE: --------

qb := NewQueryBuilder(index)
results := qb.Term("machine").And().Term("learning").Execute()
// results is a roaring.Bitmap with doc IDs

PERFORMANCE: ------------ All operations use optimized roaring bitmap operations: - AND: bitmap intersection (fast!) - OR: bitmap union (fast!) - NOT: bitmap difference (fast!)

func (*QueryBuilder) ExecuteWithBM25 ΒΆ

func (qb *QueryBuilder) ExecuteWithBM25(maxResults int) []Match

ExecuteWithBM25 runs the query and returns ranked results using BM25

ALGORITHM: ---------- 1. Execute boolean query β†’ Get bitmap of matching docs 2. Extract terms from the query 3. Calculate BM25 score for each matching document 4. Sort by score and return top K

EXAMPLE: --------

qb := NewQueryBuilder(index)
matches := qb.Term("machine").And().Term("learning").
    ExecuteWithBM25(10)
// Returns top 10 matches sorted by BM25 score

func (*QueryBuilder) Group ΒΆ

func (qb *QueryBuilder) Group(fn func(*QueryBuilder)) *QueryBuilder

Group creates a sub-query with its own scope

EXAMPLE: --------

qb.Group(func(q *QueryBuilder) {
    q.Term("cat").Or().Term("dog")
}).And().Term("pet")
// Returns: (cat OR dog) AND pet

USE CASE: Control operator precedence

func (*QueryBuilder) Not ΒΆ

func (qb *QueryBuilder) Not() *QueryBuilder

Not negates the next term

EXAMPLE: --------

qb.Term("python").And().Not().Term("snake")
// Returns docs with "python" but NOT "snake"

PERFORMANCE: ------------ Roaring bitmap difference: O(1) for compressed chunks

func (*QueryBuilder) Or ΒΆ

func (qb *QueryBuilder) Or() *QueryBuilder

Or adds an OR operation

EXAMPLE: --------

qb.Term("cat").Or().Term("dog")
// Returns docs with "cat" OR "dog" (or both)

PERFORMANCE: ------------ Roaring bitmap union: O(1) for compressed chunks

func (*QueryBuilder) Phrase ΒΆ

func (qb *QueryBuilder) Phrase(phrase string) *QueryBuilder

Phrase adds a phrase query (exact sequence of words)

WHAT IT DOES: ------------- 1. Analyzes the phrase (just like during indexing) 2. Uses skip lists to find exact phrase matches 3. Converts results to a bitmap for boolean operations

EXAMPLE: --------

qb.Phrase("machine learning")  // Find exact phrase

NOTE: Phrase queries need position information, so we use skip lists

func (*QueryBuilder) Term ΒΆ

func (qb *QueryBuilder) Term(term string) *QueryBuilder

Term adds a term to the query

WHAT IT DOES: ------------- 1. Gets the roaring bitmap for the term (instant document lookup) 2. Applies any pending NOT operation 3. Combines with previous results using AND/OR

EXAMPLE: --------

qb.Term("machine")  // Find all docs with "machine"

PERFORMANCE: ------------ O(1) bitmap lookup - no skip list traversal needed!

type QueryOp ΒΆ

type QueryOp int

QueryOp represents a pending boolean operation

const (
	OpNone QueryOp = iota
	OpAnd
	OpOr
)

type SkipList ΒΆ

type SkipList struct {
	Head   *Node // Sentinel head node (doesn't contain real data)
	Height int   // Current height of the tallest tower
}

═══════════════════════════════════════════════════════════════════════════════ SKIP LIST: The Main Data Structure ═══════════════════════════════════════════════════════════════════════════════

func NewSkipList ΒΆ

func NewSkipList() *SkipList

NewSkipList creates an empty skip list

INITIAL STATE: -------------- HEAD (empty node) with no forward pointers Height = 1 (even empty lists have level 0)

func (*SkipList) Delete ΒΆ

func (sl *SkipList) Delete(key Position) bool

Delete removes a key from the skip list

ALGORITHM: ----------

  1. Search for the key
  2. If not found, return false
  3. If found: a. Unlink it from all levels b. Shrink the skip list height if needed

EXAMPLE: -------- Deleting 15:

Before: Level 1: HEAD -------> [10] -> [15] ----> [20] Level 0: HEAD -> [5] -> [10] -> [15] -> [20]

After: Level 1: HEAD -------> [10] ------------> [20] Level 0: HEAD -> [5] -> [10] ------------> [20]

(15 removed)

func (*SkipList) Find ΒΆ

func (sl *SkipList) Find(key Position) (Position, error)

Find searches for an exact key match

This is a simple wrapper around Search that only returns the key ΒΆ

Example:

Find(Doc1:Pos5) returns Doc1:Pos5 if it exists, else error

func (*SkipList) FindGreaterThan ΒΆ

func (sl *SkipList) FindGreaterThan(key Position) (Position, error)

FindGreaterThan finds the smallest key greater than the given key

TWO CASES: ---------- 1. Key exists: Return the next node after it 2. Key doesn't exist: Return the next node after where it would be

EXAMPLE: -------- Skip list: [5] -> [10] -> [15] -> [20] FindGreaterThan(10) returns 15 (next after 10) FindGreaterThan(12) returns 15 (next after where 12 would be) FindGreaterThan(20) returns EOF (nothing after 20)

USE CASE: --------- In search: "Find the next occurrence of 'quick' after position X"

func (*SkipList) FindLessThan ΒΆ

func (sl *SkipList) FindLessThan(key Position) (Position, error)

FindLessThan finds the largest key less than the given key

HOW IT WORKS: ------------- The journey from Search already gives us this answer! journey[0] is the largest node < key at the bottom level

EXAMPLE: -------- Skip list: [5] -> [10] -> [15] -> [20] FindLessThan(17) returns 15 FindLessThan(15) returns 10 FindLessThan(5) returns BOF (nothing before 5)

USE CASE: --------- In search: "Find the previous occurrence of 'quick' before position X"

func (*SkipList) Insert ΒΆ

func (sl *SkipList) Insert(key Position)

Insert adds a new key to the skip list (or updates if it exists)

ALGORITHM: ----------

  1. Search for the key (get the journey/path)
  2. If found, update the existing node
  3. If not found: a. Generate a random height for the new node b. Create the new node c. Link it into the list at each level d. Update the skip list's height if needed

EXAMPLE WALKTHROUGH: -------------------- Inserting Doc2:Pos5 into skip list: [Doc1:Pos3, Doc2:Pos10]

Step 1: Search(Doc2:Pos5)

  • Not found
  • journey[0] = Node{Doc1:Pos3} (predecessor at level 0)

Step 2: Generate height = 2 (random)

Step 3: Create Node{Doc2:Pos5}

Step 4: Link at level 0 and level 1:

  • Level 0: Doc1:Pos3 -> Doc2:Pos5 -> Doc2:Pos10
  • Level 1: HEAD -> Doc2:Pos5 -> ...

func (*SkipList) Iterator ΒΆ

func (sl *SkipList) Iterator() *Iterator

Iterator creates a new iterator starting at the first element

INITIALIZATION: --------------- We start at the first real element (sl.Head.Tower[0]) NOT at the Head itself (which is just a sentinel)

func (*SkipList) Last ΒΆ

func (sl *SkipList) Last() Position

Last returns the last position in the skip list

HOW IT WORKS: ------------- Simply traverse level 0 until we reach the end

Example: Skip list: [5] -> [10] -> [15] -> [20] -> nil Last() returns 20

func (*SkipList) Search ΒΆ

func (sl *SkipList) Search(key Position) (*Node, [MaxHeight]*Node)

Search finds a key in the skip list and returns the path taken

RETURN VALUES: -------------- 1. *Node: The node with exact key (nil if not found) 2. MaxHeight*Node: Journey array - the predecessor at each level

EXAMPLE: -------- Skip list: [5] -> [10] -> [15] -> [20] Search(15) returns:

  • found: Node{15}
  • journey[0]: Node{10} (predecessor at level 0)
  • journey[1]: Node{10} (predecessor at level 1)
  • ...

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL