gin

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2026 License: MIT Imports: 29 Imported by: 0

README

GIN Index

CI Go Reference

See CONTRIBUTING.md for local contributor workflows and SECURITY.md for disclosure guidance.

A Generalized Inverted Index (GIN) for JSON data, designed for row-group pruning in columnar storage formats like Parquet.

Features

  • String indexing - Exact match and IN queries on string fields
  • Numeric indexing - Range queries (GT, GTE, LT, LTE) with per-row-group min/max stats
  • Field transformers - Convert values (e.g., date strings to epoch) for efficient range queries
  • Trigram indexing - Full-text CONTAINS queries using n-gram matching
  • Regex support - Pattern matching with trigram-based candidate selection
  • Null tracking - IS NULL / IS NOT NULL predicates
  • Bloom filter - Fast-path rejection for non-existent values
  • HyperLogLog - Efficient cardinality estimation
  • Compression - zstd-compressed binary serialization
  • Parquet integration - Build from Parquet, embed in metadata, sidecar files, S3 support
  • CLI tool - Command-line interface for build, query, info, and extract operations

Why GIN Index?

A serverless pruning index for data lakes - the GIN index is a compact, immutable index designed to answer one question: "Which row groups might contain my data?"

The Problem

Querying large data lakes is expensive. When you search for trace_id=abc123 across millions of Parquet files, traditional approaches either:

  • Full scan - Read every row group (~TB of data, high latency, high cost)
  • Database approach - Run PostgreSQL/Elasticsearch cluster (~ms latency, operational burden)
  • Parquet stats - Use built-in min/max (useless for high-cardinality strings)
The Solution
┌─────────────────────────────────────────────────────────────────────┐
│                    Serverless Row-Group Pruning                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   1. Cache index anywhere         2. Prune locally     3. Read only  │
│      (<1MB for millions of files)    (~1µs)              matching RGs│
│                                                                      │
│   ┌──────────────┐               ┌──────────────┐    ┌────────────┐ │
│   │  memcached   │  ─────────▶   │  GIN Index   │ ─▶ │ S3/GCS     │ │
│   │  nginx       │    decode     │  Evaluate()  │    │ [RG 5, 23] │ │
│   │  CDN edge    │               │              │    │            │ │
│   │  localStorage│               │ Result: 3    │    │ Skip 99%   │ │
│   └──────────────┘               │ row groups   │    │ of data    │ │
│                                  └──────────────┘    └────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
Key Advantages
Challenge PostgreSQL GIN Elasticsearch This GIN Index
Deployment Database cluster Search cluster Just bytes - cache anywhere
Query latency ~1ms ~5-10ms ~1µs - client-side
High cardinality Index bloat Shard overhead Bloom filter fast-path
Index size MB-GB GB ~30KB per 1K row groups
Arbitrary JSON Schema required Mapping required Auto-discovered paths
Designed For
  • Log/observability platforms - Query by trace_id, request_id, arbitrary labels
  • Vector databases - Pre-filter segments before expensive ANN search
  • Data lake query engines - Pruning index for DuckDB, Trino, Spark
  • Edge/serverless - Cache index at CDN edge, query without backend

The index decouples pruning (which row groups to read) from execution (DuckDB, Trino, Spark). Your query engine handles the actual data reading - this index just tells it where to look.

Installation

go get github.com/amikos-tech/ami-gin

Quick Start

package main

import (
    "fmt"
    gin "github.com/amikos-tech/ami-gin"
)

func main() {
    // Create builder for 3 row groups
    builder := gin.NewBuilder(gin.DefaultConfig(), 3)

    // Add documents to row groups
    builder.AddDocument(0, []byte(`{"name": "alice", "age": 30}`))
    builder.AddDocument(1, []byte(`{"name": "bob", "age": 25}`))
    builder.AddDocument(2, []byte(`{"name": "alice", "age": 40}`))

    // Build index
    idx := builder.Finalize()

    // Query: find row groups where name = "alice"
    result := idx.Evaluate([]gin.Predicate{
        gin.EQ("$.name", "alice"),
    })
    fmt.Println(result.ToSlice()) // [0, 2]
}

Known limitations

GIN Index v0.1.0 intentionally focuses on the proven single-index predicate surface described above.

  • OR/AND composites are not part of the v0.1.0 query API yet.
  • Index merge across multiple index files is intentionally deferred beyond v0.1.0.
  • Query-time transformers are not supported in v0.1.0; transformations must happen at index-build time.

Query Types

Equality
gin.EQ("$.status", "active")
gin.NE("$.status", "deleted")
gin.IN("$.status", "active", "pending", "review")
gin.NIN("$.status", "deleted", "archived")  // NOT IN
Numeric Range
gin.GT("$.price", 100.0)    // price > 100
gin.GTE("$.price", 100.0)   // price >= 100
gin.LT("$.price", 500.0)    // price < 500
gin.LTE("$.price", 500.0)   // price <= 500

// Combined range
idx.Evaluate([]gin.Predicate{
    gin.GTE("$.price", 100.0),
    gin.LTE("$.price", 500.0),
})
Date Range Queries with Field Transformers

Transform date strings into numeric epoch milliseconds for efficient range queries:

// Configure transformers for date fields
config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.created_at", gin.ISODateToEpochMs),  // RFC3339
    gin.WithFieldTransformer("$.birth_date", gin.DateToEpochMs),     // YYYY-MM-DD
    gin.WithFieldTransformer("$.custom_ts", gin.CustomDateToEpochMs("2006/01/02 15:04")),
)
builder, _ := gin.NewBuilder(config, numRGs)

// Add documents - dates are automatically transformed to epoch ms
builder.AddDocument(0, []byte(`{"created_at": "2024-01-15T10:30:00Z", "birth_date": "1990-05-20"}`))
builder.AddDocument(1, []byte(`{"created_at": "2024-06-15T14:00:00Z", "birth_date": "1985-03-10"}`))

idx := builder.Finalize()

// Query with epoch milliseconds
july2024 := float64(time.Date(2024, 7, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
result := idx.Evaluate([]gin.Predicate{gin.GT("$.created_at", july2024)})

// Date range: Q1 2024
jan := float64(time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
apr := float64(time.Date(2024, 4, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
result = idx.Evaluate([]gin.Predicate{
    gin.GTE("$.created_at", jan),
    gin.LT("$.created_at", apr),
})

Built-in date transformers:

  • ISODateToEpochMs - RFC3339/ISO8601 (2024-01-15T10:30:00Z)
  • DateToEpochMs - Date only (2024-01-15)
  • CustomDateToEpochMs(layout) - Custom Go time layout

Built-in string transformers:

  • ToLower - Lowercase normalization for case-insensitive queries
  • EmailDomain - Extract and lowercase domain from email (alice@Example.COMexample.com)
  • URLHost - Extract and lowercase host from URL (https://API.Example.COM/v1api.example.com)
  • RegexExtract(pattern, group) - Extract substring via regex capture group
  • RegexExtractInt(pattern, group) - Extract and convert to numeric

Built-in numeric transformers:

  • IPv4ToInt - IPv4 address to uint32 for range queries (192.168.1.13232235777)
  • SemVerToInt - Semantic version to integer (2.1.32001003)
  • DurationToMs - Go duration string to milliseconds (1h30m5400000)
  • NumericBucket(size) - Bucket values for histograms (150 with size 100100)
  • BoolNormalize - Normalize boolean-like values ("yes", "1", "on"true)

IP subnet helpers (for use with IPv4ToInt):

  • CIDRToRange(cidr) - Parse CIDR notation, returns (start, end float64, err)
  • InSubnet(path, cidr) - Returns []Predicate for subnet membership check

Custom transformers:

// Create your own transformer
myTransformer := func(v any) (any, bool) {
    s, ok := v.(string)
    if !ok {
        return nil, false
    }
    // Your transformation logic
    return transformedValue, true
}
config, _ := gin.NewConfig(gin.WithFieldTransformer("$.my_field", myTransformer))

Example: IP subnet queries (network/security logs)

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.client_ip", gin.IPv4ToInt),
)
// "192.168.1.1" indexed as 3232235777

// Query: Find IPs in 192.168.1.0/24 subnet using InSubnet helper
result := idx.Evaluate(gin.InSubnet("$.client_ip", "192.168.1.0/24"))

// Or use CIDRToRange for manual control
start, end, _ := gin.CIDRToRange("10.0.0.0/8")
result = idx.Evaluate([]gin.Predicate{
    gin.GTE("$.client_ip", start),
    gin.LTE("$.client_ip", end),
})

Example: Version range queries (software metadata)

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.version", gin.SemVerToInt),
)
// "v2.1.3" indexed as 2001003

// Query: Find versions >= 2.0.0
result := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.version", float64(2000000)),
})

Example: Case-insensitive email queries

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.email", gin.ToLower),
)
// "Alice@Example.COM" indexed as "alice@example.com"
result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", "alice@example.com"),
})

Example: Extract error codes from log messages

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.message", gin.RegexExtract(`ERROR\[(\w+)\]:`, 1)),
)
// "ERROR[E1234]: Connection failed" indexed as "E1234"
result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.message", "E1234"),
})
Full-Text Search (CONTAINS)
// Uses trigram index for substring matching
gin.Contains("$.description", "hello")
gin.Contains("$.title", "database")  // matches "database", "databases", etc.
Regex Matching
// Uses trigram index for regex candidate selection
gin.Regex("$.message", "ERROR|WARNING")        // Alternation
gin.Regex("$.brand", "Toyota|Tesla|Ford")      // Multiple literals
gin.Regex("$.log", "error.*timeout")           // Prefix + wildcard + suffix
gin.Regex("$.code", "[A-Z]{3}_[0-9]+")         // Pattern with literals

The Regex operator extracts literal strings from regex patterns and uses the trigram index for candidate row-group selection. This enables efficient pruning before actual regex matching.

How it works:

  1. Parse regex pattern and extract literal substrings
  2. For alternations like (error|warn)_message, extracts combined literals: ["error_message", "warn_message"]
  3. Query trigram index for each literal
  4. Union results (OR semantics for alternation)
  5. Row groups not containing any literal are pruned

Limitations:

  • Requires trigram index enabled (EnableTrigrams: true)
  • Literals shorter than trigram length (default: 3) cannot prune
  • Pure wildcard patterns (.*) return all row groups
  • This is candidate selection, not regex execution - actual matching happens at query time
Null Handling
gin.IsNull("$.optional_field")
gin.IsNotNull("$.required_field")
Nested Fields and Arrays
// Nested objects
gin.EQ("$.user.address.city", "New York")

// Array elements (wildcard)
gin.EQ("$.tags[*]", "important")
gin.IN("$.roles[*]", "admin", "editor")

JSONPath Support

Supported path syntax:

  • $ - root
  • $.field - dot notation
  • $['field'] - bracket notation
  • $.items[*] - array wildcard

Not supported (will error):

  • $.items[0] - array indices
  • $..field - recursive descent
  • $.items[0:5] - slices
  • $[?(@.price > 10)] - filters

Validate paths before use:

if err := gin.ValidateJSONPath("$.user.name"); err != nil {
    log.Fatal(err)
}

Serialization

// Encode to bytes (zstd compressed)
data, err := gin.Encode(idx)

// Save to file
os.WriteFile("index.gin", data, 0644)

// Load and decode
data, _ := os.ReadFile("index.gin")
idx, err := gin.Decode(data)

Parquet Integration

The GIN index integrates directly with Parquet files, supporting three storage strategies:

  1. Sidecar file - Index stored as data.parquet.gin alongside the Parquet file
  2. Embedded metadata - Index stored in Parquet file's key-value metadata
  3. Build-time embedding - Index built and embedded during Parquet file creation
Build Index from Parquet
// Build index from a Parquet file's JSON column
idx, err := gin.BuildFromParquet("data.parquet", "attributes", gin.DefaultConfig())
Sidecar Workflow
// Write index as sidecar file (data.parquet.gin)
err := gin.WriteSidecar("data.parquet", idx)

// Read sidecar
idx, err := gin.ReadSidecar("data.parquet")

// Check if sidecar exists
if gin.HasSidecar("data.parquet") {
    // ...
}
Embedded Metadata Workflow
cfg := gin.DefaultParquetConfig() // MetadataKey: "gin.index"

// Rebuild existing Parquet file with embedded index
err := gin.RebuildWithIndex("data.parquet", idx, cfg)

// Check if Parquet has embedded index
hasIdx, err := gin.HasGINIndex("data.parquet", cfg)

// Read embedded index
idx, err := gin.ReadFromParquetMetadata("data.parquet", cfg)
Auto-Loading (Embedded First, Then Sidecar)
// Tries embedded metadata first, falls back to sidecar
idx, err := gin.LoadIndex("data.parquet", gin.DefaultParquetConfig())
Encode for Parquet Metadata (Build-Time Embedding)

When creating a new Parquet file, you can embed the index during creation:

// Get key-value pair for Parquet metadata
key, value, err := gin.EncodeToMetadata(idx, gin.DefaultParquetConfig())
// key = "gin.index", value = base64-encoded compressed index

// Use with parquet-go writer
writer := parquet.NewGenericWriter[Record](f,
    parquet.KeyValueMetadata(key, value),
)
Batch Processing (Programmatic)

Helper functions for working with multiple files:

// Local filesystem
if gin.IsDirectory("./data") {
    // List all .parquet files in directory
    parquetFiles, err := gin.ListParquetFiles("./data")

    // List all .gin files in directory
    ginFiles, err := gin.ListGINFiles("./data")

    // Process each file
    for _, f := range parquetFiles {
        idx, _ := gin.BuildFromParquet(f, "attributes", gin.DefaultConfig())
        gin.WriteSidecar(f, idx)
    }
}

// S3
s3Client, _ := gin.NewS3ClientFromEnv()

// List all .parquet files under prefix
parquetKeys, err := s3Client.ListParquetFiles("bucket", "data/")

// List all .gin files under prefix
ginKeys, err := s3Client.ListGINFiles("bucket", "data/")
S3 Support

All operations support S3 paths via AWS SDK v2:

// Configure from environment variables:
// AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
// AWS_ENDPOINT_URL (for MinIO, LocalStack), AWS_S3_PATH_STYLE=true
s3Client, err := gin.NewS3ClientFromEnv()

// Build from S3
idx, err := s3Client.BuildFromParquet("bucket", "path/to/data.parquet", "attributes", gin.DefaultConfig())

// Write sidecar to S3
err := s3Client.WriteSidecar("bucket", "path/to/data.parquet", idx)

// Read sidecar from S3
idx, err := s3Client.ReadSidecar("bucket", "path/to/data.parquet")

// Load index (tries embedded, then sidecar)
idx, err := s3Client.LoadIndex("bucket", "path/to/data.parquet", gin.DefaultParquetConfig())
CLI Tool

A command-line tool is provided for common operations:

# Install
go install github.com/amikos-tech/ami-gin/cmd/gin-index@latest

# Build sidecar index
gin-index build -c attributes data.parquet
gin-index build -c attributes -o custom.gin data.parquet

# Build and embed into Parquet file
gin-index build -c attributes -embed data.parquet

# Query index
gin-index query data.parquet.gin '$.status = "error"'
gin-index query data.parquet.gin '$.count > 100'
gin-index query data.parquet.gin '$.name IN ("alice", "bob")'

# Show index info
gin-index info data.parquet.gin

# Extract embedded index to sidecar
gin-index extract -o data.parquet.gin data.parquet

# S3 paths (uses AWS env vars)
gin-index build -c attributes s3://bucket/data.parquet
gin-index query s3://bucket/data.parquet.gin '$.status = "ok"'

Batch Processing (Directory/S3 Prefix):

Process multiple files at once by passing a directory or S3 prefix:

# Build index for all .parquet files in a directory
gin-index build -c attributes ./data/
gin-index build -c attributes -embed ./data/

# Query all .gin files in a directory
gin-index query ./data/ '$.status = "error"'

# Show info for all .gin files
gin-index info ./data/

# S3 prefix - processes all .parquet files under the prefix
gin-index build -c attributes s3://bucket/data/
gin-index query s3://bucket/data/ '$.status = "error"'
gin-index info s3://bucket/data/

# Glob patterns work too
gin-index build -c attributes './data/*.parquet'
gin-index query './data/*.gin' '$.level = "error"'

CLI Query Syntax:

  • Equality: $.field = "value", $.field != "value"
  • Numeric: $.field > 100, $.field >= 100, $.field < 100, $.field <= 100
  • IN/NOT IN: $.field IN ("a", "b"), $.field NOT IN (1, 2, 3)
  • Null: $.field IS NULL, $.field IS NOT NULL
  • Contains: $.field CONTAINS "substring"
  • Regex: $.field REGEX "pattern" (e.g., $.brand REGEX "Toyota|Tesla")

Configuration

config := gin.GINConfig{
    CardinalityThreshold: 10000,  // Use bloom-only for high-cardinality paths
    BloomFilterSize:      65536,
    BloomFilterHashes:    5,
    EnableTrigrams:       true,   // Enable CONTAINS queries
    TrigramMinLength:     3,
    HLLPrecision:         12,     // HyperLogLog precision (4-16)
    PrefixBlockSize:      16,
}

builder := gin.NewBuilder(config, numRowGroups)

Examples

See the examples directory:

go run ./examples/basic/main.go        # Equality queries
go run ./examples/range/main.go        # Numeric ranges
go run ./examples/transformers/main.go # Date field transformers
go run ./examples/transformers-advanced/main.go # IP, SemVer, email, regex transformers
go run ./examples/fulltext/main.go     # CONTAINS queries
go run ./examples/regex/main.go        # Regex pattern matching
go run ./examples/null/main.go         # NULL handling
go run ./examples/nested/main.go       # Nested JSON and arrays
go run ./examples/serialize/main.go    # Persistence
go run ./examples/full/main.go         # All types and operators
go run ./examples/parquet/main.go      # Parquet integration (sidecar, embedded, queries)

Benchmarks

Run benchmarks with:

go test -bench=. -benchmem -benchtime=1s
Performance Summary (Apple M3 Max)
Operation Latency Notes
EQ query ~1µs Bloom filter + sorted term lookup
Range query (GT/LT) 4-24µs Min/max stats scan
IN query (10 values) ~8µs Union of EQ results
CONTAINS query 2-17µs Trigram intersection
IsNull/IsNotNull 2-4µs Bitmap lookup
Bloom lookup ~100ns Fast path rejection
AddDocument ~43µs JSON parsing + indexing
Encode (1K RGs) ~4ms zstd compression
Decode (1K RGs) ~2ms zstd decompression
Index Size
Row Groups Encoded Size Per RG
100 6.7 KB 67 bytes
500 18 KB 36 bytes
1,000 30 KB 30 bytes
2,000 51 KB 26 bytes
Scaling Characteristics

Query time scales well with index size:

  • 10 RGs: ~340ns
  • 100 RGs: ~530ns
  • 1,000 RGs: ~680ns
  • 5,000 RGs: ~800ns

Build time is linear with document count and complexity:

  • 100 docs (7 fields): ~1.2ms
  • 1,000 docs: ~6.7ms
  • High cardinality (10K unique values): ~3.3ms per 1K docs
Component Performance
Component Operation Latency
Bloom Filter Add ~100ns
Bloom Filter Lookup ~100ns
RGSet (10K) Intersect ~12µs
RGSet (10K) Union ~10µs
Trigram Add (50 chars) ~16µs
Trigram Search 1-6µs
HyperLogLog Add ~70ns
HyperLogLog Estimate 7-410µs (precision dependent)
Prefix Compress 1K terms ~60µs
Real-World Scenario: 1M Docs / 50K Row Groups

Simulating a log storage scenario:

  • 1M documents across 50K row groups (~20 docs/RG)
  • 10 labels: 2 integers (status_code, duration_ms) + 8 strings
  • Mix of cardinalities: trace_id (high), service (low), host (medium)
  • Trigrams disabled (no FTS)
Metric Value
Index Size 289 KB (0.28 MB)
Bytes per RG 5.9 bytes
Bytes per doc 0.3 bytes
Build time 464ms
Encode 41ms
Decode 41ms

Query Performance:

Query Latency Notes
trace_id=X (high cardinality) 950ns Bloom filter fast-path
service=api (low cardinality) 6.5µs ~10K RGs match
trace_id=X AND level=error 6µs High card + low card
duration_ms > 5000 244µs Range scan over 50K RGs
service=api AND env=prod AND status>=400 285µs 3 predicates combined

Key takeaway: High-cardinality lookups (trace ID, request ID) are sub-microsecond. The entire index for 1M documents fits in 289 KB - easily cacheable in memory, localStorage, or CDN edge.

Benchmark Categories

The benchmark suite (benchmark_test.go) covers:

  1. Builder Performance - Document ingestion, batch loading, finalization
  2. Query Performance - All operators, parallel queries, multiple predicates
  3. Serialization - Encode/decode latency, compression ratios
  4. Components - Bloom filter, RGSet, trigram, HLL, prefix compression
  5. Scaling - Row group count, document size, cardinality, nesting depth

Comparison with Other Solutions

vs PostgreSQL GIN/JSONB
Aspect This GIN Index PostgreSQL GIN
Query Latency ~1µs (EQ) ~0.7-1.2ms per predicate
Deployment Embedded bytes, no server Requires PostgreSQL server
Cacheability Cache anywhere (nginx, memcached, CDN) Tied to database buffer cache
Index Size 26-67 bytes/row-group Larger, includes posting lists
Range Queries Native min/max stats Poor (GIN doesn't support ranges)
Full-Text Trigram-based CONTAINS Full-featured tsvector/tsquery
ACID No (read-only after build) Full transaction support

PostgreSQL GIN uses Bitmap Index Scans which cost ~0.7-1.2ms each when cached. This index achieves ~1µs queries by being purpose-built for row-group pruning with simpler data structures.

vs Parquet Built-in Statistics
Aspect This GIN Index Parquet Min/Max Stats Parquet Bloom Filters
String Equality Exact term → RG bitmap Only min/max (poor for strings) Yes, but per-column only
CONTAINS/FTS Trigram index No No
Multi-path Queries Single index file Scattered in column chunks Scattered in column chunks
Cardinality HyperLogLog estimates No No
Null Tracking Explicit null/present bitmaps Null count only No
Index Location Footer or sidecar file Column chunk metadata Column chunk metadata

Parquet's built-in bloom filters are effective for single-column equality but require reading multiple column chunks for multi-field queries. This GIN index consolidates all paths into one structure.

vs Delta Lake / Iceberg Data Skipping
Aspect This GIN Index Delta Lake Apache Iceberg
Statistics Per-path term index + min/max First 32 columns min/max Partition-level + column stats
High Cardinality Bloom filter fallback Requires Z-ordering Requires sorting
JSON Support Native path extraction Requires schema Requires schema
Query Planning Client-side, cacheable Spark/engine dependent Engine dependent
Deployment Standalone bytes Delta transaction log Metadata tables

Delta Lake's data skipping relies on Z-ordering for effectiveness with high-cardinality columns. This GIN index handles high cardinality natively via bloom filters.

vs Elasticsearch
Aspect This GIN Index Elasticsearch
Query Latency ~1µs ~1-10ms (network + processing)
Deployment Embedded, no server Cluster required
Index Size ~30KB for 1K row-groups GB+ for equivalent data
Use Case Row-group pruning Full search engine
Updates Rebuild required Near real-time

Elasticsearch provides millisecond-level latency for searches but requires cluster infrastructure. This index is designed for embedding in data lake metadata.

Key Advantage: High-Cardinality Arbitrary JSON

This index was born from log storage needs - indexing arbitrary attributes/labels where:

  • High cardinality is the norm - trace IDs, request IDs, user IDs, session tokens
  • Schema is unknown - arbitrary key-value labels attached at runtime
  • Queries are selective - "find logs where trace_id=abc123" should be instant

Traditional solutions struggle here:

Challenge PostgreSQL GIN Parquet Stats This GIN Index
trace_id (millions unique) Index bloat, slow writes Min/max useless Bloom filter fast-path
user.email (arbitrary path) Requires schema Column must exist Auto-discovered paths
labels["env"] (dynamic keys) JSONB @> operator (~1ms) Not supported Native path indexing (~1µs)
Mixed types per path Type coercion issues Single type per column Tracks observed types

Log/observability example:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "trace_id": "abc123def456",
  "user": {"id": "user_98765", "email": "alice@example.com"},
  "labels": {"env": "prod", "region": "us-east-1", "version": "2.1.0"},
  "message": "Connection timeout to downstream service"
}

Query: trace_id=abc123def456 AND labels.env=prod AND level=error

  • Bloom filter rejects non-matching row groups instantly
  • High-cardinality trace_id doesn't degrade performance
  • Arbitrary labels.* paths indexed automatically

Vector database metadata filtering:

Vector databases need efficient pre-filtering before similarity search. Without good metadata indexing, you either:

  1. Scan all vectors then filter (slow)
  2. Filter first with poor index (still slow)
  3. Build separate metadata infrastructure (complex)
{
  "id": "doc_12345",
  "embedding": [0.1, 0.2, ...],
  "metadata": {
    "source": "arxiv",
    "year": 2024,
    "authors": ["Alice", "Bob"],
    "topics": ["machine-learning", "transformers"],
    "cited_by": 142,
    "full_text": "We present a novel approach to..."
  }
}

Query: Find similar vectors WHERE metadata.source=arxiv AND metadata.year>=2023 AND metadata.topics[*]=transformers

┌─────────────────────────────────────────────────────────────────┐
│                   Vector DB Hybrid Search                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. Metadata Filter (GIN Index)         2. Vector Search       │
│   ┌─────────────────────────────┐       ┌──────────────────┐   │
│   │ source=arxiv                │       │                  │   │
│   │ year>=2023        ──────────┼──────▶│  ANN Search      │   │
│   │ topics[*]=transformers      │       │  (only on        │   │
│   │                             │       │   segments 2,5)  │   │
│   │ Result: segments [2, 5]     │       │                  │   │
│   └─────────────────────────────┘       └──────────────────┘   │
│           ~1µs                              search scope        │
│                                             reduced 80%         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

This enables:

  • Pre-filtering - Prune segments before expensive ANN search
  • Flexible schemas - Each document can have different metadata fields
  • High cardinality - Filter by doc_id, user_id, session_id
  • Range + equality - year>=2023 AND source=arxiv
  • Array membership - topics[*]=machine-learning
  • Full-text on metadata - CONTAINS(full_text, "transformer")
Cacheable Pruning Index

The second differentiator is deployment flexibility:

┌─────────────────────────────────────────────────────────────────┐
│                     Data Lake Query Flow                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Client                                                        │
│     │                                                           │
│     │ 1. Fetch GIN index (cached)                              │
│     ▼                                                           │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │  nginx/memcached/CDN/local cache                        │  │
│   │  ┌─────────────┐                                        │  │
│   │  │ index.gin   │  ← 30KB, serves in <1ms               │  │
│   │  │ (cached)    │                                        │  │
│   │  └─────────────┘                                        │  │
│   └─────────────────────────────────────────────────────────┘  │
│     │                                                           │
│     │ 2. Evaluate predicates locally (~1µs)                    │
│     │    Result: [RG 5, RG 23, RG 47]                          │
│     │                                                           │
│     │ 3. Read only matching row groups from object storage     │
│     ▼                                                           │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │  S3 / GCS / Azure Blob                                  │  │
│   │  ┌─────────────────────────────────────────────────┐    │  │
│   │  │ data.parquet                                    │    │  │
│   │  │  RG 0 ──────── skipped                         │    │  │
│   │  │  RG 5 ◀─────── read                            │    │  │
│   │  │  RG 10 ─────── skipped                         │    │  │
│   │  │  RG 23 ◀─────── read                           │    │  │
│   │  │  RG 47 ◀─────── read                           │    │  │
│   │  │  ...                                            │    │  │
│   │  └─────────────────────────────────────────────────┘    │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Benefits:

  • No database server - Index is just bytes, evaluate anywhere
  • Cache at edge - nginx, memcached, CDN, browser localStorage
  • Cross-language - Binary format works in any language
  • Offline capable - Cache index locally for disconnected queries
  • Cost efficient - Avoid scanning TB of Parquet data

This architecture is ideal for:

  • Log/observability platforms - Index arbitrary labels, query by trace ID
  • Vector databases - Pre-filter segments before ANN search
  • Serverless query engines - No database to manage
  • Browser-based data explorers - Cache index in localStorage
  • Edge computing / IoT analytics - Offline-capable querying
  • Cost-sensitive data lake queries - Minimize S3/GCS egress

Architecture

Index Structure
┌─────────────────────────────────────────────────────────────────────────────┐
│                              GINIndex                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Header                                                               │   │
│  │  • Version, NumRowGroups, NumDocs, NumPaths                         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Path Directory                                                       │   │
│  │  pathID → { PathName, ObservedTypes, Cardinality, Flags }           │   │
│  │                                                                      │   │
│  │  Example:                                                            │   │
│  │    0 → { "$.name",   String,  150,   0x00 }                         │   │
│  │    1 → { "$.age",    Int,     80,    0x00 }                         │   │
│  │    2 → { "$.tags[*]", String, 50000, FlagBloomOnly }                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Global Bloom Filter                                                  │   │
│  │  Fast rejection for path=value pairs                                 │   │
│  │  Contains: "$.name=alice", "$.name=bob", "$.age=30", ...            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌───────────────────────┐  ┌───────────────────────┐                      │
│  │ StringIndex           │  │ NumericIndex          │                      │
│  │ (per pathID)          │  │ (per pathID)          │                      │
│  │                       │  │                       │                      │
│  │ pathID: 0 ($.name)    │  │ pathID: 1 ($.age)     │                      │
│  │ ┌─────────┬────────┐  │  │ ┌─────┬──────┬──────┐│                      │
│  │ │  Term   │ RGSet  │  │  │ │ RG  │ Min  │ Max  ││                      │
│  │ ├─────────┼────────┤  │  │ ├─────┼──────┼──────┤│                      │
│  │ │ "alice" │ {0,2}  │  │  │ │  0  │  25  │  35  ││                      │
│  │ │ "bob"   │ {1}    │  │  │ │  1  │  20  │  45  ││                      │
│  │ │ "carol" │ {2,3}  │  │  │ │  2  │  30  │  30  ││                      │
│  │ └─────────┴────────┘  │  │ └─────┴──────┴──────┘│                      │
│  └───────────────────────┘  └───────────────────────┘                      │
│                                                                             │
│  ┌───────────────────────┐  ┌───────────────────────┐                      │
│  │ NullIndex             │  │ TrigramIndex          │                      │
│  │ (per pathID)          │  │ (per pathID)          │                      │
│  │                       │  │                       │                      │
│  │ pathID: 1 ($.age)     │  │ pathID: 3 ($.desc)    │                      │
│  │ ┌──────────┬────────┐ │  │ ┌─────────┬────────┐ │                      │
│  │ │ NullRGs  │ {4,7}  │ │  │ │ Trigram │ RGSet  │ │                      │
│  │ │ Present  │ {0-9}  │ │  │ ├─────────┼────────┤ │                      │
│  │ └──────────┴────────┘ │  │ │ "hel"   │ {0,2}  │ │                      │
│  └───────────────────────┘  │ │ "ell"   │ {0,2}  │ │                      │
│                             │ │ "llo"   │ {0,2,5}│ │                      │
│  ┌────────────────────────┐ │ │ "wor"   │ {1,3}  │ │                      │
│  │ DocID Mapping          │ │ └─────────┴────────┘ │                      │
│  │ (optional)             │ └───────────────────────┘                      │
│  │                        │                                                 │
│  │ pos → DocID            │  ┌───────────────────────┐                     │
│  │  0  → 1000             │  │ PathCardinality (HLL) │                     │
│  │  1  → 1001             │  │ (per pathID)          │                     │
│  │  2  → 1020             │  │                       │                     │
│  │  3  → 1021             │  │ Estimates unique vals │                     │
│  └────────────────────────┘  └───────────────────────┘                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Note: All RGSet bitmaps use Roaring Bitmaps for efficient compression
Data Flow
                                BUILD PHASE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│   JSON Documents                                                            │
│        │                                                                    │
│        ▼                                                                    │
│   ┌─────────┐     ┌──────────────────────────────────────────────────┐     │
│   │ DocID 0 │────▶│                  GINBuilder                       │     │
│   │ RG: 0   │     │                                                   │     │
│   └─────────┘     │  AddDocument(docID, json)                        │     │
│   ┌─────────┐     │       │                                          │     │
│   │ DocID 1 │────▶│       ▼                                          │     │
│   │ RG: 0   │     │  ┌─────────────┐                                 │     │
│   └─────────┘     │  │ Walk JSON   │                                 │     │
│   ┌─────────┐     │  │ Extract:    │                                 │     │
│   │ DocID 2 │────▶│  │  • paths    │                                 │     │
│   │ RG: 1   │     │  │  • values   │                                 │     │
│   └─────────┘     │  │  • types    │                                 │     │
│        ⋮          │  └──────┬──────┘                                 │     │
│                   │         │                                         │     │
│                   │         ▼                                         │     │
│                   │  ┌─────────────────────────────────────────┐     │     │
│                   │  │ Update per-path structures:             │     │     │
│                   │  │  • stringTerms[term] → RGSet.Set(pos)   │     │     │
│                   │  │  • numericStats[pos].Min/Max            │     │     │
│                   │  │  • nullRGs.Set(pos) if null             │     │     │
│                   │  │  • trigrams.Add(term, pos)              │     │     │
│                   │  │  • bloom.Add(path=value)                │     │     │
│                   │  │  • hll.Add(value)                       │     │     │
│                   │  └─────────────────────────────────────────┘     │     │
│                   │                                                   │     │
│                   └───────────────────────┬──────────────────────────┘     │
│                                           │                                 │
│                                           ▼                                 │
│                                    Finalize()                               │
│                                           │                                 │
│                                           ▼                                 │
│                                    ┌─────────────┐                          │
│                                    │  GINIndex   │                          │
│                                    └─────────────┘                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

                                QUERY PHASE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│   Predicates: [EQ("$.name", "alice"), GT("$.age", 25)]                     │
│        │                                                                    │
│        ▼                                                                    │
│   ┌─────────────────────────────────────────────────────────────────┐      │
│   │                     idx.Evaluate(predicates)                     │      │
│   └─────────────────────────────────────────────────────────────────┘      │
│        │                                                                    │
│        ├──────────────────────┬──────────────────────┐                     │
│        ▼                      ▼                      ▼                     │
│   ┌──────────┐          ┌──────────┐          ┌──────────┐                 │
│   │ Predicate│          │ Predicate│          │   ...    │                 │
│   │    1     │          │    2     │          │          │                 │
│   └────┬─────┘          └────┬─────┘          └──────────┘                 │
│        │                     │                                              │
│        ▼                     ▼                                              │
│   ┌──────────────┐     ┌──────────────┐                                    │
│   │ Bloom Check  │     │ Bloom Check  │   ◀── Fast rejection path          │
│   │ path=value?  │     │   (skip for  │                                    │
│   └──────┬───────┘     │    ranges)   │                                    │
│          │             └──────┬───────┘                                    │
│          ▼                    ▼                                             │
│   ┌──────────────┐     ┌──────────────┐                                    │
│   │ StringIndex  │     │ NumericIndex │                                    │
│   │ lookup term  │     │ scan min/max │                                    │
│   │ → RGSet      │     │ → RGSet      │                                    │
│   └──────┬───────┘     └──────┬───────┘                                    │
│          │                    │                                             │
│          │    RGSet{0,2}      │    RGSet{0,1,2}                            │
│          │                    │                                             │
│          └─────────┬──────────┘                                             │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │  Intersect  │                                                 │
│             │  (AND all)  │                                                 │
│             └──────┬──────┘                                                 │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │ RGSet{0,2}  │  ◀── Matching row groups                       │
│             └──────┬──────┘                                                 │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │ ToSlice()   │  → [0, 2]                                      │
│             │     or      │                                                 │
│             │ MatchingDocIDs() → [DocID...]                                │
│             └─────────────┘                                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
DocID Codec (Optional)

For composite document identifiers (e.g., file + row group):

// Encode file index and row group into single DocID
codec := gin.NewRowGroupCodec(20)  // 20 RGs per file
builder := gin.NewBuilderWithCodec(config, totalRGs, codec)

docID := codec.Encode(fileIndex, rgIndex)  // e.g., file=3, rg=15 → DocID=75
builder.AddDocument(docID, jsonDoc)

// Query and decode results
result := idx.Evaluate(predicates)
for _, docID := range idx.MatchingDocIDs(result) {
    decoded := codec.Decode(docID)  // [3, 15]
    fileIdx, rgIdx := decoded[0], decoded[1]
}

How It Works

The GIN index maintains several data structures:

  1. Path Directory - Maps JSON paths to their metadata (types, cardinality)
  2. String Index - For each path, maps terms to row-group bitmaps (Roaring)
  3. Numeric Index - Per-row-group min/max values for range pruning
  4. Null Index - Bitmaps tracking which row groups have null/present values
  5. Trigram Index - Maps 3-character sequences to row-group bitmaps
  6. Global Bloom Filter - Fast rejection of non-existent path=value pairs
  7. DocID Mapping - Optional external DocID to internal position mapping

Query evaluation intersects the matching row-group bitmaps from each predicate.

Design Notes

Why numRGs Must Be Known Upfront

The NewBuilder(config, numRGs) requires the total number of row groups at construction time. This is intentional:

  1. Complement operations require universe size - Operations like AllRGs() and Invert() need to know the total number of row groups to compute complements. When a query cannot prune (e.g., unknown path, graceful degradation), the index returns "all row groups" - which requires knowing what "all" means.

  2. Parquet metadata provides this - The index is designed for Parquet row-group pruning. In this context, the number of row groups is always available from Parquet file metadata before indexing begins.

  3. Bounds checking - The builder validates that document positions don't exceed the declared row group count, catching configuration errors early.

License

MIT

Documentation

Index

Constants

View Source
const (
	MagicBytes = "GIN\x01"
	Version    = uint16(3)
)
View Source
const (
	TypeString uint8 = 1 << iota
	TypeInt
	TypeFloat
	TypeBool
	TypeNull
)
View Source
const (
	FlagBloomOnly    uint8 = 1 << iota
	FlagTrigramIndex       // path has trigram index for CONTAINS queries
)
View Source
const DefaultMetadataKey = "gin.index"
View Source
const (
	FlagHasDocIDMap uint16 = 1 << iota
)

Variables

View Source
var (
	// ErrVersionMismatch is returned by Decode when the binary format version
	// does not match the expected version (Version constant).
	ErrVersionMismatch = errors.New("version mismatch")

	// ErrInvalidFormat is returned by Decode when the binary data is structurally
	// invalid: unrecognized magic bytes, oversized allocations, or corrupt fields.
	ErrInvalidFormat = errors.New("invalid format")
)

Functions

func BoolNormalize

func BoolNormalize(v any) (any, bool)

BoolNormalize normalizes various boolean-like values to actual booleans. Handles: bool, "true"/"false"/"yes"/"no"/"1"/"0"/"on"/"off", float64 (0 = false).

func CIDRToRange

func CIDRToRange(cidr string) (start, end float64, err error)

CIDRToRange parses a CIDR notation string and returns the start and end IP addresses as float64 values suitable for use with GTE/LTE predicates on IPv4ToInt-transformed fields. Example: CIDRToRange("192.168.1.0/24") returns (3232235776, 3232236031, nil)

func CompressionStats

func CompressionStats(terms []string) (compressed, original int, ratio float64)

CompressionRatio returns the compression ratio for a set of terms. Returns (compressed size, original size, ratio).

func DateToEpochMs

func DateToEpochMs(v any) (any, bool)

DateToEpochMs parses "2006-01-02" format to Unix milliseconds (midnight UTC).

func DurationToMs

func DurationToMs(v any) (any, bool)

DurationToMs parses Go duration strings (e.g., "1h30m", "500ms") to milliseconds.

func EmailDomain

func EmailDomain(v any) (any, bool)

EmailDomain extracts and lowercases the domain from an email address.

func Encode

func Encode(idx *GINIndex) ([]byte, error)

Encode serializes the index using zstd-15 compression (recommended default).

func EncodeToMetadata

func EncodeToMetadata(idx *GINIndex, cfg ParquetConfig) (key string, value string, err error)

func EncodeWithLevel

func EncodeWithLevel(idx *GINIndex, level CompressionLevel) ([]byte, error)

EncodeWithLevel serializes the index with the specified compression level. Use CompressionNone (0) for no compression, or 1-19 for zstd compression levels.

func ExtractLiterals

func ExtractLiterals(pattern string) ([]string, error)

ExtractLiterals extracts literal strings from a regex pattern that can be used for trigram-based candidate selection. Returns a slice of literal alternatives. For patterns like "foo|bar", returns ["foo", "bar"]. For patterns like "(error|warn)_msg", returns ["error_msg", "warn_msg"] (combined).

func ExtractTrigrams

func ExtractTrigrams(s string) []string

func GenerateBigrams

func GenerateBigrams(text string) []string

func GenerateNGrams

func GenerateNGrams(text string, n int, opts ...NGramOption) ([]string, error)

func GenerateTrigrams

func GenerateTrigrams(text string) []string

func HasGINIndex

func HasGINIndex(parquetFile string, cfg ParquetConfig) (bool, error)

func HasGINIndexReader

func HasGINIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (bool, error)

func HasSidecar

func HasSidecar(parquetFile string) bool

func IPv4ToInt

func IPv4ToInt(v any) (any, bool)

IPv4ToInt converts IPv4 address strings to uint32 (as float64) for range queries.

func ISODateToEpochMs

func ISODateToEpochMs(v any) (any, bool)

ISODateToEpochMs parses RFC3339/ISO8601 strings to Unix milliseconds.

func IsDirectory

func IsDirectory(path string) bool

func IsS3Path

func IsS3Path(path string) bool

func IsValidJSONPath

func IsValidJSONPath(path string) bool

func ListGINFiles

func ListGINFiles(dir string) ([]string, error)

func ListParquetFiles

func ListParquetFiles(dir string) ([]string, error)

func MustValidateJSONPath

func MustValidateJSONPath(path string) string

func NormalizePath

func NormalizePath(path string) string

NormalizePath converts a JSONPath to a canonical dot-notation form.

func ParseJSONPath

func ParseJSONPath(path string) (jp.Expr, error)

ParseJSONPath parses and validates a JSONPath, returning the parsed expression.

func ParseS3Path

func ParseS3Path(path string) (bucket, key string, err error)

func RebuildWithIndex

func RebuildWithIndex(parquetFile string, idx *GINIndex, cfg ParquetConfig) error

func SemVerToInt

func SemVerToInt(v any) (any, bool)

SemVerToInt encodes semantic versions as integers: major*1000000 + minor*1000 + patch. Supports formats: "1.2.3", "v1.2.3", "1.2", "v1.2", "1.2.3-beta" (pre-release suffix ignored).

func SidecarPath

func SidecarPath(parquetFile string) string

func ToLower

func ToLower(v any) (any, bool)

ToLower normalizes strings to lowercase for case-insensitive queries.

func URLHost

func URLHost(v any) (any, bool)

URLHost extracts and lowercases the host from a URL.

func ValidateJSONPath

func ValidateJSONPath(path string) error

ValidateJSONPath validates a JSONPath expression and ensures it only uses features supported by the GIN index (dot notation, wildcards). Unsupported: array indices [0], filters [?()], recursive descent .., scripts

func WriteCompressedTerms

func WriteCompressedTerms(w io.Writer, blocks []CompressedTermBlock) error

func WriteSidecar

func WriteSidecar(parquetFile string, idx *GINIndex) error

Types

type BloomFilter

type BloomFilter struct {
	// contains filtered or unexported fields
}

func BloomFilterFromBits

func BloomFilterFromBits(bits []uint64, numBits uint32, numHashes uint8) *BloomFilter

func MustNewBloomFilter

func MustNewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) *BloomFilter

func NewBloomFilter

func NewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) (*BloomFilter, error)

func (*BloomFilter) Add

func (bf *BloomFilter) Add(data []byte)

func (*BloomFilter) AddString

func (bf *BloomFilter) AddString(s string)

func (*BloomFilter) Bits

func (bf *BloomFilter) Bits() []uint64

func (*BloomFilter) MayContain

func (bf *BloomFilter) MayContain(data []byte) bool

func (*BloomFilter) MayContainString

func (bf *BloomFilter) MayContainString(s string) bool

func (*BloomFilter) NumBits

func (bf *BloomFilter) NumBits() uint32

func (*BloomFilter) NumHashes

func (bf *BloomFilter) NumHashes() uint8

type BloomFilterOption

type BloomFilterOption func(*BloomFilter) error

type BuilderOption

type BuilderOption func(*GINBuilder) error

func WithCodec

func WithCodec(codec DocIDCodec) BuilderOption

type CompressedTermBlock

type CompressedTermBlock struct {
	FirstTerm string
	Entries   []PrefixEntry
}

func ReadCompressedTerms

func ReadCompressedTerms(r io.Reader) ([]CompressedTermBlock, error)

type CompressionLevel

type CompressionLevel int

CompressionLevel specifies the compression level for index serialization.

const (
	CompressionNone     CompressionLevel = 0  // No compression
	CompressionFastest  CompressionLevel = 1  // zstd level 1
	CompressionBalanced CompressionLevel = 3  // zstd level 3
	CompressionBetter   CompressionLevel = 9  // zstd level 9
	CompressionBest     CompressionLevel = 15 // zstd level 15 (recommended)
	CompressionMax      CompressionLevel = 19 // zstd level 19 (slow)
)

type ConfigOption

type ConfigOption func(*GINConfig) error

func WithBoolNormalizeTransformer

func WithBoolNormalizeTransformer(path string) ConfigOption

func WithCustomDateTransformer

func WithCustomDateTransformer(path, layout string) ConfigOption

func WithDateTransformer

func WithDateTransformer(path string) ConfigOption

func WithDurationTransformer

func WithDurationTransformer(path string) ConfigOption

func WithEmailDomainTransformer

func WithEmailDomainTransformer(path string) ConfigOption

func WithFTSPaths

func WithFTSPaths(paths ...string) ConfigOption

func WithFieldTransformer

func WithFieldTransformer(path string, fn FieldTransformer) ConfigOption

func WithIPv4Transformer

func WithIPv4Transformer(path string) ConfigOption

func WithISODateTransformer

func WithISODateTransformer(path string) ConfigOption

func WithNumericBucketTransformer

func WithNumericBucketTransformer(path string, size float64) ConfigOption

func WithRegexExtractIntTransformer

func WithRegexExtractIntTransformer(path, pattern string, group int) ConfigOption

func WithRegexExtractTransformer

func WithRegexExtractTransformer(path, pattern string, group int) ConfigOption

func WithRegisteredTransformer

func WithRegisteredTransformer(path string, id TransformerID, params []byte) ConfigOption

func WithSemVerTransformer

func WithSemVerTransformer(path string) ConfigOption

func WithToLowerTransformer

func WithToLowerTransformer(path string) ConfigOption

func WithURLHostTransformer

func WithURLHostTransformer(path string) ConfigOption

type CustomDateParams

type CustomDateParams struct {
	Layout string `json:"layout"`
}

type DocID

type DocID uint64

DocID represents an external document identifier.

type DocIDCodec

type DocIDCodec interface {
	Encode(indices ...int) DocID
	Decode(docID DocID) []int
	Name() string
}

DocIDCodec encodes/decodes composite information into a single DocID.

type FieldTransformer

type FieldTransformer func(value any) (any, bool)

FieldTransformer transforms a value before indexing. Returns (transformedValue, ok). If ok=false, original value is indexed.

func CustomDateToEpochMs

func CustomDateToEpochMs(layout string) FieldTransformer

CustomDateToEpochMs returns a transformer for custom date formats.

func NumericBucket

func NumericBucket(size float64) FieldTransformer

NumericBucket returns a transformer that buckets numeric values by size. Example: NumericBucket(100) transforms 150 -> 100, 250 -> 200.

func ReconstructTransformer

func ReconstructTransformer(id TransformerID, params json.RawMessage) (FieldTransformer, error)

func RegexExtract

func RegexExtract(pattern string, group int) FieldTransformer

RegexExtract returns a transformer that extracts a substring via regex capture group. Pattern is compiled once at config time. Group 0 = full match, group 1+ = capture groups.

func RegexExtractInt

func RegexExtractInt(pattern string, group int) FieldTransformer

RegexExtractInt extracts a substring via regex and converts it to float64.

type GINBuilder

type GINBuilder struct {
	// contains filtered or unexported fields
}

func NewBuilder

func NewBuilder(config GINConfig, numRGs int, opts ...BuilderOption) (*GINBuilder, error)

func (*GINBuilder) AddDocument

func (b *GINBuilder) AddDocument(docID DocID, jsonDoc []byte) error

func (*GINBuilder) Finalize

func (b *GINBuilder) Finalize() *GINIndex

type GINConfig

type GINConfig struct {
	CardinalityThreshold uint32
	BloomFilterSize      uint32
	BloomFilterHashes    uint8
	EnableTrigrams       bool
	TrigramMinLength     int
	HLLPrecision         uint8
	PrefixBlockSize      int
	// contains filtered or unexported fields
}

func DefaultConfig

func DefaultConfig() GINConfig

func NewConfig

func NewConfig(opts ...ConfigOption) (GINConfig, error)

type GINIndex

type GINIndex struct {
	Header              Header
	PathDirectory       []PathEntry
	GlobalBloom         *BloomFilter
	StringIndexes       map[uint16]*StringIndex
	NumericIndexes      map[uint16]*NumericIndex
	NullIndexes         map[uint16]*NullIndex
	TrigramIndexes      map[uint16]*TrigramIndex
	StringLengthIndexes map[uint16]*StringLengthIndex
	PathCardinality     map[uint16]*HyperLogLog
	DocIDMapping        []DocID
	Config              *GINConfig
}

func BuildFromParquet

func BuildFromParquet(parquetFile string, jsonColumn string, config GINConfig) (*GINIndex, error)

func BuildFromParquetReader

func BuildFromParquetReader(parquetFile string, jsonColumn string, config GINConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func Decode

func Decode(data []byte) (*GINIndex, error)

func DecodeFromMetadata

func DecodeFromMetadata(value string) (*GINIndex, error)

func LoadIndex

func LoadIndex(parquetFile string, cfg ParquetConfig) (*GINIndex, error)

func LoadIndexReader

func LoadIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func NewGINIndex

func NewGINIndex() *GINIndex

func ReadFromParquetMetadata

func ReadFromParquetMetadata(parquetFile string, cfg ParquetConfig) (*GINIndex, error)

func ReadFromParquetMetadataReader

func ReadFromParquetMetadataReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func ReadSidecar

func ReadSidecar(parquetFile string) (*GINIndex, error)

func (*GINIndex) Evaluate

func (idx *GINIndex) Evaluate(predicates []Predicate) *RGSet

func (*GINIndex) MatchingDocIDs

func (idx *GINIndex) MatchingDocIDs(rgSet *RGSet) []DocID
type Header struct {
	Magic             [4]byte
	Version           uint16
	Flags             uint16
	NumRowGroups      uint32
	NumDocs           uint64
	NumPaths          uint32
	CardinalityThresh uint32
}

type HyperLogLog

type HyperLogLog struct {
	// contains filtered or unexported fields
}

HyperLogLog implements the HyperLogLog algorithm for cardinality estimation. It uses 2^precision registers to estimate the number of distinct elements.

func HyperLogLogFromRegisters

func HyperLogLogFromRegisters(registers []uint8, precision uint8) *HyperLogLog

func MustNewHyperLogLog

func MustNewHyperLogLog(precision uint8, opts ...HyperLogLogOption) *HyperLogLog

func NewHyperLogLog

func NewHyperLogLog(precision uint8, opts ...HyperLogLogOption) (*HyperLogLog, error)

NewHyperLogLog creates a new HyperLogLog with the given precision. Precision must be between 4 and 16. Higher precision = more accuracy but more memory. Memory usage: 2^precision bytes. Standard error: 1.04 / sqrt(m) where m = 2^precision

func (*HyperLogLog) Add

func (hll *HyperLogLog) Add(data []byte)

func (*HyperLogLog) AddString

func (hll *HyperLogLog) AddString(s string)

func (*HyperLogLog) Clear

func (hll *HyperLogLog) Clear()

func (*HyperLogLog) Clone

func (hll *HyperLogLog) Clone() *HyperLogLog

func (*HyperLogLog) Estimate

func (hll *HyperLogLog) Estimate() uint64

func (*HyperLogLog) Merge

func (hll *HyperLogLog) Merge(other *HyperLogLog)

func (*HyperLogLog) Precision

func (hll *HyperLogLog) Precision() uint8

func (*HyperLogLog) Registers

func (hll *HyperLogLog) Registers() []uint8

type HyperLogLogOption

type HyperLogLogOption func(*HyperLogLog) error

type IdentityCodec

type IdentityCodec struct{}

IdentityCodec treats the position as the DocID (1:1 mapping).

func NewIdentityCodec

func NewIdentityCodec() *IdentityCodec

func (*IdentityCodec) Decode

func (c *IdentityCodec) Decode(docID DocID) []int

func (*IdentityCodec) Encode

func (c *IdentityCodec) Encode(indices ...int) DocID

func (*IdentityCodec) Name

func (c *IdentityCodec) Name() string

type JSONPathError

type JSONPathError struct {
	Path    string
	Message string
}

func (*JSONPathError) Error

func (e *JSONPathError) Error() string

type NGramConfig

type NGramConfig struct {
	N       int
	Padding string
}

type NGramOption

type NGramOption func(*NGramConfig) error

func WithN

func WithN(n int) NGramOption

func WithPadding

func WithPadding(pad string) NGramOption

type NullIndex

type NullIndex struct {
	NullRGBitmap    *RGSet
	PresentRGBitmap *RGSet
}

type NumericBucketParams

type NumericBucketParams struct {
	Size float64 `json:"size"`
}

type NumericIndex

type NumericIndex struct {
	ValueType uint8
	GlobalMin float64
	GlobalMax float64
	RGStats   []RGNumericStat
}

type Operator

type Operator uint8
const (
	OpEQ Operator = iota
	OpNE
	OpGT
	OpLT
	OpGTE
	OpLTE
	OpIN
	OpNIN
	OpIsNull
	OpIsNotNull
	OpContains
	OpRegex
)

func (Operator) String

func (o Operator) String() string

type ParquetConfig

type ParquetConfig struct {
	MetadataKey string
}

func DefaultParquetConfig

func DefaultParquetConfig() ParquetConfig

type ParquetIndexWriter

type ParquetIndexWriter struct {
	// contains filtered or unexported fields
}

func NewParquetIndexWriter

func NewParquetIndexWriter(w io.Writer, schema *parquet.Schema, jsonColumn string, numRowGroups int, ginConfig GINConfig, pqConfig ParquetConfig) (*ParquetIndexWriter, error)

type PathEntry

type PathEntry struct {
	PathID        uint16
	PathName      string
	ObservedTypes uint8
	Cardinality   uint32
	Flags         uint8
}

type Predicate

type Predicate struct {
	Path     string
	Operator Operator
	Value    any
}

func Contains

func Contains(path string, pattern string) Predicate

func EQ

func EQ(path string, value any) Predicate

func GT

func GT(path string, value any) Predicate

func GTE

func GTE(path string, value any) Predicate

func IN

func IN(path string, values ...any) Predicate

func InSubnet

func InSubnet(path, cidr string) []Predicate

InSubnet creates predicates to check if an IP field (transformed with IPv4ToInt) falls within a CIDR subnet range. Example: InSubnet("$.client_ip", "192.168.1.0/24") returns predicates for 192.168.1.0-255 Panics if CIDR is invalid - use CIDRToRange for error handling.

func IsNotNull

func IsNotNull(path string) Predicate

func IsNull

func IsNull(path string) Predicate

func LT

func LT(path string, value any) Predicate

func LTE

func LTE(path string, value any) Predicate

func NE

func NE(path string, value any) Predicate

func NIN

func NIN(path string, values ...any) Predicate

func Regex

func Regex(path string, pattern string) Predicate

func (Predicate) String

func (p Predicate) String() string

type PrefixCompressor

type PrefixCompressor struct {
	// contains filtered or unexported fields
}

PrefixCompressor implements front-coding compression for sorted string lists. Each string is stored as: shared prefix length + suffix. This works well for sorted terms that share common prefixes.

func MustNewPrefixCompressor

func MustNewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) *PrefixCompressor

func NewPrefixCompressor

func NewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) (*PrefixCompressor, error)

func (*PrefixCompressor) BlockSize

func (pc *PrefixCompressor) BlockSize() int

func (*PrefixCompressor) Compress

func (pc *PrefixCompressor) Compress(terms []string) []CompressedTermBlock

func (*PrefixCompressor) Decompress

func (pc *PrefixCompressor) Decompress(blocks []CompressedTermBlock) []string

type PrefixCompressorOption

type PrefixCompressorOption func(*PrefixCompressor) error

type PrefixEntry

type PrefixEntry struct {
	PrefixLen uint16
	Suffix    string
}

type RGNumericStat

type RGNumericStat struct {
	Min      float64
	Max      float64
	HasValue bool
}

type RGSet

type RGSet struct {
	NumRGs int
	// contains filtered or unexported fields
}

func AllRGs

func AllRGs(numRGs int) *RGSet

func MustNewRGSet

func MustNewRGSet(numRGs int, opts ...RGSetOption) *RGSet

func NewRGSet

func NewRGSet(numRGs int, opts ...RGSetOption) (*RGSet, error)

func NoRGs

func NoRGs(numRGs int) *RGSet

func RGSetFromRoaring

func RGSetFromRoaring(bitmap *roaring.Bitmap, numRGs int) *RGSet

func (*RGSet) All

func (rs *RGSet) All() *RGSet

func (*RGSet) Clear

func (rs *RGSet) Clear(rgID int)

func (*RGSet) Clone

func (rs *RGSet) Clone() *RGSet

func (*RGSet) Count

func (rs *RGSet) Count() int

func (*RGSet) Intersect

func (rs *RGSet) Intersect(other *RGSet) *RGSet

func (*RGSet) Invert

func (rs *RGSet) Invert() *RGSet

func (*RGSet) IsEmpty

func (rs *RGSet) IsEmpty() bool

func (*RGSet) IsSet

func (rs *RGSet) IsSet(rgID int) bool

func (*RGSet) Roaring

func (rs *RGSet) Roaring() *roaring.Bitmap

func (*RGSet) Set

func (rs *RGSet) Set(rgID int)

func (*RGSet) ToSlice

func (rs *RGSet) ToSlice() []int

func (*RGSet) Union

func (rs *RGSet) Union(other *RGSet) *RGSet

type RGSetOption

type RGSetOption func(*RGSet) error

type RGStringLengthStat

type RGStringLengthStat struct {
	Min      uint32
	Max      uint32
	HasValue bool
}

type RegexLiteralInfo

type RegexLiteralInfo struct {
	Literals    []string // Extracted literal strings
	HasWildcard bool     // Pattern contains unbounded wildcards
	MinLength   int      // Minimum length of any literal
}

RegexLiteralInfo contains extracted information from a regex pattern

func AnalyzeRegex

func AnalyzeRegex(pattern string) (*RegexLiteralInfo, error)

AnalyzeRegex extracts literals and metadata from a regex pattern

type RegexParams

type RegexParams struct {
	Pattern string `json:"pattern"`
	Group   int    `json:"group"`
}

type RowGroupCodec

type RowGroupCodec struct {
	// contains filtered or unexported fields
}

RowGroupCodec encodes file index and row group index into a DocID. Layout: DocID = fileIndex * rowGroupsPerFile + rgIndex

func NewRowGroupCodec

func NewRowGroupCodec(rowGroupsPerFile int) *RowGroupCodec

func (*RowGroupCodec) Decode

func (c *RowGroupCodec) Decode(docID DocID) []int

func (*RowGroupCodec) Encode

func (c *RowGroupCodec) Encode(indices ...int) DocID

func (*RowGroupCodec) Name

func (c *RowGroupCodec) Name() string

func (*RowGroupCodec) RowGroupsPerFile

func (c *RowGroupCodec) RowGroupsPerFile() int

type S3Client

type S3Client struct {
	// contains filtered or unexported fields
}

func NewS3Client

func NewS3Client(cfg S3Config) (*S3Client, error)

func NewS3ClientFromEnv

func NewS3ClientFromEnv() (*S3Client, error)

func (*S3Client) BuildFromParquet

func (c *S3Client) BuildFromParquet(bucket, key, jsonColumn string, ginCfg GINConfig) (*GINIndex, error)

func (*S3Client) Exists

func (c *S3Client) Exists(bucket, key string) (bool, error)

func (*S3Client) GetObjectSize

func (c *S3Client) GetObjectSize(bucket, key string) (int64, error)

func (*S3Client) HasGINIndex

func (c *S3Client) HasGINIndex(bucket, key string, cfg ParquetConfig) (bool, error)

func (*S3Client) HasSidecar

func (c *S3Client) HasSidecar(bucket, parquetKey string) (bool, error)

func (*S3Client) ListGINFiles

func (c *S3Client) ListGINFiles(bucket, prefix string) ([]string, error)

func (*S3Client) ListParquetFiles

func (c *S3Client) ListParquetFiles(bucket, prefix string) ([]string, error)

func (*S3Client) LoadIndex

func (c *S3Client) LoadIndex(bucket, parquetKey string, cfg ParquetConfig) (*GINIndex, error)

func (*S3Client) OpenParquet

func (c *S3Client) OpenParquet(bucket, key string) (*parquet.File, io.ReaderAt, int64, error)

func (*S3Client) ReadFile

func (c *S3Client) ReadFile(bucket, key string) ([]byte, error)

func (*S3Client) ReadFromParquetMetadata

func (c *S3Client) ReadFromParquetMetadata(bucket, key string, cfg ParquetConfig) (*GINIndex, error)

func (*S3Client) ReadSidecar

func (c *S3Client) ReadSidecar(bucket, parquetKey string) (*GINIndex, error)

func (*S3Client) WriteFile

func (c *S3Client) WriteFile(bucket, key string, data []byte) error

func (*S3Client) WriteSidecar

func (c *S3Client) WriteSidecar(bucket, parquetKey string, idx *GINIndex) error

type S3Config

type S3Config struct {
	Endpoint  string
	Region    string
	AccessKey string
	SecretKey string
	PathStyle bool
}

func S3ConfigFromEnv

func S3ConfigFromEnv() S3Config

type SerializedConfig

type SerializedConfig struct {
	BloomFilterSize   uint32            `json:"bloom_filter_size"`
	BloomFilterHashes uint8             `json:"bloom_filter_hashes"`
	EnableTrigrams    bool              `json:"enable_trigrams"`
	TrigramMinLength  int               `json:"trigram_min_length"`
	HLLPrecision      uint8             `json:"hll_precision"`
	PrefixBlockSize   int               `json:"prefix_block_size"`
	FTSPaths          []string          `json:"fts_paths,omitempty"`
	Transformers      []TransformerSpec `json:"transformers,omitempty"`
}

type StringIndex

type StringIndex struct {
	Terms     []string
	RGBitmaps []*RGSet
}

type StringLengthIndex

type StringLengthIndex struct {
	GlobalMin uint32
	GlobalMax uint32
	RGStats   []RGStringLengthStat
}

type TransformerID

type TransformerID uint8
const (
	TransformerUnknown TransformerID = iota
	TransformerISODateToEpochMs
	TransformerDateToEpochMs
	TransformerCustomDateToEpochMs
	TransformerToLower
	TransformerIPv4ToInt
	TransformerSemVerToInt
	TransformerRegexExtract
	TransformerRegexExtractInt
	TransformerDurationToMs
	TransformerEmailDomain
	TransformerURLHost
	TransformerNumericBucket
	TransformerBoolNormalize
)

type TransformerSpec

type TransformerSpec struct {
	Path   string          `json:"path"`
	ID     TransformerID   `json:"id"`
	Name   string          `json:"name"`
	Params json.RawMessage `json:"params,omitempty"`
}

func NewTransformerSpec

func NewTransformerSpec(path string, id TransformerID, params json.RawMessage) TransformerSpec

type TrigramIndex

type TrigramIndex struct {
	Trigrams  map[string]*RGSet
	NumRGs    int
	N         int
	Padding   string
	MinLength int
}

func NewTrigramIndex

func NewTrigramIndex(numRGs int, opts ...NGramOption) (*TrigramIndex, error)

func (*TrigramIndex) Add

func (ti *TrigramIndex) Add(value string, rgID int)

func (*TrigramIndex) Search

func (ti *TrigramIndex) Search(pattern string) *RGSet

func (*TrigramIndex) TrigramCount

func (ti *TrigramIndex) TrigramCount() int

Directories

Path Synopsis
cmd
gin-index command
examples
basic command
Example: Basic GIN index usage with equality queries
Example: Basic GIN index usage with equality queries
full command
Example: Comprehensive GIN index usage demonstrating all index types and query operators
Example: Comprehensive GIN index usage demonstrating all index types and query operators
fulltext command
Example: Full-text search with trigram index (CONTAINS queries)
Example: Full-text search with trigram index (CONTAINS queries)
nested command
Example: Nested JSON objects and arrays
Example: Nested JSON objects and arrays
null command
Example: NULL handling queries
Example: NULL handling queries
parquet command
range command
Example: Numeric range queries with GIN index
Example: Numeric range queries with GIN index
regex command
Example: Regex pattern matching with trigram-based candidate selection
Example: Regex pattern matching with trigram-based candidate selection
serialize command
Example: Serializing and deserializing GIN index
Example: Serializing and deserializing GIN index
transformers command
Example: Field transformers for date indexing
Example: Field transformers for date indexing
transformers-advanced command
Example: Advanced field transformers for IP ranges, semantic versions, emails, and regex extraction
Example: Advanced field transformers for IP ranges, semantic versions, emails, and regex extraction

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL