gin

package module

v0.2.0 Latest Latest Go to latest Published: Apr 17, 2026 License: MIT Imports: 31 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/amikos-tech/ami-gin

Links

Open Source Insights

README ¶

GIN Index

See CONTRIBUTING.md for local contributor workflows and SECURITY.md for disclosure guidance.

A Generalized Inverted Index (GIN) for JSON data, designed for row-group pruning in columnar storage formats like Parquet.

Features

String indexing - Exact match and IN queries on string fields
Numeric indexing - Range queries (GT, GTE, LT, LTE) with per-row-group min/max stats
Field transformers - Convert values (e.g., date strings to epoch) for efficient range queries
Trigram indexing - Full-text CONTAINS queries using n-gram matching
Regex support - Pattern matching with trigram-based candidate selection
Null tracking - IS NULL / IS NOT NULL predicates
Bloom filter - Fast-path rejection for non-existent values
HyperLogLog - Efficient cardinality estimation
Compression - zstd-compressed binary serialization
Parquet integration - Build from Parquet, embed in metadata, sidecar files, S3 support
CLI tool - Command-line interface for build, query, info, and extract operations

Why GIN Index?

A serverless pruning index for data lakes - the GIN index is a compact, immutable index designed to answer one question: "Which row groups might contain my data?"

The Problem

Querying large data lakes is expensive. When you search for trace_id=abc123 across millions of Parquet files, traditional approaches either:

Full scan - Read every row group (~TB of data, high latency, high cost)
Database approach - Run PostgreSQL/Elasticsearch cluster (~ms latency, operational burden)
Parquet stats - Use built-in min/max (useless for high-cardinality strings)

The Solution

┌─────────────────────────────────────────────────────────────────────┐
│                    Serverless Row-Group Pruning                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   1. Cache index anywhere         2. Prune locally     3. Read only  │
│      (<1MB for millions of files)    (~1µs)              matching RGs│
│                                                                      │
│   ┌──────────────┐               ┌──────────────┐    ┌────────────┐ │
│   │  memcached   │  ─────────▶   │  GIN Index   │ ─▶ │ S3/GCS     │ │
│   │  nginx       │    decode     │  Evaluate()  │    │ [RG 5, 23] │ │
│   │  CDN edge    │               │              │    │            │ │
│   │  localStorage│               │ Result: 3    │    │ Skip 99%   │ │
│   └──────────────┘               │ row groups   │    │ of data    │ │
│                                  └──────────────┘    └────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key Advantages

Challenge	PostgreSQL GIN	Elasticsearch	This GIN Index
Deployment	Database cluster	Search cluster	Just bytes - cache anywhere
Query latency	~1ms	~5-10ms	~1µs - client-side
High cardinality	Index bloat	Shard overhead	adaptive-hybrid hot-value pruning
Index size	MB-GB	GB	~30KB per 1K row groups
Arbitrary JSON	Schema required	Mapping required	Auto-discovered paths

Designed For

Log/observability platforms - Query by trace_id, request_id, arbitrary labels
Vector databases - Pre-filter segments before expensive ANN search
Data lake query engines - Pruning index for DuckDB, Trino, Spark
Edge/serverless - Cache index at CDN edge, query without backend

The index decouples pruning (which row groups to read) from execution (DuckDB, Trino, Spark). Your query engine handles the actual data reading - this index just tells it where to look.

Installation

go get github.com/amikos-tech/ami-gin

Quick Start

package main

import (
    "fmt"
    gin "github.com/amikos-tech/ami-gin"
)

func main() {
    // Create builder for 3 row groups
    builder, err := gin.NewBuilder(gin.DefaultConfig(), 3)
    if err != nil {
        panic(err)
    }

    // Add documents to row groups
    builder.AddDocument(0, []byte(`{"name": "alice", "age": 30}`))
    builder.AddDocument(1, []byte(`{"name": "bob", "age": 25}`))
    builder.AddDocument(2, []byte(`{"name": "alice", "age": 40}`))

    // Build index
    idx := builder.Finalize()

    // Query: find row groups where name = "alice"
    result := idx.Evaluate([]gin.Predicate{
        gin.EQ("$.name", "alice"),
    })
    fmt.Println(result.ToSlice()) // [0, 2]
}

Known limitations

GIN Index v0.2.0 expands the original predicate surface with adaptive high-cardinality pruning and derived representations, but it still intentionally excludes a few deferred capabilities.

OR/AND composites are not part of the v0.2.0 query API yet.
Index merge across multiple index files is intentionally deferred beyond v0.2.0.
Query-time transformers are not supported in v0.2.0; transformations must happen at index-build time.

Serialized index compatibility remains strict: Decode() rejects older payload versions. Indexes built with v0.1.0 (wire format v3) must be rebuilt with v0.2.0 (wire format v8).

Query Types

Equality

gin.EQ("$.status", "active")
gin.NE("$.status", "deleted")
gin.IN("$.status", "active", "pending", "review")
gin.NIN("$.status", "deleted", "archived")  // NOT IN

Numeric Range

gin.GT("$.price", 100.0)    // price > 100
gin.GTE("$.price", 100.0)   // price >= 100
gin.LT("$.price", 500.0)    // price < 500
gin.LTE("$.price", 500.0)   // price <= 500

// Combined range
idx.Evaluate([]gin.Predicate{
    gin.GTE("$.price", 100.0),
    gin.LTE("$.price", 500.0),
})

Derived Representation Queries

Derived representations add companion indexes without dropping the raw source value. Raw-path queries stay raw by default; query a companion explicitly with gin.As(alias, value). Hidden internal target paths are not part of the public query contract.

config, _ := gin.NewConfig(
    gin.WithISODateTransformer("$.created_at", "epoch_ms"),
    gin.WithToLowerTransformer("$.email", "lower"),
    gin.WithEmailDomainTransformer("$.email", "domain"),
    gin.WithRegexExtractTransformer("$.message", "error_code", `ERROR\[(\w+)\]:`, 1),
)
builder, _ := gin.NewBuilder(config, numRGs)

builder.AddDocument(0, []byte(`{
    "created_at": "2024-07-10T09:00:00Z",
    "email": "Alice@Example.COM",
    "message": "ERROR[E1001]: Connection timeout"
}`))

idx := builder.Finalize()

// Raw source-path queries still use the original value.
raw := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.created_at", "2024-07-10T09:00:00Z"),
})

// Alias queries opt into the derived companion explicitly.
july2024 := float64(time.Date(2024, 7, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
dateResult := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.created_at", gin.As("epoch_ms", july2024)),
})
lowerResult := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", gin.As("lower", "alice@example.com")),
})
domainResult := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", gin.As("domain", "example.com")),
})
errorResult := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.message", gin.As("error_code", "E1001")),
})

Built-in additive helpers:

Date/time: WithISODateTransformer(path, alias), WithDateTransformer(path, alias), WithCustomDateTransformer(path, alias, layout)
String normalization: WithToLowerTransformer(path, alias), WithEmailDomainTransformer(path, alias), WithURLHostTransformer(path, alias)
Extracted subfields: WithRegexExtractTransformer(path, alias, pattern, group), WithRegexExtractIntTransformer(path, alias, pattern, group)
Numeric companions: WithIPv4Transformer(path, alias), WithSemVerTransformer(path, alias), WithDurationTransformer(path, alias), WithNumericBucketTransformer(path, alias, size), WithBoolNormalizeTransformer(path, alias)

Custom companions:

myTransformer := func(v any) (any, bool) {
    s, ok := v.(string)
    if !ok {
        return nil, false
    }
    return strings.ToUpper(s), true
}

config, _ := gin.NewConfig(
    gin.WithCustomTransformer("$.my_field", "upper", myTransformer),
)

WithCustomTransformer(...) works for in-memory indexes, but opaque custom companions are not serializable. Encode() rejects them because the function cannot be reconstructed on Decode().

Example: IP subnet queries

config, _ := gin.NewConfig(
    gin.WithIPv4Transformer("$.client_ip", "ipv4_int"),
)

start, end, _ := gin.CIDRToRange("192.168.1.0/24")
result := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.client_ip", gin.As("ipv4_int", start)),
    gin.LTE("$.client_ip", gin.As("ipv4_int", end)),
})

Example: Version range queries

config, _ := gin.NewConfig(
    gin.WithSemVerTransformer("$.version", "semver_int"),
)

result := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.version", gin.As("semver_int", float64(2000000))),
})

Example: Case-insensitive email queries

config, _ := gin.NewConfig(
    gin.WithToLowerTransformer("$.email", "lower"),
)

result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", gin.As("lower", "alice@example.com")),
})

Example: Extract error codes from log messages

config, _ := gin.NewConfig(
    gin.WithRegexExtractTransformer("$.message", "error_code", `ERROR\[(\w+)\]:`, 1),
)

result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.message", gin.As("error_code", "E1234")),
})

Full-Text Search (CONTAINS)

// Uses trigram index for substring matching
gin.Contains("$.description", "hello")
gin.Contains("$.title", "database")  // matches "database", "databases", etc.

Regex Matching

// Uses trigram index for regex candidate selection
gin.Regex("$.message", "ERROR|WARNING")        // Alternation
gin.Regex("$.brand", "Toyota|Tesla|Ford")      // Multiple literals
gin.Regex("$.log", "error.*timeout")           // Prefix + wildcard + suffix
gin.Regex("$.code", "[A-Z]{3}_[0-9]+")         // Pattern with literals

The Regex operator extracts literal strings from regex patterns and uses the trigram index for candidate row-group selection. This enables efficient pruning before actual regex matching.

How it works:

Parse regex pattern and extract literal substrings
For alternations like (error|warn)_message, extracts combined literals: ["error_message", "warn_message"]
Query trigram index for each literal
Union results (OR semantics for alternation)
Row groups not containing any literal are pruned

Limitations:

Requires trigram index enabled (EnableTrigrams: true)
Literals shorter than trigram length (default: 3) cannot prune
Pure wildcard patterns (.*) return all row groups
This is candidate selection, not regex execution - actual matching happens at query time

Null Handling

gin.IsNull("$.optional_field")
gin.IsNotNull("$.required_field")

Nested Fields and Arrays

// Nested objects
gin.EQ("$.user.address.city", "New York")

// Array elements (wildcard)
gin.EQ("$.tags[*]", "important")
gin.IN("$.roles[*]", "admin", "editor")

JSONPath Support

Supported path syntax:

$ - root
$.field - dot notation
$['field'] - bracket notation
$.items[*] - array wildcard

Not supported (will error):

$.items[0] - array indices
$..field - recursive descent
$.items[0:5] - slices
$[?(@.price > 10)] - filters

Validate paths before use:

if err := gin.ValidateJSONPath("$.user.name"); err != nil {
    log.Fatal(err)
}

Serialization

// Encode to bytes (zstd compressed)
data, err := gin.Encode(idx)

// Save to file
os.WriteFile("index.gin", data, 0644)

// Load and decode
data, _ := os.ReadFile("index.gin")
idx, err := gin.Decode(data)

Parquet Integration

The GIN index integrates directly with Parquet files, supporting three storage strategies:

Sidecar file - Index stored as data.parquet.gin alongside the Parquet file
Embedded metadata - Index stored in Parquet file's key-value metadata
Build-time embedding - Index built and embedded during Parquet file creation

Build Index from Parquet

// Build index from a Parquet file's JSON column
idx, err := gin.BuildFromParquet("data.parquet", "attributes", gin.DefaultConfig())

Sidecar Workflow

// Write index as sidecar file (data.parquet.gin)
err := gin.WriteSidecar("data.parquet", idx)

// Read sidecar
idx, err := gin.ReadSidecar("data.parquet")

// Check if sidecar exists
if gin.HasSidecar("data.parquet") {
    // ...
}

Embedded Metadata Workflow

cfg := gin.DefaultParquetConfig() // MetadataKey: "gin.index"

// Rebuild existing Parquet file with embedded index
err := gin.RebuildWithIndex("data.parquet", idx, cfg)

// Check if Parquet has embedded index
hasIdx, err := gin.HasGINIndex("data.parquet", cfg)

// Read embedded index
idx, err := gin.ReadFromParquetMetadata("data.parquet", cfg)

Auto-Loading (Embedded First, Then Sidecar)

// Tries embedded metadata first, falls back to sidecar
idx, err := gin.LoadIndex("data.parquet", gin.DefaultParquetConfig())

Encode for Parquet Metadata (Build-Time Embedding)

When creating a new Parquet file, you can embed the index during creation:

// Get key-value pair for Parquet metadata
key, value, err := gin.EncodeToMetadata(idx, gin.DefaultParquetConfig())
// key = "gin.index", value = base64-encoded compressed index

// Use with parquet-go writer
writer := parquet.NewGenericWriter[Record](f,
    parquet.KeyValueMetadata(key, value),
)

Batch Processing (Programmatic)

Helper functions for working with multiple files:

// Local filesystem
if gin.IsDirectory("./data") {
    // List all .parquet files in directory
    parquetFiles, err := gin.ListParquetFiles("./data")

    // List all .gin files in directory
    ginFiles, err := gin.ListGINFiles("./data")

    // Process each file
    for _, f := range parquetFiles {
        idx, _ := gin.BuildFromParquet(f, "attributes", gin.DefaultConfig())
        gin.WriteSidecar(f, idx)
    }
}

// S3
s3Client, _ := gin.NewS3ClientFromEnv()

// List all .parquet files under prefix
parquetKeys, err := s3Client.ListParquetFiles("bucket", "data/")

// List all .gin files under prefix
ginKeys, err := s3Client.ListGINFiles("bucket", "data/")

S3 Support

All operations support S3 paths via AWS SDK v2:

// Configure from environment variables:
// AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
// AWS_ENDPOINT_URL (for MinIO, LocalStack), AWS_S3_PATH_STYLE=true
s3Client, err := gin.NewS3ClientFromEnv()

// Build from S3
idx, err := s3Client.BuildFromParquet("bucket", "path/to/data.parquet", "attributes", gin.DefaultConfig())

// Write sidecar to S3
err := s3Client.WriteSidecar("bucket", "path/to/data.parquet", idx)

// Read sidecar from S3
idx, err := s3Client.ReadSidecar("bucket", "path/to/data.parquet")

// Load index (tries embedded, then sidecar)
idx, err := s3Client.LoadIndex("bucket", "path/to/data.parquet", gin.DefaultParquetConfig())

CLI Tool

A command-line tool is provided for common operations:

# Install
go install github.com/amikos-tech/ami-gin/cmd/gin-index@latest

# Build sidecar index
gin-index build -c attributes data.parquet
gin-index build -c attributes -o custom.gin data.parquet

# Build and embed into Parquet file
gin-index build -c attributes -embed data.parquet

# Query index
gin-index query data.parquet.gin '$.status = "error"'
gin-index query data.parquet.gin '$.count > 100'
gin-index query data.parquet.gin '$.name IN ("alice", "bob")'

# Show index info
gin-index info data.parquet.gin

# Extract embedded index to sidecar
gin-index extract -o data.parquet.gin data.parquet

# S3 paths (uses AWS env vars)
gin-index build -c attributes s3://bucket/data.parquet
gin-index query s3://bucket/data.parquet.gin '$.status = "ok"'

Batch Processing (Directory/S3 Prefix):

Process multiple files at once by passing a directory or S3 prefix:

# Build index for all .parquet files in a directory
gin-index build -c attributes ./data/
gin-index build -c attributes -embed ./data/

# Query all .gin files in a directory
gin-index query ./data/ '$.status = "error"'

# Show info for all .gin files
gin-index info ./data/

# S3 prefix - processes all .parquet files under the prefix
gin-index build -c attributes s3://bucket/data/
gin-index query s3://bucket/data/ '$.status = "error"'
gin-index info s3://bucket/data/

# Glob patterns work too
gin-index build -c attributes './data/*.parquet'
gin-index query './data/*.gin' '$.level = "error"'

CLI Query Syntax:

Equality: $.field = "value", $.field != "value"
Numeric: $.field > 100, $.field >= 100, $.field < 100, $.field <= 100
IN/NOT IN: $.field IN ("a", "b"), $.field NOT IN (1, 2, 3)
Null: $.field IS NULL, $.field IS NOT NULL
Contains: $.field CONTAINS "substring"
Regex: $.field REGEX "pattern" (e.g., $.brand REGEX "Toyota|Tesla")

Configuration

config := gin.GINConfig{
    CardinalityThreshold:    10000, // Exact below threshold, adaptive above it
    BloomFilterSize:         65536,
    BloomFilterHashes:       5,
    EnableTrigrams:          true,  // Enable CONTAINS queries
    TrigramMinLength:        3,
    HLLPrecision:            12,    // HyperLogLog precision (4-16)
    PrefixBlockSize:         16,
    AdaptiveMinRGCoverage:   2,     // Promote values seen in at least 2 row groups
    AdaptivePromotedTermCap: 64,    // Keep at most 64 exact hot terms per path
    AdaptiveCoverageCeiling: 0.80,  // Skip terms that cover more than 80% of row groups
    AdaptiveBucketCount:     128,   // Fixed bucket count for long-tail fallback
}

builder, err := gin.NewBuilder(config, numRowGroups)
if err != nil {
    panic(err)
}

High-Cardinality String Modes

GIN Index uses three string-path modes:

exact - path cardinality stays under CardinalityThreshold, so every observed value keeps an exact row-group bitmap.
adaptive-hybrid - path exceeds CardinalityThreshold, but hot values still retain exact row-group pruning while the long tail falls back to fixed hash buckets.
bloom-only - adaptive promotion is disabled, so high-cardinality paths keep only the bloom filter fast-path.

The additive adaptive knobs above control when a hot value is promoted (AdaptiveMinRGCoverage), how many promoted values are kept (AdaptivePromotedTermCap), how broad a promoted value is allowed to be (AdaptiveCoverageCeiling), and how much compact fallback space is reserved for the long tail (AdaptiveBucketCount). This means hot values on a high-cardinality path can still retain exact row-group pruning instead of degrading immediately to bloom-only behavior.

Examples

See the examples directory:

go run ./examples/basic/main.go        # Equality queries
go run ./examples/range/main.go        # Numeric ranges
go run ./examples/transformers/main.go # Date field transformers
go run ./examples/transformers-advanced/main.go # IP, SemVer, email, regex transformers
go run ./examples/fulltext/main.go     # CONTAINS queries
go run ./examples/regex/main.go        # Regex pattern matching
go run ./examples/null/main.go         # NULL handling
go run ./examples/nested/main.go       # Nested JSON and arrays
go run ./examples/serialize/main.go    # Persistence
go run ./examples/full/main.go         # All types and operators
go run ./examples/parquet/main.go      # Parquet integration (sidecar, embedded, queries)

Benchmarks

Run benchmarks with:

go test -bench=. -benchmem -benchtime=1s

Performance Summary (Apple M3 Max)

Operation	Latency	Notes
EQ query	~1µs	Bloom filter + sorted term lookup
Range query (GT/LT)	4-24µs	Min/max stats scan
IN query (10 values)	~8µs	Union of EQ results
CONTAINS query	2-17µs	Trigram intersection
IsNull/IsNotNull	2-4µs	Bitmap lookup
Bloom lookup	~100ns	Fast path rejection
AddDocument	~43µs	JSON parsing + indexing
Encode (1K RGs)	~4ms	zstd compression
Decode (1K RGs)	~2ms	zstd decompression

Index Size

Row Groups	Encoded Size	Per RG
100	6.7 KB	67 bytes
500	18 KB	36 bytes
1,000	30 KB	30 bytes
2,000	51 KB	26 bytes

Scaling Characteristics

Query time scales well with index size:

10 RGs: ~340ns
100 RGs: ~530ns
1,000 RGs: ~680ns
5,000 RGs: ~800ns

Build time is linear with document count and complexity:

100 docs (7 fields): ~1.2ms
1,000 docs: ~6.7ms
High cardinality (10K unique values): ~3.3ms per 1K docs

Component Performance

Component	Operation	Latency
Bloom Filter	Add	~100ns
Bloom Filter	Lookup	~100ns
RGSet (10K)	Intersect	~12µs
RGSet (10K)	Union	~10µs
Trigram	Add (50 chars)	~16µs
Trigram	Search	1-6µs
HyperLogLog	Add	~70ns
HyperLogLog	Estimate	7-410µs (precision dependent)
Prefix Compress	1K terms	~60µs

Real-World Scenario: 1M Docs / 50K Row Groups

Simulating a log storage scenario:

1M documents across 50K row groups (~20 docs/RG)
10 labels: 2 integers (status_code, duration_ms) + 8 strings
Mix of cardinalities: trace_id (high), service (low), host (medium)
Trigrams disabled (no FTS)

Metric	Value
Index Size	289 KB (0.28 MB)
Bytes per RG	5.9 bytes
Bytes per doc	0.3 bytes
Build time	464ms
Encode	41ms
Decode	41ms

Query Performance:

Query	Latency	Notes
`trace_id=X` (high cardinality)	950ns	adaptive-hybrid hot-term prune or compact tail fallback
`service=api` (low cardinality)	6.5µs	~10K RGs match
`trace_id=X AND level=error`	6µs	High card + low card
`duration_ms > 5000`	244µs	Range scan over 50K RGs
`service=api AND env=prod AND status>=400`	285µs	3 predicates combined

Key takeaway: High-cardinality lookups (trace ID, request ID) are sub-microsecond. The entire index for 1M documents fits in 289 KB - easily cacheable in memory, localStorage, or CDN edge.

Benchmark Categories

The benchmark suite (benchmark_test.go) covers:

Builder Performance - Document ingestion, batch loading, finalization
Query Performance - All operators, parallel queries, multiple predicates
Serialization - Encode/decode latency, compression ratios
Components - Bloom filter, RGSet, trigram, HLL, prefix compression
Scaling - Row group count, document size, cardinality, nesting depth

Comparison with Other Solutions

vs PostgreSQL GIN/JSONB

Aspect	This GIN Index	PostgreSQL GIN
Query Latency	~1µs (EQ)	~0.7-1.2ms per predicate
Deployment	Embedded bytes, no server	Requires PostgreSQL server
Cacheability	Cache anywhere (nginx, memcached, CDN)	Tied to database buffer cache
Index Size	26-67 bytes/row-group	Larger, includes posting lists
Range Queries	Native min/max stats	Poor (GIN doesn't support ranges)
Full-Text	Trigram-based CONTAINS	Full-featured tsvector/tsquery
ACID	No (read-only after build)	Full transaction support

PostgreSQL GIN uses Bitmap Index Scans which cost ~0.7-1.2ms each when cached. This index achieves ~1µs queries by being purpose-built for row-group pruning with simpler data structures.

vs Parquet Built-in Statistics

Aspect	This GIN Index	Parquet Min/Max Stats	Parquet Bloom Filters
String Equality	Exact term → RG bitmap	Only min/max (poor for strings)	Yes, but per-column only
CONTAINS/FTS	Trigram index	No	No
Multi-path Queries	Single index file	Scattered in column chunks	Scattered in column chunks
Cardinality	HyperLogLog estimates	No	No
Null Tracking	Explicit null/present bitmaps	Null count only	No
Index Location	Footer or sidecar file	Column chunk metadata	Column chunk metadata

Parquet's built-in bloom filters are effective for single-column equality but require reading multiple column chunks for multi-field queries. This GIN index consolidates all paths into one structure.

vs Delta Lake / Iceberg Data Skipping

Aspect	This GIN Index	Delta Lake	Apache Iceberg
Statistics	Per-path term index + min/max	First 32 columns min/max	Partition-level + column stats
High Cardinality	adaptive-hybrid + bloom tail fallback	Requires Z-ordering	Requires sorting
JSON Support	Native path extraction	Requires schema	Requires schema
Query Planning	Client-side, cacheable	Spark/engine dependent	Engine dependent
Deployment	Standalone bytes	Delta transaction log	Metadata tables

Delta Lake's data skipping relies on Z-ordering for effectiveness with high-cardinality columns. This GIN index handles high-cardinality paths natively with adaptive-hybrid hot-value recovery plus compact fallback for the tail.

vs Elasticsearch

Aspect	This GIN Index	Elasticsearch
Query Latency	~1µs	~1-10ms (network + processing)
Deployment	Embedded, no server	Cluster required
Index Size	~30KB for 1K row-groups	GB+ for equivalent data
Use Case	Row-group pruning	Full search engine
Updates	Rebuild required	Near real-time

Elasticsearch provides millisecond-level latency for searches but requires cluster infrastructure. This index is designed for embedding in data lake metadata.

Key Advantage: High-Cardinality Arbitrary JSON

This index was born from log storage needs - indexing arbitrary attributes/labels where:

High cardinality is the norm - trace IDs, request IDs, user IDs, session tokens
Schema is unknown - arbitrary key-value labels attached at runtime
Queries are selective - "find logs where trace_id=abc123" should be instant

Traditional solutions struggle here:

Challenge	PostgreSQL GIN	Parquet Stats	This GIN Index
`trace_id` (millions unique)	Index bloat, slow writes	Min/max useless	adaptive-hybrid exact hot values + compact tail fallback
`user.email` (arbitrary path)	Requires schema	Column must exist	Auto-discovered paths
`labels["env"]` (dynamic keys)	JSONB @> operator (~1ms)	Not supported	Native path indexing (~1µs)
Mixed types per path	Type coercion issues	Single type per column	Tracks observed types

Log/observability example:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "trace_id": "abc123def456",
  "user": {"id": "user_98765", "email": "alice@example.com"},
  "labels": {"env": "prod", "region": "us-east-1", "version": "2.1.0"},
  "message": "Connection timeout to downstream service"
}

Query: trace_id=abc123def456 AND labels.env=prod AND level=error

Bloom filter rejects non-matching row groups instantly
High-cardinality trace_id can retain exact row-group pruning for hot values via adaptive-hybrid
Arbitrary labels.* paths indexed automatically

Vector database metadata filtering:

Vector databases need efficient pre-filtering before similarity search. Without good metadata indexing, you either:

Scan all vectors then filter (slow)
Filter first with poor index (still slow)
Build separate metadata infrastructure (complex)

{
  "id": "doc_12345",
  "embedding": [0.1, 0.2, ...],
  "metadata": {
    "source": "arxiv",
    "year": 2024,
    "authors": ["Alice", "Bob"],
    "topics": ["machine-learning", "transformers"],
    "cited_by": 142,
    "full_text": "We present a novel approach to..."
  }
}

Query: Find similar vectors WHERE metadata.source=arxiv AND metadata.year>=2023 AND metadata.topics[*]=transformers

┌─────────────────────────────────────────────────────────────────┐
│                   Vector DB Hybrid Search                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. Metadata Filter (GIN Index)         2. Vector Search       │
│   ┌─────────────────────────────┐       ┌──────────────────┐   │
│   │ source=arxiv                │       │                  │   │
│   │ year>=2023        ──────────┼──────▶│  ANN Search      │   │
│   │ topics[*]=transformers      │       │  (only on        │   │
│   │                             │       │   segments 2,5)  │   │
│   │ Result: segments [2, 5]     │       │                  │   │
│   └─────────────────────────────┘       └──────────────────┘   │
│           ~1µs                              search scope        │
│                                             reduced 80%         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

This enables:

Pre-filtering - Prune segments before expensive ANN search
Flexible schemas - Each document can have different metadata fields
High cardinality - Filter by doc_id, user_id, session_id
Range + equality - year>=2023 AND source=arxiv
Array membership - topics[*]=machine-learning
Full-text on metadata - CONTAINS(full_text, "transformer")

Cacheable Pruning Index

The second differentiator is deployment flexibility:

┌─────────────────────────────────────────────────────────────────┐
│                     Data Lake Query Flow                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Client                                                        │
│     │                                                           │
│     │ 1. Fetch GIN index (cached)                              │
│     ▼                                                           │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │  nginx/memcached/CDN/local cache                        │  │
│   │  ┌─────────────┐                                        │  │
│   │  │ index.gin   │  ← 30KB, serves in <1ms               │  │
│   │  │ (cached)    │                                        │  │
│   │  └─────────────┘                                        │  │
│   └─────────────────────────────────────────────────────────┘  │
│     │                                                           │
│     │ 2. Evaluate predicates locally (~1µs)                    │
│     │    Result: [RG 5, RG 23, RG 47]                          │
│     │                                                           │
│     │ 3. Read only matching row groups from object storage     │
│     ▼                                                           │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │  S3 / GCS / Azure Blob                                  │  │
│   │  ┌─────────────────────────────────────────────────┐    │  │
│   │  │ data.parquet                                    │    │  │
│   │  │  RG 0 ──────── skipped                         │    │  │
│   │  │  RG 5 ◀─────── read                            │    │  │
│   │  │  RG 10 ─────── skipped                         │    │  │
│   │  │  RG 23 ◀─────── read                           │    │  │
│   │  │  RG 47 ◀─────── read                           │    │  │
│   │  │  ...                                            │    │  │
│   │  └─────────────────────────────────────────────────┘    │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Benefits:

No database server - Index is just bytes, evaluate anywhere
Cache at edge - nginx, memcached, CDN, browser localStorage
Cross-language - Binary format works in any language
Offline capable - Cache index locally for disconnected queries
Cost efficient - Avoid scanning TB of Parquet data

This architecture is ideal for:

Log/observability platforms - Index arbitrary labels, query by trace ID
Vector databases - Pre-filter segments before ANN search
Serverless query engines - No database to manage
Browser-based data explorers - Cache index in localStorage
Edge computing / IoT analytics - Offline-capable querying
Cost-sensitive data lake queries - Minimize S3/GCS egress

Architecture

Index Structure

┌─────────────────────────────────────────────────────────────────────────────┐
│                              GINIndex                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Header                                                               │   │
│  │  • Version, NumRowGroups, NumDocs, NumPaths                         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Path Directory                                                       │   │
│  │  pathID → { PathName, ObservedTypes, Cardinality, Flags }           │   │
│  │                                                                      │   │
│  │  Example:                                                            │   │
│  │    0 → { "$.name",   String,  150,   0x00 }                         │   │
│  │    1 → { "$.age",    Int,     80,    0x00 }                         │   │
│  │    2 → { "$.tags[*]", String, 50000, FlagBloomOnly }                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Global Bloom Filter                                                  │   │
│  │  Fast rejection for path=value pairs                                 │   │
│  │  Contains: "$.name=alice", "$.name=bob", "$.age=30", ...            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌───────────────────────┐  ┌───────────────────────┐                      │
│  │ StringIndex           │  │ NumericIndex          │                      │
│  │ (per pathID)          │  │ (per pathID)          │                      │
│  │                       │  │                       │                      │
│  │ pathID: 0 ($.name)    │  │ pathID: 1 ($.age)     │                      │
│  │ ┌─────────┬────────┐  │  │ ┌─────┬──────┬──────┐│                      │
│  │ │  Term   │ RGSet  │  │  │ │ RG  │ Min  │ Max  ││                      │
│  │ ├─────────┼────────┤  │  │ ├─────┼──────┼──────┤│                      │
│  │ │ "alice" │ {0,2}  │  │  │ │  0  │  25  │  35  ││                      │
│  │ │ "bob"   │ {1}    │  │  │ │  1  │  20  │  45  ││                      │
│  │ │ "carol" │ {2,3}  │  │  │ │  2  │  30  │  30  ││                      │
│  │ └─────────┴────────┘  │  │ └─────┴──────┴──────┘│                      │
│  └───────────────────────┘  └───────────────────────┘                      │
│                                                                             │
│  ┌───────────────────────┐  ┌───────────────────────┐                      │
│  │ NullIndex             │  │ TrigramIndex          │                      │
│  │ (per pathID)          │  │ (per pathID)          │                      │
│  │                       │  │                       │                      │
│  │ pathID: 1 ($.age)     │  │ pathID: 3 ($.desc)    │                      │
│  │ ┌──────────┬────────┐ │  │ ┌─────────┬────────┐ │                      │
│  │ │ NullRGs  │ {4,7}  │ │  │ │ Trigram │ RGSet  │ │                      │
│  │ │ Present  │ {0-9}  │ │  │ ├─────────┼────────┤ │                      │
│  │ └──────────┴────────┘ │  │ │ "hel"   │ {0,2}  │ │                      │
│  └───────────────────────┘  │ │ "ell"   │ {0,2}  │ │                      │
│                             │ │ "llo"   │ {0,2,5}│ │                      │
│  ┌────────────────────────┐ │ │ "wor"   │ {1,3}  │ │                      │
│  │ DocID Mapping          │ │ └─────────┴────────┘ │                      │
│  │ (optional)             │ └───────────────────────┘                      │
│  │                        │                                                 │
│  │ pos → DocID            │  ┌───────────────────────┐                     │
│  │  0  → 1000             │  │ PathCardinality (HLL) │                     │
│  │  1  → 1001             │  │ (per pathID)          │                     │
│  │  2  → 1020             │  │                       │                     │
│  │  3  → 1021             │  │ Estimates unique vals │                     │
│  └────────────────────────┘  └───────────────────────┘                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Note: All RGSet bitmaps use Roaring Bitmaps for efficient compression

Data Flow

                                BUILD PHASE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│   JSON Documents                                                            │
│        │                                                                    │
│        ▼                                                                    │
│   ┌─────────┐     ┌──────────────────────────────────────────────────┐     │
│   │ DocID 0 │────▶│                  GINBuilder                       │     │
│   │ RG: 0   │     │                                                   │     │
│   └─────────┘     │  AddDocument(docID, json)                        │     │
│   ┌─────────┐     │       │                                          │     │
│   │ DocID 1 │────▶│       ▼                                          │     │
│   │ RG: 0   │     │  ┌─────────────┐                                 │     │
│   └─────────┘     │  │ Walk JSON   │                                 │     │
│   ┌─────────┐     │  │ Extract:    │                                 │     │
│   │ DocID 2 │────▶│  │  • paths    │                                 │     │
│   │ RG: 1   │     │  │  • values   │                                 │     │
│   └─────────┘     │  │  • types    │                                 │     │
│        ⋮          │  └──────┬──────┘                                 │     │
│                   │         │                                         │     │
│                   │         ▼                                         │     │
│                   │  ┌─────────────────────────────────────────┐     │     │
│                   │  │ Update per-path structures:             │     │     │
│                   │  │  • stringTerms[term] → RGSet.Set(pos)   │     │     │
│                   │  │  • numericStats[pos].Min/Max            │     │     │
│                   │  │  • nullRGs.Set(pos) if null             │     │     │
│                   │  │  • trigrams.Add(term, pos)              │     │     │
│                   │  │  • bloom.Add(path=value)                │     │     │
│                   │  │  • hll.Add(value)                       │     │     │
│                   │  └─────────────────────────────────────────┘     │     │
│                   │                                                   │     │
│                   └───────────────────────┬──────────────────────────┘     │
│                                           │                                 │
│                                           ▼                                 │
│                                    Finalize()                               │
│                                           │                                 │
│                                           ▼                                 │
│                                    ┌─────────────┐                          │
│                                    │  GINIndex   │                          │
│                                    └─────────────┘                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

                                QUERY PHASE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│   Predicates: [EQ("$.name", "alice"), GT("$.age", 25)]                     │
│        │                                                                    │
│        ▼                                                                    │
│   ┌─────────────────────────────────────────────────────────────────┐      │
│   │                     idx.Evaluate(predicates)                     │      │
│   └─────────────────────────────────────────────────────────────────┘      │
│        │                                                                    │
│        ├──────────────────────┬──────────────────────┐                     │
│        ▼                      ▼                      ▼                     │
│   ┌──────────┐          ┌──────────┐          ┌──────────┐                 │
│   │ Predicate│          │ Predicate│          │   ...    │                 │
│   │    1     │          │    2     │          │          │                 │
│   └────┬─────┘          └────┬─────┘          └──────────┘                 │
│        │                     │                                              │
│        ▼                     ▼                                              │
│   ┌──────────────┐     ┌──────────────┐                                    │
│   │ Bloom Check  │     │ Bloom Check  │   ◀── Fast rejection path          │
│   │ path=value?  │     │   (skip for  │                                    │
│   └──────┬───────┘     │    ranges)   │                                    │
│          │             └──────┬───────┘                                    │
│          ▼                    ▼                                             │
│   ┌──────────────┐     ┌──────────────┐                                    │
│   │ StringIndex  │     │ NumericIndex │                                    │
│   │ lookup term  │     │ scan min/max │                                    │
│   │ → RGSet      │     │ → RGSet      │                                    │
│   └──────┬───────┘     └──────┬───────┘                                    │
│          │                    │                                             │
│          │    RGSet{0,2}      │    RGSet{0,1,2}                            │
│          │                    │                                             │
│          └─────────┬──────────┘                                             │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │  Intersect  │                                                 │
│             │  (AND all)  │                                                 │
│             └──────┬──────┘                                                 │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │ RGSet{0,2}  │  ◀── Matching row groups                       │
│             └──────┬──────┘                                                 │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │ ToSlice()   │  → [0, 2]                                      │
│             │     or      │                                                 │
│             │ MatchingDocIDs() → [DocID...]                                │
│             └─────────────┘                                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

DocID Codec (Optional)

For composite document identifiers (e.g., file + row group):

// Encode file index and row group into single DocID
codec := gin.NewRowGroupCodec(20)  // 20 RGs per file
builder, err := gin.NewBuilder(config, totalRGs, gin.WithCodec(codec))
if err != nil {
    panic(err)
}

docID := codec.Encode(fileIndex, rgIndex)  // e.g., file=3, rg=15 → DocID=75
builder.AddDocument(docID, jsonDoc)

// Query and decode results
result := idx.Evaluate(predicates)
for _, docID := range idx.MatchingDocIDs(result) {
    decoded := codec.Decode(docID)  // [3, 15]
    fileIdx, rgIdx := decoded[0], decoded[1]
}

How It Works

The GIN index maintains several data structures:

Path Directory - Maps JSON paths to their metadata (types, cardinality)
String Index - For each path, maps terms to row-group bitmaps (Roaring)
Numeric Index - Per-row-group min/max values for range pruning
Null Index - Bitmaps tracking which row groups have null/present values
Trigram Index - Maps 3-character sequences to row-group bitmaps
Global Bloom Filter - Fast rejection of non-existent path=value pairs
DocID Mapping - Optional external DocID to internal position mapping

Query evaluation intersects the matching row-group bitmaps from each predicate.

Design Notes

Why numRGs Must Be Known Upfront

The NewBuilder(config, numRGs) requires the total number of row groups at construction time. This is intentional:

Complement operations require universe size - Operations like AllRGs() and Invert() need to know the total number of row groups to compute complements. When a query cannot prune (e.g., unknown path, graceful degradation), the index returns "all row groups" - which requires knowing what "all" means.
Parquet metadata provides this - The index is designed for Parquet row-group pruning. In this context, the number of row groups is always available from Parquet file metadata before indexing begins.
Bounds checking - The builder validates that document positions don't exceed the declared row group count, catching configuration errors early.

License

MIT

Documentation ¶

Index ¶

Constants
Variables
func BoolNormalize(v any) (any, bool)
func CIDRToRange(cidr string) (start, end float64, err error)
func CompressionStats(terms []string) (compressed, original int, ratio float64)
func DateToEpochMs(v any) (any, bool)
func DurationToMs(v any) (any, bool)
func EmailDomain(v any) (any, bool)
func Encode(idx *GINIndex) ([]byte, error)
func EncodeToMetadata(idx *GINIndex, cfg ParquetConfig) (key string, value string, err error)
func EncodeWithLevel(idx *GINIndex, level CompressionLevel) ([]byte, error)
func ExtractLiterals(pattern string) ([]string, error)
func ExtractTrigrams(s string) []string
func GenerateBigrams(text string) []string
func GenerateNGrams(text string, n int, opts ...NGramOption) ([]string, error)
func GenerateTrigrams(text string) []string
func HasGINIndex(parquetFile string, cfg ParquetConfig) (bool, error)
func HasGINIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (bool, error)
func HasSidecar(parquetFile string) bool
func IPv4ToInt(v any) (any, bool)
func ISODateToEpochMs(v any) (any, bool)
func IsDirectory(path string) bool
func IsS3Path(path string) bool
func IsValidJSONPath(path string) bool
func ListGINFiles(dir string) ([]string, error)
func ListParquetFiles(dir string) ([]string, error)
func MustValidateJSONPath(path string) string
func NormalizePath(path string) string
func ParseJSONPath(path string) (jp.Expr, error)
func ParseS3Path(path string) (bucket, key string, err error)
func RebuildWithIndex(parquetFile string, idx *GINIndex, cfg ParquetConfig) error
func SemVerToInt(v any) (any, bool)
func SetAdaptiveInvariantLogger(l *log.Logger)
func SidecarPath(parquetFile string) string
func ToLower(v any) (any, bool)
func URLHost(v any) (any, bool)
func ValidateJSONPath(path string) error
func WriteCompressedTerms(w io.Writer, blocks []CompressedTermBlock) error
func WriteSidecar(parquetFile string, idx *GINIndex) error
type AdaptiveStringIndex
- func NewAdaptiveStringIndex(terms []string, rgBitmaps []*RGSet, bucketBitmaps []*RGSet) (*AdaptiveStringIndex, error)
type BloomFilter
- func BloomFilterFromBits(bits []uint64, numBits uint32, numHashes uint8) *BloomFilter
- func MustNewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) *BloomFilter
- func NewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) (*BloomFilter, error)
- func (bf *BloomFilter) Add(data []byte)
- func (bf *BloomFilter) AddString(s string)
- func (bf *BloomFilter) Bits() []uint64
- func (bf *BloomFilter) MayContain(data []byte) bool
- func (bf *BloomFilter) MayContainString(s string) bool
- func (bf *BloomFilter) NumBits() uint32
- func (bf *BloomFilter) NumHashes() uint8
type BloomFilterOption
type BuilderOption
- func WithCodec(codec DocIDCodec) BuilderOption
type CompressedTermBlock
- func ReadCompressedTerms(r io.Reader) ([]CompressedTermBlock, error)
type CompressionLevel
type ConfigOption
- func WithAdaptiveBucketCount(bucketCount int) ConfigOption
- func WithAdaptiveCoverageCeiling(ceiling float64) ConfigOption
- func WithAdaptiveMinRGCoverage(minCoverage int) ConfigOption
- func WithAdaptivePromotedTermCap(cap int) ConfigOption
- func WithBoolNormalizeTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithCustomDateTransformer(path, alias, layout string, opts ...TransformerOption) ConfigOption
- func WithCustomTransformer(path, alias string, fn FieldTransformer, opts ...TransformerOption) ConfigOption
- func WithDateTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithDurationTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithEmailDomainTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithFTSPaths(paths ...string) ConfigOption
- func WithFieldTransformer(path string, fn FieldTransformer) ConfigOption
- func WithIPv4Transformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithISODateTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithNumericBucketTransformer(path, alias string, size float64, opts ...TransformerOption) ConfigOption
- func WithRegexExtractIntTransformer(path, alias, pattern string, group int, opts ...TransformerOption) ConfigOption
- func WithRegexExtractTransformer(path, alias, pattern string, group int, opts ...TransformerOption) ConfigOption
- func WithRegisteredTransformer(path, alias string, id TransformerID, params []byte, opts ...TransformerOption) ConfigOption
- func WithSemVerTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithToLowerTransformer(path, alias string, opts ...TransformerOption) ConfigOption
- func WithURLHostTransformer(path, alias string, opts ...TransformerOption) ConfigOption
type CustomDateParams
type DocID
type DocIDCodec
type FieldTransformer
- func CustomDateToEpochMs(layout string) FieldTransformer
- func NumericBucket(size float64) FieldTransformer
- func ReconstructTransformer(id TransformerID, params json.RawMessage) (FieldTransformer, error)
- func RegexExtract(pattern string, group int) FieldTransformer
- func RegexExtractInt(pattern string, group int) FieldTransformer
type GINBuilder
- func NewBuilder(config GINConfig, numRGs int, opts ...BuilderOption) (*GINBuilder, error)
- func (b *GINBuilder) AddDocument(docID DocID, jsonDoc []byte) error
- func (b *GINBuilder) Finalize() *GINIndex
type GINConfig
- func DefaultConfig() GINConfig
- func NewConfig(opts ...ConfigOption) (GINConfig, error)
- func (c GINConfig) AdaptiveEnabled() bool
type GINIndex
- func BuildFromParquet(parquetFile string, jsonColumn string, config GINConfig) (*GINIndex, error)
- func BuildFromParquetReader(parquetFile string, jsonColumn string, config GINConfig, reader io.ReaderAt, ...) (*GINIndex, error)
- func Decode(data []byte) (*GINIndex, error)
- func DecodeFromMetadata(value string) (*GINIndex, error)
- func LoadIndex(parquetFile string, cfg ParquetConfig) (*GINIndex, error)
- func LoadIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)
- func NewGINIndex() *GINIndex
- func ReadFromParquetMetadata(parquetFile string, cfg ParquetConfig) (*GINIndex, error)
- func ReadFromParquetMetadataReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)
- func ReadSidecar(parquetFile string) (*GINIndex, error)
- func (idx *GINIndex) Evaluate(predicates []Predicate) *RGSet
- func (idx *GINIndex) MatchingDocIDs(rgSet *RGSet) []DocID
- func (idx *GINIndex) Representations(path string) []RepresentationInfo
type Header
type HyperLogLog
- func HyperLogLogFromRegisters(registers []uint8, precision uint8) *HyperLogLog
- func MustNewHyperLogLog(precision uint8, opts ...HyperLogLogOption) *HyperLogLog
- func NewHyperLogLog(precision uint8, opts ...HyperLogLogOption) (*HyperLogLog, error)
- func (hll *HyperLogLog) Add(data []byte)
- func (hll *HyperLogLog) AddString(s string)
- func (hll *HyperLogLog) Clear()
- func (hll *HyperLogLog) Clone() *HyperLogLog
- func (hll *HyperLogLog) Estimate() uint64
- func (hll *HyperLogLog) Merge(other *HyperLogLog)
- func (hll *HyperLogLog) Precision() uint8
- func (hll *HyperLogLog) Registers() []uint8
type HyperLogLogOption
type IdentityCodec
- func NewIdentityCodec() *IdentityCodec
- func (c *IdentityCodec) Decode(docID DocID) []int
- func (c *IdentityCodec) Encode(indices ...int) DocID
- func (c *IdentityCodec) Name() string
type JSONPathError
- func (e *JSONPathError) Error() string
type NGramConfig
type NGramOption
- func WithN(n int) NGramOption
- func WithPadding(pad string) NGramOption
type NullIndex
type NumericBucketParams
type NumericIndex
type NumericValueType
type Operator
- func (o Operator) String() string
type ParquetConfig
- func DefaultParquetConfig() ParquetConfig
type ParquetIndexWriter
- func NewParquetIndexWriter(w io.Writer, schema *parquet.Schema, jsonColumn string, numRowGroups int, ...) (*ParquetIndexWriter, error)
type PathEntry
type PathMode
- func (m PathMode) IsValid() bool
- func (m PathMode) String() string
type Predicate
- func Contains(path string, pattern string) Predicate
- func EQ(path string, value any) Predicate
- func GT(path string, value any) Predicate
- func GTE(path string, value any) Predicate
- func IN(path string, values ...any) Predicate
- func InSubnet(path, cidr string) []Predicate
- func InSubnetAs(path, alias, cidr string) []Predicate
- func IsNotNull(path string) Predicate
- func IsNull(path string) Predicate
- func LT(path string, value any) Predicate
- func LTE(path string, value any) Predicate
- func NE(path string, value any) Predicate
- func NIN(path string, values ...any) Predicate
- func Regex(path string, pattern string) Predicate
- func (p Predicate) String() string
type PrefixCompressor
- func MustNewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) *PrefixCompressor
- func NewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) (*PrefixCompressor, error)
- func (pc *PrefixCompressor) BlockSize() int
- func (pc *PrefixCompressor) Compress(terms []string) []CompressedTermBlock
- func (pc *PrefixCompressor) Decompress(blocks []CompressedTermBlock) []string
type PrefixCompressorOption
type PrefixEntry
type RGNumericStat
type RGSet
- func AllRGs(numRGs int) *RGSet
- func MustNewRGSet(numRGs int, opts ...RGSetOption) *RGSet
- func NewRGSet(numRGs int, opts ...RGSetOption) (*RGSet, error)
- func NoRGs(numRGs int) *RGSet
- func RGSetFromRoaring(bitmap *roaring.Bitmap, numRGs int) *RGSet
- func (rs *RGSet) All() *RGSet
- func (rs *RGSet) Clear(rgID int)
- func (rs *RGSet) Clone() *RGSet
- func (rs *RGSet) Count() int
- func (rs *RGSet) Intersect(other *RGSet) *RGSet
- func (rs *RGSet) Invert() *RGSet
- func (rs *RGSet) IsEmpty() bool
- func (rs *RGSet) IsSet(rgID int) bool
- func (rs *RGSet) Roaring() *roaring.Bitmap
- func (rs *RGSet) Set(rgID int)
- func (rs *RGSet) ToSlice() []int
- func (rs *RGSet) Union(other *RGSet) *RGSet
- func (rs *RGSet) UnionWith(other *RGSet)
type RGSetOption
type RGStringLengthStat
type RegexLiteralInfo
- func AnalyzeRegex(pattern string) (*RegexLiteralInfo, error)
type RegexParams
type RepresentationInfo
type RepresentationSpec
type RepresentationValue
- func As(alias string, value any) RepresentationValue
type RowGroupCodec
- func NewRowGroupCodec(rowGroupsPerFile int) *RowGroupCodec
- func (c *RowGroupCodec) Decode(docID DocID) []int
- func (c *RowGroupCodec) Encode(indices ...int) DocID
- func (c *RowGroupCodec) Name() string
- func (c *RowGroupCodec) RowGroupsPerFile() int
type S3Client
- func NewS3Client(cfg S3Config) (*S3Client, error)
- func NewS3ClientFromEnv() (*S3Client, error)
- func (c *S3Client) BuildFromParquet(bucket, key, jsonColumn string, ginCfg GINConfig) (*GINIndex, error)
- func (c *S3Client) Exists(bucket, key string) (bool, error)
- func (c *S3Client) GetObjectSize(bucket, key string) (int64, error)
- func (c *S3Client) HasGINIndex(bucket, key string, cfg ParquetConfig) (bool, error)
- func (c *S3Client) HasSidecar(bucket, parquetKey string) (bool, error)
- func (c *S3Client) ListGINFiles(bucket, prefix string) ([]string, error)
- func (c *S3Client) ListParquetFiles(bucket, prefix string) ([]string, error)
- func (c *S3Client) LoadIndex(bucket, parquetKey string, cfg ParquetConfig) (*GINIndex, error)
- func (c *S3Client) OpenParquet(bucket, key string) (*parquet.File, io.ReaderAt, int64, error)
- func (c *S3Client) ReadFile(bucket, key string) ([]byte, error)
- func (c *S3Client) ReadFromParquetMetadata(bucket, key string, cfg ParquetConfig) (*GINIndex, error)
- func (c *S3Client) ReadSidecar(bucket, parquetKey string) (*GINIndex, error)
- func (c *S3Client) WriteFile(bucket, key string, data []byte) error
- func (c *S3Client) WriteSidecar(bucket, parquetKey string, idx *GINIndex) error
type S3Config
- func S3ConfigFromEnv() S3Config
type SerializedConfig
type StringIndex
type StringLengthIndex
type TransformerFailureMode
type TransformerID
type TransformerOption
- func WithTransformerFailureMode(mode TransformerFailureMode) TransformerOption
type TransformerSpec
- func NewTransformerSpec(path string, id TransformerID, params json.RawMessage) TransformerSpec
type TrigramIndex
- func MustNewTrigramIndex(numRGs int, opts ...NGramOption) *TrigramIndex
- func NewTrigramIndex(numRGs int, opts ...NGramOption) (*TrigramIndex, error)
- func (ti *TrigramIndex) Add(value string, rgID int)
- func (ti *TrigramIndex) Search(pattern string) *RGSet
- func (ti *TrigramIndex) TrigramCount() int

Constants ¶

View Source

const (
	MagicBytes = "GIN\x01"
	// Version is the binary format version. Decode rejects mismatches with
	// ErrVersionMismatch; the only migration path is to rebuild the index
	// with the target binary. Version history:
	//   v8: explicit companion transformer failure modes in serialized config
	//       and representation metadata (strict by default, soft-fail opt-in)
	//   v7: explicit representation metadata for derived alias routing
	//       (phase 09 derived representations)
	//   v6: PathEntry.Mode byte + FlagTrigramIndex bit reassignment
	//       (phase 08 adaptive high-cardinality indexing)
	//   v5: never released; payloads are always rejected. Was an in-tree
	//       iteration of the adaptive string index section before the wire
	//       format was finalised in v6.
	//   v4: earlier pre-OSS format
	Version = uint16(8)
)

View Source

const (
	TypeString uint8 = 1 << iota
	TypeInt
	TypeFloat
	TypeBool
	TypeNull
)

View Source

const DefaultMetadataKey = "gin.index"

View Source

const (
	FlagHasDocIDMap uint16 = 1 << iota
)

View Source

const (
	FlagTrigramIndex uint8 = 1 << iota // path has trigram index for CONTAINS queries
)

Variables ¶

View Source

var (
	// ErrVersionMismatch is returned by Decode when the binary format version
	// does not match the expected version (Version constant).
	ErrVersionMismatch = errors.New("version mismatch")

	// ErrInvalidFormat is returned by Decode when the binary data is structurally
	// invalid: unrecognized magic bytes, oversized allocations, or corrupt fields.
	ErrInvalidFormat = errors.New("invalid format")
)

Functions ¶

func BoolNormalize ¶

func BoolNormalize(v any) (any, bool)

BoolNormalize normalizes various boolean-like values to actual booleans. Handles: bool, "true"/"false"/"yes"/"no"/"1"/"0"/"on"/"off", float64 (0 = false).

func CIDRToRange ¶

func CIDRToRange(cidr string) (start, end float64, err error)

CIDRToRange parses a CIDR notation string and returns the start and end IP addresses as float64 values suitable for use with GTE/LTE predicates on IPv4ToInt-transformed fields. Example: CIDRToRange("192.168.1.0/24") returns (3232235776, 3232236031, nil)

func CompressionStats ¶

func CompressionStats(terms []string) (compressed, original int, ratio float64)

CompressionRatio returns the compression ratio for a set of terms. Returns (compressed size, original size, ratio).

func DateToEpochMs ¶

func DateToEpochMs(v any) (any, bool)

DateToEpochMs parses "2006-01-02" format to Unix milliseconds (midnight UTC).

func DurationToMs ¶

func DurationToMs(v any) (any, bool)

DurationToMs parses Go duration strings (e.g., "1h30m", "500ms") to milliseconds.

func EmailDomain ¶

func EmailDomain(v any) (any, bool)

EmailDomain extracts and lowercases the domain from an email address.

func Encode ¶

func Encode(idx *GINIndex) ([]byte, error)

Encode serializes the index using zstd-15 compression (recommended default).

func EncodeToMetadata ¶

func EncodeToMetadata(idx *GINIndex, cfg ParquetConfig) (key string, value string, err error)

func EncodeWithLevel ¶

func EncodeWithLevel(idx *GINIndex, level CompressionLevel) ([]byte, error)

EncodeWithLevel serializes the index with the specified compression level. Use CompressionNone (0) for no compression, or 1-19 for zstd compression levels.

func ExtractLiterals ¶

func ExtractLiterals(pattern string) ([]string, error)

ExtractLiterals extracts literal strings from a regex pattern that can be used for trigram-based candidate selection. Returns a slice of literal alternatives. For patterns like "foo|bar", returns ["foo", "bar"]. For patterns like "(error|warn)_msg", returns ["error_msg", "warn_msg"] (combined).

func ExtractTrigrams ¶

func ExtractTrigrams(s string) []string

func GenerateBigrams ¶

func GenerateBigrams(text string) []string

func GenerateNGrams ¶

func GenerateNGrams(text string, n int, opts ...NGramOption) ([]string, error)

func GenerateTrigrams ¶

func GenerateTrigrams(text string) []string

func HasGINIndex ¶

func HasGINIndex(parquetFile string, cfg ParquetConfig) (bool, error)

func HasGINIndexReader ¶

func HasGINIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (bool, error)

func HasSidecar ¶

func HasSidecar(parquetFile string) bool

func IPv4ToInt ¶

func IPv4ToInt(v any) (any, bool)

IPv4ToInt converts IPv4 address strings to uint32 (as float64) for range queries.

func ISODateToEpochMs ¶

func ISODateToEpochMs(v any) (any, bool)

ISODateToEpochMs parses RFC3339/ISO8601 strings to Unix milliseconds.

func IsDirectory ¶

func IsDirectory(path string) bool

func IsS3Path ¶

func IsS3Path(path string) bool

func IsValidJSONPath ¶

func IsValidJSONPath(path string) bool

func ListGINFiles ¶

func ListGINFiles(dir string) ([]string, error)

func ListParquetFiles ¶

func ListParquetFiles(dir string) ([]string, error)

func MustValidateJSONPath ¶

func MustValidateJSONPath(path string) string

func NormalizePath ¶

func NormalizePath(path string) string

NormalizePath converts a JSONPath to a canonical dot-notation form without validating that the path uses only GIN-supported JSONPath features. Callers handling untrusted input should use ValidateJSONPath or canonicalizeSupportedPath first.

func ParseJSONPath ¶

func ParseJSONPath(path string) (jp.Expr, error)

ParseJSONPath parses and validates a JSONPath, returning the parsed expression.

func ParseS3Path ¶

func ParseS3Path(path string) (bucket, key string, err error)

func RebuildWithIndex ¶

func RebuildWithIndex(parquetFile string, idx *GINIndex, cfg ParquetConfig) error

func SemVerToInt ¶

func SemVerToInt(v any) (any, bool)

SemVerToInt encodes semantic versions as integers: major*1000000 + minor*1000 + patch. Supports formats: "1.2.3", "v1.2.3", "1.2", "v1.2", "1.2.3-beta" (pre-release suffix ignored).

func SetAdaptiveInvariantLogger ¶ added in v0.2.0

func SetAdaptiveInvariantLogger(l *log.Logger)

SetAdaptiveInvariantLogger installs a logger that surfaces adaptive index invariant violations (e.g. a path flagged PathModeAdaptiveHybrid with no matching AdaptiveStringIndexes section). The default is nil (silent); pass log.Default() or your own *log.Logger to opt in. Safe for concurrent use.

func SidecarPath ¶

func SidecarPath(parquetFile string) string

func ToLower ¶

func ToLower(v any) (any, bool)

ToLower normalizes strings to lowercase for case-insensitive queries.

func URLHost ¶

func URLHost(v any) (any, bool)

URLHost extracts and lowercases the host from a URL.

func ValidateJSONPath ¶

func ValidateJSONPath(path string) error

ValidateJSONPath validates a JSONPath expression and ensures it only uses features supported by the GIN index (dot notation, wildcards). Unsupported: array indices [0], filters [?()], recursive descent .., scripts

func WriteCompressedTerms ¶

func WriteCompressedTerms(w io.Writer, blocks []CompressedTermBlock) error

func WriteSidecar ¶

func WriteSidecar(parquetFile string, idx *GINIndex) error

Types ¶

type AdaptiveStringIndex ¶ added in v0.2.0

type AdaptiveStringIndex struct {
	// Terms holds the promoted exact-match values in sorted order.
	Terms []string
	// RGBitmaps[i] lists the row groups that contain Terms[i].
	RGBitmaps []*RGSet
	// BucketRGBitmaps partitions the long-tail terms by xxhash; len must be a
	// non-zero power of two. A bucket hit is a superset match (may include
	// row groups that do not actually contain the queried term).
	BucketRGBitmaps []*RGSet
}

AdaptiveStringIndex stores promoted exact terms plus lossy tail buckets. Terms must be sorted lexically; RGBitmaps is parallel to Terms. Values that are not promoted fall into one of len(BucketRGBitmaps) hash buckets, which may return false-positive row groups.

func NewAdaptiveStringIndex ¶ added in v0.2.0

func NewAdaptiveStringIndex(terms []string, rgBitmaps []*RGSet, bucketBitmaps []*RGSet) (*AdaptiveStringIndex, error)

NewAdaptiveStringIndex validates and constructs an adaptive string index.

type BloomFilter ¶

type BloomFilter struct {
	// contains filtered or unexported fields
}

func BloomFilterFromBits ¶

func BloomFilterFromBits(bits []uint64, numBits uint32, numHashes uint8) *BloomFilter

func MustNewBloomFilter ¶

func MustNewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) *BloomFilter

func NewBloomFilter ¶

func NewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) (*BloomFilter, error)

func (*BloomFilter) Add ¶

func (bf *BloomFilter) Add(data []byte)

func (*BloomFilter) AddString ¶

func (bf *BloomFilter) AddString(s string)

func (*BloomFilter) Bits ¶

func (bf *BloomFilter) Bits() []uint64

func (*BloomFilter) MayContain ¶

func (bf *BloomFilter) MayContain(data []byte) bool

func (*BloomFilter) MayContainString ¶

func (bf *BloomFilter) MayContainString(s string) bool

func (*BloomFilter) NumBits ¶

func (bf *BloomFilter) NumBits() uint32

func (*BloomFilter) NumHashes ¶

func (bf *BloomFilter) NumHashes() uint8

type BloomFilterOption ¶

type BloomFilterOption func(*BloomFilter) error

type BuilderOption ¶

type BuilderOption func(*GINBuilder) error

func WithCodec ¶

func WithCodec(codec DocIDCodec) BuilderOption

type CompressedTermBlock ¶

type CompressedTermBlock struct {
	FirstTerm string
	Entries   []PrefixEntry
}

func ReadCompressedTerms ¶

func ReadCompressedTerms(r io.Reader) ([]CompressedTermBlock, error)

type CompressionLevel ¶

type CompressionLevel int

CompressionLevel specifies the compression level for index serialization.

const (
	CompressionNone     CompressionLevel = 0  // No compression
	CompressionFastest  CompressionLevel = 1  // zstd level 1
	CompressionBalanced CompressionLevel = 3  // zstd level 3
	CompressionBetter   CompressionLevel = 9  // zstd level 9
	CompressionBest     CompressionLevel = 15 // zstd level 15 (recommended)
	CompressionMax      CompressionLevel = 19 // zstd level 19 (slow)
)

type ConfigOption ¶

type ConfigOption func(*GINConfig) error

func WithAdaptiveBucketCount ¶ added in v0.2.0

func WithAdaptiveBucketCount(bucketCount int) ConfigOption

WithAdaptiveBucketCount sets the fan-out of the long-tail bucket layer. Must be a positive power of two. To disable adaptive mode, omit this option (and WithAdaptivePromotedTermCap) or build a GINConfig literal with AdaptiveBucketCount/AdaptivePromotedTermCap set to 0; this option rejects 0 to keep the builder path explicit.

func WithAdaptiveCoverageCeiling ¶ added in v0.2.0

func WithAdaptiveCoverageCeiling(ceiling float64) ConfigOption

WithAdaptiveCoverageCeiling sets the maximum fraction of row groups a term may cover and still be eligible for promotion. Terms above the ceiling are treated as too-ubiquitous and fall through to the bucket layer. Must be in the open interval (0, 1).

func WithAdaptiveMinRGCoverage ¶ added in v0.2.0

func WithAdaptiveMinRGCoverage(minCoverage int) ConfigOption

WithAdaptiveMinRGCoverage sets the minimum number of row groups a term must cover to be eligible for promotion to the exact adaptive index. Terms below this threshold fall into the bucket layer.

func WithAdaptivePromotedTermCap ¶ added in v0.2.0

func WithAdaptivePromotedTermCap(cap int) ConfigOption

WithAdaptivePromotedTermCap caps the number of terms promoted to the exact adaptive index per high-cardinality path. Zero disables adaptive mode.

func WithBoolNormalizeTransformer ¶

func WithBoolNormalizeTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithCustomDateTransformer ¶

func WithCustomDateTransformer(path, alias, layout string, opts ...TransformerOption) ConfigOption

func WithCustomTransformer ¶ added in v0.2.0

func WithCustomTransformer(path, alias string, fn FieldTransformer, opts ...TransformerOption) ConfigOption

func WithDateTransformer ¶

func WithDateTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithDurationTransformer ¶

func WithDurationTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithEmailDomainTransformer ¶

func WithEmailDomainTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithFTSPaths ¶

func WithFTSPaths(paths ...string) ConfigOption

func WithFieldTransformer ¶

func WithFieldTransformer(path string, fn FieldTransformer) ConfigOption

func WithIPv4Transformer ¶

func WithIPv4Transformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithISODateTransformer ¶

func WithISODateTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithNumericBucketTransformer ¶

func WithNumericBucketTransformer(path, alias string, size float64, opts ...TransformerOption) ConfigOption

func WithRegexExtractIntTransformer ¶

func WithRegexExtractIntTransformer(path, alias, pattern string, group int, opts ...TransformerOption) ConfigOption

func WithRegexExtractTransformer ¶

func WithRegexExtractTransformer(path, alias, pattern string, group int, opts ...TransformerOption) ConfigOption

func WithRegisteredTransformer ¶

func WithRegisteredTransformer(path, alias string, id TransformerID, params []byte, opts ...TransformerOption) ConfigOption

func WithSemVerTransformer ¶

func WithSemVerTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithToLowerTransformer ¶

func WithToLowerTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithURLHostTransformer ¶

func WithURLHostTransformer(path, alias string, opts ...TransformerOption) ConfigOption

type CustomDateParams ¶

type CustomDateParams struct {
	Layout string `json:"layout"`
}

type DocID ¶

type DocID uint64

DocID represents an external document identifier.

type DocIDCodec ¶

type DocIDCodec interface {
	Encode(indices ...int) DocID
	Decode(docID DocID) []int
	Name() string
}

DocIDCodec encodes/decodes composite information into a single DocID.

type FieldTransformer ¶

type FieldTransformer func(value any) (any, bool)

FieldTransformer transforms a value before indexing. Returns (transformedValue, ok). If ok=false, the companion representation follows the registration's configured failure mode. Strict is the default.

func CustomDateToEpochMs ¶

func CustomDateToEpochMs(layout string) FieldTransformer

CustomDateToEpochMs returns a transformer for custom date formats.

func NumericBucket ¶

func NumericBucket(size float64) FieldTransformer

NumericBucket returns a transformer that buckets numeric values by size. Example: NumericBucket(100) transforms 150 -> 100, 250 -> 200.

func ReconstructTransformer ¶

func ReconstructTransformer(id TransformerID, params json.RawMessage) (FieldTransformer, error)

func RegexExtract ¶

func RegexExtract(pattern string, group int) FieldTransformer

RegexExtract returns a transformer that extracts a substring via regex capture group. Pattern is compiled once at config time. Group 0 = full match, group 1+ = capture groups.

func RegexExtractInt ¶

func RegexExtractInt(pattern string, group int) FieldTransformer

RegexExtractInt extracts a substring via regex and converts it to float64.

type GINBuilder ¶

type GINBuilder struct {
	// contains filtered or unexported fields
}

func NewBuilder ¶

func NewBuilder(config GINConfig, numRGs int, opts ...BuilderOption) (*GINBuilder, error)

func (*GINBuilder) AddDocument ¶

func (b *GINBuilder) AddDocument(docID DocID, jsonDoc []byte) error

func (*GINBuilder) Finalize ¶

func (b *GINBuilder) Finalize() *GINIndex

type GINConfig ¶

type GINConfig struct {
	CardinalityThreshold    uint32
	BloomFilterSize         uint32
	BloomFilterHashes       uint8
	EnableTrigrams          bool
	TrigramMinLength        int
	HLLPrecision            uint8
	PrefixBlockSize         int
	AdaptiveMinRGCoverage   int
	AdaptivePromotedTermCap int
	AdaptiveCoverageCeiling float64
	AdaptiveBucketCount     int
	// contains filtered or unexported fields
}

func DefaultConfig ¶

func DefaultConfig() GINConfig

func NewConfig ¶

func NewConfig(opts ...ConfigOption) (GINConfig, error)

func (GINConfig) AdaptiveEnabled ¶ added in v0.2.0

func (c GINConfig) AdaptiveEnabled() bool

AdaptiveEnabled reports whether adaptive high-cardinality indexing is enabled.

type GINIndex ¶

type GINIndex struct {
	// GINIndex is immutable after `Finalize()` or `Decode()`; pathLookup is
	// derived, non-serialized state rebuilt once and then treated as read-only.
	Header                Header
	PathDirectory         []PathEntry
	GlobalBloom           *BloomFilter
	StringIndexes         map[uint16]*StringIndex
	AdaptiveStringIndexes map[uint16]*AdaptiveStringIndex
	NumericIndexes        map[uint16]*NumericIndex
	NullIndexes           map[uint16]*NullIndex
	TrigramIndexes        map[uint16]*TrigramIndex
	StringLengthIndexes   map[uint16]*StringLengthIndex
	PathCardinality       map[uint16]*HyperLogLog
	DocIDMapping          []DocID
	Config                *GINConfig
	// contains filtered or unexported fields
}

func BuildFromParquet ¶

func BuildFromParquet(parquetFile string, jsonColumn string, config GINConfig) (*GINIndex, error)

func BuildFromParquetReader ¶

func BuildFromParquetReader(parquetFile string, jsonColumn string, config GINConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func Decode ¶

func Decode(data []byte) (*GINIndex, error)

Decode deserializes an index, validates cross-structure path references, and canonicalizes supported JSONPath spellings in PathDirectory while rebuilding derived lookup state.

func DecodeFromMetadata ¶

func DecodeFromMetadata(value string) (*GINIndex, error)

func LoadIndex ¶

func LoadIndex(parquetFile string, cfg ParquetConfig) (*GINIndex, error)

func LoadIndexReader ¶

func LoadIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func NewGINIndex ¶

func NewGINIndex() *GINIndex

func ReadFromParquetMetadata ¶

func ReadFromParquetMetadata(parquetFile string, cfg ParquetConfig) (*GINIndex, error)

func ReadFromParquetMetadataReader ¶

func ReadFromParquetMetadataReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func ReadSidecar ¶

func ReadSidecar(parquetFile string) (*GINIndex, error)

func (*GINIndex) Evaluate ¶

func (idx *GINIndex) Evaluate(predicates []Predicate) *RGSet

func (*GINIndex) MatchingDocIDs ¶

func (idx *GINIndex) MatchingDocIDs(rgSet *RGSet) []DocID

func (*GINIndex) Representations ¶ added in v0.2.0

func (idx *GINIndex) Representations(path string) []RepresentationInfo

type Header ¶

type Header struct {
	Magic             [4]byte
	Version           uint16
	Flags             uint16
	NumRowGroups      uint32
	NumDocs           uint64
	NumPaths          uint32
	CardinalityThresh uint32
}

type HyperLogLog ¶

type HyperLogLog struct {
	// contains filtered or unexported fields
}

HyperLogLog implements the HyperLogLog algorithm for cardinality estimation. It uses 2^precision registers to estimate the number of distinct elements.

func HyperLogLogFromRegisters ¶

func HyperLogLogFromRegisters(registers []uint8, precision uint8) *HyperLogLog

func MustNewHyperLogLog ¶

func MustNewHyperLogLog(precision uint8, opts ...HyperLogLogOption) *HyperLogLog

func NewHyperLogLog ¶

func NewHyperLogLog(precision uint8, opts ...HyperLogLogOption) (*HyperLogLog, error)

NewHyperLogLog creates a new HyperLogLog with the given precision. Precision must be between 4 and 16. Higher precision = more accuracy but more memory. Memory usage: 2^precision bytes. Standard error: 1.04 / sqrt(m) where m = 2^precision

func (*HyperLogLog) Add ¶

func (hll *HyperLogLog) Add(data []byte)

func (*HyperLogLog) AddString ¶

func (hll *HyperLogLog) AddString(s string)

func (*HyperLogLog) Clear ¶

func (hll *HyperLogLog) Clear()

func (*HyperLogLog) Clone ¶

func (hll *HyperLogLog) Clone() *HyperLogLog

func (*HyperLogLog) Estimate ¶

func (hll *HyperLogLog) Estimate() uint64

func (*HyperLogLog) Merge ¶

func (hll *HyperLogLog) Merge(other *HyperLogLog)

func (*HyperLogLog) Precision ¶

func (hll *HyperLogLog) Precision() uint8

func (*HyperLogLog) Registers ¶

func (hll *HyperLogLog) Registers() []uint8

type HyperLogLogOption ¶

type HyperLogLogOption func(*HyperLogLog) error

type IdentityCodec ¶

type IdentityCodec struct{}

IdentityCodec treats the position as the DocID (1:1 mapping).

func NewIdentityCodec ¶

func NewIdentityCodec() *IdentityCodec

func (*IdentityCodec) Decode ¶

func (c *IdentityCodec) Decode(docID DocID) []int

func (*IdentityCodec) Encode ¶

func (c *IdentityCodec) Encode(indices ...int) DocID

func (*IdentityCodec) Name ¶

func (c *IdentityCodec) Name() string

type JSONPathError ¶

type JSONPathError struct {
	Path    string
	Message string
}

func (*JSONPathError) Error ¶

func (e *JSONPathError) Error() string

type NGramConfig ¶

type NGramConfig struct {
	N       int
	Padding string
}

type NGramOption ¶

type NGramOption func(*NGramConfig) error

func WithN ¶

func WithN(n int) NGramOption

func WithPadding ¶

func WithPadding(pad string) NGramOption

type NullIndex ¶

type NullIndex struct {
	NullRGBitmap    *RGSet
	PresentRGBitmap *RGSet
}

type NumericBucketParams ¶

type NumericBucketParams struct {
	Size float64 `json:"size"`
}

type NumericIndex ¶

type NumericIndex struct {
	// ValueType is the numeric storage mode: int-only or float/mixed.
	ValueType    NumericValueType
	IntGlobalMin int64
	IntGlobalMax int64
	GlobalMin    float64
	GlobalMax    float64
	RGStats      []RGNumericStat
}

type NumericValueType ¶ added in v0.2.0

type NumericValueType uint8

const (
	NumericValueTypeIntOnly NumericValueType = iota
	NumericValueTypeFloatMixed
)

type Operator ¶

type Operator uint8

const (
	OpEQ Operator = iota
	OpNE
	OpGT
	OpLT
	OpGTE
	OpLTE
	OpIN
	OpNIN
	OpIsNull
	OpIsNotNull
	OpContains
	OpRegex
)

func (Operator) String ¶

func (o Operator) String() string

type ParquetConfig ¶

type ParquetConfig struct {
	MetadataKey string
}

func DefaultParquetConfig ¶

func DefaultParquetConfig() ParquetConfig

type ParquetIndexWriter ¶

type ParquetIndexWriter struct {
	// contains filtered or unexported fields
}

func NewParquetIndexWriter ¶

func NewParquetIndexWriter(w io.Writer, schema *parquet.Schema, jsonColumn string, numRowGroups int, ginConfig GINConfig, pqConfig ParquetConfig) (*ParquetIndexWriter, error)

type PathEntry ¶

type PathEntry struct {
	PathID        uint16
	PathName      string
	ObservedTypes uint8
	Cardinality   uint32
	// Mode is the exclusive string-evaluation mode for this path.
	Mode  PathMode
	Flags uint8
	// AdaptivePromotedTerms and AdaptiveBucketCount are derived metadata
	// populated from the adaptive section at decode time. They are not
	// persisted in the path directory; encoders must not rely on them.
	AdaptivePromotedTerms uint16
	AdaptiveBucketCount   uint16
}

type PathMode ¶ added in v0.2.0

type PathMode uint8

PathMode is the exclusive storage mode for a path entry. The zero value is the classic exact mode.

const (
	// PathModeClassic keeps the full exact string index for a path.
	// Its user-facing string label remains "exact" because that describes the
	// query semantics more clearly than the internal mode name.
	PathModeClassic PathMode = iota
	// PathModeBloomOnly stores no exact term index and answers via bloom-filter fallback.
	PathModeBloomOnly
	// PathModeAdaptiveHybrid stores promoted exact terms plus lossy tail buckets.
	PathModeAdaptiveHybrid
)

func (PathMode) IsValid ¶ added in v0.2.0

func (m PathMode) IsValid() bool

IsValid reports whether m is one of the declared PathMode constants. Decoders should call this on every byte read from disk before trusting the value; values outside the declared range indicate a corrupt payload.

func (PathMode) String ¶ added in v0.2.0

func (m PathMode) String() string

String returns the user-facing label used in CLI output and diagnostics.

type Predicate ¶

type Predicate struct {
	Path     string
	Operator Operator
	Value    any
}

func Contains ¶

func Contains(path string, pattern string) Predicate

func EQ ¶

func EQ(path string, value any) Predicate

func GT ¶

func GT(path string, value any) Predicate

func GTE ¶

func GTE(path string, value any) Predicate

func IN ¶

func IN(path string, values ...any) Predicate

func InSubnet ¶

func InSubnet(path, cidr string) []Predicate

InSubnet creates predicates using the conventional IPv4 companion alias "ipv4_int". Use InSubnetAs when a path is configured with a different alias.

func InSubnetAs ¶ added in v0.2.0

func InSubnetAs(path, alias, cidr string) []Predicate

InSubnetAs creates predicates to check if an IP field (transformed with IPv4ToInt under the provided alias) falls within a CIDR subnet range. Example: InSubnetAs("$.client_ip", "ipv4_int", "192.168.1.0/24") returns predicates for 192.168.1.0-255. Panics if CIDR is invalid - use CIDRToRange for error handling.

func IsNotNull ¶

func IsNotNull(path string) Predicate

func IsNull ¶

func IsNull(path string) Predicate

func LT ¶

func LT(path string, value any) Predicate

func LTE ¶

func LTE(path string, value any) Predicate

func NE ¶

func NE(path string, value any) Predicate

func NIN ¶

func NIN(path string, values ...any) Predicate

func Regex ¶

func Regex(path string, pattern string) Predicate

func (Predicate) String ¶

func (p Predicate) String() string

type PrefixCompressor ¶

type PrefixCompressor struct {
	// contains filtered or unexported fields
}

PrefixCompressor implements front-coding compression for sorted string lists. Each string is stored as: shared prefix length + suffix. This works well for sorted terms that share common prefixes.

func MustNewPrefixCompressor ¶

func MustNewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) *PrefixCompressor

func NewPrefixCompressor ¶

func NewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) (*PrefixCompressor, error)

func (*PrefixCompressor) BlockSize ¶

func (pc *PrefixCompressor) BlockSize() int

func (*PrefixCompressor) Compress ¶

func (pc *PrefixCompressor) Compress(terms []string) []CompressedTermBlock

func (*PrefixCompressor) Decompress ¶

func (pc *PrefixCompressor) Decompress(blocks []CompressedTermBlock) []string

type PrefixCompressorOption ¶

type PrefixCompressorOption func(*PrefixCompressor) error

type PrefixEntry ¶

type PrefixEntry struct {
	PrefixLen uint16
	Suffix    string
}

type RGNumericStat ¶

type RGNumericStat struct {
	IntMin   int64
	IntMax   int64
	Min      float64
	Max      float64
	HasValue bool
}

type RGSet ¶

type RGSet struct {
	NumRGs int
	// contains filtered or unexported fields
}

func AllRGs ¶

func AllRGs(numRGs int) *RGSet

func MustNewRGSet ¶

func MustNewRGSet(numRGs int, opts ...RGSetOption) *RGSet

func NewRGSet ¶

func NewRGSet(numRGs int, opts ...RGSetOption) (*RGSet, error)

func NoRGs ¶

func NoRGs(numRGs int) *RGSet

func RGSetFromRoaring ¶

func RGSetFromRoaring(bitmap *roaring.Bitmap, numRGs int) *RGSet

func (*RGSet) All ¶

func (rs *RGSet) All() *RGSet

func (*RGSet) Clear ¶

func (rs *RGSet) Clear(rgID int)

func (*RGSet) Clone ¶

func (rs *RGSet) Clone() *RGSet

func (*RGSet) Count ¶

func (rs *RGSet) Count() int

func (*RGSet) Intersect ¶

func (rs *RGSet) Intersect(other *RGSet) *RGSet

func (*RGSet) Invert ¶

func (rs *RGSet) Invert() *RGSet

func (*RGSet) IsEmpty ¶

func (rs *RGSet) IsEmpty() bool

func (*RGSet) IsSet ¶

func (rs *RGSet) IsSet(rgID int) bool

func (*RGSet) Roaring ¶

func (rs *RGSet) Roaring() *roaring.Bitmap

func (*RGSet) Set ¶

func (rs *RGSet) Set(rgID int)

func (*RGSet) ToSlice ¶

func (rs *RGSet) ToSlice() []int

func (*RGSet) Union ¶

func (rs *RGSet) Union(other *RGSet) *RGSet

func (*RGSet) UnionWith ¶ added in v0.2.0

func (rs *RGSet) UnionWith(other *RGSet)

UnionWith merges other into rs in place, avoiding the per-call clone of Union. Use this when the receiver is exclusively owned and the result does not need to preserve the prior state.

type RGSetOption ¶

type RGSetOption func(*RGSet) error

type RGStringLengthStat ¶

type RGStringLengthStat struct {
	Min      uint32
	Max      uint32
	HasValue bool
}

type RegexLiteralInfo ¶

type RegexLiteralInfo struct {
	Literals    []string // Extracted literal strings
	HasWildcard bool     // Pattern contains unbounded wildcards
	MinLength   int      // Minimum length of any literal
}

RegexLiteralInfo contains extracted information from a regex pattern

func AnalyzeRegex ¶

func AnalyzeRegex(pattern string) (*RegexLiteralInfo, error)

AnalyzeRegex extracts literals and metadata from a regex pattern

type RegexParams ¶

type RegexParams struct {
	Pattern string `json:"pattern"`
	Group   int    `json:"group"`
}

type RepresentationInfo ¶ added in v0.2.0

type RepresentationInfo struct {
	SourcePath  string
	Alias       string
	Transformer string
}

type RepresentationSpec ¶ added in v0.2.0

type RepresentationSpec struct {
	SourcePath   string          `json:"source_path"`
	Alias        string          `json:"alias"`
	TargetPath   string          `json:"target_path"`
	Transformer  TransformerSpec `json:"transformer"`
	Serializable bool            `json:"serializable"`
}

type RepresentationValue ¶ added in v0.2.0

type RepresentationValue struct {
	Alias string
	Value any
}

func As ¶ added in v0.2.0

func As(alias string, value any) RepresentationValue

type RowGroupCodec ¶

type RowGroupCodec struct {
	// contains filtered or unexported fields
}

RowGroupCodec encodes file index and row group index into a DocID. Layout: DocID = fileIndex * rowGroupsPerFile + rgIndex

func NewRowGroupCodec ¶

func NewRowGroupCodec(rowGroupsPerFile int) *RowGroupCodec

func (*RowGroupCodec) Decode ¶

func (c *RowGroupCodec) Decode(docID DocID) []int

func (*RowGroupCodec) Encode ¶

func (c *RowGroupCodec) Encode(indices ...int) DocID

func (*RowGroupCodec) Name ¶

func (c *RowGroupCodec) Name() string

func (*RowGroupCodec) RowGroupsPerFile ¶

func (c *RowGroupCodec) RowGroupsPerFile() int

type S3Client ¶

type S3Client struct {
	// contains filtered or unexported fields
}

func NewS3Client ¶

func NewS3Client(cfg S3Config) (*S3Client, error)

func NewS3ClientFromEnv ¶

func NewS3ClientFromEnv() (*S3Client, error)

func (*S3Client) BuildFromParquet ¶

func (c *S3Client) BuildFromParquet(bucket, key, jsonColumn string, ginCfg GINConfig) (*GINIndex, error)

func (*S3Client) Exists ¶

func (c *S3Client) Exists(bucket, key string) (bool, error)

func (*S3Client) GetObjectSize ¶

func (c *S3Client) GetObjectSize(bucket, key string) (int64, error)

func (*S3Client) HasGINIndex ¶

func (c *S3Client) HasGINIndex(bucket, key string, cfg ParquetConfig) (bool, error)

func (*S3Client) HasSidecar ¶

func (c *S3Client) HasSidecar(bucket, parquetKey string) (bool, error)

func (*S3Client) ListGINFiles ¶

func (c *S3Client) ListGINFiles(bucket, prefix string) ([]string, error)

func (*S3Client) ListParquetFiles ¶

func (c *S3Client) ListParquetFiles(bucket, prefix string) ([]string, error)

func (*S3Client) LoadIndex ¶

func (c *S3Client) LoadIndex(bucket, parquetKey string, cfg ParquetConfig) (*GINIndex, error)

func (*S3Client) OpenParquet ¶

func (c *S3Client) OpenParquet(bucket, key string) (*parquet.File, io.ReaderAt, int64, error)

func (*S3Client) ReadFile ¶

func (c *S3Client) ReadFile(bucket, key string) ([]byte, error)

func (*S3Client) ReadFromParquetMetadata ¶

func (c *S3Client) ReadFromParquetMetadata(bucket, key string, cfg ParquetConfig) (*GINIndex, error)

func (*S3Client) ReadSidecar ¶

func (c *S3Client) ReadSidecar(bucket, parquetKey string) (*GINIndex, error)

func (*S3Client) WriteFile ¶

func (c *S3Client) WriteFile(bucket, key string, data []byte) error

func (*S3Client) WriteSidecar ¶

func (c *S3Client) WriteSidecar(bucket, parquetKey string, idx *GINIndex) error

type S3Config ¶

type S3Config struct {
	Endpoint  string
	Region    string
	AccessKey string
	SecretKey string
	PathStyle bool
}

func S3ConfigFromEnv ¶

func S3ConfigFromEnv() S3Config

type SerializedConfig ¶

type SerializedConfig struct {
	BloomFilterSize         uint32            `json:"bloom_filter_size"`
	BloomFilterHashes       uint8             `json:"bloom_filter_hashes"`
	EnableTrigrams          bool              `json:"enable_trigrams"`
	TrigramMinLength        int               `json:"trigram_min_length"`
	HLLPrecision            uint8             `json:"hll_precision"`
	PrefixBlockSize         int               `json:"prefix_block_size"`
	AdaptiveMinRGCoverage   int               `json:"adaptive_min_rg_coverage"`
	AdaptivePromotedTermCap int               `json:"adaptive_promoted_term_cap"`
	AdaptiveCoverageCeiling float64           `json:"adaptive_coverage_ceiling"`
	AdaptiveBucketCount     int               `json:"adaptive_bucket_count"`
	FTSPaths                []string          `json:"fts_paths,omitempty"`
	Transformers            []TransformerSpec `json:"transformers,omitempty"`
}

type StringIndex ¶

type StringIndex struct {
	Terms     []string
	RGBitmaps []*RGSet
}

type StringLengthIndex ¶

type StringLengthIndex struct {
	GlobalMin uint32
	GlobalMax uint32
	RGStats   []RGStringLengthStat
}

type TransformerFailureMode ¶ added in v0.2.0

type TransformerFailureMode string

const (
	TransformerFailureStrict TransformerFailureMode = "strict"
	TransformerFailureSoft   TransformerFailureMode = "soft_fail"
)

type TransformerID ¶

type TransformerID uint8

const (
	TransformerUnknown TransformerID = iota
	TransformerISODateToEpochMs
	TransformerDateToEpochMs
	TransformerCustomDateToEpochMs
	TransformerToLower
	TransformerIPv4ToInt
	TransformerSemVerToInt
	TransformerRegexExtract
	TransformerRegexExtractInt
	TransformerDurationToMs
	TransformerEmailDomain
	TransformerURLHost
	TransformerNumericBucket
	TransformerBoolNormalize
)

type TransformerOption ¶ added in v0.2.0

type TransformerOption func(*transformerRegistrationOptions) error

func WithTransformerFailureMode ¶ added in v0.2.0

func WithTransformerFailureMode(mode TransformerFailureMode) TransformerOption

type TransformerSpec ¶

type TransformerSpec struct {
	Path        string                 `json:"path"`
	Alias       string                 `json:"alias,omitempty"`
	TargetPath  string                 `json:"target_path,omitempty"`
	FailureMode TransformerFailureMode `json:"failure_mode,omitempty"`
	ID          TransformerID          `json:"id"`
	Name        string                 `json:"name"`
	Params      json.RawMessage        `json:"params,omitempty"`
}

func NewTransformerSpec ¶

func NewTransformerSpec(path string, id TransformerID, params json.RawMessage) TransformerSpec

type TrigramIndex ¶

type TrigramIndex struct {
	Trigrams  map[string]*RGSet
	NumRGs    int
	N         int
	Padding   string
	MinLength int
}

func MustNewTrigramIndex ¶ added in v0.2.0

func MustNewTrigramIndex(numRGs int, opts ...NGramOption) *TrigramIndex

func NewTrigramIndex ¶

func NewTrigramIndex(numRGs int, opts ...NGramOption) (*TrigramIndex, error)

func (*TrigramIndex) Add ¶

func (ti *TrigramIndex) Add(value string, rgID int)

func (*TrigramIndex) Search ¶

func (ti *TrigramIndex) Search(pattern string) *RGSet

func (*TrigramIndex) TrigramCount ¶

func (ti *TrigramIndex) TrigramCount() int

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
gin-index command
examples
basic command Example: Basic GIN index usage with equality queries	Example: Basic GIN index usage with equality queries
full command Example: Comprehensive GIN index usage demonstrating all index types and query operators	Example: Comprehensive GIN index usage demonstrating all index types and query operators
fulltext command Example: Full-text search with trigram index (CONTAINS queries)	Example: Full-text search with trigram index (CONTAINS queries)
nested command Example: Nested JSON objects and arrays	Example: Nested JSON objects and arrays
null command Example: NULL handling queries	Example: NULL handling queries
parquet command
range command Example: Numeric range queries with GIN index	Example: Numeric range queries with GIN index
regex command Example: Regex pattern matching with trigram-based candidate selection	Example: Regex pattern matching with trigram-based candidate selection
serialize command Example: Serializing and deserializing GIN index	Example: Serializing and deserializing GIN index
transformers command Example: Field transformers for date indexing	Example: Field transformers for date indexing
transformers-advanced command Example: Advanced field transformers for IP ranges, semantic versions, emails, and regex extraction	Example: Advanced field transformers for IP ranges, semantic versions, emails, and regex extraction

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL