gin

package module
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 17, 2026 License: MIT Imports: 31 Imported by: 0

README

GIN Index

CI Go Reference

See CONTRIBUTING.md for local contributor workflows and SECURITY.md for disclosure guidance.

A Generalized Inverted Index (GIN) for JSON data, designed for row-group pruning in columnar storage formats like Parquet.

Features

  • String indexing - Exact match and IN queries on string fields
  • Numeric indexing - Range queries (GT, GTE, LT, LTE) with per-row-group min/max stats
  • Field transformers - Convert values (e.g., date strings to epoch) for efficient range queries
  • Trigram indexing - Full-text CONTAINS queries using n-gram matching
  • Regex support - Pattern matching with trigram-based candidate selection
  • Null tracking - IS NULL / IS NOT NULL predicates
  • Bloom filter - Fast-path rejection for non-existent values
  • HyperLogLog - Efficient cardinality estimation
  • Compression - zstd-compressed binary serialization
  • Parquet integration - Build from Parquet, embed in metadata, sidecar files, S3 support
  • CLI tool - Command-line interface for build, query, info, and extract operations

Why GIN Index?

A serverless pruning index for data lakes - the GIN index is a compact, immutable index designed to answer one question: "Which row groups might contain my data?"

The Problem

Querying large data lakes is expensive. When you search for trace_id=abc123 across millions of Parquet files, traditional approaches either:

  • Full scan - Read every row group (~TB of data, high latency, high cost)
  • Database approach - Run PostgreSQL/Elasticsearch cluster (~ms latency, operational burden)
  • Parquet stats - Use built-in min/max (useless for high-cardinality strings)
The Solution
┌─────────────────────────────────────────────────────────────────────┐
│                    Serverless Row-Group Pruning                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   1. Cache index anywhere         2. Prune locally     3. Read only  │
│      (<1MB for millions of files)    (~1µs)              matching RGs│
│                                                                      │
│   ┌──────────────┐               ┌──────────────┐    ┌────────────┐ │
│   │  memcached   │  ─────────▶   │  GIN Index   │ ─▶ │ S3/GCS     │ │
│   │  nginx       │    decode     │  Evaluate()  │    │ [RG 5, 23] │ │
│   │  CDN edge    │               │              │    │            │ │
│   │  localStorage│               │ Result: 3    │    │ Skip 99%   │ │
│   └──────────────┘               │ row groups   │    │ of data    │ │
│                                  └──────────────┘    └────────────┘ │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
Key Advantages
Challenge PostgreSQL GIN Elasticsearch This GIN Index
Deployment Database cluster Search cluster Just bytes - cache anywhere
Query latency ~1ms ~5-10ms ~1µs - client-side
High cardinality Index bloat Shard overhead adaptive-hybrid hot-value pruning
Index size MB-GB GB ~30KB per 1K row groups
Arbitrary JSON Schema required Mapping required Auto-discovered paths
Designed For
  • Log/observability platforms - Query by trace_id, request_id, arbitrary labels
  • Vector databases - Pre-filter segments before expensive ANN search
  • Data lake query engines - Pruning index for DuckDB, Trino, Spark
  • Edge/serverless - Cache index at CDN edge, query without backend

The index decouples pruning (which row groups to read) from execution (DuckDB, Trino, Spark). Your query engine handles the actual data reading - this index just tells it where to look.

Installation

go get github.com/amikos-tech/ami-gin

Quick Start

package main

import (
    "fmt"
    gin "github.com/amikos-tech/ami-gin"
)

func main() {
    // Create builder for 3 row groups
    builder, err := gin.NewBuilder(gin.DefaultConfig(), 3)
    if err != nil {
        panic(err)
    }

    // Add documents to row groups
    builder.AddDocument(0, []byte(`{"name": "alice", "age": 30}`))
    builder.AddDocument(1, []byte(`{"name": "bob", "age": 25}`))
    builder.AddDocument(2, []byte(`{"name": "alice", "age": 40}`))

    // Build index
    idx := builder.Finalize()

    // Query: find row groups where name = "alice"
    result := idx.Evaluate([]gin.Predicate{
        gin.EQ("$.name", "alice"),
    })
    fmt.Println(result.ToSlice()) // [0, 2]
}

Known limitations

GIN Index v0.2.0 expands the original predicate surface with adaptive high-cardinality pruning and derived representations, but it still intentionally excludes a few deferred capabilities.

  • OR/AND composites are not part of the v0.2.0 query API yet.
  • Index merge across multiple index files is intentionally deferred beyond v0.2.0.
  • Query-time transformers are not supported in v0.2.0; transformations must happen at index-build time.

Serialized index compatibility remains strict: Decode() rejects older payload versions. Indexes built with v0.1.0 (wire format v3) must be rebuilt with v0.2.0 (wire format v8).

Query Types

Equality
gin.EQ("$.status", "active")
gin.NE("$.status", "deleted")
gin.IN("$.status", "active", "pending", "review")
gin.NIN("$.status", "deleted", "archived")  // NOT IN
Numeric Range
gin.GT("$.price", 100.0)    // price > 100
gin.GTE("$.price", 100.0)   // price >= 100
gin.LT("$.price", 500.0)    // price < 500
gin.LTE("$.price", 500.0)   // price <= 500

// Combined range
idx.Evaluate([]gin.Predicate{
    gin.GTE("$.price", 100.0),
    gin.LTE("$.price", 500.0),
})
Derived Representation Queries

Derived representations add companion indexes without dropping the raw source value. Raw-path queries stay raw by default; query a companion explicitly with gin.As(alias, value). Hidden internal target paths are not part of the public query contract.

config, _ := gin.NewConfig(
    gin.WithISODateTransformer("$.created_at", "epoch_ms"),
    gin.WithToLowerTransformer("$.email", "lower"),
    gin.WithEmailDomainTransformer("$.email", "domain"),
    gin.WithRegexExtractTransformer("$.message", "error_code", `ERROR\[(\w+)\]:`, 1),
)
builder, _ := gin.NewBuilder(config, numRGs)

builder.AddDocument(0, []byte(`{
    "created_at": "2024-07-10T09:00:00Z",
    "email": "Alice@Example.COM",
    "message": "ERROR[E1001]: Connection timeout"
}`))

idx := builder.Finalize()

// Raw source-path queries still use the original value.
raw := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.created_at", "2024-07-10T09:00:00Z"),
})

// Alias queries opt into the derived companion explicitly.
july2024 := float64(time.Date(2024, 7, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
dateResult := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.created_at", gin.As("epoch_ms", july2024)),
})
lowerResult := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", gin.As("lower", "alice@example.com")),
})
domainResult := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", gin.As("domain", "example.com")),
})
errorResult := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.message", gin.As("error_code", "E1001")),
})

Built-in additive helpers:

  • Date/time: WithISODateTransformer(path, alias), WithDateTransformer(path, alias), WithCustomDateTransformer(path, alias, layout)
  • String normalization: WithToLowerTransformer(path, alias), WithEmailDomainTransformer(path, alias), WithURLHostTransformer(path, alias)
  • Extracted subfields: WithRegexExtractTransformer(path, alias, pattern, group), WithRegexExtractIntTransformer(path, alias, pattern, group)
  • Numeric companions: WithIPv4Transformer(path, alias), WithSemVerTransformer(path, alias), WithDurationTransformer(path, alias), WithNumericBucketTransformer(path, alias, size), WithBoolNormalizeTransformer(path, alias)

Custom companions:

myTransformer := func(v any) (any, bool) {
    s, ok := v.(string)
    if !ok {
        return nil, false
    }
    return strings.ToUpper(s), true
}

config, _ := gin.NewConfig(
    gin.WithCustomTransformer("$.my_field", "upper", myTransformer),
)

WithCustomTransformer(...) works for in-memory indexes, but opaque custom companions are not serializable. Encode() rejects them because the function cannot be reconstructed on Decode().

Example: IP subnet queries

config, _ := gin.NewConfig(
    gin.WithIPv4Transformer("$.client_ip", "ipv4_int"),
)

start, end, _ := gin.CIDRToRange("192.168.1.0/24")
result := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.client_ip", gin.As("ipv4_int", start)),
    gin.LTE("$.client_ip", gin.As("ipv4_int", end)),
})

Example: Version range queries

config, _ := gin.NewConfig(
    gin.WithSemVerTransformer("$.version", "semver_int"),
)

result := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.version", gin.As("semver_int", float64(2000000))),
})

Example: Case-insensitive email queries

config, _ := gin.NewConfig(
    gin.WithToLowerTransformer("$.email", "lower"),
)

result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", gin.As("lower", "alice@example.com")),
})

Example: Extract error codes from log messages

config, _ := gin.NewConfig(
    gin.WithRegexExtractTransformer("$.message", "error_code", `ERROR\[(\w+)\]:`, 1),
)

result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.message", gin.As("error_code", "E1234")),
})
Full-Text Search (CONTAINS)
// Uses trigram index for substring matching
gin.Contains("$.description", "hello")
gin.Contains("$.title", "database")  // matches "database", "databases", etc.
Regex Matching
// Uses trigram index for regex candidate selection
gin.Regex("$.message", "ERROR|WARNING")        // Alternation
gin.Regex("$.brand", "Toyota|Tesla|Ford")      // Multiple literals
gin.Regex("$.log", "error.*timeout")           // Prefix + wildcard + suffix
gin.Regex("$.code", "[A-Z]{3}_[0-9]+")         // Pattern with literals

The Regex operator extracts literal strings from regex patterns and uses the trigram index for candidate row-group selection. This enables efficient pruning before actual regex matching.

How it works:

  1. Parse regex pattern and extract literal substrings
  2. For alternations like (error|warn)_message, extracts combined literals: ["error_message", "warn_message"]
  3. Query trigram index for each literal
  4. Union results (OR semantics for alternation)
  5. Row groups not containing any literal are pruned

Limitations:

  • Requires trigram index enabled (EnableTrigrams: true)
  • Literals shorter than trigram length (default: 3) cannot prune
  • Pure wildcard patterns (.*) return all row groups
  • This is candidate selection, not regex execution - actual matching happens at query time
Null Handling
gin.IsNull("$.optional_field")
gin.IsNotNull("$.required_field")
Nested Fields and Arrays
// Nested objects
gin.EQ("$.user.address.city", "New York")

// Array elements (wildcard)
gin.EQ("$.tags[*]", "important")
gin.IN("$.roles[*]", "admin", "editor")

JSONPath Support

Supported path syntax:

  • $ - root
  • $.field - dot notation
  • $['field'] - bracket notation
  • $.items[*] - array wildcard

Not supported (will error):

  • $.items[0] - array indices
  • $..field - recursive descent
  • $.items[0:5] - slices
  • $[?(@.price > 10)] - filters

Validate paths before use:

if err := gin.ValidateJSONPath("$.user.name"); err != nil {
    log.Fatal(err)
}

Serialization

// Encode to bytes (zstd compressed)
data, err := gin.Encode(idx)

// Save to file
os.WriteFile("index.gin", data, 0644)

// Load and decode
data, _ := os.ReadFile("index.gin")
idx, err := gin.Decode(data)

Parquet Integration

The GIN index integrates directly with Parquet files, supporting three storage strategies:

  1. Sidecar file - Index stored as data.parquet.gin alongside the Parquet file
  2. Embedded metadata - Index stored in Parquet file's key-value metadata
  3. Build-time embedding - Index built and embedded during Parquet file creation
Build Index from Parquet
// Build index from a Parquet file's JSON column
idx, err := gin.BuildFromParquet("data.parquet", "attributes", gin.DefaultConfig())
Sidecar Workflow
// Write index as sidecar file (data.parquet.gin)
err := gin.WriteSidecar("data.parquet", idx)

// Read sidecar
idx, err := gin.ReadSidecar("data.parquet")

// Check if sidecar exists
if gin.HasSidecar("data.parquet") {
    // ...
}
Embedded Metadata Workflow
cfg := gin.DefaultParquetConfig() // MetadataKey: "gin.index"

// Rebuild existing Parquet file with embedded index
err := gin.RebuildWithIndex("data.parquet", idx, cfg)

// Check if Parquet has embedded index
hasIdx, err := gin.HasGINIndex("data.parquet", cfg)

// Read embedded index
idx, err := gin.ReadFromParquetMetadata("data.parquet", cfg)
Auto-Loading (Embedded First, Then Sidecar)
// Tries embedded metadata first, falls back to sidecar
idx, err := gin.LoadIndex("data.parquet", gin.DefaultParquetConfig())
Encode for Parquet Metadata (Build-Time Embedding)

When creating a new Parquet file, you can embed the index during creation:

// Get key-value pair for Parquet metadata
key, value, err := gin.EncodeToMetadata(idx, gin.DefaultParquetConfig())
// key = "gin.index", value = base64-encoded compressed index

// Use with parquet-go writer
writer := parquet.NewGenericWriter[Record](f,
    parquet.KeyValueMetadata(key, value),
)
Batch Processing (Programmatic)

Helper functions for working with multiple files:

// Local filesystem
if gin.IsDirectory("./data") {
    // List all .parquet files in directory
    parquetFiles, err := gin.ListParquetFiles("./data")

    // List all .gin files in directory
    ginFiles, err := gin.ListGINFiles("./data")

    // Process each file
    for _, f := range parquetFiles {
        idx, _ := gin.BuildFromParquet(f, "attributes", gin.DefaultConfig())
        gin.WriteSidecar(f, idx)
    }
}

// S3
s3Client, _ := gin.NewS3ClientFromEnv()

// List all .parquet files under prefix
parquetKeys, err := s3Client.ListParquetFiles("bucket", "data/")

// List all .gin files under prefix
ginKeys, err := s3Client.ListGINFiles("bucket", "data/")
S3 Support

All operations support S3 paths via AWS SDK v2:

// Configure from environment variables:
// AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
// AWS_ENDPOINT_URL (for MinIO, LocalStack), AWS_S3_PATH_STYLE=true
s3Client, err := gin.NewS3ClientFromEnv()

// Build from S3
idx, err := s3Client.BuildFromParquet("bucket", "path/to/data.parquet", "attributes", gin.DefaultConfig())

// Write sidecar to S3
err := s3Client.WriteSidecar("bucket", "path/to/data.parquet", idx)

// Read sidecar from S3
idx, err := s3Client.ReadSidecar("bucket", "path/to/data.parquet")

// Load index (tries embedded, then sidecar)
idx, err := s3Client.LoadIndex("bucket", "path/to/data.parquet", gin.DefaultParquetConfig())
CLI Tool

A command-line tool is provided for common operations:

# Install
go install github.com/amikos-tech/ami-gin/cmd/gin-index@latest

# Build sidecar index
gin-index build -c attributes data.parquet
gin-index build -c attributes -o custom.gin data.parquet

# Build and embed into Parquet file
gin-index build -c attributes -embed data.parquet

# Query index
gin-index query data.parquet.gin '$.status = "error"'
gin-index query data.parquet.gin '$.count > 100'
gin-index query data.parquet.gin '$.name IN ("alice", "bob")'

# Show index info
gin-index info data.parquet.gin

# Extract embedded index to sidecar
gin-index extract -o data.parquet.gin data.parquet

# S3 paths (uses AWS env vars)
gin-index build -c attributes s3://bucket/data.parquet
gin-index query s3://bucket/data.parquet.gin '$.status = "ok"'

Batch Processing (Directory/S3 Prefix):

Process multiple files at once by passing a directory or S3 prefix:

# Build index for all .parquet files in a directory
gin-index build -c attributes ./data/
gin-index build -c attributes -embed ./data/

# Query all .gin files in a directory
gin-index query ./data/ '$.status = "error"'

# Show info for all .gin files
gin-index info ./data/

# S3 prefix - processes all .parquet files under the prefix
gin-index build -c attributes s3://bucket/data/
gin-index query s3://bucket/data/ '$.status = "error"'
gin-index info s3://bucket/data/

# Glob patterns work too
gin-index build -c attributes './data/*.parquet'
gin-index query './data/*.gin' '$.level = "error"'

CLI Query Syntax:

  • Equality: $.field = "value", $.field != "value"
  • Numeric: $.field > 100, $.field >= 100, $.field < 100, $.field <= 100
  • IN/NOT IN: $.field IN ("a", "b"), $.field NOT IN (1, 2, 3)
  • Null: $.field IS NULL, $.field IS NOT NULL
  • Contains: $.field CONTAINS "substring"
  • Regex: $.field REGEX "pattern" (e.g., $.brand REGEX "Toyota|Tesla")

Configuration

config := gin.GINConfig{
    CardinalityThreshold:    10000, // Exact below threshold, adaptive above it
    BloomFilterSize:         65536,
    BloomFilterHashes:       5,
    EnableTrigrams:          true,  // Enable CONTAINS queries
    TrigramMinLength:        3,
    HLLPrecision:            12,    // HyperLogLog precision (4-16)
    PrefixBlockSize:         16,
    AdaptiveMinRGCoverage:   2,     // Promote values seen in at least 2 row groups
    AdaptivePromotedTermCap: 64,    // Keep at most 64 exact hot terms per path
    AdaptiveCoverageCeiling: 0.80,  // Skip terms that cover more than 80% of row groups
    AdaptiveBucketCount:     128,   // Fixed bucket count for long-tail fallback
}

builder, err := gin.NewBuilder(config, numRowGroups)
if err != nil {
    panic(err)
}
High-Cardinality String Modes

GIN Index uses three string-path modes:

  • exact - path cardinality stays under CardinalityThreshold, so every observed value keeps an exact row-group bitmap.
  • adaptive-hybrid - path exceeds CardinalityThreshold, but hot values still retain exact row-group pruning while the long tail falls back to fixed hash buckets.
  • bloom-only - adaptive promotion is disabled, so high-cardinality paths keep only the bloom filter fast-path.

The additive adaptive knobs above control when a hot value is promoted (AdaptiveMinRGCoverage), how many promoted values are kept (AdaptivePromotedTermCap), how broad a promoted value is allowed to be (AdaptiveCoverageCeiling), and how much compact fallback space is reserved for the long tail (AdaptiveBucketCount). This means hot values on a high-cardinality path can still retain exact row-group pruning instead of degrading immediately to bloom-only behavior.

Examples

See the examples directory:

go run ./examples/basic/main.go        # Equality queries
go run ./examples/range/main.go        # Numeric ranges
go run ./examples/transformers/main.go # Date field transformers
go run ./examples/transformers-advanced/main.go # IP, SemVer, email, regex transformers
go run ./examples/fulltext/main.go     # CONTAINS queries
go run ./examples/regex/main.go        # Regex pattern matching
go run ./examples/null/main.go         # NULL handling
go run ./examples/nested/main.go       # Nested JSON and arrays
go run ./examples/serialize/main.go    # Persistence
go run ./examples/full/main.go         # All types and operators
go run ./examples/parquet/main.go      # Parquet integration (sidecar, embedded, queries)

Benchmarks

Run benchmarks with:

go test -bench=. -benchmem -benchtime=1s
Performance Summary (Apple M3 Max)
Operation Latency Notes
EQ query ~1µs Bloom filter + sorted term lookup
Range query (GT/LT) 4-24µs Min/max stats scan
IN query (10 values) ~8µs Union of EQ results
CONTAINS query 2-17µs Trigram intersection
IsNull/IsNotNull 2-4µs Bitmap lookup
Bloom lookup ~100ns Fast path rejection
AddDocument ~43µs JSON parsing + indexing
Encode (1K RGs) ~4ms zstd compression
Decode (1K RGs) ~2ms zstd decompression
Index Size
Row Groups Encoded Size Per RG
100 6.7 KB 67 bytes
500 18 KB 36 bytes
1,000 30 KB 30 bytes
2,000 51 KB 26 bytes
Scaling Characteristics

Query time scales well with index size:

  • 10 RGs: ~340ns
  • 100 RGs: ~530ns
  • 1,000 RGs: ~680ns
  • 5,000 RGs: ~800ns

Build time is linear with document count and complexity:

  • 100 docs (7 fields): ~1.2ms
  • 1,000 docs: ~6.7ms
  • High cardinality (10K unique values): ~3.3ms per 1K docs
Component Performance
Component Operation Latency
Bloom Filter Add ~100ns
Bloom Filter Lookup ~100ns
RGSet (10K) Intersect ~12µs
RGSet (10K) Union ~10µs
Trigram Add (50 chars) ~16µs
Trigram Search 1-6µs
HyperLogLog Add ~70ns
HyperLogLog Estimate 7-410µs (precision dependent)
Prefix Compress 1K terms ~60µs
Real-World Scenario: 1M Docs / 50K Row Groups

Simulating a log storage scenario:

  • 1M documents across 50K row groups (~20 docs/RG)
  • 10 labels: 2 integers (status_code, duration_ms) + 8 strings
  • Mix of cardinalities: trace_id (high), service (low), host (medium)
  • Trigrams disabled (no FTS)
Metric Value
Index Size 289 KB (0.28 MB)
Bytes per RG 5.9 bytes
Bytes per doc 0.3 bytes
Build time 464ms
Encode 41ms
Decode 41ms

Query Performance:

Query Latency Notes
trace_id=X (high cardinality) 950ns adaptive-hybrid hot-term prune or compact tail fallback
service=api (low cardinality) 6.5µs ~10K RGs match
trace_id=X AND level=error 6µs High card + low card
duration_ms > 5000 244µs Range scan over 50K RGs
service=api AND env=prod AND status>=400 285µs 3 predicates combined

Key takeaway: High-cardinality lookups (trace ID, request ID) are sub-microsecond. The entire index for 1M documents fits in 289 KB - easily cacheable in memory, localStorage, or CDN edge.

Benchmark Categories

The benchmark suite (benchmark_test.go) covers:

  1. Builder Performance - Document ingestion, batch loading, finalization
  2. Query Performance - All operators, parallel queries, multiple predicates
  3. Serialization - Encode/decode latency, compression ratios
  4. Components - Bloom filter, RGSet, trigram, HLL, prefix compression
  5. Scaling - Row group count, document size, cardinality, nesting depth

Comparison with Other Solutions

vs PostgreSQL GIN/JSONB
Aspect This GIN Index PostgreSQL GIN
Query Latency ~1µs (EQ) ~0.7-1.2ms per predicate
Deployment Embedded bytes, no server Requires PostgreSQL server
Cacheability Cache anywhere (nginx, memcached, CDN) Tied to database buffer cache
Index Size 26-67 bytes/row-group Larger, includes posting lists
Range Queries Native min/max stats Poor (GIN doesn't support ranges)
Full-Text Trigram-based CONTAINS Full-featured tsvector/tsquery
ACID No (read-only after build) Full transaction support

PostgreSQL GIN uses Bitmap Index Scans which cost ~0.7-1.2ms each when cached. This index achieves ~1µs queries by being purpose-built for row-group pruning with simpler data structures.

vs Parquet Built-in Statistics
Aspect This GIN Index Parquet Min/Max Stats Parquet Bloom Filters
String Equality Exact term → RG bitmap Only min/max (poor for strings) Yes, but per-column only
CONTAINS/FTS Trigram index No No
Multi-path Queries Single index file Scattered in column chunks Scattered in column chunks
Cardinality HyperLogLog estimates No No
Null Tracking Explicit null/present bitmaps Null count only No
Index Location Footer or sidecar file Column chunk metadata Column chunk metadata

Parquet's built-in bloom filters are effective for single-column equality but require reading multiple column chunks for multi-field queries. This GIN index consolidates all paths into one structure.

vs Delta Lake / Iceberg Data Skipping
Aspect This GIN Index Delta Lake Apache Iceberg
Statistics Per-path term index + min/max First 32 columns min/max Partition-level + column stats
High Cardinality adaptive-hybrid + bloom tail fallback Requires Z-ordering Requires sorting
JSON Support Native path extraction Requires schema Requires schema
Query Planning Client-side, cacheable Spark/engine dependent Engine dependent
Deployment Standalone bytes Delta transaction log Metadata tables

Delta Lake's data skipping relies on Z-ordering for effectiveness with high-cardinality columns. This GIN index handles high-cardinality paths natively with adaptive-hybrid hot-value recovery plus compact fallback for the tail.

vs Elasticsearch
Aspect This GIN Index Elasticsearch
Query Latency ~1µs ~1-10ms (network + processing)
Deployment Embedded, no server Cluster required
Index Size ~30KB for 1K row-groups GB+ for equivalent data
Use Case Row-group pruning Full search engine
Updates Rebuild required Near real-time

Elasticsearch provides millisecond-level latency for searches but requires cluster infrastructure. This index is designed for embedding in data lake metadata.

Key Advantage: High-Cardinality Arbitrary JSON

This index was born from log storage needs - indexing arbitrary attributes/labels where:

  • High cardinality is the norm - trace IDs, request IDs, user IDs, session tokens
  • Schema is unknown - arbitrary key-value labels attached at runtime
  • Queries are selective - "find logs where trace_id=abc123" should be instant

Traditional solutions struggle here:

Challenge PostgreSQL GIN Parquet Stats This GIN Index
trace_id (millions unique) Index bloat, slow writes Min/max useless adaptive-hybrid exact hot values + compact tail fallback
user.email (arbitrary path) Requires schema Column must exist Auto-discovered paths
labels["env"] (dynamic keys) JSONB @> operator (~1ms) Not supported Native path indexing (~1µs)
Mixed types per path Type coercion issues Single type per column Tracks observed types

Log/observability example:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "trace_id": "abc123def456",
  "user": {"id": "user_98765", "email": "alice@example.com"},
  "labels": {"env": "prod", "region": "us-east-1", "version": "2.1.0"},
  "message": "Connection timeout to downstream service"
}

Query: trace_id=abc123def456 AND labels.env=prod AND level=error

  • Bloom filter rejects non-matching row groups instantly
  • High-cardinality trace_id can retain exact row-group pruning for hot values via adaptive-hybrid
  • Arbitrary labels.* paths indexed automatically

Vector database metadata filtering:

Vector databases need efficient pre-filtering before similarity search. Without good metadata indexing, you either:

  1. Scan all vectors then filter (slow)
  2. Filter first with poor index (still slow)
  3. Build separate metadata infrastructure (complex)
{
  "id": "doc_12345",
  "embedding": [0.1, 0.2, ...],
  "metadata": {
    "source": "arxiv",
    "year": 2024,
    "authors": ["Alice", "Bob"],
    "topics": ["machine-learning", "transformers"],
    "cited_by": 142,
    "full_text": "We present a novel approach to..."
  }
}

Query: Find similar vectors WHERE metadata.source=arxiv AND metadata.year>=2023 AND metadata.topics[*]=transformers

┌─────────────────────────────────────────────────────────────────┐
│                   Vector DB Hybrid Search                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. Metadata Filter (GIN Index)         2. Vector Search       │
│   ┌─────────────────────────────┐       ┌──────────────────┐   │
│   │ source=arxiv                │       │                  │   │
│   │ year>=2023        ──────────┼──────▶│  ANN Search      │   │
│   │ topics[*]=transformers      │       │  (only on        │   │
│   │                             │       │   segments 2,5)  │   │
│   │ Result: segments [2, 5]     │       │                  │   │
│   └─────────────────────────────┘       └──────────────────┘   │
│           ~1µs                              search scope        │
│                                             reduced 80%         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

This enables:

  • Pre-filtering - Prune segments before expensive ANN search
  • Flexible schemas - Each document can have different metadata fields
  • High cardinality - Filter by doc_id, user_id, session_id
  • Range + equality - year>=2023 AND source=arxiv
  • Array membership - topics[*]=machine-learning
  • Full-text on metadata - CONTAINS(full_text, "transformer")
Cacheable Pruning Index

The second differentiator is deployment flexibility:

┌─────────────────────────────────────────────────────────────────┐
│                     Data Lake Query Flow                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Client                                                        │
│     │                                                           │
│     │ 1. Fetch GIN index (cached)                              │
│     ▼                                                           │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │  nginx/memcached/CDN/local cache                        │  │
│   │  ┌─────────────┐                                        │  │
│   │  │ index.gin   │  ← 30KB, serves in <1ms               │  │
│   │  │ (cached)    │                                        │  │
│   │  └─────────────┘                                        │  │
│   └─────────────────────────────────────────────────────────┘  │
│     │                                                           │
│     │ 2. Evaluate predicates locally (~1µs)                    │
│     │    Result: [RG 5, RG 23, RG 47]                          │
│     │                                                           │
│     │ 3. Read only matching row groups from object storage     │
│     ▼                                                           │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │  S3 / GCS / Azure Blob                                  │  │
│   │  ┌─────────────────────────────────────────────────┐    │  │
│   │  │ data.parquet                                    │    │  │
│   │  │  RG 0 ──────── skipped                         │    │  │
│   │  │  RG 5 ◀─────── read                            │    │  │
│   │  │  RG 10 ─────── skipped                         │    │  │
│   │  │  RG 23 ◀─────── read                           │    │  │
│   │  │  RG 47 ◀─────── read                           │    │  │
│   │  │  ...                                            │    │  │
│   │  └─────────────────────────────────────────────────┘    │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Benefits:

  • No database server - Index is just bytes, evaluate anywhere
  • Cache at edge - nginx, memcached, CDN, browser localStorage
  • Cross-language - Binary format works in any language
  • Offline capable - Cache index locally for disconnected queries
  • Cost efficient - Avoid scanning TB of Parquet data

This architecture is ideal for:

  • Log/observability platforms - Index arbitrary labels, query by trace ID
  • Vector databases - Pre-filter segments before ANN search
  • Serverless query engines - No database to manage
  • Browser-based data explorers - Cache index in localStorage
  • Edge computing / IoT analytics - Offline-capable querying
  • Cost-sensitive data lake queries - Minimize S3/GCS egress

Architecture

Index Structure
┌─────────────────────────────────────────────────────────────────────────────┐
│                              GINIndex                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Header                                                               │   │
│  │  • Version, NumRowGroups, NumDocs, NumPaths                         │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Path Directory                                                       │   │
│  │  pathID → { PathName, ObservedTypes, Cardinality, Flags }           │   │
│  │                                                                      │   │
│  │  Example:                                                            │   │
│  │    0 → { "$.name",   String,  150,   0x00 }                         │   │
│  │    1 → { "$.age",    Int,     80,    0x00 }                         │   │
│  │    2 → { "$.tags[*]", String, 50000, FlagBloomOnly }                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ Global Bloom Filter                                                  │   │
│  │  Fast rejection for path=value pairs                                 │   │
│  │  Contains: "$.name=alice", "$.name=bob", "$.age=30", ...            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
│  ┌───────────────────────┐  ┌───────────────────────┐                      │
│  │ StringIndex           │  │ NumericIndex          │                      │
│  │ (per pathID)          │  │ (per pathID)          │                      │
│  │                       │  │                       │                      │
│  │ pathID: 0 ($.name)    │  │ pathID: 1 ($.age)     │                      │
│  │ ┌─────────┬────────┐  │  │ ┌─────┬──────┬──────┐│                      │
│  │ │  Term   │ RGSet  │  │  │ │ RG  │ Min  │ Max  ││                      │
│  │ ├─────────┼────────┤  │  │ ├─────┼──────┼──────┤│                      │
│  │ │ "alice" │ {0,2}  │  │  │ │  0  │  25  │  35  ││                      │
│  │ │ "bob"   │ {1}    │  │  │ │  1  │  20  │  45  ││                      │
│  │ │ "carol" │ {2,3}  │  │  │ │  2  │  30  │  30  ││                      │
│  │ └─────────┴────────┘  │  │ └─────┴──────┴──────┘│                      │
│  └───────────────────────┘  └───────────────────────┘                      │
│                                                                             │
│  ┌───────────────────────┐  ┌───────────────────────┐                      │
│  │ NullIndex             │  │ TrigramIndex          │                      │
│  │ (per pathID)          │  │ (per pathID)          │                      │
│  │                       │  │                       │                      │
│  │ pathID: 1 ($.age)     │  │ pathID: 3 ($.desc)    │                      │
│  │ ┌──────────┬────────┐ │  │ ┌─────────┬────────┐ │                      │
│  │ │ NullRGs  │ {4,7}  │ │  │ │ Trigram │ RGSet  │ │                      │
│  │ │ Present  │ {0-9}  │ │  │ ├─────────┼────────┤ │                      │
│  │ └──────────┴────────┘ │  │ │ "hel"   │ {0,2}  │ │                      │
│  └───────────────────────┘  │ │ "ell"   │ {0,2}  │ │                      │
│                             │ │ "llo"   │ {0,2,5}│ │                      │
│  ┌────────────────────────┐ │ │ "wor"   │ {1,3}  │ │                      │
│  │ DocID Mapping          │ │ └─────────┴────────┘ │                      │
│  │ (optional)             │ └───────────────────────┘                      │
│  │                        │                                                 │
│  │ pos → DocID            │  ┌───────────────────────┐                     │
│  │  0  → 1000             │  │ PathCardinality (HLL) │                     │
│  │  1  → 1001             │  │ (per pathID)          │                     │
│  │  2  → 1020             │  │                       │                     │
│  │  3  → 1021             │  │ Estimates unique vals │                     │
│  └────────────────────────┘  └───────────────────────┘                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Note: All RGSet bitmaps use Roaring Bitmaps for efficient compression
Data Flow
                                BUILD PHASE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│   JSON Documents                                                            │
│        │                                                                    │
│        ▼                                                                    │
│   ┌─────────┐     ┌──────────────────────────────────────────────────┐     │
│   │ DocID 0 │────▶│                  GINBuilder                       │     │
│   │ RG: 0   │     │                                                   │     │
│   └─────────┘     │  AddDocument(docID, json)                        │     │
│   ┌─────────┐     │       │                                          │     │
│   │ DocID 1 │────▶│       ▼                                          │     │
│   │ RG: 0   │     │  ┌─────────────┐                                 │     │
│   └─────────┘     │  │ Walk JSON   │                                 │     │
│   ┌─────────┐     │  │ Extract:    │                                 │     │
│   │ DocID 2 │────▶│  │  • paths    │                                 │     │
│   │ RG: 1   │     │  │  • values   │                                 │     │
│   └─────────┘     │  │  • types    │                                 │     │
│        ⋮          │  └──────┬──────┘                                 │     │
│                   │         │                                         │     │
│                   │         ▼                                         │     │
│                   │  ┌─────────────────────────────────────────┐     │     │
│                   │  │ Update per-path structures:             │     │     │
│                   │  │  • stringTerms[term] → RGSet.Set(pos)   │     │     │
│                   │  │  • numericStats[pos].Min/Max            │     │     │
│                   │  │  • nullRGs.Set(pos) if null             │     │     │
│                   │  │  • trigrams.Add(term, pos)              │     │     │
│                   │  │  • bloom.Add(path=value)                │     │     │
│                   │  │  • hll.Add(value)                       │     │     │
│                   │  └─────────────────────────────────────────┘     │     │
│                   │                                                   │     │
│                   └───────────────────────┬──────────────────────────┘     │
│                                           │                                 │
│                                           ▼                                 │
│                                    Finalize()                               │
│                                           │                                 │
│                                           ▼                                 │
│                                    ┌─────────────┐                          │
│                                    │  GINIndex   │                          │
│                                    └─────────────┘                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

                                QUERY PHASE
┌─────────────────────────────────────────────────────────────────────────────┐
│                                                                             │
│   Predicates: [EQ("$.name", "alice"), GT("$.age", 25)]                     │
│        │                                                                    │
│        ▼                                                                    │
│   ┌─────────────────────────────────────────────────────────────────┐      │
│   │                     idx.Evaluate(predicates)                     │      │
│   └─────────────────────────────────────────────────────────────────┘      │
│        │                                                                    │
│        ├──────────────────────┬──────────────────────┐                     │
│        ▼                      ▼                      ▼                     │
│   ┌──────────┐          ┌──────────┐          ┌──────────┐                 │
│   │ Predicate│          │ Predicate│          │   ...    │                 │
│   │    1     │          │    2     │          │          │                 │
│   └────┬─────┘          └────┬─────┘          └──────────┘                 │
│        │                     │                                              │
│        ▼                     ▼                                              │
│   ┌──────────────┐     ┌──────────────┐                                    │
│   │ Bloom Check  │     │ Bloom Check  │   ◀── Fast rejection path          │
│   │ path=value?  │     │   (skip for  │                                    │
│   └──────┬───────┘     │    ranges)   │                                    │
│          │             └──────┬───────┘                                    │
│          ▼                    ▼                                             │
│   ┌──────────────┐     ┌──────────────┐                                    │
│   │ StringIndex  │     │ NumericIndex │                                    │
│   │ lookup term  │     │ scan min/max │                                    │
│   │ → RGSet      │     │ → RGSet      │                                    │
│   └──────┬───────┘     └──────┬───────┘                                    │
│          │                    │                                             │
│          │    RGSet{0,2}      │    RGSet{0,1,2}                            │
│          │                    │                                             │
│          └─────────┬──────────┘                                             │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │  Intersect  │                                                 │
│             │  (AND all)  │                                                 │
│             └──────┬──────┘                                                 │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │ RGSet{0,2}  │  ◀── Matching row groups                       │
│             └──────┬──────┘                                                 │
│                    │                                                        │
│                    ▼                                                        │
│             ┌─────────────┐                                                 │
│             │ ToSlice()   │  → [0, 2]                                      │
│             │     or      │                                                 │
│             │ MatchingDocIDs() → [DocID...]                                │
│             └─────────────┘                                                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
DocID Codec (Optional)

For composite document identifiers (e.g., file + row group):

// Encode file index and row group into single DocID
codec := gin.NewRowGroupCodec(20)  // 20 RGs per file
builder, err := gin.NewBuilder(config, totalRGs, gin.WithCodec(codec))
if err != nil {
    panic(err)
}

docID := codec.Encode(fileIndex, rgIndex)  // e.g., file=3, rg=15 → DocID=75
builder.AddDocument(docID, jsonDoc)

// Query and decode results
result := idx.Evaluate(predicates)
for _, docID := range idx.MatchingDocIDs(result) {
    decoded := codec.Decode(docID)  // [3, 15]
    fileIdx, rgIdx := decoded[0], decoded[1]
}

How It Works

The GIN index maintains several data structures:

  1. Path Directory - Maps JSON paths to their metadata (types, cardinality)
  2. String Index - For each path, maps terms to row-group bitmaps (Roaring)
  3. Numeric Index - Per-row-group min/max values for range pruning
  4. Null Index - Bitmaps tracking which row groups have null/present values
  5. Trigram Index - Maps 3-character sequences to row-group bitmaps
  6. Global Bloom Filter - Fast rejection of non-existent path=value pairs
  7. DocID Mapping - Optional external DocID to internal position mapping

Query evaluation intersects the matching row-group bitmaps from each predicate.

Design Notes

Why numRGs Must Be Known Upfront

The NewBuilder(config, numRGs) requires the total number of row groups at construction time. This is intentional:

  1. Complement operations require universe size - Operations like AllRGs() and Invert() need to know the total number of row groups to compute complements. When a query cannot prune (e.g., unknown path, graceful degradation), the index returns "all row groups" - which requires knowing what "all" means.

  2. Parquet metadata provides this - The index is designed for Parquet row-group pruning. In this context, the number of row groups is always available from Parquet file metadata before indexing begins.

  3. Bounds checking - The builder validates that document positions don't exceed the declared row group count, catching configuration errors early.

License

MIT

Documentation

Index

Constants

View Source
const (
	MagicBytes = "GIN\x01"
	// Version is the binary format version. Decode rejects mismatches with
	// ErrVersionMismatch; the only migration path is to rebuild the index
	// with the target binary. Version history:
	//   v8: explicit companion transformer failure modes in serialized config
	//       and representation metadata (strict by default, soft-fail opt-in)
	//   v7: explicit representation metadata for derived alias routing
	//       (phase 09 derived representations)
	//   v6: PathEntry.Mode byte + FlagTrigramIndex bit reassignment
	//       (phase 08 adaptive high-cardinality indexing)
	//   v5: never released; payloads are always rejected. Was an in-tree
	//       iteration of the adaptive string index section before the wire
	//       format was finalised in v6.
	//   v4: earlier pre-OSS format
	Version = uint16(8)
)
View Source
const (
	TypeString uint8 = 1 << iota
	TypeInt
	TypeFloat
	TypeBool
	TypeNull
)
View Source
const DefaultMetadataKey = "gin.index"
View Source
const (
	FlagHasDocIDMap uint16 = 1 << iota
)
View Source
const (
	FlagTrigramIndex uint8 = 1 << iota // path has trigram index for CONTAINS queries
)

Variables

View Source
var (
	// ErrVersionMismatch is returned by Decode when the binary format version
	// does not match the expected version (Version constant).
	ErrVersionMismatch = errors.New("version mismatch")

	// ErrInvalidFormat is returned by Decode when the binary data is structurally
	// invalid: unrecognized magic bytes, oversized allocations, or corrupt fields.
	ErrInvalidFormat = errors.New("invalid format")
)

Functions

func BoolNormalize

func BoolNormalize(v any) (any, bool)

BoolNormalize normalizes various boolean-like values to actual booleans. Handles: bool, "true"/"false"/"yes"/"no"/"1"/"0"/"on"/"off", float64 (0 = false).

func CIDRToRange

func CIDRToRange(cidr string) (start, end float64, err error)

CIDRToRange parses a CIDR notation string and returns the start and end IP addresses as float64 values suitable for use with GTE/LTE predicates on IPv4ToInt-transformed fields. Example: CIDRToRange("192.168.1.0/24") returns (3232235776, 3232236031, nil)

func CompressionStats

func CompressionStats(terms []string) (compressed, original int, ratio float64)

CompressionRatio returns the compression ratio for a set of terms. Returns (compressed size, original size, ratio).

func DateToEpochMs

func DateToEpochMs(v any) (any, bool)

DateToEpochMs parses "2006-01-02" format to Unix milliseconds (midnight UTC).

func DurationToMs

func DurationToMs(v any) (any, bool)

DurationToMs parses Go duration strings (e.g., "1h30m", "500ms") to milliseconds.

func EmailDomain

func EmailDomain(v any) (any, bool)

EmailDomain extracts and lowercases the domain from an email address.

func Encode

func Encode(idx *GINIndex) ([]byte, error)

Encode serializes the index using zstd-15 compression (recommended default).

func EncodeToMetadata

func EncodeToMetadata(idx *GINIndex, cfg ParquetConfig) (key string, value string, err error)

func EncodeWithLevel

func EncodeWithLevel(idx *GINIndex, level CompressionLevel) ([]byte, error)

EncodeWithLevel serializes the index with the specified compression level. Use CompressionNone (0) for no compression, or 1-19 for zstd compression levels.

func ExtractLiterals

func ExtractLiterals(pattern string) ([]string, error)

ExtractLiterals extracts literal strings from a regex pattern that can be used for trigram-based candidate selection. Returns a slice of literal alternatives. For patterns like "foo|bar", returns ["foo", "bar"]. For patterns like "(error|warn)_msg", returns ["error_msg", "warn_msg"] (combined).

func ExtractTrigrams

func ExtractTrigrams(s string) []string

func GenerateBigrams

func GenerateBigrams(text string) []string

func GenerateNGrams

func GenerateNGrams(text string, n int, opts ...NGramOption) ([]string, error)

func GenerateTrigrams

func GenerateTrigrams(text string) []string

func HasGINIndex

func HasGINIndex(parquetFile string, cfg ParquetConfig) (bool, error)

func HasGINIndexReader

func HasGINIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (bool, error)

func HasSidecar

func HasSidecar(parquetFile string) bool

func IPv4ToInt

func IPv4ToInt(v any) (any, bool)

IPv4ToInt converts IPv4 address strings to uint32 (as float64) for range queries.

func ISODateToEpochMs

func ISODateToEpochMs(v any) (any, bool)

ISODateToEpochMs parses RFC3339/ISO8601 strings to Unix milliseconds.

func IsDirectory

func IsDirectory(path string) bool

func IsS3Path

func IsS3Path(path string) bool

func IsValidJSONPath

func IsValidJSONPath(path string) bool

func ListGINFiles

func ListGINFiles(dir string) ([]string, error)

func ListParquetFiles

func ListParquetFiles(dir string) ([]string, error)

func MustValidateJSONPath

func MustValidateJSONPath(path string) string

func NormalizePath

func NormalizePath(path string) string

NormalizePath converts a JSONPath to a canonical dot-notation form without validating that the path uses only GIN-supported JSONPath features. Callers handling untrusted input should use ValidateJSONPath or canonicalizeSupportedPath first.

func ParseJSONPath

func ParseJSONPath(path string) (jp.Expr, error)

ParseJSONPath parses and validates a JSONPath, returning the parsed expression.

func ParseS3Path

func ParseS3Path(path string) (bucket, key string, err error)

func RebuildWithIndex

func RebuildWithIndex(parquetFile string, idx *GINIndex, cfg ParquetConfig) error

func SemVerToInt

func SemVerToInt(v any) (any, bool)

SemVerToInt encodes semantic versions as integers: major*1000000 + minor*1000 + patch. Supports formats: "1.2.3", "v1.2.3", "1.2", "v1.2", "1.2.3-beta" (pre-release suffix ignored).

func SetAdaptiveInvariantLogger added in v0.2.0

func SetAdaptiveInvariantLogger(l *log.Logger)

SetAdaptiveInvariantLogger installs a logger that surfaces adaptive index invariant violations (e.g. a path flagged PathModeAdaptiveHybrid with no matching AdaptiveStringIndexes section). The default is nil (silent); pass log.Default() or your own *log.Logger to opt in. Safe for concurrent use.

func SidecarPath

func SidecarPath(parquetFile string) string

func ToLower

func ToLower(v any) (any, bool)

ToLower normalizes strings to lowercase for case-insensitive queries.

func URLHost

func URLHost(v any) (any, bool)

URLHost extracts and lowercases the host from a URL.

func ValidateJSONPath

func ValidateJSONPath(path string) error

ValidateJSONPath validates a JSONPath expression and ensures it only uses features supported by the GIN index (dot notation, wildcards). Unsupported: array indices [0], filters [?()], recursive descent .., scripts

func WriteCompressedTerms

func WriteCompressedTerms(w io.Writer, blocks []CompressedTermBlock) error

func WriteSidecar

func WriteSidecar(parquetFile string, idx *GINIndex) error

Types

type AdaptiveStringIndex added in v0.2.0

type AdaptiveStringIndex struct {
	// Terms holds the promoted exact-match values in sorted order.
	Terms []string
	// RGBitmaps[i] lists the row groups that contain Terms[i].
	RGBitmaps []*RGSet
	// BucketRGBitmaps partitions the long-tail terms by xxhash; len must be a
	// non-zero power of two. A bucket hit is a superset match (may include
	// row groups that do not actually contain the queried term).
	BucketRGBitmaps []*RGSet
}

AdaptiveStringIndex stores promoted exact terms plus lossy tail buckets. Terms must be sorted lexically; RGBitmaps is parallel to Terms. Values that are not promoted fall into one of len(BucketRGBitmaps) hash buckets, which may return false-positive row groups.

func NewAdaptiveStringIndex added in v0.2.0

func NewAdaptiveStringIndex(terms []string, rgBitmaps []*RGSet, bucketBitmaps []*RGSet) (*AdaptiveStringIndex, error)

NewAdaptiveStringIndex validates and constructs an adaptive string index.

type BloomFilter

type BloomFilter struct {
	// contains filtered or unexported fields
}

func BloomFilterFromBits

func BloomFilterFromBits(bits []uint64, numBits uint32, numHashes uint8) *BloomFilter

func MustNewBloomFilter

func MustNewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) *BloomFilter

func NewBloomFilter

func NewBloomFilter(numBits uint32, numHashes uint8, opts ...BloomFilterOption) (*BloomFilter, error)

func (*BloomFilter) Add

func (bf *BloomFilter) Add(data []byte)

func (*BloomFilter) AddString

func (bf *BloomFilter) AddString(s string)

func (*BloomFilter) Bits

func (bf *BloomFilter) Bits() []uint64

func (*BloomFilter) MayContain

func (bf *BloomFilter) MayContain(data []byte) bool

func (*BloomFilter) MayContainString

func (bf *BloomFilter) MayContainString(s string) bool

func (*BloomFilter) NumBits

func (bf *BloomFilter) NumBits() uint32

func (*BloomFilter) NumHashes

func (bf *BloomFilter) NumHashes() uint8

type BloomFilterOption

type BloomFilterOption func(*BloomFilter) error

type BuilderOption

type BuilderOption func(*GINBuilder) error

func WithCodec

func WithCodec(codec DocIDCodec) BuilderOption

type CompressedTermBlock

type CompressedTermBlock struct {
	FirstTerm string
	Entries   []PrefixEntry
}

func ReadCompressedTerms

func ReadCompressedTerms(r io.Reader) ([]CompressedTermBlock, error)

type CompressionLevel

type CompressionLevel int

CompressionLevel specifies the compression level for index serialization.

const (
	CompressionNone     CompressionLevel = 0  // No compression
	CompressionFastest  CompressionLevel = 1  // zstd level 1
	CompressionBalanced CompressionLevel = 3  // zstd level 3
	CompressionBetter   CompressionLevel = 9  // zstd level 9
	CompressionBest     CompressionLevel = 15 // zstd level 15 (recommended)
	CompressionMax      CompressionLevel = 19 // zstd level 19 (slow)
)

type ConfigOption

type ConfigOption func(*GINConfig) error

func WithAdaptiveBucketCount added in v0.2.0

func WithAdaptiveBucketCount(bucketCount int) ConfigOption

WithAdaptiveBucketCount sets the fan-out of the long-tail bucket layer. Must be a positive power of two. To disable adaptive mode, omit this option (and WithAdaptivePromotedTermCap) or build a GINConfig literal with AdaptiveBucketCount/AdaptivePromotedTermCap set to 0; this option rejects 0 to keep the builder path explicit.

func WithAdaptiveCoverageCeiling added in v0.2.0

func WithAdaptiveCoverageCeiling(ceiling float64) ConfigOption

WithAdaptiveCoverageCeiling sets the maximum fraction of row groups a term may cover and still be eligible for promotion. Terms above the ceiling are treated as too-ubiquitous and fall through to the bucket layer. Must be in the open interval (0, 1).

func WithAdaptiveMinRGCoverage added in v0.2.0

func WithAdaptiveMinRGCoverage(minCoverage int) ConfigOption

WithAdaptiveMinRGCoverage sets the minimum number of row groups a term must cover to be eligible for promotion to the exact adaptive index. Terms below this threshold fall into the bucket layer.

func WithAdaptivePromotedTermCap added in v0.2.0

func WithAdaptivePromotedTermCap(cap int) ConfigOption

WithAdaptivePromotedTermCap caps the number of terms promoted to the exact adaptive index per high-cardinality path. Zero disables adaptive mode.

func WithBoolNormalizeTransformer

func WithBoolNormalizeTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithCustomDateTransformer

func WithCustomDateTransformer(path, alias, layout string, opts ...TransformerOption) ConfigOption

func WithCustomTransformer added in v0.2.0

func WithCustomTransformer(path, alias string, fn FieldTransformer, opts ...TransformerOption) ConfigOption

func WithDateTransformer

func WithDateTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithDurationTransformer

func WithDurationTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithEmailDomainTransformer

func WithEmailDomainTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithFTSPaths

func WithFTSPaths(paths ...string) ConfigOption

func WithFieldTransformer

func WithFieldTransformer(path string, fn FieldTransformer) ConfigOption

func WithIPv4Transformer

func WithIPv4Transformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithISODateTransformer

func WithISODateTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithNumericBucketTransformer

func WithNumericBucketTransformer(path, alias string, size float64, opts ...TransformerOption) ConfigOption

func WithRegexExtractIntTransformer

func WithRegexExtractIntTransformer(path, alias, pattern string, group int, opts ...TransformerOption) ConfigOption

func WithRegexExtractTransformer

func WithRegexExtractTransformer(path, alias, pattern string, group int, opts ...TransformerOption) ConfigOption

func WithRegisteredTransformer

func WithRegisteredTransformer(path, alias string, id TransformerID, params []byte, opts ...TransformerOption) ConfigOption

func WithSemVerTransformer

func WithSemVerTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithToLowerTransformer

func WithToLowerTransformer(path, alias string, opts ...TransformerOption) ConfigOption

func WithURLHostTransformer

func WithURLHostTransformer(path, alias string, opts ...TransformerOption) ConfigOption

type CustomDateParams

type CustomDateParams struct {
	Layout string `json:"layout"`
}

type DocID

type DocID uint64

DocID represents an external document identifier.

type DocIDCodec

type DocIDCodec interface {
	Encode(indices ...int) DocID
	Decode(docID DocID) []int
	Name() string
}

DocIDCodec encodes/decodes composite information into a single DocID.

type FieldTransformer

type FieldTransformer func(value any) (any, bool)

FieldTransformer transforms a value before indexing. Returns (transformedValue, ok). If ok=false, the companion representation follows the registration's configured failure mode. Strict is the default.

func CustomDateToEpochMs

func CustomDateToEpochMs(layout string) FieldTransformer

CustomDateToEpochMs returns a transformer for custom date formats.

func NumericBucket

func NumericBucket(size float64) FieldTransformer

NumericBucket returns a transformer that buckets numeric values by size. Example: NumericBucket(100) transforms 150 -> 100, 250 -> 200.

func ReconstructTransformer

func ReconstructTransformer(id TransformerID, params json.RawMessage) (FieldTransformer, error)

func RegexExtract

func RegexExtract(pattern string, group int) FieldTransformer

RegexExtract returns a transformer that extracts a substring via regex capture group. Pattern is compiled once at config time. Group 0 = full match, group 1+ = capture groups.

func RegexExtractInt

func RegexExtractInt(pattern string, group int) FieldTransformer

RegexExtractInt extracts a substring via regex and converts it to float64.

type GINBuilder

type GINBuilder struct {
	// contains filtered or unexported fields
}

func NewBuilder

func NewBuilder(config GINConfig, numRGs int, opts ...BuilderOption) (*GINBuilder, error)

func (*GINBuilder) AddDocument

func (b *GINBuilder) AddDocument(docID DocID, jsonDoc []byte) error

func (*GINBuilder) Finalize

func (b *GINBuilder) Finalize() *GINIndex

type GINConfig

type GINConfig struct {
	CardinalityThreshold    uint32
	BloomFilterSize         uint32
	BloomFilterHashes       uint8
	EnableTrigrams          bool
	TrigramMinLength        int
	HLLPrecision            uint8
	PrefixBlockSize         int
	AdaptiveMinRGCoverage   int
	AdaptivePromotedTermCap int
	AdaptiveCoverageCeiling float64
	AdaptiveBucketCount     int
	// contains filtered or unexported fields
}

func DefaultConfig

func DefaultConfig() GINConfig

func NewConfig

func NewConfig(opts ...ConfigOption) (GINConfig, error)

func (GINConfig) AdaptiveEnabled added in v0.2.0

func (c GINConfig) AdaptiveEnabled() bool

AdaptiveEnabled reports whether adaptive high-cardinality indexing is enabled.

type GINIndex

type GINIndex struct {
	// GINIndex is immutable after `Finalize()` or `Decode()`; pathLookup is
	// derived, non-serialized state rebuilt once and then treated as read-only.
	Header                Header
	PathDirectory         []PathEntry
	GlobalBloom           *BloomFilter
	StringIndexes         map[uint16]*StringIndex
	AdaptiveStringIndexes map[uint16]*AdaptiveStringIndex
	NumericIndexes        map[uint16]*NumericIndex
	NullIndexes           map[uint16]*NullIndex
	TrigramIndexes        map[uint16]*TrigramIndex
	StringLengthIndexes   map[uint16]*StringLengthIndex
	PathCardinality       map[uint16]*HyperLogLog
	DocIDMapping          []DocID
	Config                *GINConfig
	// contains filtered or unexported fields
}

func BuildFromParquet

func BuildFromParquet(parquetFile string, jsonColumn string, config GINConfig) (*GINIndex, error)

func BuildFromParquetReader

func BuildFromParquetReader(parquetFile string, jsonColumn string, config GINConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func Decode

func Decode(data []byte) (*GINIndex, error)

Decode deserializes an index, validates cross-structure path references, and canonicalizes supported JSONPath spellings in PathDirectory while rebuilding derived lookup state.

func DecodeFromMetadata

func DecodeFromMetadata(value string) (*GINIndex, error)

func LoadIndex

func LoadIndex(parquetFile string, cfg ParquetConfig) (*GINIndex, error)

func LoadIndexReader

func LoadIndexReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func NewGINIndex

func NewGINIndex() *GINIndex

func ReadFromParquetMetadata

func ReadFromParquetMetadata(parquetFile string, cfg ParquetConfig) (*GINIndex, error)

func ReadFromParquetMetadataReader

func ReadFromParquetMetadataReader(parquetFile string, cfg ParquetConfig, reader io.ReaderAt, size int64) (*GINIndex, error)

func ReadSidecar

func ReadSidecar(parquetFile string) (*GINIndex, error)

func (*GINIndex) Evaluate

func (idx *GINIndex) Evaluate(predicates []Predicate) *RGSet

func (*GINIndex) MatchingDocIDs

func (idx *GINIndex) MatchingDocIDs(rgSet *RGSet) []DocID

func (*GINIndex) Representations added in v0.2.0

func (idx *GINIndex) Representations(path string) []RepresentationInfo
type Header struct {
	Magic             [4]byte
	Version           uint16
	Flags             uint16
	NumRowGroups      uint32
	NumDocs           uint64
	NumPaths          uint32
	CardinalityThresh uint32
}

type HyperLogLog

type HyperLogLog struct {
	// contains filtered or unexported fields
}

HyperLogLog implements the HyperLogLog algorithm for cardinality estimation. It uses 2^precision registers to estimate the number of distinct elements.

func HyperLogLogFromRegisters

func HyperLogLogFromRegisters(registers []uint8, precision uint8) *HyperLogLog

func MustNewHyperLogLog

func MustNewHyperLogLog(precision uint8, opts ...HyperLogLogOption) *HyperLogLog

func NewHyperLogLog

func NewHyperLogLog(precision uint8, opts ...HyperLogLogOption) (*HyperLogLog, error)

NewHyperLogLog creates a new HyperLogLog with the given precision. Precision must be between 4 and 16. Higher precision = more accuracy but more memory. Memory usage: 2^precision bytes. Standard error: 1.04 / sqrt(m) where m = 2^precision

func (*HyperLogLog) Add

func (hll *HyperLogLog) Add(data []byte)

func (*HyperLogLog) AddString

func (hll *HyperLogLog) AddString(s string)

func (*HyperLogLog) Clear

func (hll *HyperLogLog) Clear()

func (*HyperLogLog) Clone

func (hll *HyperLogLog) Clone() *HyperLogLog

func (*HyperLogLog) Estimate

func (hll *HyperLogLog) Estimate() uint64

func (*HyperLogLog) Merge

func (hll *HyperLogLog) Merge(other *HyperLogLog)

func (*HyperLogLog) Precision

func (hll *HyperLogLog) Precision() uint8

func (*HyperLogLog) Registers

func (hll *HyperLogLog) Registers() []uint8

type HyperLogLogOption

type HyperLogLogOption func(*HyperLogLog) error

type IdentityCodec

type IdentityCodec struct{}

IdentityCodec treats the position as the DocID (1:1 mapping).

func NewIdentityCodec

func NewIdentityCodec() *IdentityCodec

func (*IdentityCodec) Decode

func (c *IdentityCodec) Decode(docID DocID) []int

func (*IdentityCodec) Encode

func (c *IdentityCodec) Encode(indices ...int) DocID

func (*IdentityCodec) Name

func (c *IdentityCodec) Name() string

type JSONPathError

type JSONPathError struct {
	Path    string
	Message string
}

func (*JSONPathError) Error

func (e *JSONPathError) Error() string

type NGramConfig

type NGramConfig struct {
	N       int
	Padding string
}

type NGramOption

type NGramOption func(*NGramConfig) error

func WithN

func WithN(n int) NGramOption

func WithPadding

func WithPadding(pad string) NGramOption

type NullIndex

type NullIndex struct {
	NullRGBitmap    *RGSet
	PresentRGBitmap *RGSet
}

type NumericBucketParams

type NumericBucketParams struct {
	Size float64 `json:"size"`
}

type NumericIndex

type NumericIndex struct {
	// ValueType is the numeric storage mode: int-only or float/mixed.
	ValueType    NumericValueType
	IntGlobalMin int64
	IntGlobalMax int64
	GlobalMin    float64
	GlobalMax    float64
	RGStats      []RGNumericStat
}

type NumericValueType added in v0.2.0

type NumericValueType uint8
const (
	NumericValueTypeIntOnly NumericValueType = iota
	NumericValueTypeFloatMixed
)

type Operator

type Operator uint8
const (
	OpEQ Operator = iota
	OpNE
	OpGT
	OpLT
	OpGTE
	OpLTE
	OpIN
	OpNIN
	OpIsNull
	OpIsNotNull
	OpContains
	OpRegex
)

func (Operator) String

func (o Operator) String() string

type ParquetConfig

type ParquetConfig struct {
	MetadataKey string
}

func DefaultParquetConfig

func DefaultParquetConfig() ParquetConfig

type ParquetIndexWriter

type ParquetIndexWriter struct {
	// contains filtered or unexported fields
}

func NewParquetIndexWriter

func NewParquetIndexWriter(w io.Writer, schema *parquet.Schema, jsonColumn string, numRowGroups int, ginConfig GINConfig, pqConfig ParquetConfig) (*ParquetIndexWriter, error)

type PathEntry

type PathEntry struct {
	PathID        uint16
	PathName      string
	ObservedTypes uint8
	Cardinality   uint32
	// Mode is the exclusive string-evaluation mode for this path.
	Mode  PathMode
	Flags uint8
	// AdaptivePromotedTerms and AdaptiveBucketCount are derived metadata
	// populated from the adaptive section at decode time. They are not
	// persisted in the path directory; encoders must not rely on them.
	AdaptivePromotedTerms uint16
	AdaptiveBucketCount   uint16
}

type PathMode added in v0.2.0

type PathMode uint8

PathMode is the exclusive storage mode for a path entry. The zero value is the classic exact mode.

const (
	// PathModeClassic keeps the full exact string index for a path.
	// Its user-facing string label remains "exact" because that describes the
	// query semantics more clearly than the internal mode name.
	PathModeClassic PathMode = iota
	// PathModeBloomOnly stores no exact term index and answers via bloom-filter fallback.
	PathModeBloomOnly
	// PathModeAdaptiveHybrid stores promoted exact terms plus lossy tail buckets.
	PathModeAdaptiveHybrid
)

func (PathMode) IsValid added in v0.2.0

func (m PathMode) IsValid() bool

IsValid reports whether m is one of the declared PathMode constants. Decoders should call this on every byte read from disk before trusting the value; values outside the declared range indicate a corrupt payload.

func (PathMode) String added in v0.2.0

func (m PathMode) String() string

String returns the user-facing label used in CLI output and diagnostics.

type Predicate

type Predicate struct {
	Path     string
	Operator Operator
	Value    any
}

func Contains

func Contains(path string, pattern string) Predicate

func EQ

func EQ(path string, value any) Predicate

func GT

func GT(path string, value any) Predicate

func GTE

func GTE(path string, value any) Predicate

func IN

func IN(path string, values ...any) Predicate

func InSubnet

func InSubnet(path, cidr string) []Predicate

InSubnet creates predicates using the conventional IPv4 companion alias "ipv4_int". Use InSubnetAs when a path is configured with a different alias.

func InSubnetAs added in v0.2.0

func InSubnetAs(path, alias, cidr string) []Predicate

InSubnetAs creates predicates to check if an IP field (transformed with IPv4ToInt under the provided alias) falls within a CIDR subnet range. Example: InSubnetAs("$.client_ip", "ipv4_int", "192.168.1.0/24") returns predicates for 192.168.1.0-255. Panics if CIDR is invalid - use CIDRToRange for error handling.

func IsNotNull

func IsNotNull(path string) Predicate

func IsNull

func IsNull(path string) Predicate

func LT

func LT(path string, value any) Predicate

func LTE

func LTE(path string, value any) Predicate

func NE

func NE(path string, value any) Predicate

func NIN

func NIN(path string, values ...any) Predicate

func Regex

func Regex(path string, pattern string) Predicate

func (Predicate) String

func (p Predicate) String() string

type PrefixCompressor

type PrefixCompressor struct {
	// contains filtered or unexported fields
}

PrefixCompressor implements front-coding compression for sorted string lists. Each string is stored as: shared prefix length + suffix. This works well for sorted terms that share common prefixes.

func MustNewPrefixCompressor

func MustNewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) *PrefixCompressor

func NewPrefixCompressor

func NewPrefixCompressor(blockSize int, opts ...PrefixCompressorOption) (*PrefixCompressor, error)

func (*PrefixCompressor) BlockSize

func (pc *PrefixCompressor) BlockSize() int

func (*PrefixCompressor) Compress

func (pc *PrefixCompressor) Compress(terms []string) []CompressedTermBlock

func (*PrefixCompressor) Decompress

func (pc *PrefixCompressor) Decompress(blocks []CompressedTermBlock) []string

type PrefixCompressorOption

type PrefixCompressorOption func(*PrefixCompressor) error

type PrefixEntry

type PrefixEntry struct {
	PrefixLen uint16
	Suffix    string
}

type RGNumericStat

type RGNumericStat struct {
	IntMin   int64
	IntMax   int64
	Min      float64
	Max      float64
	HasValue bool
}

type RGSet

type RGSet struct {
	NumRGs int
	// contains filtered or unexported fields
}

func AllRGs

func AllRGs(numRGs int) *RGSet

func MustNewRGSet

func MustNewRGSet(numRGs int, opts ...RGSetOption) *RGSet

func NewRGSet

func NewRGSet(numRGs int, opts ...RGSetOption) (*RGSet, error)

func NoRGs

func NoRGs(numRGs int) *RGSet

func RGSetFromRoaring

func RGSetFromRoaring(bitmap *roaring.Bitmap, numRGs int) *RGSet

func (*RGSet) All

func (rs *RGSet) All() *RGSet

func (*RGSet) Clear

func (rs *RGSet) Clear(rgID int)

func (*RGSet) Clone

func (rs *RGSet) Clone() *RGSet

func (*RGSet) Count

func (rs *RGSet) Count() int

func (*RGSet) Intersect

func (rs *RGSet) Intersect(other *RGSet) *RGSet

func (*RGSet) Invert

func (rs *RGSet) Invert() *RGSet

func (*RGSet) IsEmpty

func (rs *RGSet) IsEmpty() bool

func (*RGSet) IsSet

func (rs *RGSet) IsSet(rgID int) bool

func (*RGSet) Roaring

func (rs *RGSet) Roaring() *roaring.Bitmap

func (*RGSet) Set

func (rs *RGSet) Set(rgID int)

func (*RGSet) ToSlice

func (rs *RGSet) ToSlice() []int

func (*RGSet) Union

func (rs *RGSet) Union(other *RGSet) *RGSet

func (*RGSet) UnionWith added in v0.2.0

func (rs *RGSet) UnionWith(other *RGSet)

UnionWith merges other into rs in place, avoiding the per-call clone of Union. Use this when the receiver is exclusively owned and the result does not need to preserve the prior state.

type RGSetOption

type RGSetOption func(*RGSet) error

type RGStringLengthStat

type RGStringLengthStat struct {
	Min      uint32
	Max      uint32
	HasValue bool
}

type RegexLiteralInfo

type RegexLiteralInfo struct {
	Literals    []string // Extracted literal strings
	HasWildcard bool     // Pattern contains unbounded wildcards
	MinLength   int      // Minimum length of any literal
}

RegexLiteralInfo contains extracted information from a regex pattern

func AnalyzeRegex

func AnalyzeRegex(pattern string) (*RegexLiteralInfo, error)

AnalyzeRegex extracts literals and metadata from a regex pattern

type RegexParams

type RegexParams struct {
	Pattern string `json:"pattern"`
	Group   int    `json:"group"`
}

type RepresentationInfo added in v0.2.0

type RepresentationInfo struct {
	SourcePath  string
	Alias       string
	Transformer string
}

type RepresentationSpec added in v0.2.0

type RepresentationSpec struct {
	SourcePath   string          `json:"source_path"`
	Alias        string          `json:"alias"`
	TargetPath   string          `json:"target_path"`
	Transformer  TransformerSpec `json:"transformer"`
	Serializable bool            `json:"serializable"`
}

type RepresentationValue added in v0.2.0

type RepresentationValue struct {
	Alias string
	Value any
}

func As added in v0.2.0

func As(alias string, value any) RepresentationValue

type RowGroupCodec

type RowGroupCodec struct {
	// contains filtered or unexported fields
}

RowGroupCodec encodes file index and row group index into a DocID. Layout: DocID = fileIndex * rowGroupsPerFile + rgIndex

func NewRowGroupCodec

func NewRowGroupCodec(rowGroupsPerFile int) *RowGroupCodec

func (*RowGroupCodec) Decode

func (c *RowGroupCodec) Decode(docID DocID) []int

func (*RowGroupCodec) Encode

func (c *RowGroupCodec) Encode(indices ...int) DocID

func (*RowGroupCodec) Name

func (c *RowGroupCodec) Name() string

func (*RowGroupCodec) RowGroupsPerFile

func (c *RowGroupCodec) RowGroupsPerFile() int

type S3Client

type S3Client struct {
	// contains filtered or unexported fields
}

func NewS3Client

func NewS3Client(cfg S3Config) (*S3Client, error)

func NewS3ClientFromEnv

func NewS3ClientFromEnv() (*S3Client, error)

func (*S3Client) BuildFromParquet

func (c *S3Client) BuildFromParquet(bucket, key, jsonColumn string, ginCfg GINConfig) (*GINIndex, error)

func (*S3Client) Exists

func (c *S3Client) Exists(bucket, key string) (bool, error)

func (*S3Client) GetObjectSize

func (c *S3Client) GetObjectSize(bucket, key string) (int64, error)

func (*S3Client) HasGINIndex

func (c *S3Client) HasGINIndex(bucket, key string, cfg ParquetConfig) (bool, error)

func (*S3Client) HasSidecar

func (c *S3Client) HasSidecar(bucket, parquetKey string) (bool, error)

func (*S3Client) ListGINFiles

func (c *S3Client) ListGINFiles(bucket, prefix string) ([]string, error)

func (*S3Client) ListParquetFiles

func (c *S3Client) ListParquetFiles(bucket, prefix string) ([]string, error)

func (*S3Client) LoadIndex

func (c *S3Client) LoadIndex(bucket, parquetKey string, cfg ParquetConfig) (*GINIndex, error)

func (*S3Client) OpenParquet

func (c *S3Client) OpenParquet(bucket, key string) (*parquet.File, io.ReaderAt, int64, error)

func (*S3Client) ReadFile

func (c *S3Client) ReadFile(bucket, key string) ([]byte, error)

func (*S3Client) ReadFromParquetMetadata

func (c *S3Client) ReadFromParquetMetadata(bucket, key string, cfg ParquetConfig) (*GINIndex, error)

func (*S3Client) ReadSidecar

func (c *S3Client) ReadSidecar(bucket, parquetKey string) (*GINIndex, error)

func (*S3Client) WriteFile

func (c *S3Client) WriteFile(bucket, key string, data []byte) error

func (*S3Client) WriteSidecar

func (c *S3Client) WriteSidecar(bucket, parquetKey string, idx *GINIndex) error

type S3Config

type S3Config struct {
	Endpoint  string
	Region    string
	AccessKey string
	SecretKey string
	PathStyle bool
}

func S3ConfigFromEnv

func S3ConfigFromEnv() S3Config

type SerializedConfig

type SerializedConfig struct {
	BloomFilterSize         uint32            `json:"bloom_filter_size"`
	BloomFilterHashes       uint8             `json:"bloom_filter_hashes"`
	EnableTrigrams          bool              `json:"enable_trigrams"`
	TrigramMinLength        int               `json:"trigram_min_length"`
	HLLPrecision            uint8             `json:"hll_precision"`
	PrefixBlockSize         int               `json:"prefix_block_size"`
	AdaptiveMinRGCoverage   int               `json:"adaptive_min_rg_coverage"`
	AdaptivePromotedTermCap int               `json:"adaptive_promoted_term_cap"`
	AdaptiveCoverageCeiling float64           `json:"adaptive_coverage_ceiling"`
	AdaptiveBucketCount     int               `json:"adaptive_bucket_count"`
	FTSPaths                []string          `json:"fts_paths,omitempty"`
	Transformers            []TransformerSpec `json:"transformers,omitempty"`
}

type StringIndex

type StringIndex struct {
	Terms     []string
	RGBitmaps []*RGSet
}

type StringLengthIndex

type StringLengthIndex struct {
	GlobalMin uint32
	GlobalMax uint32
	RGStats   []RGStringLengthStat
}

type TransformerFailureMode added in v0.2.0

type TransformerFailureMode string
const (
	TransformerFailureStrict TransformerFailureMode = "strict"
	TransformerFailureSoft   TransformerFailureMode = "soft_fail"
)

type TransformerID

type TransformerID uint8
const (
	TransformerUnknown TransformerID = iota
	TransformerISODateToEpochMs
	TransformerDateToEpochMs
	TransformerCustomDateToEpochMs
	TransformerToLower
	TransformerIPv4ToInt
	TransformerSemVerToInt
	TransformerRegexExtract
	TransformerRegexExtractInt
	TransformerDurationToMs
	TransformerEmailDomain
	TransformerURLHost
	TransformerNumericBucket
	TransformerBoolNormalize
)

type TransformerOption added in v0.2.0

type TransformerOption func(*transformerRegistrationOptions) error

func WithTransformerFailureMode added in v0.2.0

func WithTransformerFailureMode(mode TransformerFailureMode) TransformerOption

type TransformerSpec

type TransformerSpec struct {
	Path        string                 `json:"path"`
	Alias       string                 `json:"alias,omitempty"`
	TargetPath  string                 `json:"target_path,omitempty"`
	FailureMode TransformerFailureMode `json:"failure_mode,omitempty"`
	ID          TransformerID          `json:"id"`
	Name        string                 `json:"name"`
	Params      json.RawMessage        `json:"params,omitempty"`
}

func NewTransformerSpec

func NewTransformerSpec(path string, id TransformerID, params json.RawMessage) TransformerSpec

type TrigramIndex

type TrigramIndex struct {
	Trigrams  map[string]*RGSet
	NumRGs    int
	N         int
	Padding   string
	MinLength int
}

func MustNewTrigramIndex added in v0.2.0

func MustNewTrigramIndex(numRGs int, opts ...NGramOption) *TrigramIndex

func NewTrigramIndex

func NewTrigramIndex(numRGs int, opts ...NGramOption) (*TrigramIndex, error)

func (*TrigramIndex) Add

func (ti *TrigramIndex) Add(value string, rgID int)

func (*TrigramIndex) Search

func (ti *TrigramIndex) Search(pattern string) *RGSet

func (*TrigramIndex) TrigramCount

func (ti *TrigramIndex) TrigramCount() int

Directories

Path Synopsis
cmd
gin-index command
examples
basic command
Example: Basic GIN index usage with equality queries
Example: Basic GIN index usage with equality queries
full command
Example: Comprehensive GIN index usage demonstrating all index types and query operators
Example: Comprehensive GIN index usage demonstrating all index types and query operators
fulltext command
Example: Full-text search with trigram index (CONTAINS queries)
Example: Full-text search with trigram index (CONTAINS queries)
nested command
Example: Nested JSON objects and arrays
Example: Nested JSON objects and arrays
null command
Example: NULL handling queries
Example: NULL handling queries
parquet command
range command
Example: Numeric range queries with GIN index
Example: Numeric range queries with GIN index
regex command
Example: Regex pattern matching with trigram-based candidate selection
Example: Regex pattern matching with trigram-based candidate selection
serialize command
Example: Serializing and deserializing GIN index
Example: Serializing and deserializing GIN index
transformers command
Example: Field transformers for date indexing
Example: Field transformers for date indexing
transformers-advanced command
Example: Advanced field transformers for IP ranges, semantic versions, emails, and regex extraction
Example: Advanced field transformers for IP ranges, semantic versions, emails, and regex extraction

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL