lsh

package module

v2.0.0 Latest Latest Go to latest Published: Mar 18, 2026 License: Apache-2.0 Imports: 9 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/FrogoAI/lsh

Links

Open Source Insights

README ¶

lsh/v2

Go library for Locality-Sensitive Hashing with two built-in use cases:

String deduplication — MinHash signatures + Jaccard similarity
Vector behavioural ID — random hyperplane signatures + cosine similarity

Both share a common core (banding, pooling, sharding) and a pluggable storage layer.

Architecture

lsh (root)              Core toolkit: banding, signature pools, lock sharding, prefix hashing
├── dedup/              String dedup use case (MinHash + Jaccard)
├── vector/             Vector behavioural ID use case (Hyperplane + Cosine)
└── repositories/       Storage interface + implementations
    ├── memory/         In-memory (for tests)
    └── aerospike/      Aerospike (for production)

The core package has zero dependency on repositories — it provides pure functions only. Each use case owns its config, hasher, similarity functions, and record serialization.

Installation

go get github.com/FrogoAI/lsh/v2

Usage

String deduplication

Detects near-duplicate text. The flow: text -> shingles -> MinHash signature -> LSH bands -> representative lookup -> Jaccard verification.

import (
    "context"

    "github.com/FrogoAI/lsh/v2/dedup"
    "github.com/FrogoAI/lsh/v2/repositories/memory"
)

repo := memory.NewRepository()

cfg := &dedup.Config{}
cfg.Bands = 40
cfg.Rows = 5
cfg.ShingleSize = 3
cfg.JaccardThreshold = 0.6
cfg.MaxTotalCandidates = 100
cfg.Seed = 13374269

svc := dedup.NewService(repo, cfg)

id1, _ := svc.Upsert(context.Background(), "group1", "the quick brown fox")
id2, _ := svc.Upsert(context.Background(), "group1", "the quick brown fox jumps")

// id1 == id2 if Jaccard similarity >= threshold

Upsert returns the ID of the matching existing record if a near-duplicate is found, or a new deterministic ID if the input is novel.

Vector behavioural ID

Assigns a behavioural cluster ID to a float64 feature vector using cosine similarity.

import (
    "context"

    "github.com/FrogoAI/lsh/v2/vector"
    "github.com/FrogoAI/lsh/v2/repositories/memory"
)

repo := memory.NewRepository()

cfg := &vector.Config{}
cfg.Bands = 40
cfg.Rows = 5
cfg.VectorDimensions = 20
cfg.CosineThreshold = 0.5
cfg.MaxTotalCandidates = 100
cfg.Seed = 13374269

svc := vector.NewService(repo, cfg)

// Two similar behavioral vectors get the same ID
id1, _ := svc.Upsert(context.Background(), "org1", bonusHunter1Vec)
id2, _ := svc.Upsert(context.Background(), "org1", bonusHunter2Vec)

// id1 == id2 if cosine similarity >= threshold

Configuration

Both use cases read config from environment variables via caarlos0/env.

dedup (prefix `LSH`)

Variable	Default	Description
`LSH_BANDS`	40	Number of LSH bands
`LSH_ROWS`	5	Rows per band
`LSH_SHINGLE_SIZE`	3	Character n-gram size
`LSH_JAC_THRESHOLD`	0.6	Jaccard similarity threshold
`LSH_MAX_TOTAL_CANDIDATES`	100	Max candidates to evaluate
`LSH_SEED`	13374269	RNG seed for hash functions

cfg, err := dedup.GetConfigFromEnv()

vector (prefix `VLSH`)

Variable	Default	Description
`VLSH_BANDS`	40	Number of LSH bands
`VLSH_ROWS`	5	Rows per band
`VLSH_VECTOR_DIMENSIONS`	20	Expected vector dimensionality
`VLSH_COS_THRESHOLD`	0.5	Cosine similarity threshold
`VLSH_MAX_TOTAL_CANDIDATES`	100	Max candidates to evaluate
`VLSH_SEED`	13374269	RNG seed for hyperplanes

cfg, err := vector.GetConfigFromEnv()

Storage backends

Implement the repositories.Storage interface to plug in any backend:

type Storage interface {
    SetRepresentative(bucketKey, memberID string, metadata int64) error
    BatchSetRepresentative(bucketKeys []string, memberID string, metadata int64) error
    GetRepresentatives(bucketKey string) ([]Representative, error)
    BatchGetRepresentatives(bucketKeys []string) (map[string][]Representative, error)
    SaveRecord(key string, bins map[string]any) error
    GetRecords(keys []string) ([]Record, error)
    PutValue(key, value string) error
    GetValue(key string) (string, error)
    Close()
}

Built-in implementations:

repositories/memory — in-memory (tests and prototyping)
repositories/aerospike — Aerospike (production)

How it works

Overview

Input (text or vector)
        |
  1. Deterministic ID (SHA256)
        |
  2. L1 cache: exact record lookup --> HIT: return ID
        |
  3. L2 cache: resolved ID lookup --> HIT: return resolved ID
        |
  4. Compute LSH signature (MinHash or Hyperplane)
        |
  5. Split into B bands of R rows --> B bucket keys
        |
  6. Fetch representatives from matching buckets (bounded, small)
        |
  7. Compare input vs representatives:
     - Estimated similarity (fast, signature-based)
     - Exact similarity (precise, on full data)
        |
  8a. MATCH: cache resolution, return existing ID
  8b. NO MATCH: save as new record + add as representative to buckets

Step by step

Deterministic ID — The input is hashed (SHA256, truncated to 128 bits, base64url-encoded) to produce a memberID. Same input always produces the same ID.
L1 cache (record lookup) — If a record with this ID already exists in storage, return it immediately. This handles exact duplicate inputs with zero LSH overhead.
L2 cache (resolved lookup) — If this ID was previously resolved to another ID (i.e., it matched a representative), return the resolved ID. The L2 cache is persisted to storage (PutValue), surviving pod restarts.
Signature computation — The input is converted to a fixed-size signature:
- Dedup: text -> character shingles -> MinHash signature (universal hashing with odd coefficients for implicit 2^64 overflow permutation)
- Vector: float64 vector -> random hyperplane projections -> binary signature (0/1 per hyperplane)
Banding — The signature (B*R elements) is split into B bands of R rows each. Each band is hashed (xxhash) into a bucket key. This is the core LSH step: similar items collide in at least one band with high probability.
Representative lookup — Each bucket stores a bounded set of representatives (cluster leaders), not all members. This is the key design choice — see Representative-based storage below.
Similarity verification — Candidates are first filtered by estimated similarity (cheap signature comparison), then verified with exact similarity (Jaccard or cosine). This two-stage filter reduces expensive comparisons.
Resolution — If a match is found, the input's ID is mapped to the representative's ID and cached. If no match, the input becomes a new representative in all its buckets.

Tuning

The probability of two items colliding in at least one band:

Jaccard (MinHash):    P = 1 - (1 - s^R)^B
Cosine (Hyperplane):  P = 1 - (1 - (1 - arccos(s)/pi)^R)^B

Higher recall (catch more similar items) -> increase Bands
Higher precision (fewer false positives) -> increase Rows

Use Config.CalculateApproximateThreshold() to see the similarity at which collision probability is ~50% for your settings.

Representative-based storage

Problem: hot buckets

The traditional LSH approach stores all members in each bucket. This creates two problems:

Unbounded growth — With N users and only 2^R possible bucket values per band (e.g., 2^5 = 32 for binary hyperplane hashing), the average bucket holds N/32 members. At 1M users that is ~31,000 members per bucket.
Cascade failure — When buckets exceed a size limit, they get skipped during queries. Items that should match can't find their candidates, become "novel" erroneously, and get added to buckets, making them even bigger.
Storage pressure — In Aerospike, each bucket is one record with a CDT Map. A bucket with 31K entries (~1.4 MB) may exceed the write-block-size limit, cause write amplification, and transfer excessive data over the network.

Solution: store only representatives

Each bucket stores one entry per cluster (the cluster leader / representative), not one per user. When a new item arrives:

If it matches a representative -> resolve to that representative's ID (no bucket write)
If it's novel -> save record, add self as representative to buckets

The bucket uses set semantics (keyed by memberID, idempotent upsert). The number of entries per bucket is bounded by the number of distinct clusters, not the number of users.

Scaling characteristics

Scenario	Members per bucket (old)	Representatives per bucket (new)
1M users, 1K clusters, Rows=5	~31,250	~31
1M users, 10K clusters, Rows=5	~31,250	~312
1M users, 1K clusters, Rows=10	~977	~1

The representative count grows with the number of distinct clusters, not the number of users. This is typically orders of magnitude smaller.

Memory estimates (1M users, 1K clusters)

Component	Old (all members)	New (representatives)
Bucket data	~2.7 GB (1M x 40 buckets x 68B)	~2.7 MB (1K x 40 buckets x 68B)
Record data	~2.0 GB	~2.0 GB (unchanged)
Per-query network	~55 MB (40 x 1.4 MB)	~55 KB (40 x 1.4 KB)
Aerospike record size	~1.4 MB (may exceed limit)	~2 KB

Candidate ranking (multi-probe)

When querying, the system collects candidates from all bands, counts how many bands each candidate appears in, and ranks them by overlap count (descending). The top MaxTotalCandidates are selected for record loading and verification.

This is important because:

Candidates sharing more bands are exponentially more likely to be true matches
Without ranking, a naive "break after N candidates" approach degrades recall from 99.5% to 23.2% once buckets have >100 representatives
Ranking preserves full recall (all bands searched) while bounding the expensive verification step

Scaling analysis (10M users)

Single record size (Bands=40, Rows=5, Dims=20)

Component	Bytes
Key (base64url SHA256[:16])	22
Vector (20 x float64)	160
Signature (200 x uint64)	1,600
Group string	~20
Overhead (headers, bin names)	~90
Total	~1.9 KB

Storage at scale

Scenario	Records	Buckets	Resolved cache	Total
1M users, 1K clusters	1 MB	1 MB	93 MB	95 MB
1M users, 10K clusters	10 MB	9 MB	93 MB	112 MB
10M users, 10K clusters	10 MB	9 MB	934 MB	953 MB
10M users, 100K clusters	104 MB	86 MB	925 MB	1.1 GB
10M users, 1M clusters	1 GB	858 MB	841 MB	2.7 GB

The dominant cost at 10M users is the resolved cache (~93 bytes per resolved user). Records and buckets are proportional to cluster count, not user count.

Bucket size vs cluster count (Rows=5, 32 slots per band)

Clusters	Reps/bucket	Bucket size	Fits Aerospike 1MB
1,000	31	1.4 KB	yes
10,000	312	14 KB	yes
100,000	3,125	137 KB	yes
1,000,000	31,250	1.4 MB	no

Aerospike limit: default write-block-size is 1 MB. At >500K distinct clusters, bucket records exceed this. For high-cluster-count use cases, increase write-block-size to 8 MB or increase Rows (more slots per band).

Recall vs cluster count

With multi-probe ranking, recall is maintained regardless of cluster count. The system searches all bands and ranks by overlap. MaxTotalCandidates limits only the verification step (record loading + similarity check), not band scanning.

Clusters	MaxTotalCandidates	Recall (cos=0.5)
20	100	100%
100	50	97%
200	20	97%

Precision guarantee

The algorithm guarantees: if input V is resolved to representative R, then ExactCosine(V, R.Vector) >= CosineThreshold. This is enforced by the exact verification step. The estimated similarity filter (signature comparison) uses a margin to avoid false negatives but the final decision is always exact.

Per-bucket eviction cap

Both repository implementations accept WithMaxBucketReps(n) to limit the number of representatives per bucket. This prevents bucket records from exceeding Aerospike's write-block-size (default 1 MB) regardless of cluster count.

// Memory (tests)
repo := memory.NewRepository(memory.WithMaxBucketReps(10000))

// Aerospike (production)
repo := aerospike.NewRepository(client, ns, set, aerospike.WithMaxBucketReps(10000))

With MaxBucketReps=10000: each bucket record stays under 450 KB, supporting up to 320K clusters before any eviction. 0 (default) means unlimited.

When a bucket is at cap, new representatives are silently rejected (memory) or oldest entries are trimmed by key order (Aerospike). Existing entries can always be updated.

API note: userID vs vector-derived ID

The behavioural_id returned by Upsert is derived deterministically from the vector content (SHA256), not from a user identifier. This is intentional: the ID identifies a behavior pattern, not a user. Two different users with identical behavior vectors get the same behavioural_id.

The consumer (scoring engine) maintains the user -> behavioural_id mapping externally by storing the returned ID on the enriched event.

Known issues and mitigations

1. Binary hyperplane bucket cardinality

Issue: With Rows=R, each band of the hyperplane hasher produces only 2^R possible bucket values. For R=5, that's 32 buckets per band. All users map into these 32 slots.

Impact: Low — with representative-based storage, the bucket size is bounded by cluster count (~31 per bucket for 1K clusters), not user count. The 32-slot limit only matters if you have >32K distinct behavioural clusters.

Mitigation: If cluster count is very high, increase Rows (e.g., R=10 gives 1024 slots) and compensate recall loss by increasing Bands.

2. Representative coverage (triangle inequality)

Issue: If user C is similar to user B (cos=0.8), and B was resolved to representative A (cos=0.85), but C has cos=0.65 with A (below threshold), then C won't find its cluster and becomes a spurious new representative.

Impact: Medium — creates slightly more clusters than ideal. More common at threshold boundaries.

Mitigation:

This naturally self-corrects: C becomes a representative, and future items similar to C will find it.
Use a lower CosineThreshold/JaccardThreshold to create looser clusters.
The two-stage filter (estimated + exact similarity) already uses a margin (cosineMargin=0.1, jaccardMargin=0.1) to avoid threshold edge effects on the estimated stage.

3. Cluster drift over time

Issue: The representative is the first item in a cluster. As more items join (resolved to the representative), the cluster's "true center" may drift, but the representative stays fixed.

Impact: Low — the representative's threshold defines cluster boundaries. Items near the boundary may flip between clusters across different representative choices.

Mitigation:

Accept that clusters are defined by the representative, not a centroid.
If precise centroids are needed, maintain them in a separate analytics pipeline.
Bucket TTL (14 days) naturally refreshes representatives.

4. Representative TTL expiration

Issue: Buckets have a TTL (14 days in Aerospike). When a bucket expires, its representative is lost. New items that would have matched must re-discover the cluster.

Impact: Low — the representative's full record has a longer TTL (90 days). The L2 cache (PutValue) also survives bucket expiration. Only truly cold clusters (no queries in 14 days) need re-discovery.

Mitigation:

L2 cache persists resolved mappings with record TTL (90 days).
Re-discovery is automatic: the first item in a cluster after expiration becomes the new representative.

5. MinHash vs Hyperplane bucket distribution

Issue: MinHash (dedup) produces full uint64 hash values per row, giving effectively unlimited bucket cardinality. Hyperplane (vector) produces binary 0/1 values, giving only 2^R buckets per band. The same storage interface serves both, but their scaling profiles differ.

Impact: Low — representative-based storage handles both cases. Dedup buckets are naturally small; vector buckets are bounded by cluster count.

Mitigation: No action needed. The representative model adapts to both distributions.

6. Concurrent writes to the same bucket

Issue: Two concurrent Upsert calls may both determine they're "novel" and both add themselves as representatives to the same buckets. This is safe (set semantics, idempotent), but may create two representatives for what could be one cluster.

Impact: Low — group-level lock sharding (64 mutexes) reduces this. It can happen across different pods.

Mitigation:

Same-group operations are serialized within a process via groupLocks.
Cross-pod: Aerospike MapPutOp is atomic. Both writes succeed, creating two representatives. Future items may match either one. This is acceptable for probabilistic clustering.

Development

All shortcuts are in the Makefile:

make test                # Unit tests (no external deps)
make test-integration    # Integration tests (requires Aerospike)
make lint                # golangci-lint
make coverage            # Unit coverage + threshold check (>=70%)
make bench               # Run benchmarks (starts Aerospike automatically)
make bench-baseline      # Rebuild tracked baseline
make bench-compare       # Compare current vs baseline via benchstat
make aerospike-start     # Start Aerospike via podman
make aerospike-stop      # Stop Aerospike
make ci                  # lint + coverage

License

See LICENSE for details.

Documentation ¶

Index ¶

Constants
Variables
func ComputeBands(signature []uint64, bands, rows int) ([]string, error)
func GroupShard(group string) uint32
func HashVersion(group string, configFields ...any) (string, error)
func NewSignaturePool(sigSize int) *sync.Pool
func PrefixKeys(prefix string, keys []string) []string
type Config
- func (c *Config) SignatureSize() int
type Instruments
- func NewInstruments(m metric.Meter, prefix string) (*Instruments, error)

Constants ¶

View Source

const (
	ResultL1Hit = "l1_hit"
	ResultL2Hit = "l2_hit"
	ResultL3Hit = "l3_hit"
	ResultMatch = "match"
	ResultNew   = "new"
)

Result label values.

View Source

const GroupLockShards = 64

Variables ¶

View Source

var (
	ErrSignatureTooShort = errors.New("signature too short")
	ErrInvalidConfig     = errors.New("invalid LSH config")
)

View Source

var (
	AttrResult = attribute.Key("result")
	AttrGroup  = attribute.Key("group")
)

Attribute keys for metric labels.

Functions ¶

func ComputeBands ¶

func ComputeBands(signature []uint64, bands, rows int) ([]string, error)

ComputeBands splits a signature into bands and hashes each band into a bucket key. This is the universal LSH banding step — works for any signature type (MinHash, hyperplane, etc.).

func GroupShard ¶

func GroupShard(group string) uint32

GroupShard maps a group string to a shard index for lock striping.

func HashVersion ¶

func HashVersion(group string, configFields ...any) (string, error)

HashVersion computes a deterministic prefix from a group name and config fields. Any change in config fields produces a different prefix, invalidating old buckets.

func NewSignaturePool ¶

func NewSignaturePool(sigSize int) *sync.Pool

NewSignaturePool creates a sync.Pool for reusable signature slices.

func PrefixKeys ¶

func PrefixKeys(prefix string, keys []string) []string

PrefixKeys prepends a config-version prefix to each bucket key.

Types ¶

type Config ¶

type Config struct {
	Bands              int   `env:"_BANDS" envDefault:"40"`
	Rows               int   `env:"_ROWS" envDefault:"5"`
	MaxBucketSize      int   `env:"_MAX_BUCKET_SIZE" envDefault:"1000"`
	MaxTotalCandidates int   `env:"_MAX_TOTAL_CANDIDATES" envDefault:"100"`
	ResolvedCacheSize  int   `env:"_RESOLVED_CACHE_SIZE" envDefault:"500000"`
	Seed               int64 `env:"_SEED" envDefault:"13374269"`
}

Config holds universal LSH parameters shared across all use cases.

func (*Config) SignatureSize ¶

func (c *Config) SignatureSize() int

SignatureSize returns the total number of hash functions (Bands * Rows).

type Instruments ¶

type Instruments struct {
	UpsertDuration     metric.Float64Histogram
	UpsertTotal        metric.Int64Counter
	NewIDTotal         metric.Int64Counter
	CandidateCount     metric.Int64Histogram
	ExactCompareCount  metric.Int64Histogram
	BucketRepsReturned metric.Int64Histogram
}

Instruments holds pre-built OpenTelemetry metric instruments for an LSH service. All fields are thread-safe (otel instruments are safe for concurrent use).

func NewInstruments ¶

func NewInstruments(m metric.Meter, prefix string) (*Instruments, error)

NewInstruments creates all LSH metric instruments from the given meter. Use prefix "lsh.vector." or "lsh.dedup." to differentiate algorithms.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
dedup
repositories
aerospike
memory
vector

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

lsh/v2

Architecture

Installation

Usage

String deduplication

Vector behavioural ID

Configuration

dedup (prefix LSH)

vector (prefix VLSH)

Storage backends

How it works

Overview

Step by step

Tuning

Representative-based storage

Problem: hot buckets

Solution: store only representatives

Scaling characteristics

Memory estimates (1M users, 1K clusters)

Candidate ranking (multi-probe)

Scaling analysis (10M users)

Single record size (Bands=40, Rows=5, Dims=20)

Storage at scale

Bucket size vs cluster count (Rows=5, 32 slots per band)

Recall vs cluster count

Precision guarantee

Per-bucket eviction cap

API note: userID vs vector-derived ID

Known issues and mitigations

1. Binary hyperplane bucket cardinality

2. Representative coverage (triangle inequality)

3. Cluster drift over time

4. Representative TTL expiration

5. MinHash vs Hyperplane bucket distribution

6. Concurrent writes to the same bucket

Development

License

Documentation ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func ComputeBands ¶

func GroupShard ¶

func HashVersion ¶

func NewSignaturePool ¶

func PrefixKeys ¶

Types ¶

type Config ¶

func (*Config) SignatureSize ¶

type Instruments ¶

func NewInstruments ¶

Source Files ¶

Directories ¶

dedup (prefix `LSH`)

vector (prefix `VLSH`)