lsh

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 11, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

README

🚀 High-Performance MinHash LSH for Go

A production-grade Locality Sensitive Hashing (LSH) implementation engineered for high-throughput deduplication, fraud detection, and similarity search. This library focuses on extreme CPU efficiency and zero-allocation hot paths. ✨ Key Features

⚡ Odd-Coefficient Optimization: Replaces expensive modulo arithmetic with implicit 264 overflow. By enforcing odd coefficients, we guarantee a mathematically perfect permutation with zero CPU overhead.

♻️ Zero-Allocation Design: Utilizes sync.Pool for signature buffers and optimized slicing to eliminate GC pressure during heavy ingestion.

🛡️ Hot-Bucket Protection: Includes a "Peek & Stub" strategy for Aerospike/Database layers to prevent OOM crashes when encountering "hot" tokens (millions of matches).

🧩 Scalable Architecture: Decoupled storage interfaces supporting In-Memory and Aerospike (CDT-based) backends out of the box.

🧪 Mathematically Sound: Implements 3-char shingling and MinHash signatures with configurable P=1−(1−sr)b probability curves.

📦 Installation

go get github.com/FrogoAI/lsh

🚀 Quick Start

package main

import (
	"context"
	
	"github.com/FrogoAI/lsh"
	"github.com/FrogoAI/lsh/repositories/memory"
)

func main() {
	// 1. Setup Config
	cfg := &lsh.Config{
		Bands:            20,
		Rows:             5,
		ShingleSize:      3,
		JaccardThreshold: 0.6,
		Seed:             13374269,
	}

	// 2. Initialize Service
	repo := memory.New()
	service := lsh.NewSimilarityService(repo, cfg)

	// 3. Upsert a record
	// Returns a new unique record id, or the existing ID if it's a duplicate.
	id, _ := service.Upsert(context.Background(), "users", "John Doe")
}

📐 Tuning the LSH Curve

The probability of two items sharing a bucket depends on their Jaccard similarity s, the number of bands b, and rows per band r.

Higher Recall (Catch more): Increase Bands.

Higher Precision (Fewer false positives): Increase Rows.

🔬 Performance Optimizations The "Odd Coefficient" Trick

Standard MinHash often uses (ax+b)(modP). We utilize the CPU's natural behavior with 264 overflow. Because we ensure a is always odd, it remains coprime to 264, satisfying the requirement for a linear congruential generator to produce a full-period permutation. Aerospike CDT Implementation

The Aerospike repository uses Map CDTs (Collection Data Types). Instead of fetching a whole list of IDs, we use MapSizeOp to check the density of a bucket server-side. If a bucket is too "hot," we skip it without ever transferring the data over the wire.

Documentation

Index

Constants

View Source
const (
	EnvPrefix = "LSH"
)

Variables

View Source
var (
	ErrSignatureTooShort    = errors.New("signature too short")
	ErrShingleResultIsEmpty = errors.New("error shingle result is empty")
	ErrEmptyInputString     = errors.New("empty input string")
)

Functions

func EstimateJaccard added in v1.0.0

func EstimateJaccard(sig1, sig2 []uint64) float64

Types

type Config

type Config struct {
	Bands              int     `env:"_BANDS" envDefault:"40"`
	Rows               int     `env:"_ROWS" envDefault:"5"`
	ShingleSize        int     `env:"_SHINGLE_SIZE" envDefault:"3"`
	JaccardThreshold   float64 `env:"_JAC_THRESHOLD" envDefault:"0.6"`
	MaxBucketSize      int     `env:"_MAX_BUCKET_SIZE" envDefault:"200"`
	MaxTotalCandidates int     `env:"_MAX_TOTAL_CANDIDATES" envDefault:"100"`
	Seed               int64   `env:"_SEED" envDefault:"13374269"`
}

Config defines the parameters for the LSH (Locality-Sensitive Hashing) pipeline.

- Bands and Rows control how MinHash signatures are split into band keys:

  • signature size = Bands * Rows

  • increasing Bands (with fixed Rows) generally increases recall: more chances for similar items to land in at least one common bucket.

  • increasing Rows (with fixed Bands) generally increases precision: a stricter requirement to match within a band.

    Together they define an *approximate* similarity level where collisions become likely (often estimated as s ≈ (1/Bands)^(1/Rows)). This is a probabilistic candidate generator.

  • JaccardThreshold is the final similarity filter used after candidate generation. We intentionally keep the LSH bucketing stage “looser” (i.e., allowing candidates at a lower similarity) so that *similar* items are saved/located via similar buckets. Then JaccardThreshold decides whether a candidate is “similar enough” to return the same ID or should be treated as different (new) ID.

func GetLSHConfigFromEnv

func GetLSHConfigFromEnv() (*Config, error)

func (*Config) CalculateApproximateThreshold added in v1.0.1

func (c *Config) CalculateApproximateThreshold() float64

CalculateApproximateThreshold computes the approximate Jaccard similarity threshold at which the LSH configuration (Bands and Rows) is most sensitive. This is the point where the probability of two items being hashed to the same bucket begins to rise sharply. The formula is s ≈ (1/B)^(1/R).

func (*Config) HashVersion

func (c *Config) HashVersion(group string) (string, error)

type Hasher

type Hasher struct {
	// contains filtered or unexported fields
}

func NewHasher

func NewHasher(bands, rows int, seed int64) *Hasher

func (*Hasher) ComputeBands

func (h *Hasher) ComputeBands(signature []uint64) ([]string, error)

func (*Hasher) ComputeSignature

func (h *Hasher) ComputeSignature(tokens set.GenericDataSet[string], sig []uint64)

type SimilarityService

type SimilarityService struct {
	// contains filtered or unexported fields
}

func NewSimilarityService

func NewSimilarityService(repo repositories.Storage, config *Config) *SimilarityService

func (*SimilarityService) CalculateJaccardOptimized

func (s *SimilarityService) CalculateJaccardOptimized(sourceSet set.GenericDataSet[string], targetStr string) float64

func (*SimilarityService) GetNewID

func (s *SimilarityService) GetNewID(input string) (string, error)

func (*SimilarityService) Shingle

func (s *SimilarityService) Shingle(input string) set.GenericDataSet[string]

func (*SimilarityService) Upsert

func (s *SimilarityService) Upsert(ctx context.Context, group, input string) (string, error)

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL