minhash

package
v1.0.0-rc.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 13, 2026 License: Apache-2.0, Apache-2.0 Imports: 5 Imported by: 0

Documentation

Overview

Package minhash provides MinHash signature generation for set similarity estimation.

MinHash compresses a set of tokens or shingles into a compact fixed-size signature. The Jaccard similarity between two sets can then be estimated by comparing signatures in O(k) time, where k is the number of hash functions (typically 128).

This implementation uses FNV-1a base hashing with per-hash-function seeds mixed via a splitmix64 finalizer to produce k independent hash values from a single base hash computation.

Index

Constants

View Source
const (
	// HeaderSize is the number of bytes for the numHashes uint32 in serialization.
	HeaderSize = 4

	// BytesPerHash is the number of bytes per uint64 hash value in serialization.
	BytesPerHash = 8
)

Variables

View Source
var (
	// ErrZeroNumHashes is returned when numHashes is zero.
	ErrZeroNumHashes = errors.New("minhash: numHashes must be positive")

	// ErrSizeMismatch is returned when comparing signatures of different sizes.
	ErrSizeMismatch = errors.New("minhash: signature sizes do not match")

	// ErrNilSignature is returned when a nil signature is provided.
	ErrNilSignature = errors.New("minhash: signature must not be nil")

	// ErrInvalidData is returned when deserialization data is invalid.
	ErrInvalidData = errors.New("minhash: invalid serialized data")
)

Functions

This section is empty.

Types

type Signature

type Signature struct {
	// contains filtered or unexported fields
}

Signature is a thread-safe MinHash signature for Jaccard similarity estimation.

func New

func New(numHashes int) (*Signature, error)

New creates a new MinHash signature with the given number of hash functions. Each minimum is initialized to math.MaxUint64. Returns an error if numHashes is zero.

func (*Signature) Add

func (s *Signature) Add(token []byte)

Add updates all hash function minimums with the given token.

func (*Signature) Bytes

func (s *Signature) Bytes() []byte

Bytes serializes the signature to a compact binary format. Format: [numHashes as uint32 big-endian (4 bytes)] + [mins as []uint64 big-endian].

func (*Signature) Len

func (s *Signature) Len() int

Len returns the number of hash functions in the signature.

func (*Signature) Similarity

func (s *Signature) Similarity(other *Signature) (float64, error)

Similarity returns the estimated Jaccard index between this signature and another. Returns an error if the signatures have different sizes or if other is nil.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL