Documentation
¶
Overview ¶
Package minhash provides MinHash signature generation for set similarity estimation.
MinHash compresses a set of tokens or shingles into a compact fixed-size signature. The Jaccard similarity between two sets can then be estimated by comparing signatures in O(k) time, where k is the number of hash functions (typically 128).
This implementation uses FNV-1a base hashing with per-hash-function seeds mixed via a splitmix64 finalizer to produce k independent hash values from a single base hash computation.
Index ¶
Constants ¶
const ( // HeaderSize is the number of bytes for the numHashes uint32 in serialization. HeaderSize = 4 // BytesPerHash is the number of bytes per uint64 hash value in serialization. BytesPerHash = 8 )
Variables ¶
var ( // ErrZeroNumHashes is returned when numHashes is zero. ErrZeroNumHashes = errors.New("minhash: numHashes must be positive") // ErrSizeMismatch is returned when comparing signatures of different sizes. ErrSizeMismatch = errors.New("minhash: signature sizes do not match") // ErrNilSignature is returned when a nil signature is provided. ErrNilSignature = errors.New("minhash: signature must not be nil") // ErrInvalidData is returned when deserialization data is invalid. ErrInvalidData = errors.New("minhash: invalid serialized data") )
Functions ¶
This section is empty.
Types ¶
type Signature ¶
type Signature struct {
// contains filtered or unexported fields
}
Signature is a thread-safe MinHash signature for Jaccard similarity estimation.
func New ¶
New creates a new MinHash signature with the given number of hash functions. Each minimum is initialized to math.MaxUint64. Returns an error if numHashes is zero.
func (*Signature) Bytes ¶
Bytes serializes the signature to a compact binary format. Format: [numHashes as uint32 big-endian (4 bytes)] + [mins as []uint64 big-endian].