Documentation
¶
Index ¶
- Constants
- Variables
- func EstimateJaccard(sig1, sig2 []uint64) float64
- type Config
- type Hasher
- type SimilarityService
- func (s *SimilarityService) CalculateJaccardOptimized(sourceSet set.GenericDataSet[string], targetStr string) float64
- func (s *SimilarityService) GetNewID(input string) (string, error)
- func (s *SimilarityService) Shingle(input string) set.GenericDataSet[string]
- func (s *SimilarityService) Upsert(ctx context.Context, group, input string) (string, error)
Constants ¶
const (
EnvPrefix = "LSH"
)
Variables ¶
Functions ¶
func EstimateJaccard ¶ added in v1.0.0
Types ¶
type Config ¶
type Config struct {
Bands int `env:"_BANDS" envDefault:"40"`
Rows int `env:"_ROWS" envDefault:"5"`
ShingleSize int `env:"_SHINGLE_SIZE" envDefault:"3"`
JaccardThreshold float64 `env:"_JAC_THRESHOLD" envDefault:"0.6"`
MaxBucketSize int `env:"_MAX_BUCKET_SIZE" envDefault:"200"`
MaxTotalCandidates int `env:"_MAX_TOTAL_CANDIDATES" envDefault:"100"`
Seed int64 `env:"_SEED" envDefault:"13374269"`
}
Config defines the parameters for the LSH (Locality-Sensitive Hashing) pipeline.
- Bands and Rows control how MinHash signatures are split into band keys:
signature size = Bands * Rows
increasing Bands (with fixed Rows) generally increases recall: more chances for similar items to land in at least one common bucket.
increasing Rows (with fixed Bands) generally increases precision: a stricter requirement to match within a band.
Together they define an *approximate* similarity level where collisions become likely (often estimated as s ≈ (1/Bands)^(1/Rows)). This is a probabilistic candidate generator.
JaccardThreshold is the final similarity filter used after candidate generation. We intentionally keep the LSH bucketing stage “looser” (i.e., allowing candidates at a lower similarity) so that *similar* items are saved/located via similar buckets. Then JaccardThreshold decides whether a candidate is “similar enough” to return the same ID or should be treated as different (new) ID.
func GetLSHConfigFromEnv ¶
func (*Config) CalculateApproximateThreshold ¶ added in v1.0.1
CalculateApproximateThreshold computes the approximate Jaccard similarity threshold at which the LSH configuration (Bands and Rows) is most sensitive. This is the point where the probability of two items being hashed to the same bucket begins to rise sharply. The formula is s ≈ (1/B)^(1/R).
type Hasher ¶
type Hasher struct {
// contains filtered or unexported fields
}
func (*Hasher) ComputeSignature ¶
func (h *Hasher) ComputeSignature(tokens set.GenericDataSet[string], sig []uint64)
type SimilarityService ¶
type SimilarityService struct {
// contains filtered or unexported fields
}
func NewSimilarityService ¶
func NewSimilarityService(repo repositories.Storage, config *Config) *SimilarityService
func (*SimilarityService) CalculateJaccardOptimized ¶
func (s *SimilarityService) CalculateJaccardOptimized(sourceSet set.GenericDataSet[string], targetStr string) float64
func (*SimilarityService) GetNewID ¶
func (s *SimilarityService) GetNewID(input string) (string, error)
func (*SimilarityService) Shingle ¶
func (s *SimilarityService) Shingle(input string) set.GenericDataSet[string]