duplicate

package
v0.100.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 19, 2026 License: MIT Imports: 11 Imported by: 0

README

AI Duplicate Detection (ai/duplicate)

The duplicate package implements a multi-modal feature-based memo duplicate detection and relationship recommendation algorithm.

Architecture

classDiagram
    class DuplicateDetector {
        <<interface>>
        +Detect(ctx, req)
        +Merge(ctx, src, tgt)
        +Link(ctx, m1, m2)
    }
    class duplicateDetector {
        -embeddingService
        -store
        -weights Weights
        +Detect()
    }
    class Similarity {
        +CosineSimilarity()
        +TagCoOccurrence()
        +TimeProximity()
    }

    DuplicateDetector <|.. duplicateDetector : implements
    duplicateDetector --> Similarity : uses
  • DuplicateDetector Interface: Defines three core operations: Detect, Merge, Link.
  • Multi-dimensional Similarity Model: Combines vector semantics, tag system, and time dimensions.

Algorithm

Uses Weighted Hybrid Similarity to calculate similarity score (0-1):

Score = (Vector * 0.5) + (TagCoOccur * 0.3) + (TimeProx * 0.2)

1. Semantic Similarity (weight 0.5)
  • Calculate Cosine Similarity of text embeddings.
  • Captures semantic-level similarity.
2. Tag Co-occurrence (weight 0.3)
  • Calculate Jaccard Similarity (intersection/union) of tag sets.
  • Captures explicit user categorization similarity.
3. Time Proximity (weight 0.2)
  • Use exponential decay function: exp(-days_diff / 7).
  • Half-life is 7 days - notes closer in time are more relevant.

Thresholds

  • Duplicate: Score > 0.9. System prompts user about possible duplicate content.
  • Related: 0.7 < Score <= 0.9. System recommends as related notes.

Workflow

flowchart TD
    Start[New Memo Input] --> Embed[Generate Embedding]
    Embed --> Search[Vector Search (Top-N Candidates)]
    Search --> Candidates

    subgraph Similarity Check
        direction TB
        Candidates --> Calc1[Vector Similarity]
        Candidates --> Calc2[Tag Jaccard]
        Candidates --> Calc3[Time Decay]
        Calc1 & Calc2 & Calc3 --> WeightedSum[Weighted Sum]
    end

    WeightedSum --> Score{Score Check}
    Score -- > 0.9 --> Dup[Mark as Duplicate]
    Score -- 0.7-0.9 --> Rel[Mark as Related]
    Score -- < 0.7 --> Ignore[Ignore]

    Dup & Rel --> Response[Return Detection Result]
  1. User inputs memo content.
  2. System calls Detect asynchronously or synchronously.
  3. Calculate hybrid similarity with existing memos.
  4. Return Top-K results with similarity scores and breakdown factors.

Core Operations

Detect
  • Generate embedding for new content
  • Vector search for top-K candidates
  • Calculate 3D similarity for each candidate
  • Categorize into Duplicates or Related
Merge
  • Merge content from source to target memo
  • Merge tags (union, case-insensitive deduplication)
  • Archive source memo (set to ARCHIVED status)
  • Create bidirectional relation between two memos
  • Uses MemoRelationReference type

Documentation

Overview

Package duplicate provides memo duplicate detection for P2-C002.

Package duplicate - detector implementation for P2-C002.

Package duplicate - similarity calculation for P2-C002.

Index

Constants

View Source
const (
	DuplicateThreshold = 0.9 // >90% = duplicate
	RelatedThreshold   = 0.7 // 70-90% = related
	DefaultTopK        = 5
)

Thresholds for duplicate detection.

View Source
const TimeDecayDays = 7

TimeDecayDays is the decay period for time proximity calculation.

Variables

View Source
var DefaultWeights = Weights{
	Vector:     0.5,
	TagCoOccur: 0.3,
	TimeProx:   0.2,
}

DefaultWeights are the default weights for similarity calculation.

Functions

func CalculateWeightedSimilarity

func CalculateWeightedSimilarity(b *Breakdown, w Weights) float64

CalculateWeightedSimilarity computes weighted similarity from breakdown.

func CosineSimilarity

func CosineSimilarity(a, b []float32) float64

CosineSimilarity calculates cosine similarity between two vectors.

func ExtractTitle

func ExtractTitle(content string) string

ExtractTitle extracts title from memo content (first line).

func FindSharedTags

func FindSharedTags(tags1, tags2 []string) []string

FindSharedTags returns tags that appear in both slices.

func TagCoOccurrence

func TagCoOccurrence(tags1, tags2 []string) float64

TagCoOccurrence calculates Jaccard similarity between two tag sets.

func TimeProximity

func TimeProximity(newTime, candidateTime time.Time) float64

TimeProximity calculates time proximity using exponential decay. Returns 1.0 for same day, decaying exponentially over TimeDecayDays.

func Truncate

func Truncate(content string, maxLen int) string

Truncate truncates content to maxLen characters (Unicode-safe).

Types

type Breakdown

type Breakdown struct {
	Vector     float64 `json:"vector"`
	TagCoOccur float64 `json:"tag_co_occur"`
	TimeProx   float64 `json:"time_prox"`
}

Breakdown shows how similarity was calculated.

type DetectRequest

type DetectRequest struct {
	Title   string   `json:"title"`
	Content string   `json:"content"`
	Tags    []string `json:"tags,omitempty"`
	TopK    int      `json:"top_k,omitempty"`
	UserID  int32    `json:"user_id"`
}

DetectRequest contains input for duplicate detection.

type DetectResponse

type DetectResponse struct {
	Duplicates   []SimilarMemo `json:"duplicates,omitempty"`
	Related      []SimilarMemo `json:"related,omitempty"`
	LatencyMs    int64         `json:"latency_ms"`
	HasDuplicate bool          `json:"has_duplicate"`
	HasRelated   bool          `json:"has_related"`
}

DetectResponse contains detection results.

type DuplicateDetector

type DuplicateDetector interface {
	// Detect finds duplicate and related memos for given content.
	Detect(ctx context.Context, req *DetectRequest) (*DetectResponse, error)

	// Merge merges source memo into target memo.
	Merge(ctx context.Context, userID int32, sourceID, targetID string) error

	// Link creates a bidirectional relation between two memos.
	Link(ctx context.Context, userID int32, memoID1, memoID2 string) error
}

DuplicateDetector detects duplicate and related memos.

func NewDuplicateDetector

func NewDuplicateDetector(s *store.Store, embedding ai.EmbeddingService, model string) DuplicateDetector

NewDuplicateDetector creates a new DuplicateDetector.

func NewDuplicateDetectorWithWeights

func NewDuplicateDetectorWithWeights(s *store.Store, embedding ai.EmbeddingService, model string, weights Weights) DuplicateDetector

NewDuplicateDetectorWithWeights creates a detector with custom weights.

type SimilarMemo

type SimilarMemo struct {
	Breakdown  *Breakdown `json:"breakdown,omitempty"`
	ID         string     `json:"id"`
	Name       string     `json:"name"`
	Title      string     `json:"title"`
	Snippet    string     `json:"snippet"`
	Level      string     `json:"level"`
	SharedTags []string   `json:"shared_tags,omitempty"`
	Similarity float64    `json:"similarity"`
}

SimilarMemo represents a memo similar to the input.

type Weights

type Weights struct {
	Vector     float64
	TagCoOccur float64
	TimeProx   float64
}

Weights for similarity calculation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL