interlinking

package
v0.12.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 10, 2026 License: AGPL-3.0 Imports: 17 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ComputeOpportunities

func ComputeOpportunities(ctx context.Context, store OpportunityStore, opts ComputeOpportunitiesOptions) error

ComputeOpportunities runs the full interlinking analysis pipeline: 1. Stream HTML → extract content → build TF-IDF corpus 2. Find similar pairs above threshold 3. Filter out pairs that already have an internal link 4. Enrich with metadata and store results

func CosineDistanceFloat32

func CosineDistanceFloat32(a, b []float32) float64

CosineDistanceFloat32 computes cosine distance between two float32 vectors.

func DeduplicatePages

func DeduplicatePages(pageMeta map[string]storage.PageMetadata) map[string]bool

DeduplicatePages returns a set of URLs to skip in the corpus. Two deduplication mechanisms:

  1. Canonical: if page A has a non-self canonical pointing to page B (which exists in the crawl), skip A.
  2. Tracking params: after stripping tracking parameters, if multiple URLs resolve to the same normalized URL, keep only the best one (prefer the canonical/clean URL, then highest PageRank).

func NormalizeURL

func NormalizeURL(rawURL string) string

NormalizeURL strips known tracking parameters and normalizes trailing slashes. Returns the cleaned URL string. If parsing fails, returns the original.

func WithVirtualLinks(g *storage.PageRankGraph, links []storage.VirtualLink) *storage.PageRankGraph

WithVirtualLinks returns a copy of the graph with additional edges injected.

Types

type ComputeOpportunitiesOptions

type ComputeOpportunitiesOptions struct {
	SessionID           string
	Method              string  // "tfidf"
	SimilarityThreshold float64 // default 0.3
	MaxOpportunities    int     // default 1000
	MinCommonTerms      int     // default 3
}

ComputeOpportunitiesOptions controls the interlinking analysis.

type Corpus

type Corpus struct {
	Vocab    map[string]uint32 // term → ID
	IDF      []float64         // ID → IDF value
	Docs     []Document
	DocCount int
}

Corpus holds all documents and the shared vocabulary/IDF weights.

func BuildCorpus

func BuildCorpus(pages <-chan storage.PageHTMLRow, pageInfo map[string]storage.PageMetadata) (*Corpus, error)

BuildCorpus constructs a TF-IDF corpus from streamed page HTML rows. Workers extract content and tokenize in parallel.

type Document

type Document struct {
	URL       string
	Title     string
	Lang      string
	PageRank  float64
	WordCount uint32
	TermFreqs map[uint32]float64 // vocab ID → normalized TF-IDF
	Norm      float64            // precomputed L2 norm of TermFreqs
}

Document represents a single page's TF-IDF vector (sparse, top-K terms).

type EmbeddingProvider

type EmbeddingProvider interface {
	Embed(ctx context.Context, texts []string) ([][]float32, error)
	Dimension() int
}

EmbeddingProvider generates vector embeddings from text.

type OpenAIProvider

type OpenAIProvider struct {
	APIKey    string
	Model     string
	BatchSize int
}

OpenAIProvider calls the OpenAI embeddings API.

func NewOpenAIProvider

func NewOpenAIProvider(apiKey, model string, batchSize int) *OpenAIProvider

NewOpenAIProvider creates a new OpenAI embedding provider.

func (*OpenAIProvider) Dimension

func (p *OpenAIProvider) Dimension() int

Dimension returns the embedding dimension for the configured model.

func (*OpenAIProvider) Embed

func (p *OpenAIProvider) Embed(ctx context.Context, texts []string) ([][]float32, error)

Embed generates embeddings for a batch of texts. Automatically chunks into BatchSize sub-batches.

type OpportunityStore

type OpportunityStore interface {
	StreamPagesHTML(ctx context.Context, sessionID string) (<-chan storage.PageHTMLRow, error)
	LoadInternalLinkSet(ctx context.Context, sessionID string) (map[[2]string]struct{}, error)
	LoadPageMetadata(ctx context.Context, sessionID string) (map[string]storage.PageMetadata, error)
	DeleteInterlinkingOpportunities(ctx context.Context, sessionID string) error
	InsertInterlinkingOpportunities(ctx context.Context, sessionID string, opps []storage.InterlinkingOpportunity) error
}

OpportunityStore is the subset of storage needed by the opportunity finder.

type SimilarPair

type SimilarPair struct {
	SourceIdx  int
	TargetIdx  int
	Similarity float64
}

SimilarPair represents two documents with high cosine similarity.

func FindSimilarPairs

func FindSimilarPairs(corpus *Corpus, threshold float64, minCommonTerms int) []SimilarPair

FindSimilarPairs finds document pairs above a similarity threshold. Uses an inverted index on high-IDF terms to avoid N² comparisons.

type SimulateResult

type SimulateResult struct {
	SimulationID  string
	PagesImproved uint32
	PagesDeclined uint32
	AvgDiff       float64
	MaxDiff       float64
	Results       []storage.SimulationResultRow
}

SimulateResult holds the outcome of a PageRank simulation.

func SimulatePageRank

func SimulatePageRank(ctx context.Context, store SimulationStore, sessionID, simID string, links []storage.VirtualLink) (*SimulateResult, error)

SimulatePageRank computes PageRank before/after adding virtual links.

type SimulationStore

type SimulationStore interface {
	LoadPageRankGraph(ctx context.Context, sessionID string) (*storage.PageRankGraph, error)
	InsertSimulation(ctx context.Context, sessionID string, simID string, virtualLinks []storage.VirtualLink, results []storage.SimulationResultRow, meta storage.SimulationMeta) error
}

SimulationStore is the subset of storage needed for PageRank simulation.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL