Documentation
¶
Index ¶
- func ComputeOpportunities(ctx context.Context, store OpportunityStore, opts ComputeOpportunitiesOptions) error
- func CosineDistanceFloat32(a, b []float32) float64
- func DeduplicatePages(pageMeta map[string]storage.PageMetadata) map[string]bool
- func NormalizeURL(rawURL string) string
- func WithVirtualLinks(g *storage.PageRankGraph, links []storage.VirtualLink) *storage.PageRankGraph
- type ComputeOpportunitiesOptions
- type Corpus
- type Document
- type EmbeddingProvider
- type OpenAIProvider
- type OpportunityStore
- type SimilarPair
- type SimulateResult
- type SimulationStore
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ComputeOpportunities ¶
func ComputeOpportunities(ctx context.Context, store OpportunityStore, opts ComputeOpportunitiesOptions) error
ComputeOpportunities runs the full interlinking analysis pipeline: 1. Stream HTML → extract content → build TF-IDF corpus 2. Find similar pairs above threshold 3. Filter out pairs that already have an internal link 4. Enrich with metadata and store results
func CosineDistanceFloat32 ¶
CosineDistanceFloat32 computes cosine distance between two float32 vectors.
func DeduplicatePages ¶
func DeduplicatePages(pageMeta map[string]storage.PageMetadata) map[string]bool
DeduplicatePages returns a set of URLs to skip in the corpus. Two deduplication mechanisms:
- Canonical: if page A has a non-self canonical pointing to page B (which exists in the crawl), skip A.
- Tracking params: after stripping tracking parameters, if multiple URLs resolve to the same normalized URL, keep only the best one (prefer the canonical/clean URL, then highest PageRank).
func NormalizeURL ¶
NormalizeURL strips known tracking parameters and normalizes trailing slashes. Returns the cleaned URL string. If parsing fails, returns the original.
func WithVirtualLinks ¶
func WithVirtualLinks(g *storage.PageRankGraph, links []storage.VirtualLink) *storage.PageRankGraph
WithVirtualLinks returns a copy of the graph with additional edges injected.
Types ¶
type ComputeOpportunitiesOptions ¶
type ComputeOpportunitiesOptions struct {
SessionID string
Method string // "tfidf"
SimilarityThreshold float64 // default 0.3
MaxOpportunities int // default 1000
MinCommonTerms int // default 3
}
ComputeOpportunitiesOptions controls the interlinking analysis.
type Corpus ¶
type Corpus struct {
Vocab map[string]uint32 // term → ID
IDF []float64 // ID → IDF value
Docs []Document
DocCount int
}
Corpus holds all documents and the shared vocabulary/IDF weights.
func BuildCorpus ¶
func BuildCorpus(pages <-chan storage.PageHTMLRow, pageInfo map[string]storage.PageMetadata) (*Corpus, error)
BuildCorpus constructs a TF-IDF corpus from streamed page HTML rows. Workers extract content and tokenize in parallel.
type Document ¶
type Document struct {
URL string
Title string
Lang string
PageRank float64
WordCount uint32
TermFreqs map[uint32]float64 // vocab ID → normalized TF-IDF
Norm float64 // precomputed L2 norm of TermFreqs
}
Document represents a single page's TF-IDF vector (sparse, top-K terms).
type EmbeddingProvider ¶
type EmbeddingProvider interface {
Embed(ctx context.Context, texts []string) ([][]float32, error)
Dimension() int
}
EmbeddingProvider generates vector embeddings from text.
type OpenAIProvider ¶
OpenAIProvider calls the OpenAI embeddings API.
func NewOpenAIProvider ¶
func NewOpenAIProvider(apiKey, model string, batchSize int) *OpenAIProvider
NewOpenAIProvider creates a new OpenAI embedding provider.
func (*OpenAIProvider) Dimension ¶
func (p *OpenAIProvider) Dimension() int
Dimension returns the embedding dimension for the configured model.
type OpportunityStore ¶
type OpportunityStore interface {
StreamPagesHTML(ctx context.Context, sessionID string) (<-chan storage.PageHTMLRow, error)
LoadInternalLinkSet(ctx context.Context, sessionID string) (map[[2]string]struct{}, error)
LoadPageMetadata(ctx context.Context, sessionID string) (map[string]storage.PageMetadata, error)
DeleteInterlinkingOpportunities(ctx context.Context, sessionID string) error
InsertInterlinkingOpportunities(ctx context.Context, sessionID string, opps []storage.InterlinkingOpportunity) error
}
OpportunityStore is the subset of storage needed by the opportunity finder.
type SimilarPair ¶
SimilarPair represents two documents with high cosine similarity.
func FindSimilarPairs ¶
func FindSimilarPairs(corpus *Corpus, threshold float64, minCommonTerms int) []SimilarPair
FindSimilarPairs finds document pairs above a similarity threshold. Uses an inverted index on high-IDF terms to avoid N² comparisons.
type SimulateResult ¶
type SimulateResult struct {
SimulationID string
PagesImproved uint32
PagesDeclined uint32
AvgDiff float64
MaxDiff float64
Results []storage.SimulationResultRow
}
SimulateResult holds the outcome of a PageRank simulation.
func SimulatePageRank ¶
func SimulatePageRank(ctx context.Context, store SimulationStore, sessionID, simID string, links []storage.VirtualLink) (*SimulateResult, error)
SimulatePageRank computes PageRank before/after adding virtual links.
type SimulationStore ¶
type SimulationStore interface {
LoadPageRankGraph(ctx context.Context, sessionID string) (*storage.PageRankGraph, error)
InsertSimulation(ctx context.Context, sessionID string, simID string, virtualLinks []storage.VirtualLink, results []storage.SimulationResultRow, meta storage.SimulationMeta) error
}
SimulationStore is the subset of storage needed for PageRank simulation.