Documentation
¶
Overview ¶
Package analysis provides advanced content analysis engines for crawled pages.
Package analysis provides advanced content analysis engines for crawled pages.
Index ¶
- func CheckCWVIssues(results []model.LighthouseResult) []model.Issue
- func SortedCategories(sr *ScoreResult) []string
- type CWVSummary
- type ClassifiedPage
- type ContentAnalyzer
- type CrawlBudgetAnalyzer
- type DuplicateDetector
- type LinkGraph
- func (g *LinkGraph) ComputePageRank(iterations int, dampingFactor float64) map[string]float64
- func (g *LinkGraph) Edges() int
- func (g *LinkGraph) ExportDOT(w io.Writer) error
- func (g *LinkGraph) ExportJSON(w io.Writer) error
- func (g *LinkGraph) Nodes() int
- func (g *LinkGraph) OutgoingLinks(url string) []string
- type LinkGraphChecker
- type PageClass
- type RedirectChecker
- type RedirectEntry
- type RedirectMap
- type ScoreResult
- type Technology
- type URLScore
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CheckCWVIssues ¶
func CheckCWVIssues(results []model.LighthouseResult) []model.Issue
CheckCWVIssues examines the aggregated Lighthouse results and returns issues for poor site-wide scores.
func SortedCategories ¶
func SortedCategories(sr *ScoreResult) []string
SortedCategories returns the category names sorted alphabetically. This is useful for deterministic output ordering.
Types ¶
type CWVSummary ¶
type CWVSummary struct {
PageCount int `json:"page_count"`
AvgPerformance float64 `json:"avg_performance"`
AvgAccessibility float64 `json:"avg_accessibility"`
AvgSEO float64 `json:"avg_seo"`
AvgBestPractices float64 `json:"avg_best_practices"`
WorstPerformance []URLScore `json:"worst_performance"`
WorstAccessibility []URLScore `json:"worst_accessibility"`
PassRate float64 `json:"pass_rate"`
}
CWVSummary aggregates Core Web Vitals / Lighthouse scores across all audited pages.
func AggregateCWV ¶
func AggregateCWV(results []model.LighthouseResult) *CWVSummary
AggregateCWV computes aggregate Lighthouse scores from the given results. It returns nil if results is empty.
type ClassifiedPage ¶
type ClassifiedPage struct {
URL string `json:"url"`
Class PageClass `json:"class"`
Score float64 `json:"score"` // confidence 0-1
}
ClassifiedPage pairs a page URL with its detected class and confidence.
func ClassifyPages ¶
func ClassifyPages(pages []*model.Page) []ClassifiedPage
ClassifyPages classifies each page by applying heuristic rules and returning the highest-scoring class. Pages that match no rules are classified as "other" with a score of 0.
type ContentAnalyzer ¶
type ContentAnalyzer struct{}
ContentAnalyzer checks per-page content quality metrics.
func NewContentAnalyzer ¶
func NewContentAnalyzer() *ContentAnalyzer
NewContentAnalyzer returns a new ContentAnalyzer.
func (*ContentAnalyzer) Name ¶
func (c *ContentAnalyzer) Name() string
Name returns the checker name.
type CrawlBudgetAnalyzer ¶
type CrawlBudgetAnalyzer struct{}
CrawlBudgetAnalyzer detects crawl budget waste across a site.
func NewCrawlBudgetAnalyzer ¶
func NewCrawlBudgetAnalyzer() *CrawlBudgetAnalyzer
NewCrawlBudgetAnalyzer returns a new CrawlBudgetAnalyzer.
func (*CrawlBudgetAnalyzer) Name ¶
func (c *CrawlBudgetAnalyzer) Name() string
Name returns the checker name.
type DuplicateDetector ¶
type DuplicateDetector struct {
// ThinContentThreshold is the minimum word count below which a page is
// flagged as thin content. Defaults to defaultThinContentThreshold.
ThinContentThreshold int
}
DuplicateDetector finds exact duplicates, near-duplicates, and thin content across a set of crawled pages.
func NewDuplicateDetector ¶
func NewDuplicateDetector() *DuplicateDetector
NewDuplicateDetector returns a DuplicateDetector with default settings.
func (*DuplicateDetector) Analyze ¶
func (d *DuplicateDetector) Analyze(pages []*model.Page) []model.Issue
Analyze examines all pages for exact duplicates, near-duplicates, and thin content.
func (*DuplicateDetector) CheckSite ¶
CheckSite runs duplicate and thin content detection across all pages.
func (*DuplicateDetector) Name ¶
func (d *DuplicateDetector) Name() string
Name returns the checker name.
type LinkGraph ¶
type LinkGraph struct {
// contains filtered or unexported fields
}
LinkGraph represents the internal link structure of a website.
func BuildGraph ¶
BuildGraph constructs a LinkGraph from a set of crawled pages. Only internal links (same host) are included in the graph.
func (*LinkGraph) ComputePageRank ¶
ComputePageRank runs the iterative PageRank algorithm.
Parameters:
- iterations: number of iterations to run (use 0 for default of 20)
- dampingFactor: the damping factor d (use 0 for default of 0.85)
Returns a map of URL to PageRank score.
func (*LinkGraph) ExportJSON ¶
ExportJSON writes the link graph as a JSON adjacency list to w.
func (*LinkGraph) OutgoingLinks ¶
OutgoingLinks returns the outgoing internal links for a URL.
type LinkGraphChecker ¶
type LinkGraphChecker struct{}
LinkGraphChecker wraps LinkGraph analysis as a SiteChecker for the audit registry.
func NewLinkGraphChecker ¶
func NewLinkGraphChecker() *LinkGraphChecker
NewLinkGraphChecker returns a new LinkGraphChecker.
func (*LinkGraphChecker) CheckSite ¶
CheckSite builds the link graph, computes PageRank, and generates issues.
func (*LinkGraphChecker) Name ¶
func (c *LinkGraphChecker) Name() string
Name returns the checker name.
type PageClass ¶
type PageClass string
PageClass represents the classification of a page.
const ( ClassHomepage PageClass = "homepage" ClassBlog PageClass = "blog" ClassProduct PageClass = "product" ClassCategory PageClass = "category" ClassContact PageClass = "contact" ClassAbout PageClass = "about" ClassLegal PageClass = "legal" ClassAPI PageClass = "api" ClassOther PageClass = "other" )
type RedirectChecker ¶
type RedirectChecker struct{}
RedirectChecker implements SiteChecker for redirect-related issues.
func NewRedirectChecker ¶
func NewRedirectChecker() *RedirectChecker
NewRedirectChecker returns a new RedirectChecker.
func (*RedirectChecker) CheckSite ¶
CheckSite runs site-wide redirect analysis and returns any issues found.
func (*RedirectChecker) Name ¶
func (c *RedirectChecker) Name() string
Name returns the checker name.
type RedirectEntry ¶
type RedirectEntry struct {
From string `json:"from"`
To string `json:"to"`
Chain []string `json:"chain"`
Hops int `json:"hops"`
LinkedBy []string `json:"linked_by"`
}
RedirectEntry represents a single redirect, including its full chain, hop count, and the set of pages that link to the redirecting URL.
type RedirectMap ¶
type RedirectMap struct {
Entries []RedirectEntry `json:"entries"`
}
RedirectMap holds all redirect entries discovered during analysis.
func AnalyzeRedirects ¶
func AnalyzeRedirects(pages []*model.Page) *RedirectMap
AnalyzeRedirects scans all pages for non-empty RedirectChain, builds redirect entries with full chain info, cross-references pages that link to the redirecting URL, and returns the result sorted by hop count descending.
type ScoreResult ¶
type ScoreResult struct {
Overall int `json:"overall"` // 0-100
Categories map[string]int `json:"categories"` // category -> score
Breakdown map[string]int `json:"breakdown"` // severity -> count
TotalPages int `json:"total_pages"`
TotalIssues int `json:"total_issues"`
}
ScoreResult holds the computed health score for a crawled site.
func ComputeScore ¶
func ComputeScore(result *model.CrawlResult) *ScoreResult
ComputeScore calculates a 0-100 aggregate health score from the issues in a CrawlResult. The overall score starts at 100 and is decremented per issue (critical=-10, warning=-3, info=-1), floored at 0. Category scores are computed independently using the same algorithm.
type Technology ¶
type Technology struct {
Name string `json:"name"`
Category string `json:"category"` // cms, framework, analytics, cdn, etc.
Version string `json:"version,omitempty"`
Evidence string `json:"evidence"` // what signal detected it
}
Technology represents a detected technology on a website.
func DetectTechnologies ¶
func DetectTechnologies(pages []*model.Page) []Technology
DetectTechnologies scans the provided pages for known technology signatures and returns a deduplicated, sorted list of detected technologies.
Body and asset checks use primarily the first (homepage) page. Header checks are applied across all pages for CDN detection.