Documentation
¶
Overview ¶
Package scope provides URL scope checking for the crawler.
Index ¶
- Variables
- func ClassifyURL(urlStr string) string
- func ExtractDomain(urlStr string) (string, error)
- func IsAPIPath(path string) bool
- func IsValidURL(urlStr string) bool
- func MatchPattern(url, pattern string) bool
- func NormalizeURL(rawURL string) (string, error)
- func ResolveURL(baseURL, relativeURL string) (string, error)
- type Checker
- type RuleBuilder
- func (b *RuleBuilder) Build() ScopeRules
- func (b *RuleBuilder) WithAllowedDomains(domains ...string) *RuleBuilder
- func (b *RuleBuilder) WithDefaultExcludes() *RuleBuilder
- func (b *RuleBuilder) WithExcludePatterns(patterns ...string) *RuleBuilder
- func (b *RuleBuilder) WithFollowExternal(follow bool) *RuleBuilder
- func (b *RuleBuilder) WithIncludePatterns(patterns ...string) *RuleBuilder
- func (b *RuleBuilder) WithMaxDepth(depth int) *RuleBuilder
- type ScopeRules
Constants ¶
This section is empty.
Variables ¶
var CommonAPIPatterns = []string{
`/api/`,
`/v[0-9]+/`,
`/graphql`,
`/rest/`,
`/rpc/`,
`/ajax/`,
`/json/`,
`/xml/`,
}
CommonAPIPatterns contains common API path patterns.
var DefaultExcludePatterns = []string{
`.*[?&]logout.*`,
`.*[?&]signout.*`,
`.*[?&]exit.*`,
`.*\/logout.*`,
`.*\/signout.*`,
`.*\/delete-account.*`,
`.*\/unsubscribe.*`,
`.*\/reset-password.*`,
`.*\.pdf$`,
`.*\.zip$`,
`.*\.exe$`,
`.*\.dmg$`,
}
DefaultExcludePatterns contains common patterns to exclude.
Functions ¶
func ClassifyURL ¶
ClassifyURL classifies a URL by its likely type.
func ExtractDomain ¶
ExtractDomain extracts the domain from a URL.
func IsValidURL ¶
IsValidURL checks if a URL is valid for crawling.
func MatchPattern ¶
MatchPattern checks if a URL matches a pattern.
func NormalizeURL ¶
NormalizeURL normalizes a URL for deduplication.
func ResolveURL ¶
ResolveURL resolves a relative URL against a base URL.
Types ¶
type Checker ¶
type Checker struct {
// contains filtered or unexported fields
}
Checker validates URLs against scope rules.
func NewChecker ¶
func NewChecker(targetURL string, rules ScopeRules) (*Checker, error)
NewChecker creates a new scope checker.
func (*Checker) AddAllowedDomain ¶
AddAllowedDomain adds a domain to the allowed list.
func (*Checker) AddExcludePattern ¶
AddExcludePattern adds an exclude pattern.
func (*Checker) AddIncludePattern ¶
AddIncludePattern adds an include pattern.
func (*Checker) SetMaxDepth ¶
SetMaxDepth sets the maximum crawl depth.
type RuleBuilder ¶
type RuleBuilder struct {
// contains filtered or unexported fields
}
RuleBuilder helps build scope rules.
func (*RuleBuilder) Build ¶
func (b *RuleBuilder) Build() ScopeRules
Build returns the configured rules.
func (*RuleBuilder) WithAllowedDomains ¶
func (b *RuleBuilder) WithAllowedDomains(domains ...string) *RuleBuilder
WithAllowedDomains sets allowed domains.
func (*RuleBuilder) WithDefaultExcludes ¶
func (b *RuleBuilder) WithDefaultExcludes() *RuleBuilder
WithDefaultExcludes adds default exclude patterns.
func (*RuleBuilder) WithExcludePatterns ¶
func (b *RuleBuilder) WithExcludePatterns(patterns ...string) *RuleBuilder
WithExcludePatterns adds exclude patterns.
func (*RuleBuilder) WithFollowExternal ¶
func (b *RuleBuilder) WithFollowExternal(follow bool) *RuleBuilder
WithFollowExternal enables following external links.
func (*RuleBuilder) WithIncludePatterns ¶
func (b *RuleBuilder) WithIncludePatterns(patterns ...string) *RuleBuilder
WithIncludePatterns adds include patterns.
func (*RuleBuilder) WithMaxDepth ¶
func (b *RuleBuilder) WithMaxDepth(depth int) *RuleBuilder
WithMaxDepth sets the maximum crawl depth.