crawler

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 11, 2026 License: MIT Imports: 14 Imported by: 0

Documentation

Overview

Package crawler implements a concurrent website crawler with rate limiting, depth control, URL deduplication, and robots.txt compliance.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FetchSitemapURLs

func FetchSitemapURLs(ctx context.Context, client *http.Client, sitemapURLs []string) []string

FetchSitemapURLs fetches and parses sitemap(s) from the given URLs. Supports both sitemap index files and direct URL sets.

func ServeDir

func ServeDir(ctx context.Context, dir string) (*http.Server, string, error)

ServeDir starts a temporary HTTP file server for the given directory. Returns the server and its address (host:port). The caller must call srv.Close() when done.

Types

type Config

type Config struct {
	MaxDepth        int
	Concurrency     int
	Timeout         time.Duration
	PageTimeout     time.Duration
	RateLimit       int
	RetryAttempts   int
	RetryDelay      time.Duration
	UserAgent       string
	FollowRedirects int
	RespectRobots   bool
	Exclude         []string
	AuthHeader      string
	AuthValue       string
	CookieJar       http.CookieJar
	AllowPrivateIPs bool // When true, skip SSRF protection for private IPs
}

Config controls crawler behavior.

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler performs concurrent crawling with rate limiting.

func New

func New(cfg Config) *Crawler

New creates a configured Crawler.

func (*Crawler) Crawl

func (c *Crawler) Crawl(ctx context.Context, startURL string) ([]*Page, error)

Crawl starts from the given URL and discovers pages up to MaxDepth. Returns all crawled pages. Safe for concurrent use via internal locking.

type Form

type Form struct {
	Action  string
	Method  string
	ID      string
	Inputs  []FormInput
	HasCSRF bool
}

Form represents an HTML form found on a page.

type FormInput

type FormInput struct {
	Name     string
	Type     string
	Required bool
	Value    string
}

FormInput represents a form field.

type Link struct {
	Href     string
	Text     string
	Rel      string
	External bool
	Anchor   bool
	Resource bool   // true for non-anchor resource URLs (img, script, iframe, etc.)
	Tag      string // source element tag (e.g., "img", "script", "iframe")
}

Link represents a hyperlink found on a page.

type Page

type Page struct {
	URL          string
	StatusCode   int
	Headers      http.Header
	Body         []byte
	Links        []Link
	Forms        []Form
	Depth        int
	ParentURL    string
	Duration     time.Duration
	Error        error
	AuthRequired bool // true when server returned 401/403
}

Page represents a single crawled page with its metadata.

type ParseResult added in v0.5.0

type ParseResult struct {
	Links    []Link
	Forms    []Form
	ParseErr error // non-nil if HTML parsing encountered an error (partial results still returned)
}

ParseResult holds links/forms extraction results along with any parse error.

func ParseHTML added in v0.5.0

func ParseHTML(pageURL string, body []byte) ParseResult

ParseHTML extracts links and forms, returning partial results even on parse error.

type RobotsCache

type RobotsCache struct {
	// contains filtered or unexported fields
}

RobotsCache caches parsed robots.txt rules per host.

func NewRobotsCache

func NewRobotsCache() *RobotsCache

NewRobotsCache creates an empty robots.txt cache.

func (*RobotsCache) Allowed

func (rc *RobotsCache) Allowed(rawURL, userAgent string) bool

Allowed checks if a URL is permitted by robots.txt rules. Per the standard, if both Allow and Disallow match a path, the longest matching rule wins. If they are the same length, Allow takes precedence.

func (*RobotsCache) CrawlDelay added in v0.5.0

func (rc *RobotsCache) CrawlDelay(origin string) time.Duration

CrawlDelay returns the crawl-delay directive for the given origin, or 0 if not set.

func (*RobotsCache) Fetch

func (rc *RobotsCache) Fetch(ctx context.Context, client *http.Client, origin string)

Fetch downloads and parses robots.txt for the given origin.

func (*RobotsCache) Sitemaps

func (rc *RobotsCache) Sitemaps(origin string) []string

Sitemaps returns sitemap URLs declared in robots.txt.

type SitemapURL

type SitemapURL struct {
	Loc        string `xml:"loc"`
	Lastmod    string `xml:"lastmod,omitempty"`
	Changefreq string `xml:"changefreq,omitempty"`
	Priority   string `xml:"priority,omitempty"`
}

SitemapURL represents a single URL entry in a sitemap.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL