crawler

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 2, 2026 License: MIT Imports: 13 Imported by: 0

Documentation

Overview

Package crawler implements a concurrent website crawler with rate limiting, depth control, URL deduplication, and robots.txt compliance.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func FetchSitemapURLs

func FetchSitemapURLs(ctx context.Context, client *http.Client, sitemapURLs []string) []string

FetchSitemapURLs fetches and parses sitemap(s) from the given URLs. Supports both sitemap index files and direct URL sets.

func ServeDir

func ServeDir(ctx context.Context, dir string) (*http.Server, string, error)

ServeDir starts a temporary HTTP file server for the given directory. Returns the server and its address (host:port). The caller must call srv.Close() when done.

Types

type Config

type Config struct {
	MaxDepth        int
	Concurrency     int
	Timeout         time.Duration
	PageTimeout     time.Duration
	RateLimit       int
	RetryAttempts   int
	RetryDelay      time.Duration
	UserAgent       string
	FollowRedirects int
	RespectRobots   bool
	Exclude         []string
	AuthHeader      string
	AuthValue       string
	CookieJar       http.CookieJar
}

Config controls crawler behavior.

type Crawler

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler performs concurrent crawling with rate limiting.

func New

func New(cfg Config) *Crawler

New creates a configured Crawler.

func (*Crawler) Crawl

func (c *Crawler) Crawl(ctx context.Context, startURL string) ([]*Page, error)

Crawl starts from the given URL and discovers pages up to MaxDepth. Returns all crawled pages. Safe for concurrent use via internal locking.

type Form

type Form struct {
	Action  string
	Method  string
	ID      string
	Inputs  []FormInput
	HasCSRF bool
}

Form represents an HTML form found on a page.

type FormInput

type FormInput struct {
	Name     string
	Type     string
	Required bool
	Value    string
}

FormInput represents a form field.

type Link struct {
	Href     string
	Text     string
	Rel      string
	External bool
	Anchor   bool
	Resource bool   // true for non-anchor resource URLs (img, script, iframe, etc.)
	Tag      string // source element tag (e.g., "img", "script", "iframe")
}

Link represents a hyperlink found on a page.

type Page

type Page struct {
	URL        string
	StatusCode int
	Headers    http.Header
	Body       []byte
	Links      []Link
	Forms      []Form
	Depth      int
	ParentURL  string
	Duration   time.Duration
	Error      error
}

Page represents a single crawled page with its metadata.

type RobotsCache

type RobotsCache struct {
	// contains filtered or unexported fields
}

RobotsCache caches parsed robots.txt rules per host.

func NewRobotsCache

func NewRobotsCache() *RobotsCache

NewRobotsCache creates an empty robots.txt cache.

func (*RobotsCache) Allowed

func (rc *RobotsCache) Allowed(rawURL, userAgent string) bool

Allowed checks if a URL is permitted by robots.txt rules.

func (*RobotsCache) Fetch

func (rc *RobotsCache) Fetch(ctx context.Context, client *http.Client, origin string)

Fetch downloads and parses robots.txt for the given origin.

func (*RobotsCache) Sitemaps

func (rc *RobotsCache) Sitemaps(origin string) []string

Sitemaps returns sitemap URLs declared in robots.txt.

type SitemapURL

type SitemapURL struct {
	Loc        string `xml:"loc"`
	Lastmod    string `xml:"lastmod,omitempty"`
	Changefreq string `xml:"changefreq,omitempty"`
	Priority   string `xml:"priority,omitempty"`
}

SitemapURL represents a single URL entry in a sitemap.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL