crawler

package

v0.1.0 Latest Latest Go to latest Published: May 2, 2026 License: MIT Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/GrayCodeAI/inspect

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler implements a concurrent website crawler with rate limiting, depth control, URL deduplication, and robots.txt compliance.

Index ¶

func FetchSitemapURLs(ctx context.Context, client *http.Client, sitemapURLs []string) []string
func ServeDir(ctx context.Context, dir string) (*http.Server, string, error)
type Config
type Crawler
- func New(cfg Config) *Crawler
- func (c *Crawler) Crawl(ctx context.Context, startURL string) ([]*Page, error)
type Form
type FormInput
type Link
type Page
type RobotsCache
- func NewRobotsCache() *RobotsCache
type SitemapURL

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func FetchSitemapURLs ¶

func FetchSitemapURLs(ctx context.Context, client *http.Client, sitemapURLs []string) []string

FetchSitemapURLs fetches and parses sitemap(s) from the given URLs. Supports both sitemap index files and direct URL sets.

func ServeDir ¶

func ServeDir(ctx context.Context, dir string) (*http.Server, string, error)

ServeDir starts a temporary HTTP file server for the given directory. Returns the server and its address (host:port). The caller must call srv.Close() when done.

Types ¶

type Config ¶

type Config struct {
	MaxDepth        int
	Concurrency     int
	Timeout         time.Duration
	PageTimeout     time.Duration
	RateLimit       int
	RetryAttempts   int
	RetryDelay      time.Duration
	UserAgent       string
	FollowRedirects int
	RespectRobots   bool
	Exclude         []string
	AuthHeader      string
	AuthValue       string
	CookieJar       http.CookieJar
}

Config controls crawler behavior.

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler performs concurrent crawling with rate limiting.

func New ¶

func New(cfg Config) *Crawler

New creates a configured Crawler.

func (*Crawler) Crawl ¶

func (c *Crawler) Crawl(ctx context.Context, startURL string) ([]*Page, error)

Crawl starts from the given URL and discovers pages up to MaxDepth. Returns all crawled pages. Safe for concurrent use via internal locking.

type Form ¶

type Form struct {
	Action  string
	Method  string
	ID      string
	Inputs  []FormInput
	HasCSRF bool
}

Form represents an HTML form found on a page.

type FormInput ¶

type FormInput struct {
	Name     string
	Type     string
	Required bool
	Value    string
}

FormInput represents a form field.

type Link ¶

type Link struct {
	Href     string
	Text     string
	Rel      string
	External bool
	Anchor   bool
	Resource bool   // true for non-anchor resource URLs (img, script, iframe, etc.)
	Tag      string // source element tag (e.g., "img", "script", "iframe")
}

Link represents a hyperlink found on a page.

type Page ¶

type Page struct {
	URL        string
	StatusCode int
	Headers    http.Header
	Body       []byte
	Links      []Link
	Forms      []Form
	Depth      int
	ParentURL  string
	Duration   time.Duration
	Error      error
}

Page represents a single crawled page with its metadata.

type RobotsCache ¶

type RobotsCache struct {
	// contains filtered or unexported fields
}

RobotsCache caches parsed robots.txt rules per host.

func NewRobotsCache ¶

func NewRobotsCache() *RobotsCache

NewRobotsCache creates an empty robots.txt cache.

func (*RobotsCache) Allowed ¶

func (rc *RobotsCache) Allowed(rawURL, userAgent string) bool

Allowed checks if a URL is permitted by robots.txt rules.

func (*RobotsCache) Fetch ¶

func (rc *RobotsCache) Fetch(ctx context.Context, client *http.Client, origin string)

Fetch downloads and parses robots.txt for the given origin.

func (*RobotsCache) Sitemaps ¶

func (rc *RobotsCache) Sitemaps(origin string) []string

Sitemaps returns sitemap URLs declared in robots.txt.

type SitemapURL ¶

type SitemapURL struct {
	Loc        string `xml:"loc"`
	Lastmod    string `xml:"lastmod,omitempty"`
	Changefreq string `xml:"changefreq,omitempty"`
	Priority   string `xml:"priority,omitempty"`
}

SitemapURL represents a single URL entry in a sitemap.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL