crawler

package

v0.5.0 Latest Latest Go to latest Published: May 11, 2026 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/GrayCodeAI/inspect

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler implements a concurrent website crawler with rate limiting, depth control, URL deduplication, and robots.txt compliance.

Index ¶

func FetchSitemapURLs(ctx context.Context, client *http.Client, sitemapURLs []string) []string
func ServeDir(ctx context.Context, dir string) (*http.Server, string, error)
type Config
type Crawler
- func New(cfg Config) *Crawler
- func (c *Crawler) Crawl(ctx context.Context, startURL string) ([]*Page, error)
type Form
type FormInput
type Link
type Page
type ParseResult
- func ParseHTML(pageURL string, body []byte) ParseResult
type RobotsCache
- func NewRobotsCache() *RobotsCache
type SitemapURL

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func FetchSitemapURLs ¶

func FetchSitemapURLs(ctx context.Context, client *http.Client, sitemapURLs []string) []string

FetchSitemapURLs fetches and parses sitemap(s) from the given URLs. Supports both sitemap index files and direct URL sets.

func ServeDir ¶

func ServeDir(ctx context.Context, dir string) (*http.Server, string, error)

ServeDir starts a temporary HTTP file server for the given directory. Returns the server and its address (host:port). The caller must call srv.Close() when done.

Types ¶

type Crawler ¶

type Crawler struct {
	// contains filtered or unexported fields
}

Crawler performs concurrent crawling with rate limiting.

func New ¶

func New(cfg Config) *Crawler

New creates a configured Crawler.

func (*Crawler) Crawl ¶

func (c *Crawler) Crawl(ctx context.Context, startURL string) ([]*Page, error)

Crawl starts from the given URL and discovers pages up to MaxDepth. Returns all crawled pages. Safe for concurrent use via internal locking.

type Form ¶

type Form struct {
	Action  string
	Method  string
	ID      string
	Inputs  []FormInput
	HasCSRF bool
}

Form represents an HTML form found on a page.

type Link ¶

type Link struct {
	Href     string
	Text     string
	Rel      string
	External bool
	Anchor   bool
	Resource bool   // true for non-anchor resource URLs (img, script, iframe, etc.)
	Tag      string // source element tag (e.g., "img", "script", "iframe")
}

Link represents a hyperlink found on a page.

type Page ¶

type Page struct {
	URL          string
	StatusCode   int
	Headers      http.Header
	Body         []byte
	Links        []Link
	Forms        []Form
	Depth        int
	ParentURL    string
	Duration     time.Duration
	Error        error
	AuthRequired bool // true when server returned 401/403
}

Page represents a single crawled page with its metadata.

type ParseResult ¶ added in v0.5.0

type ParseResult struct {
	Links    []Link
	Forms    []Form
	ParseErr error // non-nil if HTML parsing encountered an error (partial results still returned)
}

ParseResult holds links/forms extraction results along with any parse error.

func ParseHTML ¶ added in v0.5.0

func ParseHTML(pageURL string, body []byte) ParseResult

ParseHTML extracts links and forms, returning partial results even on parse error.

type RobotsCache ¶

type RobotsCache struct {
	// contains filtered or unexported fields
}

RobotsCache caches parsed robots.txt rules per host.

func NewRobotsCache ¶

func NewRobotsCache() *RobotsCache

NewRobotsCache creates an empty robots.txt cache.

func (*RobotsCache) Allowed ¶

func (rc *RobotsCache) Allowed(rawURL, userAgent string) bool

Allowed checks if a URL is permitted by robots.txt rules. Per the standard, if both Allow and Disallow match a path, the longest matching rule wins. If they are the same length, Allow takes precedence.

func (*RobotsCache) CrawlDelay ¶ added in v0.5.0

func (rc *RobotsCache) CrawlDelay(origin string) time.Duration

CrawlDelay returns the crawl-delay directive for the given origin, or 0 if not set.

func (*RobotsCache) Fetch ¶

func (rc *RobotsCache) Fetch(ctx context.Context, client *http.Client, origin string)

Fetch downloads and parses robots.txt for the given origin.

func (*RobotsCache) Sitemaps ¶

func (rc *RobotsCache) Sitemaps(origin string) []string

Sitemaps returns sitemap URLs declared in robots.txt.

type SitemapURL ¶

type SitemapURL struct {
	Loc        string `xml:"loc"`
	Lastmod    string `xml:"lastmod,omitempty"`
	Changefreq string `xml:"changefreq,omitempty"`
	Priority   string `xml:"priority,omitempty"`
}

SitemapURL represents a single URL entry in a sitemap.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL