Documentation
¶
Overview ¶
FILE: pkg/crawler/crawler.go
FILE: pkg/crawler/output.go
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Crawler ¶
type Crawler struct {
// contains filtered or unexported fields
}
Crawler orchestrates the web crawling process for a single configured site
func NewCrawler ¶
func NewCrawler( appCfg *config.AppConfig, siteCfg *config.SiteConfig, siteKey string, baseLogger *logrus.Entry, store storage.VisitedStore, fetcher fetch.HTTPFetcher, rateLimiter *fetch.RateLimiter, crawlCtx context.Context, cancelCrawl context.CancelFunc, resume bool, ) (*Crawler, error)
NewCrawler creates and initializes a new Crawler instance and its components
func NewCrawlerWithOptions ¶
func NewCrawlerWithOptions( appCfg *config.AppConfig, siteCfg *config.SiteConfig, siteKey string, baseLogger *logrus.Entry, store storage.VisitedStore, fetcher fetch.HTTPFetcher, rateLimiter *fetch.RateLimiter, crawlCtx context.Context, cancelCrawl context.CancelFunc, resume bool, opts *CrawlerOptions, ) (*Crawler, error)
NewCrawlerWithOptions creates a new Crawler with optional configuration
func (*Crawler) FoundSitemap ¶
FoundSitemap implements fetch.SitemapDiscoverer for the RobotsHandler callback. It's called by RobotsHandler when a sitemap URL is found in robots.txt.
func (*Crawler) GetProgress ¶
func (c *Crawler) GetProgress() CrawlerProgress
GetProgress returns the current progress of the crawler
type CrawlerOptions ¶
type CrawlerOptions struct {
// If nil, the crawler creates its own semaphore based on appCfg.MaxRequests
SharedSemaphore *semaphore.Weighted
}
CrawlerOptions contains optional parameters for NewCrawler
type CrawlerProgress ¶
CrawlerProgress contains progress information for a crawler
type OutputManager ¶
type OutputManager struct {
// contains filtered or unexported fields
}
OutputManager owns all output file handles and metadata collection for a crawl.
func NewOutputManager ¶
func NewOutputManager(log *logrus.Entry, resolved *config.ResolvedSiteConfig, siteCfg *config.SiteConfig, enableTokenCounting bool, siteKey, siteOutputDir string) *OutputManager
NewOutputManager creates an OutputManager without opening files. Call OpenFiles after the output directory is ready (e.g. after cleanSiteOutputDir).
func (*OutputManager) Close ¶
func (om *OutputManager) Close() error
Close syncs and closes all output files and writes the YAML metadata file.
func (*OutputManager) OpenFiles ¶
func (om *OutputManager) OpenFiles(resume bool)
OpenFiles opens all configured output files (TSV, JSONL, chunks). Must be called after the output directory exists and has been cleaned if needed.
func (*OutputManager) PagesSaved ¶
func (om *OutputManager) PagesSaved() int
PagesSaved returns the number of pages whose metadata has been collected.
func (*OutputManager) RecordPageOutput ¶
func (om *OutputManager) RecordPageOutput(finalURL, normalizedURL, savedContentPath string, markdownBytes []byte, pageTitle string, currentDepth, imageCount int, taskLog *logrus.Entry)
RecordPageOutput handles all post-save output: TSV write, YAML metadata collection, JSONL write, and chunks write. Called after content is successfully saved to disk. markdownBytes is the already-written markdown content, passed through to avoid re-reading the file.